curl+xpath から始めるお手軽スクレイピング（２）

この記事は前回の続きです。

前回までで、xpath + curl + cookie を使いました。

xpath はとても便利なので、基本的な使い方を再掲しておきます。（以前にまとめたもののですが）

xpath	内容
//*	全てのノード
//a	全ての<a> ノード
(//a)[1]	全ての<a> ノードを取得して、最初の１個
(//a)[2]	全ての<a> ノードを取得して、２番め（配列アクセス）
(//a[1])	親ノード中の最初の１個の<a>をすべて
//a/span	span ノードで、親が<a>のものをすべて
//a/@href	aノードのすべてのhref属性
//a/text()	aノードのすべてのtext()表現
//a[@href="/index.html"]	aノードのうち href属性が"/index.html"と合致するモノすべて
//a[contains(@href,"index.html")]	aノードのうち href属性に index.html を含むものすべて
//title \| //meta	title と meta タグを両方
//img[ contains(@src,'jpg') or contains(@src,'png')]	img ノードで src に jpg/pngを含むものすべて
//div[ contains(@class,'link') and contains(@class,'book')]	div.link.bookに相当するもの
//form[ ./input[name="username"] ]	子ノードに //input[name="username"]を持つform ノード
//div[@id=main]//form	form ノードで、親が div[@id=main] の物をすべて
//div/*	div の子要素すべて
//div//*	div の子孫要素をすべて
//table//td[2]	table タグで２番めのｔｄのもの（２列目をすべて）
//*[@id]	id属性があるものをすべて
id("tid_123")	id属性がtid_123のもの(id="tid_123")

xpath の練習

このページに含まれる a 要素を列挙する

curl -s  http://takuya-1st.hatenablog.jp/ | xpath "//a/@href"

このページに含まれる title と meta を取り出す。

curl -s  http://takuya-1st.hatenablog.jp/ | xpath "//title | //meta"

このページに含まれるform で２番目のものを取り出す。

curl -s  http://takuya-1st.hatenablog.jp/ | xpath "(//form)[2]"

curl + xpath + md5sum で更新チェック

ページの更新チェックを、要素を内部HTMLの変化として考えて、要素の変化を追いかけて、更新チェックをする

url="http://localhost/"
xpath_exp="(//div[contains(@class, 'main')])[2]"
digest=`curl -s  $url | xpath $xpath_exp 2>/dev/null | md5sum `


while true ;  do
  current=`curl -s  $url | xpath $xpath_exp 2>/dev/null | md5sum `
  if [[ $digest != $current ]] ; then
    echo changed!!
    sendmail ほげほげ
    digest=$current
  fi
  sleep 1
done

このように、ページの更新をcurl と xpath で確実に追いかけることができます。

curl で連続ページ取得

curl でも、連続したページの取得ができます。それが --next オプションです。

--next によるフォーム送信からのデータ取得

next を使うことで、通常のクローラーを描くような動作をcurl にも行わせることができます。

以下の例は、pitapa.com にログインして、トップページへ遷移している例です。

curl -v -k  -c  pitapa.cookie.yml -F id=takuyaXXX -F password=XXXXX \
  https://www2.pitapa.com/member/login.do\
 --next -k -c  pitapa.cookie.yml    https://www2.pitapa.com/member/top.do

next は続けていくつでもかけます。便利！

もっとまとめてデータを取得したい

URLの一覧を列挙して、バンバンアクセスして取得することができます。それが --config オプション

page.conf ファイル

アクセスしたいURLを列挙してページを取得に行きます。

url="http://www.yahoo.co.jp/"
output="yahoo.html"

url="https://qiita.com"
output="qiita.com.html"

url="http://b.hatena.ne.jp/"
output="hatebu.html"

ページの一覧はxpath で作っておけばいいと思います。

連続取得

curl に--config/-K オプションをつけると連続してデータ取得をバッチ処理してくれます。

curl -s -K page.conf

これで、xpath で作成したアクセスURL一覧へどんどんアクセスすることが可能になります。便利

user-agent なども指定できる

config の名前の通り、 curl に渡すコマンドオプションも記述できます。

user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36"

url="http://www.yahoo.co.jp/"
output="1.html"
url="http://www.yahoo.co.jp/"
output="2.html"
url="http://www.yahoo.co.jp/"
output="3.html"
url="http://www.yahoo.co.jp/"
output="4.html"
url="http://www.yahoo.co.jp/"
output="5.html"
url="http://www.yahoo.co.jp/"
output="6.html"
url="http://www.yahoo.co.jp/"
output="7.html"

libcurl オプション

curl コマンドからしか使えないってこともないです。C言語のソースも吐いてくれます。

curl --libcurl get_urls.c -s -k -K curl.conf

こうすると、get_urls.c が生成されて

/********* Sample code generated by the curl command line tool **********
 * All curl_easy_setopt() options are documented at:
 * http://curl.haxx.se/libcurl/c/curl_easy_setopt.html
 ************************************************************************/
#include <curl/curl.h>

int main(int argc, char *argv[])
{
  CURLcode ret;
  CURL *hnd;

  hnd = curl_easy_init();
  curl_easy_setopt(hnd, CURLOPT_URL, "http://www.yahoo.co.jp/");
  curl_easy_setopt(hnd, CURLOPT_NOPROGRESS, 1L);
  curl_easy_setopt(hnd, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36");
  curl_easy_setopt(hnd, CURLOPT_MAXREDIRS, 50L);
  curl_easy_setopt(hnd, CURLOPT_SSL_VERIFYPEER, 0L);
  curl_easy_setopt(hnd, CURLOPT_SSL_VERIFYHOST, 0L);
  curl_easy_setopt(hnd, CURLOPT_TCP_KEEPALIVE, 1L);
（以下略

よく使うアクセスパターンをC言語でコンパイルでコマンド化することが可能になります。楽しい。

まとめ

curl は便利
curl はcookie もきっちり扱える
curl でバッチ処理は --next または config
xpath は楽しい
curlで作ったコマンドは libcurl のC言語ソースとして再利用可能。

curl って便利なので、スクレーパーを作る際に大変重宝します。

xpath 参考資料

http://yakinikunotare.boo.jp/orebase/index.php?XML%2FXPath%2FXPath%A4%CE%BD%F1%A4%AD%CA%FD

たのしいXML: XPath(基礎編)

https://sites.google.com/site/shin1ogawa/xsl/xpath

http://os0x.g.hatena.ne.jp/os0x/20080620/1213987223

それマグで！

知識はカップより、マグでゆっくり頂きます。　takuya_1stのブログ