JuliaでYahoo!ニュースをスクレイピング

はじめに
使用するパッケージ
Yahoo!ニュースのトピックス一覧
トピックス一覧から、タイトルとURL（概要）、投稿日時を取得
URL（概要）のページから、URL（全文）を取得
URL（本文）のページから、本文と、ニュースソースを取得
まとめ
参考記事

はじめに

　前回の記事「Juliaでジップの法則（Zipf’s law）を確認」で、青空文庫の小説を使ってジップの法則を確認しました。結果は、かなりのずれが見られたわけですが、ほかのコーパス、例えばニュース記事などではどうなのかも確認したいと考えました。しかし、コーパスが手元にありません。
　そこで、まずは、ニュースコーパスを収集します。今回は、Yahoo!ニュースの記事を収集したいと思います。

使用するパッケージ

　HTMLファイルのスクレイピングですので、次の記事で使用したものと同じです。

HTTP.jl	httpアクセス
Gumbo.jl	HTMLを解析
Cascadia.jl	CSSセレクタ
URIs.jl	url操作
Dates.jl	日時操作
JSON.JL	jsonの操作

Yahoo!ニュースのトピックス一覧

　収集する対象は、ニュースのトピックス一覧です。以下のURLの記事をすべて収集します。

https://news.yahoo.co.jp/topics/domestic	国内
https://news.yahoo.co.jp/topics/world	国際
https://news.yahoo.co.jp/topics/business	経済
https://news.yahoo.co.jp/topics/entertainment	エンタメ
https://news.yahoo.co.jp/topics/sports	スポーツ
https://news.yahoo.co.jp/topics/it	IT
https://news.yahoo.co.jp/topics/science	科学
https://news.yahoo.co.jp/topics/local	地域

　スクレイピングの流れは次のようになります。

上記のトピックス一覧の各ページから、タイトルとURL（概要）、投稿日時を取得します。
URL（概要）のページから、URL（全文）を取得します。
URL（全文）のページから、本文と、ニュースソースを取得します。

トピックス一覧から、タイトルとURL（概要）、投稿日時を取得

　過去記事に比べて、HTMLタグの名前やHTMLの構成が違うのは言うまでもありませんが、基本的な操作は同じです。

using HTTP
using Gumbo
using Cascadia
using Dates
using JSON

topics_tags = ["domestic", "world", "business", "entertainment", "sports", "it", "science", "local"]
base_url = "https://news.yahoo.co.jp/topics/"

news_list_p = []
for category in topics_tags
    topics_url = base_url * category

    # 指定トピックのニュース一覧を取得
    response = HTTP.request("GET", topics_url)
    code = String(response.body)
    doc = Gumbo.parsehtml(code)
    #head = doc.root[1]
    body = doc.root[2]
    items = eachmatch(Selector(".newsFeed_item"), body) ;
    for item in items
        # url 
        link_items = eachmatch(Selector(".newsFeed_item_link"), item)
        if length(link_items) == 0   # 広告をスキップする
            continue
        end
        url = getattr(link_items[1], "href", nothing)
        # title
        title = nodeText(eachmatch(Selector(".newsFeed_item_title"), item)[1])
        # datetime
        dt = nodeText(eachmatch(Selector("time"), item)[1])
        result = Dict("url" => url, "title" => title, "datetime" => dt, "category" => category)
        push!(news_list_p, result)
    end
end

　なお、それぞれのトピックスのページは、2ページ以上に分かれているのですが、今回は１ページ目だけを取得しています。

URL（概要）のページから、URL（全文）を取得

　ニュース記事の本文は、URL「articles」の下に格納されています。HTML記述は統一されています。
　ニュース記事以外にコラムなどが掲載されています。これは、それぞれの記者によってHTML記述が異なります。また、「articles」以外のURLになっているので、その場合には取得しません。

news_list_d = []
# コラムなどは書き方が異なるので除外する
effective_url_base = r"https://news.yahoo.co.jp/articles/"
for ns in news_list_p
    response = HTTP.request("GET", ns["url"])
    ns_code = String(response.body)
    ns_doc = Gumbo.parsehtml(ns_code)
    body = ns_doc.root[2]
    detail_part = eachmatch(Selector("a[data-ual-gotocontent=\"true\"]"), body)
    if length(detail_part) > 0
        detail_url = getattr(detail_part[1], "href", nothing)
        if occursin(effective_url_base, detail_url)
            ns["url"] = detail_url
            push!(news_list_d, ns)
        end
    end
end

（2022/06/30 修正）

URL（本文）のページから、本文と、ニュースソースを取得

　本文を取得します。複数ページにわたる記事は１ページ目のみを取得します。
　ニュースソースが存在する場合はそれも取得しておきます。存在しない場合は空文字列を設定しています。

news_list = []
for ns in news_list_d
    response = HTTP.request("GET", ns["url"])
    code = String(response.body)
    doc = Gumbo.parsehtml(code)
    body = doc.root[2]
    source_parts = eachmatch(Selector("article > div.article_body > p"), body)
    source = length(source_parts) != 0 ? nodeText(source_parts[1]) : ""
    detail = nodeText(eachmatch(Selector("article > div.article_body > div> p"), body)[1])
    ns["detail"] = detail
    ns["source"] = source
    push!(news_list, ns)
end

　取得した情報を、json形式でファイル保存します。本文中に改行が含まれる場合は、csvやtsvで保存すると扱いが大変なので、json形式にします。

# 保存
filename = "yahoo" * Dates.format(now(), "yyyymmddHHMMSS") * ".txt"
open(filename, "w") do f
    println(f, json(news_list))
end

まとめ

上記をまとめたコードを下記に公開しています。ipynb形式です。

julialangjp/ScrapingYahooNews_jl (Github)