Scrapy: "load More Result" Pages
I was trying to write the follwing scrapy script to scrape items from the follwing web site. I was able to scrape first page items but there are more about 2000 page that i want s
Solution 1:
Each click on "LOAD MORE RESULTS" returns Javascript response with JSON object inside:
if (typeof addMoreNewsResults == 'function') {
addMoreNewsResults( {
blob: 'National+Health+Investors%2C+Inc.',
sortBy: 'relevance',
dateRange: 'all',
totalResultNumber: 1970,
totalResultNumberStr: "1,970",
news: [
{
id: "-pbm-push-idUSKBN1DG2CP",
headline: "Diplomat Pharmacy plunges as <b>investors<\/b> fret over rapid PBM push",
date: "November 16, 2017 11:22am EST",
href: "/article/us-diplomat-stocks/diplomat-pharmacy-plunges-as-investors-fret-over-rapid-pbm-push-idUSKBN1DG2CP",
blurb: "...(Reuters) - Shares of Diplomat Pharmacy <b>Inc<\/b> <DPLO.N> tumbled 20... <b>National<\/b> Pharmaceutical Services.\nSome analysts were not excited...",
mainPicUrl: ""
},
{....
So you need to use different parsing mechanism to get information you want (import json
, json.loads()
etc)
There is much easy way. You can get everything in one request (just change numResultsToShow
param to get everything):
https://www.reuters.com/assets/searchArticleLoadMoreJson?blob=National+Health+Investors%2C+Inc.&bigOrSmall=big&articleWithBlog=true&sortBy=&dateRange=&numResultsToShow=2000&pn=1&callback=addMoreNewsResults
UPDATE
# -*- coding: utf-8 -*-import scrapy
import re
import json
classReutersSpider(scrapy.Spider):
name = "reuters"
start_urls = [
'https://www.reuters.com/assets/searchArticleLoadMoreJson?blob=National+Health+Investors%2C+Inc.&bigOrSmall=big&articleWithBlog=true&sortBy=&dateRange=&numResultsToShow=2000&pn=1&callback=addMoreNewsResults',
]
defparse(self, response):
json_string = re.search( r'addMoreNewsResults\((.+?) \);', response.body, re.DOTALL ).group(1)
#Below code is used to transform from Javascript-ish JSON-like structure to JSON
json_string = re.sub( r'^\s*(\w+):', r'"\1":', json_string, flags=re.MULTILINE)
json_string = re.sub( r'(\w+),\s*$', r'"\1",', json_string, flags=re.MULTILINE)
json_string = re.sub( r':\s*\'(.+?)\',\s*$', r': "\1",', json_string, flags=re.MULTILINE)
results = json.loads(json_string)
for result in results["news"]:
item = {}
item["href"] = result["href"]
item["date"] = result["date"]
yield item
Post a Comment for "Scrapy: "load More Result" Pages"