Skip to content Skip to sidebar Skip to footer

Scrapy-splash Active Content Selector Works In Shell But Not With Spider

I just started using scrapy-splash to retrieve the number of bookings from opentable.com. The following works fine in the shell: $ scrapy shell 'http://localhost:8050/render.html?u

Solution 1:

I think your problem is in middlewares, first of all you need to add some settings

# settings.py# uncomment `DOWNLOADER_MIDDLEWARES` and add this settings to it
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# url of splash server
SPLASH_URL = 'http://localhost:8050'# and some splash variables
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

And now run docker

sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode

If i do all these steps a get back:

scrapy crawl opentable

...

2018-06-23 11:23:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opentable.com/new-york-restaurant-listings via http://localhost:8050/render.html> (referer: None)
2018-06-23 11:23:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': [
    'Booked 44 times today',
    'Booked 24 times today',
    'and many others Booked values'
]}

Solution 2:

This is not working because this content of the web is using JS.

You can adopt serveral solutions:

1) Use selenium.

2) If you see the API of the page, if you call this url <GET https://www.opentable.com/injector/stats/v1/restaurants/<restaurant_id>/reservations> you will have the number of current reservaitions of this specific restaurant (restaurant_id).

Post a Comment for "Scrapy-splash Active Content Selector Works In Shell But Not With Spider"