Can't Get Scrapy To Parse And Follow 301, 302 Redirects

August 20, 2024 Post a Comment

I'm trying to write a very simple website crawler to list URLs along with referrer and status codes for 200, 301, 302 and 404 http status codes. Turns out that Scrapy works great a

Solution 1:

If you want to parse 301 and 302 responses, and follow them at the same time, ask for 301 and 302 to be processed by your callback and mimick the behavior of RedirectMiddleware.

Test 1 (not working)

Let's illustrate with a simple spider to start with (not working as you intend yet):

import scrapy


classHandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    defparse(self, response):
        self.logger.info("got response for %r" % response.url)

Right now, the spider asks for 2 pages, and the 2nd one should redirect to http://www.example.com

$scrapyrunspidertest.py2016-09-30 11:28:17 [scrapy] INFO:Scrapy1.1.3started(bot:scrapybot)2016-09-30 11:28:18 [scrapy] DEBUG:Crawled(200)<GEThttps://httpbin.org/get>(referer:None)2016-09-30 11:28:18 [scrapy] DEBUG:Redirecting(302)to<GEThttp://example.com/>from<GEThttps://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>2016-09-30 11:28:18 [handle] INFO:gotresponsefor'https://httpbin.org/get'2016-09-30 11:28:18 [scrapy] DEBUG:Crawled(200)<GEThttp://example.com/>(referer:None)2016-09-30 11:28:18 [handle] INFO:gotresponsefor'http://example.com/'2016-09-30 11:28:18 [scrapy] INFO:Spiderclosed(finished)

The 302 is handled by RedirectMiddleware automatically and it does not get passed to your callback.

Test 2 (still not quite right)

Let's configure the spider to handle 301 and 302s in the callback, using handle_httpstatus_list:

import scrapy


classHandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    handle_httpstatus_list = [301, 302]
    defparse(self, response):
        self.logger.info("got response %d for %r" % (response.status, response.url))

Let's run it:

$scrapyrunspidertest.py2016-09-30 11:33:32 [scrapy] INFO:Scrapy1.1.3started(bot:scrapybot)2016-09-30 11:33:32 [scrapy] DEBUG:Crawled(200)<GEThttps://httpbin.org/get>(referer:None)2016-09-30 11:33:32 [scrapy] DEBUG:Crawled(302)<GEThttps://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>(referer:None)2016-09-30 11:33:33 [handle] INFO:gotresponse200for'https://httpbin.org/get'2016-09-30 11:33:33 [handle] INFO:gotresponse302for'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'2016-09-30 11:33:33 [scrapy] INFO:Spiderclosed(finished)

Here, we're missing the redirection.

Test 3 (working)

Do the same as RedirectMiddleware but in the spider callback:

from six.moves.urllib.parse import urljoin

import scrapy
from scrapy.utils.python import to_native_str


classHandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    handle_httpstatus_list = [301, 302]
    defparse(self, response):
        self.logger.info("got response %d for %r" % (response.status, response.url))

        # do something with the response here...# handle redirection# this is copied/adapted from RedirectMiddlewareif response.status >= 300and response.status < 400:

            # HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
            location = to_native_str(response.headers['location'].decode('latin1'))

            # get the original request
            request = response.request
            # and the URL we got redirected to
            redirected_url = urljoin(request.url, location)

            if response.status in (301, 307) or request.method == 'HEAD':
                redirected = request.replace(url=redirected_url)
                yield redirected
            else:
                redirected = request.replace(url=redirected_url, method='GET', body='')
                redirected.headers.pop('Content-Type', None)
                redirected.headers.pop('Content-Length', None)
                yield redirected

And run the spider again:

$scrapyrunspidertest.py2016-09-30 11:45:20 [scrapy] INFO:Scrapy1.1.3started(bot:scrapybot)2016-09-30 11:45:21 [scrapy] DEBUG:Crawled(302)<GEThttps://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>(referer:None)2016-09-30 11:45:21 [scrapy] DEBUG:Crawled(200)<GEThttps://httpbin.org/get>(referer:None)2016-09-30 11:45:21 [handle] INFO:gotresponse302for'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'2016-09-30 11:45:21 [handle] INFO:gotresponse200for'https://httpbin.org/get'2016-09-30 11:45:21 [scrapy] DEBUG:Crawled(200)<GEThttp://example.com/>(referer:https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F)2016-09-30 11:45:21 [handle] INFO:gotresponse200for'http://example.com/'2016-09-30 11:45:21 [scrapy] INFO:Spiderclosed(finished)

We got redirected to http://www.example.com and we also got the response through our callback.

Learn Python Programming

Can't Get Scrapy To Parse And Follow 301, 302 Redirects

Solution 1:

Test 1 (not working)

Test 2 (still not quite right)

Test 3 (working)

Post a Comment for "Can't Get Scrapy To Parse And Follow 301, 302 Redirects"