Can't Get Scrapy To Parse And Follow 301, 302 Redirects
Solution 1:
If you want to parse 301 and 302 responses, and follow them at the same time, ask for 301 and 302 to be processed by your callback and mimick the behavior of RedirectMiddleware.
Test 1 (not working)
Let's illustrate with a simple spider to start with (not working as you intend yet):
import scrapy
classHandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
defparse(self, response):
self.logger.info("got response for %r" % response.url)
Right now, the spider asks for 2 pages, and the 2nd one should redirect to http://www.example.com
$scrapyrunspidertest.py2016-09-30 11:28:17 [scrapy] INFO:Scrapy1.1.3started(bot:scrapybot)2016-09-30 11:28:18 [scrapy] DEBUG:Crawled(200)<GEThttps://httpbin.org/get>(referer:None)2016-09-30 11:28:18 [scrapy] DEBUG:Redirecting(302)to<GEThttp://example.com/>from<GEThttps://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>2016-09-30 11:28:18 [handle] INFO:gotresponsefor'https://httpbin.org/get'2016-09-30 11:28:18 [scrapy] DEBUG:Crawled(200)<GEThttp://example.com/>(referer:None)2016-09-30 11:28:18 [handle] INFO:gotresponsefor'http://example.com/'2016-09-30 11:28:18 [scrapy] INFO:Spiderclosed(finished)
The 302 is handled by RedirectMiddleware
automatically and it does not get passed to your callback.
Test 2 (still not quite right)
Let's configure the spider to handle 301 and 302s in the callback, using handle_httpstatus_list
:
import scrapy
classHandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
handle_httpstatus_list = [301, 302]
defparse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
Let's run it:
$scrapyrunspidertest.py2016-09-30 11:33:32 [scrapy] INFO:Scrapy1.1.3started(bot:scrapybot)2016-09-30 11:33:32 [scrapy] DEBUG:Crawled(200)<GEThttps://httpbin.org/get>(referer:None)2016-09-30 11:33:32 [scrapy] DEBUG:Crawled(302)<GEThttps://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>(referer:None)2016-09-30 11:33:33 [handle] INFO:gotresponse200for'https://httpbin.org/get'2016-09-30 11:33:33 [handle] INFO:gotresponse302for'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'2016-09-30 11:33:33 [scrapy] INFO:Spiderclosed(finished)
Here, we're missing the redirection.
Test 3 (working)
Do the same as RedirectMiddleware but in the spider callback:
from six.moves.urllib.parse import urljoin
import scrapy
from scrapy.utils.python import to_native_str
classHandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
handle_httpstatus_list = [301, 302]
defparse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
# do something with the response here...# handle redirection# this is copied/adapted from RedirectMiddlewareif response.status >= 300and response.status < 400:
# HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
location = to_native_str(response.headers['location'].decode('latin1'))
# get the original request
request = response.request
# and the URL we got redirected to
redirected_url = urljoin(request.url, location)
if response.status in (301, 307) or request.method == 'HEAD':
redirected = request.replace(url=redirected_url)
yield redirected
else:
redirected = request.replace(url=redirected_url, method='GET', body='')
redirected.headers.pop('Content-Type', None)
redirected.headers.pop('Content-Length', None)
yield redirected
And run the spider again:
$scrapyrunspidertest.py2016-09-30 11:45:20 [scrapy] INFO:Scrapy1.1.3started(bot:scrapybot)2016-09-30 11:45:21 [scrapy] DEBUG:Crawled(302)<GEThttps://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>(referer:None)2016-09-30 11:45:21 [scrapy] DEBUG:Crawled(200)<GEThttps://httpbin.org/get>(referer:None)2016-09-30 11:45:21 [handle] INFO:gotresponse302for'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'2016-09-30 11:45:21 [handle] INFO:gotresponse200for'https://httpbin.org/get'2016-09-30 11:45:21 [scrapy] DEBUG:Crawled(200)<GEThttp://example.com/>(referer:https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F)2016-09-30 11:45:21 [handle] INFO:gotresponse200for'http://example.com/'2016-09-30 11:45:21 [scrapy] INFO:Spiderclosed(finished)
We got redirected to http://www.example.com and we also got the response through our callback.
Post a Comment for "Can't Get Scrapy To Parse And Follow 301, 302 Redirects"