Scrapy - Importing Excel .csv As Start_url
Solution 1:
Just generating a list for start_urls
does not work as it is clearly written in Scrapy documentation.
From documentation:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests()
method which (by default) generatesRequest
for the URLs specified in thestart_urls
and theparse
method as callback function for the Requests.
I would rather do it in this way:
def get_urls_from_csv():
with open('websites.csv', 'rbU') as csv_file:
data = csv.reader(csv_file)
scrapurls = []
for row indata:
scrapurls.append(row)
return scrapurls
classDanishSpider(scrapy.Spider):
...
def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
Solution 2:
I find the following useful when in need:
import csv
import scrapy
classDanishSpider(scrapy.Spider):
name = "rei"withopen("output.csv","r") as f:
reader = csv.DictReader(f)
start_urls = [item['Link'] for item in reader]
defparse(self, response):
yield {"link":response.url}
Solution 3:
Try opening the .csv file inside the class (not outside as you have done before) and append the start_urls. This solution worked for me. Hope this helps :-)
class DanishSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = []
start_urls = []
f = open('websites.csv'), 'r')
for i in f:
u = i.split('\n')
start_urls.append(u[0])
Solution 4:
forrowin data:
scrapurls.append(row)
row
is a list [column1, column2, ..]
So I think you need to extract the columns, and append to your start_urls.
forrowin data:
# if all the columnis the url str
forcolumninrow:
scrapurls.append(column)
Solution 5:
Try this way also,
filee = open("filename.csv","r+")
# Removing the \n 'new line' from the urlr=[i for i in filee]
start_urls=[r[j].replace('\n','') for j in range(len(r))]
Post a Comment for "Scrapy - Importing Excel .csv As Start_url"