Scrapy - Importing Excel .csv As Start_url

September 16, 2024 Post a Comment

So I'm building a scraper that imports a .csv excel file which has one row of ~2,400 websites (each website is in its own column) and using these as the start_url. I keep getting t

Solution 1:

Just generating a list for start_urls does not work as it is clearly written in Scrapy documentation.

From documentation:

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

I would rather do it in this way:

def get_urls_from_csv():
    with open('websites.csv', 'rbU') as csv_file:
        data = csv.reader(csv_file)
        scrapurls = []
        for row indata:
            scrapurls.append(row)
        return scrapurls


classDanishSpider(scrapy.Spider):

    ...

    def start_requests(self):
        return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

Solution 2:

I find the following useful when in need:

import csv
import scrapy

classDanishSpider(scrapy.Spider):
    name = "rei"withopen("output.csv","r") as f:
        reader = csv.DictReader(f)
        start_urls = [item['Link'] for item in reader]

    defparse(self, response):
        yield {"link":response.url}

Solution 3:

Try opening the .csv file inside the class (not outside as you have done before) and append the start_urls. This solution worked for me. Hope this helps :-)

    class DanishSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = []
        start_urls = []

        f = open('websites.csv'), 'r')
        for i in f:
        u = i.split('\n')
        start_urls.append(u[0])

Solution 4:

forrowin data:
    scrapurls.append(row)

row is a list [column1, column2, ..] So I think you need to extract the columns, and append to your start_urls.

forrowin data:
      # if all the columnis the url str
      forcolumninrow:
          scrapurls.append(column)

Solution 5:

Try this way also,

filee = open("filename.csv","r+")

# Removing the \n 'new line' from the urlr=[i for i in filee]
start_urls=[r[j].replace('\n','') for j in range(len(r))]

Learn Python Programming