Skip to content Skip to sidebar Skip to footer

Crawl A Full Domain And Load All H1 Into A Item

I am relatively new to python and scrapy. What I want to achieve is to crawl a number of websites mainly company websites. Crawl the full domain and extract all the h1 h2 h3. Creat

Solution 1:

yield statement keeps going until the next domain is up.

cannot be done, things are done in parallel and there is no way to make domain crawling serially.

what you can do is to write a pipeline that will accumulate them and yield the entire structure on spider_close, something like:

# this assume your item looks like the followingclassMyItem():
    domain = Field()
    hs = Field()


import collections
classDomainPipeline(object):

    accumulator = collections.defaultdict(set)

    defprocess_item(self, item, spider):
        accumulator[item['domain']].update(item['hs'])

    defclose_spider(spider):
        for domain,hs in accumulator.items():
            yield MyItem(domain=domain, hs=hs)

usage:

>>> from scrapy.item import Item, Field
>>> classMyItem(Item):
...     domain = Field()
...     hs = Field()
... >>> from collections import defaultdict
>>> accumulator = defaultdict(set)
>>> items = []
>>> for i inrange(10):
...     items.append(MyItem(domain='google.com', hs=[str(i)]))
... >>> items
[{'domain': 'google.com', 'hs': ['0']}, {'domain': 'google.com', 'hs': ['1']}, {'domain': 'google.com', 'hs': ['2']}, {'domain': 'google.com', 'hs': ['3']}, {'domain': 'google.com', 'hs': ['4']}, {'domain': 'google.com', 'hs': ['5']}, {'domain': 'google.com', 'hs': ['6']}, {'domain': 'google.com', 'hs': ['7']}, {'domain': 'google.com', 'hs': ['8']}, {'domain': 'google.com', 'hs': ['9']}]
>>> for item in items:
...     accumulator[item['domain']].update(item['hs'])
... >>> accumulator
defaultdict(<type'set'>, {'google.com': set(['1', '0', '3', '2', '5', '4', '7', '6', '9', '8'])})
>>> for domain, hs in accumulator.items():
... print MyItem(domain=domain, hs=hs)
... 
{'domain': 'google.com',
 'hs': set(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])}
>>> 

Post a Comment for "Crawl A Full Domain And Load All H1 Into A Item"