python - Crawl a full domain and load all h1 into a item

Question

Welcome To Ask or Share your Answers For Others

python - Crawl a full domain and load all h1 into a item

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I am relatively new to python and scrapy. What I want to achieve is to crawl a number of websites mainly company websites. Crawl the full domain and extract all the h1 h2 h3. Create a record that contains the domain name and a string with all the h1 h2 h3 from that domain. Basically have a Domain item and a large string containing all the headers.

I would like the output to be DOMAIN, STRING(h1,h2,h2) - from all the urls on this domain

The problem I have is that each URL goes into separate Items. I know I haven't gotten very far but a hint in the right direction would be very much appreciated. Basically, how I create an outer loop so that the yield statement keeps going until the next domain is up.

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from Autotask_Prospecting.items import AutotaskProspectingItem
from Autotask_Prospecting.items import WebsiteItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from nltk import clean_html


class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]
    start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]


    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_xpath('h1',("//h1/text()"))
        loader.add_xpath('h2',("//h2/text()"))
        loader.add_xpath('h3',("//h3/text()"))
        yield loader.load_item()

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

185 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:05:59+0000

yield statement keeps going until the next domain is up.

cannot be done, things are done in parallel and there is no way to make domain crawling serially.

what you can do is to write a pipeline that will accumulate them and yield the entire structure on spider_close, something like:

# this assume your item looks like the following
class MyItem():
    domain = Field()
    hs = Field()


import collections
class DomainPipeline(object):

    accumulator = collections.defaultdict(set)

    def process_item(self, item, spider):
        accumulator[item['domain']].update(item['hs'])

    def close_spider(spider):
        for domain,hs in accumulator.items():
            yield MyItem(domain=domain, hs=hs)

usage:

>>> from scrapy.item import Item, Field
>>> class MyItem(Item):
...     domain = Field()
...     hs = Field()
... 
>>> from collections import defaultdict
>>> accumulator = defaultdict(set)
>>> items = []
>>> for i in range(10):
...     items.append(MyItem(domain='google.com', hs=[str(i)]))
... 
>>> items
[{'domain': 'google.com', 'hs': ['0']}, {'domain': 'google.com', 'hs': ['1']}, {'domain': 'google.com', 'hs': ['2']}, {'domain': 'google.com', 'hs': ['3']}, {'domain': 'google.com', 'hs': ['4']}, {'domain': 'google.com', 'hs': ['5']}, {'domain': 'google.com', 'hs': ['6']}, {'domain': 'google.com', 'hs': ['7']}, {'domain': 'google.com', 'hs': ['8']}, {'domain': 'google.com', 'hs': ['9']}]
>>> for item in items:
...     accumulator[item['domain']].update(item['hs'])
... 
>>> accumulator
defaultdict(<type 'set'>, {'google.com': set(['1', '0', '3', '2', '5', '4', '7', '6', '9', '8'])})
>>> for domain, hs in accumulator.items():
...     print MyItem(domain=domain, hs=hs)
... 
{'domain': 'google.com',
 'hs': set(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])}
>>>

Categories

python - Crawl a full domain and load all h1 into a item

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags