Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have been tasked with building a web crawler that downloads all .pdfs in a given site. Spider runs on local machine and on scraping hub. For some reason when I run it only downloads some but not all of the pdfs. This can be seen by looking at the items in the output JSON.

I have set MEDIA_ALLOW_REDIRECTS = True and tried to run it on scrapinghub as well as locally

Here is my spider

import scrapy
from scrapy.loader import ItemLoader
from poc_scrapy.items import file_list_Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class PdfCrawler(CrawlSpider):
    # loader = ItemLoader(item=file_list_Item())
    downloaded_set = {''}
    name = 'example'
    allowed_domains = ['www.groton.org']
    start_urls = ['https://www.groton.org']

    rules=(
        Rule(LinkExtractor(allow='www.groton.org'), callback='parse_page', follow=True),
    )



    def parse_page(self, response):
        print('parseing' , response)
        pdf_urls = []
        link_urls = []
        other_urls = []
        # print("this is the response", response.text)
        all_href = response.xpath('/html/body//a/@href').extract()

        # classify all links
        for href in all_href:
            if len(href) < 1:
                continue
            if href[-4:] == '.pdf':
                pdf_urls.append(href)
            elif href[0] == '/':
                link_urls.append(href)
            else:
                other_urls.append(href)

        # get the links that have pdfs and send them to the item pipline 
        for pdf in pdf_urls:
            if pdf[0:5] != 'http':
                new_pdf = response.urljoin(pdf)

                if new_pdf in self.downloaded_set:
                    # we have seen it before, dont do anything
                    # print('skipping ', new_pdf)
                    pass
                else: 
                    loader = ItemLoader(item=file_list_Item())
                    # print(self.downloaded_set)   
                    self.downloaded_set.add(new_pdf) 
                    loader.add_value('file_urls', new_pdf)
                    loader.add_value('base_url', response.url)
                    yield loader.load_item()
            else:

                if new_pdf in self.downloaded_set:
                    pass
                else:
                    loader = ItemLoader(item=file_list_Item())
                    self.downloaded_set.add(new_pdf) 
                    loader.add_value('file_urls', new_pdf)
                    loader.add_value('base_url', response.url)
                    yield loader.load_item()

settings.py

MEDIA_ALLOW_REDIRECTS = True
BOT_NAME = 'poc_scrapy'

SPIDER_MODULES = ['poc_scrapy.spiders']
NEWSPIDER_MODULE = 'poc_scrapy.spiders'

ROBOTSTXT_OBEY = True


DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'poc_scrapy.middlewares.UserAgentMiddlewareRotator': 400,
}


ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline':1
}
FILES_STORE = 'pdfs/'

AUTOTHROTTLE_ENABLED = True

here is the output a small portion of the output

    {
        "file_urls": [
            "https://www.groton.org/ftpimages/542/download/download_3402393.pdf"
        ],
        "base_url": [
            "https://www.groton.org/parents/business-office"
        ],
        "files": []
    },

as you can see the pdf file is in the file_urls but not downloaded, there are 5 warning messages that indicate that the some of them can not be downloaded but there are over 20 missing files.

Here is the warning message I get for some of the files

[scrapy.pipelines.files] File (code: 301): Error downloading file from <GET http://groton.myschoolapp.com/ftpimages/542/download/Candidate_Statement_2013.pdf> referred in <None>

[scrapy.core.downloader.handlers.http11] Received more bytes than download warn size (33554432) in request <GET https://groton.myschoolapp.com/ftpimages/542/download/download_1474034.pdf>

.

I would expect that all the files will be download or at least a warning message for all files that are not downloaded. Maybe there is a workaround.

Any feedback is greatly appreciated. Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
214 views
Welcome To Ask or Share your Answers For Others

1 Answer

UPDATE: I realized that the problem was that robots.txt was not allowing me to visit some of the pdfs. This could be fixed by using an other service to download them or by not following robots.txt


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...