Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a Flask app that takes a URL from the user and then crawls that website and returns the links found on that website. Previously, I had an issue where the crawler would only run once and after that, it wouldn't run again. I found the solution to that by using CrawlerRunner as opposed to CrawlerProcess. This is what my code looks like:

from flask import Flask, render_template, request, redirect, url_for, session, make_response
from flask_executor import Executor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
from uuid import uuid4
import urllib3, requests, urllib.parse

app = Flask(__name__)
executor = Executor(app)

http = urllib3.PoolManager()
runner = CrawlerRunner()

list = set([])
list_validate = set([])
list_final = set([])

@app.route('/', methods=["POST", "GET"])
def index():
   if request.method == "POST":
      url_input = request.form["usr_input"]

        # Modifying URL
        if 'https://' in url_input and url_input[-1] == '/':
            url = str(url_input)
        elif 'https://' in url_input and url_input[-1] != '/':
            url = str(url_input) + '/'
        elif 'https://' not in url_input and url_input[-1] != '/':
            url = 'https://' + str(url_input) + '/'
        elif 'https://' not in url_input and url_input[-1] == '/':
            url = 'https://' + str(url_input)
        # Validating URL
        try:
            response = requests.get(url)
            error = http.request("GET", url)
            if error.status == 200:
                parse = urlparse(url).netloc.split('.')
                base_url = parse[-2] + '.' + parse[-1]
                start_url = [str(url)]
                allowed_url = [str(base_url)]

                # Crawling links
                class Crawler(CrawlSpider):
                    name = "crawler"
                    start_urls = start_url
                    allowed_domains = allowed_url
                    rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]

                    def parse_links(self, response):
                        base_url = url
                        href = response.xpath('//a/@href').getall()
                        list.add(urllib.parse.quote(response.url, safe=':/'))
                        for link in href:
                            if base_url not in link:
                                list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
                        for link in list:
                            if base_url in link:
                                list_validate.add(link)

                 def start_spider():
                    d = runner.crawl(Crawler)

                    def start(d):
                        for link in list_validate:
                        error = http.request("GET", link)
                        if error.status == 200:
                            list_final.add(link)
                        original_stdout = sys.stdout
                        with open('templates/file.txt', 'w') as f:
                           sys.stdout = f
                           for link in list_final:
                              print(link)

                     d.addCallback(start)

                def run():                         
                   reactor.run(0)

                unique_id = uuid4().__str__()
                executor.submit_stored(unique_id, start_spider)
                executor.submit(run)
                return redirect(url_for('crawling', id=unique_id))

            elif error.status != 200:
                return render_template('index.html')

        except requests.ConnectionError as exception:
            return render_template('index.html')
   else:
     return render_template('index.html')

@app.route('/crawling-<string:id>')
def crawling(id):
if not executor.futures.done(id):
    return render_template('start-crawl.html', refresh=True)
else:
    executor.futures.pop(id)
    return render_template('finish-crawl.html')

I also have this code to refresh the page every 5 seconds in start-crawl.html:

{% if refresh %}
    <meta http-equiv="refresh" content="5">
{% endif %}

The problem is it renders start-crawl.html only while it's crawling and not while it's validating. So basically, what is happing is it takes the URL, crawls it while rendering start-crawl.html. Then it goes to finish-crawl.html while validating.

I believe the issue could be in start_spider(), in the line d.addCallback(start). I think that because it might be executing that line in the background which I don't want. I believe what might be happening here is in start_spider(), d = runner.crawl(Crawler) is getting executed and then d.addCallback(start) is happening in the background which is why it takes me to finish-crawl.html while it's validating. I want the entire function to be executed in the background and not just that part. That is why I have: executor.submit_stored(unique_id, start_spider).

I want this code to take a URL, then crawl and validate it while rendering start-crawl.html.Then when it finishes I want it to render finish-crawl.html.

Anyways if that isn't the issue, does anyone know what it is and how to fix it? Please ignore the complicity of this code and anything that isn't a "programming convention". Thanks in advance to everyone.

question from:https://stackoverflow.com/questions/65713913/why-is-acrapy-spider-not-functioning-with-flask-correctly

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
142 views
Welcome To Ask or Share your Answers For Others

1 Answer

By looking at the code I see that everything should work if you would call function run() at some point as it's now is never called. Also as mentioned in the comment you should move out the classes and functions from route to separate files - basically you should restructure your code so that the stack would work correctly and if you need to store the state use some tmp file or at least SQLite for queue and results.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...