Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I collect url from command python and then insert it into start_urls

from flask import Flask, jsonify, request
import scrapy
import subprocess

class ClassSpider(scrapy.Spider):
    name        = 'mySpider'
    #start_urls = []
    #pages      = 0
    news        = []

    def __init__(self, url, nbrPage):
        self.pages      = nbrPage
        self.start_urls = []
        self.start_urlsappend(url)

    def parse(self):
        ...

    def run(self):
        subprocess.check_output(['scrapy', 'crawl', 'mySpider', '-a', f'url={self.start_urls}', '-a', f'nbrPage={self.pages}'])
        return self.news

app = Flask(__name__)
data = []

@app.route('/', methods=['POST'])
def getNews():
    mySpiderClass = ClassSpider(request.json['url'], 2)
    return jsonify({'data': mySpider.run()})

if __name__ == "__main__":
    app.run(debug=True)

I got this error: raise not supported("unsupported url scheme %s: %s" % scrapy.exceptions.NotSupported: Unsupported URL scheme '': no handler available for that scheme

When I put a print('my urls List: ' + str(self.start_urls)), it prints a list of url like --> my urls List: ['www.googole.com']

Any help plz

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
320 views
Welcome To Ask or Share your Answers For Others

1 Answer

I guess this happens because you first append url to self.start_urls and then you call ClassSpiders run method with your list self.start_urls which in turn appends the list to a list and you end up with a nested list instead of a list of strings.
To avoid this you should maybe change your __init__ method like this:

    def __init__(self, url, nbrPage):
        self.pages      = nbrPage
        self.url        = url
        self.start_urls = []
        self.start_urls.append(url)

And then pass self.url instead of self.start_urls in run:

    def run(self):
        subprocess.check_output(['scrapy', 'crawl', 'mySpider', '-a', f'url={self.url}', '-a', f'nbrPage={self.pages}'])
        return self.news

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...