Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am working on a crawler. I have a list of URL need to be requested. There are several hundreds of request at the same time if I don't set it to be async. I am afraid that it would explode my bandwidth or produce to much network access to the target website. What should I do?

Here is what I am doing:

urlList.forEach((url, index) => {

    console.log('Fetching ' + url);
    request(url, function(error, response, body) {
        //do sth for body

    });
});

I want one request is called after one request is completed.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
278 views
Welcome To Ask or Share your Answers For Others

1 Answer

The things you need to watch for are:

  1. Whether the target site has rate limiting and you may be blocked from access if you try to request too much too fast?

  2. How many simultaneous requests the target site can handle without degrading its performance?

  3. How much bandwidth your server has on its end of things?

  4. How many simultaneous requests your own server can have in flight and process without causing excess memory usage or a pegged CPU.

In general, the scheme for managing all this is to create a way to tune how many requests you launch. There are many different ways to control this by number of simultaneous requests, number of requests per second, amount of data used, etc...

The simplest way to start would be to just control how many simultaneous requests you make. That can be done like this:

function runRequests(arrayOfData, maxInFlight, fn) {
    return new Promise((resolve, reject) => {
        let index = 0;
        let inFlight = 0;

        function next() {
            while (inFlight < maxInFlight && index < arrayOfData.length) {
                ++inFlight;
                fn(arrayOfData[index++]).then(result => {
                    --inFlight;
                    next();
                }).catch(err => {
                    --inFlight;
                    console.log(err);
                    // purposely eat the error and let the rest of the processing continue
                    // if you want to stop further processing, you can call reject() here
                    next();
                });
            }
            if (inFlight === 0) {
                // all done
                resolve();
            }
        }
        next();
    });
}

And, then you would use that like this:

const rp = require('request-promise');

// run the whole urlList, no more than 10 at a time
runRequests(urlList, 10, function(url) {
    return rp(url).then(function(data) {
        // process fetched data here for one url
    }).catch(function(err) {
        console.log(url, err);
    });
}).then(function() {
    // all requests done here
});

This can be made as sophisticated as you want by adding a time element to it (no more than N requests per second) or even a bandwidth element to it.

I want one request is called after one request is completed.

That's a very slow way to do things. If you really want that, then you can just pass a 1 for the maxInFlight parameter to the above function, but typically, things would work a lot faster and not cause problems by allowing somewhere between 5 and 50 simultaneous requests. Only testing would tell you where the sweet spot is for your particular target sites and your particular server infrastructure and amount of processing you need to do on the results.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...