Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Hi guys i'm fairly new in python. what i'm trying to do is to move my old code into multiprocessing however i'm facing some errors that i hope anyone could help me out. My code is used to check a few thousand links given in a text form to check for certain tags. Once found it will output it to me. Due to the reason i have a few thousand links to check, speed is an issue and hence the need for me to move to multi processing.

Update: i'm having return errors of HTTP 503 errors. Am i sending too much request or am i missin gout something?

Multiprocessing code:

from mechanize import Browser
from bs4 import BeautifulSoup
import sys
import socket
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

no_stock = []

def main(lines):
    done = False
    tries = 1
    while tries and not done:
        try:
            r = br.open(lines, timeout=15)
            r = r.read()
            soup = BeautifulSoup(r,'html.parser')
            done = True # exit the loop
        except socket.timeout:
            print('Failed socket retrying')
            tries -= 1 # to exit when tries == 0
        except Exception as e: 
            print '%s: %s' % (e.__class__.__name__, e)
            print sys.exc_info()[0]
            tries -= 1 # to exit when tries == 0
    if not done:
        print('Failed for {}
'.format(lines))
    table = soup.find_all('div', {'class' : "empty_result"})
    results = soup.find_all('strong', style = 'color: red;')
    if table or results:
        no_stock.append(lines)

if __name__ == "__main__":
    r = br.open('http://www.randomweb.com/') #avoid redirection
    fileName = "url.txt"
    pool = Pool(processes=2)
    with open(fileName, "r+") as f:
        lines = pool.map(main, f)
    with open('no_stock.txt','w') as f :
        f.write('No. of out of stock items : '+str(len(no_stock))+'
'+'
')
    for i in no_stock:
        f.write(i + '
')

Traceback:

Traceback (most recent call last):
  File "test2.py", line 43, in <module>
    lines = pool.map(main, f)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
UnboundLocalError: local variable 'soup' referenced before assignment

my txt file is something like this:-

http://www.randomweb.com/item.htm?uuid=44733096229
http://www.randomweb.com/item.htm?uuid=4473309622789
http://www.randomweb.com/item.htm?uuid=447330962291
....etc
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
229 views
Welcome To Ask or Share your Answers For Others

1 Answer

from mechanize import Browser
from bs4 import BeautifulSoup
import sys
import socket
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

br = Browser()

no_stock = []

def main(line):
    done = False
    tries = 3
    while tries and not done:
        try:
            r = br.open(line, timeout=15)
            r = r.read()
            soup = BeautifulSoup(r,'html.parser')
            done = True # exit the loop
        except socket.timeout:
            print('Failed socket retrying')
            tries -= 1 # to exit when tries == 0
        except:
            print('Random fail retrying')
            print sys.exc_info()[0]
            tries -= 1 # to exit when tries == 0
    if not done:
        print('Failed for {}
'.format(i))
    table = soup.find_all('div', {'class' : "empty_result"})
    results = soup.find_all('strong', style = 'color: red;')
    if table or results:
        no_stock.append(i)

if __name__ == "__main__":
    fileName = "url.txt"
    pool = Pool(cpu_count() * 2)  # Creates a Pool with cpu_count * 2 threads.
    with open(fileName, "rb") as f:
        lines = pool.map(main, f)
    with open('no_stock.txt','w') as f :
        f.write('No. of out of stock items : '+str(len(no_stock))+'
'+'
')
    for i in no_stock:
        f.write(i + '
')

pool.map takes two parameters, the fist is a function(in your code, is main), the other is an iterable, each item of iterable will be a parameter of the function(in your code, is each line of the file)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...