web scraping - Find or select elements from python to scrape with beautifulsoup

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

web scraping - Find or select elements from python to scrape with beautifulsoup

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I am not sure how to select below items inside the table class="table-info"

Using python and beautifulsoup, I want to extract the:

phone
email
website

main activity (li element text without the div) "Computer consultancy activities".

 <table class="table-info">
 <tbody>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Business name</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">Company XYZ</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Register code:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">112233558</div>
         </td>
     </tr>


     <tr>
         <td class="col-1">
             <div class="col-1-text">Operating address:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
                     class="link-location">Some location strt. 233</a></div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Legal address</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">
                 <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
                     location
                 </a>
             </div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">VAT No:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
                     liability</a></div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Age:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">1 year&nbsp;3 months</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Founded:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">20/09/2019</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Capital:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">2000 USD</div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Phone:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">123456789</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">E-mail:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Representatives:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">
                 <div class="box-message">
                     <p class="desc">To access information, please</p>
                     <p>
                         <a href="#" onclick="return loginClicked(this, '#');"
                             class="btn btn-small btn-purple link-login">Log in</a>
                     </p>
                 </div>
             </div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">
                 Main activity:
                 <span class="tip info" title=""
                     data-original-title="Activities are classified according to EMTAK 2008"></span>
             </div>
         </td>
         <td class="col-2">
             <div class="col-2-text" id="activity_top5ffe2eab23d13">
                 <ul>
                     <li>
                         Computer consultancy activities
                         <div class="main_activities_top_link_wrapper">
                             <a href="https://www.somesite.com/" target="_blank"
                                 onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
                                 class="btn btn-simple btn-open-graph">
                                 <span>Open TOP 20</span> </a>
                         </div>
                     </li>
                 </ul>

             </div>
         </td>
     </tr>


 </tbody>

Note: Above code is one query result / html example, but sometimes query result / company does not have email or website / vice versa. So, its important that code does not run into error if it does not find the html content what its looking for. I find its better to follow the class names or ids rather than counting how deep the table/div nesting goes (xpath).

I have code which is not working great atm:

import csv
import requests
import datetime
import time
 
from requests import get
from bs4 import BeautifulSoup
 
 
with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)
 
    count = 0
     
    for row in reader:
         
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 
        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
         
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")
 
        table_info = soup.select_one('.table-info')
 
        mail = table_info.select_one('.col-2 a[href^=mailto]')
        mail = mail.get('href')
        mail_clean = mail.split(':')[1]
 
        website = soup.find(text='Website:')
        website = table_info.select_one('.col-2 a[target^=_blank]')
        website = website.get('href') 
         
        collected_data = row[1], mail_clean, website, timestamp
 
        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
        with open('extracted.csv', 'w', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerows(data_list)
 
        print(row[1], "|", mail_clean,"|", website,"|", timestamp)
        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

140 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:21:30+0000

Have you considered using css selectors that count the table's children? If your table will always mirror the example code, it just might be easier to use the nth-child property.

Phone: tr:nth-child(10) .col-2-text
Email: tr:nth-child(11) a
Website: span
Main Activity: li

I used Selector Gadget to grab these tags. You might want to run it on your page directly to see if there are any other ones that are easier to implement.

Categories

web scraping - Find or select elements from python to scrape with beautifulsoup

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags