Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am not sure how to select below items inside the table class="table-info"

Using python and beautifulsoup, I want to extract the:

  1. phone

  2. email

  3. website

  4. main activity (li element text without the div) "Computer consultancy activities".

     <table class="table-info">
     <tbody>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Business name</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">Company XYZ</div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Register code:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">112233558</div>
             </td>
         </tr>
    
    
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Operating address:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
                         class="link-location">Some location strt. 233</a></div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Legal address</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">
                     <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
                         location
                     </a>
                 </div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">VAT No:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
                         liability</a></div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Age:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">1 year&nbsp;3 months</div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Founded:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">20/09/2019</div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Capital:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">2000 USD</div>
             </td>
         </tr>
         <tr>
             <td colspan="2" class="sep"></td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Phone:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">123456789</div>
             </td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">E-mail:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div>
             </td>
         </tr>
         <tr>
             <td colspan="2" class="sep"></td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">Representatives:</div>
             </td>
             <td class="col-2">
                 <div class="col-2-text">
                     <div class="box-message">
                         <p class="desc">To access information, please</p>
                         <p>
                             <a href="#" onclick="return loginClicked(this, '#');"
                                 class="btn btn-small btn-purple link-login">Log in</a>
                         </p>
                     </div>
                 </div>
             </td>
         </tr>
         <tr>
             <td colspan="2" class="sep"></td>
         </tr>
         <tr>
             <td class="col-1">
                 <div class="col-1-text">
                     Main activity:
                     <span class="tip info" title=""
                         data-original-title="Activities are classified according to EMTAK 2008"></span>
                 </div>
             </td>
             <td class="col-2">
                 <div class="col-2-text" id="activity_top5ffe2eab23d13">
                     <ul>
                         <li>
                             Computer consultancy activities
                             <div class="main_activities_top_link_wrapper">
                                 <a href="https://www.somesite.com/" target="_blank"
                                     onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
                                     class="btn btn-simple btn-open-graph">
                                     <span>Open TOP 20</span> </a>
                             </div>
                         </li>
                     </ul>
    
                 </div>
             </td>
         </tr>
    
    
     </tbody>
    

Note: Above code is one query result / html example, but sometimes query result / company does not have email or website / vice versa. So, its important that code does not run into error if it does not find the html content what its looking for. I find its better to follow the class names or ids rather than counting how deep the table/div nesting goes (xpath).

I have code which is not working great atm:

import csv
import requests
import datetime
import time
 
from requests import get
from bs4 import BeautifulSoup
 
 
with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)
 
    count = 0
     
    for row in reader:
         
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 
        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
         
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")
 
        table_info = soup.select_one('.table-info')
 
        mail = table_info.select_one('.col-2 a[href^=mailto]')
        mail = mail.get('href')
        mail_clean = mail.split(':')[1]
 
        website = soup.find(text='Website:')
        website = table_info.select_one('.col-2 a[target^=_blank]')
        website = website.get('href') 
         
        collected_data = row[1], mail_clean, website, timestamp
 
        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
        with open('extracted.csv', 'w', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerows(data_list)
 
        print(row[1], "|", mail_clean,"|", website,"|", timestamp)
        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1
     
  
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
140 views
Welcome To Ask or Share your Answers For Others

1 Answer

Have you considered using css selectors that count the table's children? If your table will always mirror the example code, it just might be easier to use the nth-child property.

  • Phone: tr:nth-child(10) .col-2-text
  • Email: tr:nth-child(11) a
  • Website: span
  • Main Activity: li

I used Selector Gadget to grab these tags. You might want to run it on your page directly to see if there are any other ones that are easier to implement.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...