Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a html doc similar to following:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
    <div id="Symbols" class="cb">
    <table class="quotes">
    <tr><th>Code</th><th>Name</th>
        <th style="text-align:right;">High</th>
        <th style="text-align:right;">Low</th>
    </tr>
    <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
        <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
        <td>A Inc.</td>
        <td align="right">45.44</td>
        <td align="right">44.26</td>
    <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
        <td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
        <td>B Inc.</td>
        <td align="right">18.29</td>
        <td align="right">17.92</td>
</div></html>

I need to extract code/name/high/low information from the table.

I used following code from one of the similar examples in Stack Over Flow:

#############################
import urllib2
from lxml import html, etree

webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)

for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
    for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
        print column.strip(),
    print

#############################

I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr') from table.xpath('//table[@class="quotes"]/tbody/tr')

I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr') not work.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.1k views
Welcome To Ask or Share your Answers For Others

1 Answer

You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.

Omit the tbody level in your XPath. For example, this works:

tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...