php - DOMDocument for parsing HTML (instead of regex)

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

php - DOMDocument for parsing HTML (instead of regex)

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am trying to learn using DOMDocument for parsing HTML code.

I am just doing some simple work, I already liked gordon's answer on scrap data using regex and simplehtmldom and based my code on his work.

I found documentation on PHP.net not that good due to limited information, almost no examples, and most specifics were based on parsing XML.

<?php
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html');
libxml_clear_errors();

$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('page'); // would have preferred getContentbyClass('content') (unique) in this case.

# title
print_r($xpath->evaluate('string(div/div/div/div/div/h1)', $contentDiv));

# content (this is not working)
#print_r($xpath->evaluate('string(div/div/div/div['content'])', $contentDiv)); // if only this worked
print_r($xpath->evaluate('string(div/div/div/div)', $contentDiv));
?>

For testing purposes I am trying to get the title (between h1 tags) and content (HTML) of a nu.nl news article.

As you can see I can get the title, although I am not even that happy with that evaluate string since it just happens to be the only h1 tag on that div-level.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

324 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:25:46+0000

Here is how you could do it with DOM and XPath:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/…');
libxml_clear_errors();

$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(id("leadarticle")/div/h1)');
echo $dom->saveHtml(
    $xpath->evaluate('id("leadarticle")/div[@class="content"]')->item(0)
);

The XPath string(id("leadarticle")/div/h1) will return the textContent of the h1 that is a child of a div that is the child of the element with the id leadarticle.

The XPath id("leadarticle")/div[@class="content"] will return the div with the class attribute content that is a child of the element with the id leadarticle.

Because you want the outerHTML of the content div you'll have to fetch the entire node and not just the content, hence no string() function in the XPath. Passing a node to the DOMDocument::saveHTML() method (which is only possible as of 5.3.6) will then serialize that node back to HTML.

Categories

php - DOMDocument for parsing HTML (instead of regex)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags