php - scrape data using regex and simplehtmldom

Question

Welcome To Ask or Share your Answers For Others

php - scrape data using regex and simplehtmldom

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

i am trying to scrape some data from this site : http://laperuanavegana.wordpress.com/ . actually i want the title of recipe and ingredients . ingredients is located inside two specific keyword . i am trying to get this data using regex and simplehtmldom . but its showing the full html text not just the ingredients . here is my code : <?php

include_once('simple_html_dom.php');
$base_url = "http://laperuanavegana.wordpress.com/";

traverse($base_url);


function traverse($base_url)
{
    
    $html = file_get_html($base_url);
    $k1="Ingredientes";
    $k2="Preparación";
    preg_match_all("/$k1(.*)$k2/s",$html->innertext,$out);
    echo $out[0][0];
}

?>

there is multiple ingredients in this page . i want all of them . so using preg_match_all() it will be helpful if anybody detect the bug of this code . thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

139 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:16:12+0000

When you are already using an HTML parser (even a poor one like SimpleHtmlDom), why are you trying to mess up things with Regex then? That's like using a scalpel to open up the patient and then falling back to a sharpened spoon for the actual surgery.

Since I strongly believe no one should use SimpleHtmlDom because it has a poor codebase and is much slower than libxml based parsers, here is how to do it with PHP's native DOM extension and XPath. XPath is effectively the Regex or SQL for X(HT)ML documents. Learn it, so you will never ever have to touch Regex for HTML again.

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com/2011/06/11/ensalada-tibia-de-quinua-mango-y-tomate/');
libxml_clear_errors();

$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('content');
$recipe['title'] = $xpath->evaluate('string(div/h2/a)', $contentDiv);
foreach ($xpath->query('div/div/ul/li', $contentDiv) as $listNode) {
    $recipe['ingredients'][] = $listNode->nodeValue;
}
print_r($recipe);

This will output:

Array
(
    [title] => Ensalada tibia de quinua, mango y?tomate
    [ingredients] => Array
        (
            [0] => 250gr de quinua cocida tibia
            [1] => 1 mango grande
            [2] => 2 tomates
            [3] => Unas hojas de perejil
            [4] => Sal
            [5] => Aceite de oliva
            [6] => Vinagre balsámico
        )

)

Note that we are not parsing http://laperuanavegana.wordpress.com/ but the actual blog post. The main URL will change content whenever the blog owner adds a new post.

To get all the Recipes from the main page, you can use

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com');
libxml_clear_errors();
$contentDiv = $dom->getElementById('content');
$xp = new DOMXPath($dom);
$recipes = array();
foreach ($xp->query('div/h2/a|div/div/ul/li', $contentDiv) as $node) {
    echo
        ($node->nodeName === 'a') ? "
# " : '- ',
        $node->nodeValue,
        PHP_EOL;
}

This will output

# Ensalada tibia de quinua, mango y?tomate
- 250gr de quinua cocida tibia
- 1 mango grande
- 2 tomates
- Unas hojas de perejil
- Sal
- Aceite de oliva
- Vinagre balsámico

# Flan de?lúcuma
- 1 lúcuma grandota o 3 peque?as
- 1/2 litro de leche de soja evaporada
…

and so on

Also see

Categories

php - scrape data using regex and simplehtmldom

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags