php - How to add scraped website data in database?

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

php - How to add scraped website data in database?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I want to store:

Product Name
Categoty
Subcategory
Price
Product Company.

In my table named products_data with filds name as PID, product_name, category, subcategory, product_price and product_company.

I am using curl_init() function in php to first scrap website URL, next I want to store products data in my database table. Here is what I have done so far for this:

$sites[0] = 'http://www.babyoye.com/';

foreach ($sites as $site)
{
    $ch = curl_init($site);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $html = curl_exec($ch);

    $title_start = '<div class="info">';

    $parts = explode($title_start,$html);
    foreach($parts as $part){
        $link = explode('<a href="/d/', $part);

        $link = explode('">', $link[1]);
        $url = 'http://www.babyoye.com/d/'.$link[0];

        // now for the title we need to follow a similar process:

        $title = explode('<h2>', $part);

        $title = explode('</h2>', $title[1]);

        $title = strip_tags($title[0]);

        // INSERT DB CODE HERE e.g.

        $db_conn = mysql_connect('localhost', 'root', '') or die('error');
        mysql_select_db('babyoye', $db_conn) or die(mysql_error());

        $sql = "INSERT INTO products_data(PID, product_name) VALUES ('".$url."', '".$title."')"

        mysql_query($sql) or die(mysql_error()); 

    }
}

I am little confused with database part that how to insert data in table. Any help?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

257 views

1 Answer

548k questions

547k answers

4 comments

86.3k users

Most popular tags

javascript python c# java How android c++ php ios html sql r c node.js .net iphone asp.net css reactjs jquery ruby What Android objective mysql linux Is git Python windows Why regex angular swift amazon excel algorithm macos Java visual how bash Can multithreading PHP Using scala angularjs typescript apache spring performance postgresql database flutter json rust arrays C# dart vba django wpf xml vue.js In go Get google jQuery xcode jsf http Google mongodb string shell oop powershell SQL C++ security assembly docker Javascript Android: Does haskell Convert azure debugging delphi vb.net Spring datetime pandas oracle math Django

联盟问答网站-Union QA website

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

广告位招租

深圳家

深圳家

极客中国

搜外友链

Ostack Developer QA ZONE

CC BY-SA 3.0

Contact with WebMaster by Email: jeky_zhao@qq.com

Theme made by Momin Raza, Modified by Ostack

Powered by Question2Answer

深蓝 · Answer 1 · 2021-10-23T17:52:35+0000

There's a number of things you may wish to consider in your design phase prior to writing some code:

Generalise your solutions as much as you can. If you have to write PHP code for every new scrape, your development changes required if a target site changes their layout may be too slow, and may disrupt the enterprise you are building. This is extra-important if you intend to scrape a large number of sites, since the odds of a site restructuring are statistically greater.
One way to achieve this generalisation is to use off-the-shelf libraries that are good at this already. So, rather than using cURL, use Goutte or some other programmatic browser system. This will give you sessions for free, which in some sites is necessary to click from one page to another. You'll also get CSS selectors to specify what items of content you are interested in.
For tabular content, store a look-up database table on your local site, that converts a heading title to a database column name. For product grids, you could use a table to convert a CSS selector (relative to each grid cell, say) to a column. Either of these will make it easier to respond to changes in the format of your target site(s).
If you are extracting text from a site, at a minimum you need to run it through a proper escape system, otherwise a target site could in theory add content on their site to inject SQL of their choosing into your database. In any case, an apostrophe on their side would certainly cause your call to fail, so you should use mysql_real_escape_string.
If you are extracting HTML from a site with view to re-displaying it, always remember to clean it properly first. This means stripping tags that you don't want, removing attributes that may be unwelcome, and ensuring the structure is well-nested. HTMLPurifier is good for this, I've found.

When crawling, remember:

Be a good robot and define a unique USER_AGENT for yourself, so site operators are easily block you if they wish. It is poor etiquette to masquerade as a human using, say, Internet Explorer. Include a URL to a friendly help page in your user agent, like the GoogleBot does.
Don't crawl through proxies or other systems intended to hide your identity - crawl in the open.
Respect robots.txt; if a site wishes to block scrapers, they should be allowed to do so using respected conventions. If you are acting like a search engine, the odds of an operator wishing to block you are very low (don't most people want to be scraped by search engines?)
Always do some rate limiting, otherwise this happens. On my development laptop through a slow connection, I can scrape a site at a rate of two pages a second, even without using multi_curl. On a real server, that's likely to be much faster - maybe 20? Either way, making that number of requests of one target IP/domain is a great way to find yourself in someone's blocklist. Thus, if you scrape, do it slowly.
I maintain a table of HTTP accesses, and have a rule that if I've made a request in the last 5 seconds, I "pause" this scrape, and scrape something else instead. I come back to paused scrapes once sufficient time has passed. I may be inclined to increase this value, and hold the concurrent state of a larger number of paused operations in memory.
If you are scraping a number of sites, one way to maintain performance without sleeping excessively is to interleave the requests you wish to make on a round-robin basis. So, do one HTTP operation each on 50 sites, retain the state of each scrape, and then go back to the first one.
If you implement the interleaving of many sites, you can use multi_curl to parallelise your HTTP requests. I wouldn't recommend using this on a single site for reasons already stated (the remote server may well limit the number of connections you can separately open to them anyway).
Be careful about basing your entire enterprise on the scraping of a single site. If they block you, you're fairly stuck. If your business model can rely on the scraping of many sites, then being blocked by one becomes less of a risk.

Also, it may be cost-effect to install third party scraping software, or get a third-party service to do the scraping for you. My own research in this area has turned up very few organisations that appear to be capable (and bear in mind that, at the time of writing, I've not tried any of them). So, you may wish to look at these:

80legs - commercial
Mozenda - commercial
Common Crawl - F/OSS
Crawl Anywhere - F/OSS

Categories

php - How to add scraped website data in database?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags