javascript - Get HTML content from another site

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

javascript - Get HTML content from another site

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I would like to dynamically retrieve the html contents from another website, I have the permission of the company.

Please, don't point me to JSONP, because I can't edit Site A, only Site B

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

106 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:55:42+0000

Because of cross-domain security issues, you won't be able to do this client-side, unless you're content with an iframe.

With PHP, you can use several methods of "scraping" the content. The approach you use depends on whether you need to use cookies in your requests (i.e. the data is behind a login).

Either way, to start things off on the client side you'll issue a standard AJAX request to your own server:

$.ajax({
  type: "POST",
  url: "localProxy.php",
  data: {url: "maybe_send_your_url_here.php?product_id=1"}
}).done(function( html ) {
   // do something with your HTML!
});

If you need cookies set (if the remote site requires login, you need 'em), you're going to use cURL. The full mechanics of logging in with post data and accepting cookies is a little beyond the scope of this answer, but your requests would look something like this:

$ch = curl_init(); 
curl_setopt ($ch, CURLOPT_URL, 'http://thirdpartydomain.internet/login_url.php'); 
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE); 
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"); 
curl_setopt ($ch, CURLOPT_TIMEOUT, 60); 
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 0); 
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.jar'); 
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'email='.$username.'&password='.$password); 
curl_setopt ($ch, CURLOPT_POST, 1); 
$result = curl_exec ($ch); 
curl_close($ch);

At that point, you can check the $result variable and make sure the login worked. If so, you'd then use cURL to issue another request to grab the page content. The second request won't have all the post junk, and you'd use the URL that you're trying to fetch. You'd end up with a large string full of HTML.

If you only need a portion of that page's content, you can use the method below to load the string into a DomDocument, use the loadHTML method instead of loadHTMLFile (see below)

Speaking of DomDocument, if you don't need cookies, then you can use DomDocument directly to fetch the page, skipping cURL:

$doc = new DOMDocument('1.0', 'UTF-8');
// load the string into the DOM (this is your page's HTML), see below for more info
$doc->loadHTMLFile ('http://third_party_url_here.php?query=string');

// since we are working with HTML fragments here, remove <!DOCTYPE 
$doc->removeChild($doc->firstChild);            

// remove <html></html> and any junk
$body = $doc->getElementsByTagName('body'); 
$doc->replaceChild($body->item(0), $doc->firstChild);

// now, you can get any portion of the html (target a div, for example) using familiar DOM methods

// echo the HTML (or desired portion thereof)
die($doc->saveHTML());

Documentation

HTML iframe on MDN - https://developer.mozilla.org/en/HTML/Element/iframe
jQuery.ajax() - http://api.jquery.com/jQuery.ajax/
PHP's cURL - http://php.net/manual/en/book.curl.php
Curl::set_opt (information about using cookies) - http://www.php.net/manual/en/function.curl-setopt.php
PHP's DomDocument - http://php.net/manual/en/class.domdocument.php
DomDocument::loadHTMLFile - http://www.php.net/manual/en/domdocument.loadhtmlfile.php
DomDocument::loadHTML - http://www.php.net/manual/en/domdocument.loadhtml.php

Categories

javascript - Get HTML content from another site

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags