Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

When scraping page, I would like the images included with the text.

Currently I'm only able to scrape the text. For example, as a test script, I scraped Google's homepage and it only displayed the text, no images(Google logo).

I also created another test script using Redbox, with no success, same result. Here's my attempt at scraping the Redbox 'Find a Movie' page:

<?php

$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;

?>

the page was broken, missing box art, missing scripts, etc.

Looking at FF's Firebug's Extension 'Net' tool(allows me to check headers and file paths), I discovered that Redbox's images and css files were not loaded/missing (404 not found). I noticed why, it was because my browser was looking for Redbox's images and css files in the wrong place.

Apperently the Redbox images and css files are located relative to the domain, likewise for Google's logo. So if my script above is using its domain as the base for the files path, how could I change this?

I tried altering the host and referer request headers with the script below, and I've googled extensively, but no luck.

My fix attempt:

<?php

$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$referer = 'http://www.redbox.com/Titles/AvailableTitles.aspx';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Host: www.redbox.com") );
curl_setopt ($ch, CURLOPT_REFERER, $referer); 
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;

?>

I hope I made sense, if not, let me know and I'll try to explain it better. Any help would be great! Thanks.

UPDATE


Thanks to everyone(especially Marc, and Wyatt), your answers helped me figure out a method to implement.
I was able to succesfully test by following the steps below:
  1. Download the page and its requisites via Wget.
  2. Add <base href="..." /> to downloaded page's header.
  3. Upload the revised downloaded page and its original requisites via Wput to a temporary server.
  4. Test uploaded page on temporary server via browser
  5. If the uploaded page is not displayed properly, some of the requisites might be missing still(css,jss,ect). View which are missing via a tool that lets you view header responses(eg. the 'net' tool from FF's Firebug Addon). After locating the missing requisites, visit original page that the uploaded page is based on, take note of proper requisite locations that were missing, then revise the downloaded page from step 1 to accommodate the new proper locations and begin at step 3 again. Else, if page is rendered properly, then success!

Note: When revising the downloaded page I manually edited the code, I'm sure you could use regEX or a parsing library on cUrl's request to automate the process.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
719 views
Welcome To Ask or Share your Answers For Others

1 Answer

When you scrape a URL, you're retrieving a single file, be it html, image, css, javascript, etc... The document you see displayed in a browser is almost always the result of MULTIPLE files: the original html, each seperate image, each css file, each javascript file. You enter only a single address, but fully building/displaying the page will require many HTTP requests.

When you scrape the google home page via curl and output that HTML to the user, there's no way for the user to know that they're actually viewing Google-sourced HTML - it appears as if the HTML came from your server, and your server only. The user's browser will happily suck in this HTML, find the images, and request the images from YOUR server, not google's. Since you're not hosting any of google's images, your server responds with a properly 404 "not found" error.

To make the page work properly, you've got a few choices. The easiest is to parse the HTML of the page and insert a <base href="..." /> tag into the document's header block. This will tell any viewing browsers that "relatively" links within the document should be fetched from this 'base' source (e.g. google).

A harder option is to parse the document and rewrite any references to external files (images ,css, js, etc...) and put in the URL of the originating server, so the user's browser goes to the original site and fetches from there.

The hardest option is to essentially set up a proxy server, and if a request comes in for a file that doesn't exist on your server, to try and fetch the corresponding file from Google via curl and output it to the user.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...