I am using readability API to do this. In their example they have show lead_img_url
but I could not fetch it.
REference: https://www.readability.com/developers/api/parser
Is this correct way to make direct request:
it says: {"messages": "The API Key in the form of the 'token' parameter is invalid.", "error": true}
Another try:
<?php
define('TOKEN', "1b830931777ac7c2ac954e9f0d67df437175e66e");
define('API_URL', "https://www.readability.com/api/content/v1/parser?url=%s&token=%s");
function get_image($url) {
// sanitize it so we don't break our api url
$encodedUrl = urlencode($url);
$TOKEN = '1b830931777ac7c2ac954e9f0d67df437175e66e';
$API_URL = 'https://www.readability.com/api/content/v1/parser?url=%s&token=%s';
// $API_URL = 'http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas';
// build our url
$url = sprintf($API_URL, $encodedUrl, $TOKEN);
// call the api
$response = file_get_contents($url);
if( $response ) {
return false;
}
$json = json_decode($response);
if(!isset($json['lead_image_url'])) {
return false;
}
return $json['lead_image_url'];
}
Error: Warning: file_get_contents(https://www.readability.com/api/content/v1/parser?url=http%3A%2F%2Fthenwat.com%2Fthenwat%2Finvite%2Findex.php&token=1b830931777ac7c2ac954e9f0d67df437175e66e): failed to open stream: HTTP request failed! HTTP/1.1 403 FORBIDDEN in F:\wamp\www\inviteold\test2.php on line 32
one more:
require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);
$Readability = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();
$image= $ReadabilityData['lead_image_url'];
$title= $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];
echo "$content";
It says: Notice: Undefined index: lead_image_url in F:\wamp\www\inviteold\test2.php on line 13
First, in order to use the REST API that they provide, you need to create an account. Afterwards you can generate your own token
to use in the call. The token
provided by the examples will not work because it is purposefully invalid. Its purpose is for example only.
Second, make sure the allow_url_fopen
directive in your php.ini
file is set to true
. For the purposes of a test script, or if you cannot change your php.ini
file (shared hosting solutions), you can use ini_set('allow_url_fopen', true);
at the top of your page.
Lastly, in order to parse the images yourself you'll need to retrieve all image elements from the DOM you retrieve. Sometimes there won't be any images, and sometimes there will be. It depends on what page you're pulling from. Additionally, you'll need to resolve relative paths...
Your Code
require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);
$Readability = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();
$image= $ReadabilityData['lead_image_url'];
$title= $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];
echo "$content";
After executing Readability
, you can utilize the DOMDocument
class to retrieve your images from the contents you pulled. Instantiate a new DOMDocument
and load in your HTML. Make sure to use the libxml_use_internal_errors
function to supress errors caused by the parser on most websites. We'll put this in a function to make it easier to use elsewhere if needbe.
function sampleDomMedia($html) {
// Supress validator errors
libxml_use_internal_errors(true);
// New document
$dom = new DOMDocument();
// Populate document
$dom->loadHTML($html);
//[...]
You can now retrieve all image elements from the document you instantiated, and then get their src
attribute... like so:
//[...]
// Get image elements
$nodeList = $dom->getElementsByTagName('img');
// Get length
$length = $nodeList->length;
// Initialize array
$images = array();
// Iterate over our nodes
for($i=0;$i<$length;$i++) {
// Get the current node
$node = $nodeList->item($i);
// Retrieve the src attribute
$image = $node->getAttribute('src');
// Push image src into $images array
array_push($images,$image);
}
return $images;
}
Now you have an array of images that you can present to the user for use. But before you do that, we forgot one more thing... We want to resolve all relative paths so that we always have an absolute path to the image that lives on another site.
To do this, we have to determine the base domain URL, and the relative path to the current page we're working with. We can do so using the parse_url()
function provided by PHP. For simplicity's sake, we can throw this into a function.
function getUrls($url) {
// Parse URL
$urlArr = parse_url($url);
// Determine Base URL, with scheme, host, and port
$base = $urlArr['scheme']."://".$urlArr['host'];
if(array_key_exists("port",$urlArr) && $urlArr['port'] != 80) {
$base .= ":".$urlArr['port'];
}
// Truncate the Path using the position of the last forward slash
$relative = $base.substr($urlArr['path'], 0, strrpos($urlArr['path'],"/")+1);
// Return our two URLs
return array($base, $relative);
}
Add an additional parameter to the original sampleDomMedia
function, and we can call this function to get our paths. Then we can check the src
attribute's value to determine what kind of path it is, and resolve it.
function sampleDomMedia($html, $url) {
// Retrieve our URLs
list($baseUrl, $relativeUrl) = getUrls($url);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$nodeList = $dom->getElementsByTagName('img');
$length = $nodeList->length;
$images = array();
for($i=0;$i<$length;$i++) {
$node = $nodeList->item($i);
$image = $node->getAttribute('src');
// Resolve relative paths
if(substr($image,0,2)=="//") { // Missing protocol
$image = "http:".$image;
} else if(substr($image,0,1)=="/") { // Path Relative to Base
$image = $baseUrl.$image;
} else if(substr($image,0,4)!=="http") { // Path Relative to Dimension
$image = $relativeUrl.$image;
}
array_push($images,$image);
}
return $images;
}
And last, but certainly not least, we're left with the two previous functions, and this piece of procedural code:
require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);
$Readability = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();
$image = $ReadabilityData['lead_image_url'];
$images = sampleDomMedia($html, $url);
$title = $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];
echo "$content";
Also, if you think the contents of the article may have an image inside of it (usually doesn't), you can use the contents
returned from Readability
rather than the $html
variable, like so:
$title = $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];
$images = sampleDomMedia($content, $url);
I hope that helps.