通过curl方法获取url数据,从而在符号中产生意外结果

I am facing some times Problem in getting url data by curl method specially website data is is in other language like arabic etc My curl function is

function file_get_contents_curl($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

    $data = curl_exec($ch);
    $info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    //checking mime types
    if(strstr($info,'text/html')) {
        curl_close($ch);
        return $data;
    } else {
        return false;
    }
}

And how i am getting data

$html =  file_get_contents_curl($checkurl);
    $grid ='';
    if($html)
    {
        $doc = new DOMDocument();
        @$doc->loadHTML($html);
        $nodes = $doc->getElementsByTagName('title');
        @$title = $nodes->item(0)->nodeValue;
        @$metas = $doc->getElementsByTagName('meta');
        for ($i = 0; $i < $metas->length; $i++)
        {
            $meta = $metas->item($i);
            if($meta->getAttribute('name') == 'description')
                $description = $meta->getAttribute('content');
        }

I am getting all data correctly from some arabic websites like http://www.emaratalyoum.com/multimedia/videos/2012-04-08-1.474873 and when i give this youtube url http://www.youtube.com/watch?v=Eyxljw31TtU&feature=g-logo&context=G2c4f841FOAAAAAAAFAA
it shows symbols.. what setting i have to do to show exactly the same title description.

Introduction

Getting Arabic can be very tricky but they are some basic steps you need to ensure

  • Your document must output UTF-8
  • Your DOMDocument must read in UTF-8 fromat

Problem

When getting Youtube information its already given the information in "UTF-8" format and the retrieval process adds addition UTF-8 encoding .... not sure why this occurs but a simple utf8_decode would fix the issue

Example

header('Content-Type: text/html; charset=UTF-8');
echo displayMeta("http://www.emaratalyoum.com/multimedia/videos/2012-04-08-1.474873");
echo displayMeta("http://www.youtube.com/watch?v=Eyxljw31TtU&feature=g-logo&context=G2c4f841FOAAAAAAAFAA"); 

Output

emaratalyoum.com

التقطت عدسات الكاميرا حارس مرمى ريال مدريد إيكر كاسياس في موقف محرج قبل لحظات من بداية مباراة النادي الملكي مع أبويل القبرصي في ذهاب دور الثمانية لدوري أبطال 

youtube.com

أوروبا.ففي النفق المؤدي إلى الملعب، قام كاسياس بوضع إصبعه في أنفه، وبعدها قام بمسح يده في وجه أحدبنات سعوديات: أريد "شايب يدللني ولا شاب يعللني"

Function Used

displayMeta

function displayMeta($checkurl) {
    $html = file_get_contents_curl($checkurl);
    $grid = '';
    if ($html) {
        $doc = new DOMDocument("1.0","UTF-8");
        @$doc->loadHTML($html);
        $nodes = $doc->getElementsByTagName('title');
        $title = $nodes->item(0)->nodeValue;
        $metas = $doc->getElementsByTagName('meta');
        for($i = 0; $i < $metas->length; $i ++) {
            $meta = $metas->item($i);
            if ($meta->getAttribute('name') == 'description') {
                $description = $meta->getAttribute('content');
                if (stripos(parse_url($checkurl, PHP_URL_HOST), "youtube") !== false)
                    return utf8_decode($description);
                else {
                    return $description;
                }
            }
        }
    }
}

*file_get_contents_curl*

function file_get_contents_curl($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

    $data = curl_exec($ch);
    $info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    // checking mime types
    if (strstr($info, 'text/html')) {
        curl_close($ch);
        return $data;
    } else {
        return false;
    }
}

I believe this will work... utf8_decode() your attribute..

function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

$data = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

//checking mime types
if(strstr($info,'text/html')) {
    curl_close($ch);
    return $data;
} else {
    return false;
}
}

$html =  file_get_contents_curl($checkurl);
$grid ='';
if($html)
{
    $doc = new DOMDocument();
    @$doc->loadHTML($html);
    $nodes = $doc->getElementsByTagName('title');
    @$title = $nodes->item(0)->nodeValue;
    @$metas = $doc->getElementsByTagName('meta');
    for ($i = 0; $i < $metas->length; $i++)
    {
        $meta = $metas->item($i);
        if($meta->getAttribute('name') == 'description')
            $description = utf8_decode($meta->getAttribute('content'));
    }

What happens here is that you're discarding the found Content-Type header that cURL returned in your file_get_contents_curl() function; DOMDocument needs that information to understand the character set that was used on the page.

A somewhat ugly hack, but most generic, is to prefix the returned page with a <meta> tag containing the returned character set from the response headers:

if (strstr($info, 'text/html')) {
    curl_close($ch);
    return '<meta http-equiv="Content-Type" content="' . $info . '" />' . $data;
}

DOMDocument will accept the misplaced meta tag and do the respective conversions automatically.