从页面获取特定元素

I'm trying to pull some data from my website. It is pretty simple, but I can't find any good examples/docs, so I am having a tough time. I'm trying to make an API for my friends to use my blog, but it's a bit difficult. Let's assume I have a website at http://www.sample.com, and the html source for that website is:

  <div class="container">
   <a href="/mywebsiteblogpost/">
      <h2 class="title">im the best</h2>
   </a>
   <span class="author">Josue Espinosa</span> 
   <div class="thumb"> <img src="http://www.sample.com/imgsrc" alt="">
   <span class="category">sports</span> 
   </div>
   <p>preview text</p>
   <a class="more" href="/mywebsiteblogpost/">full text...</a> 
</div>

I want to get all of .container's children, the first a child's href value, the text value of the class title, author, the img src for the child inside .thumb, and the text value for category.

I started with the a href src, but I didn't even get that far. I thought $title would be echoing the href value of the first anchor tag inside of container, but it doesn't work.

$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div) {
    $class = $div->getAttribute('class');
    if(strpos($class, 'container') !== FALSE) {
        // title doesnt retrieve the href value of title :(
        $title = 'TITLE'.$div->getElementsByTagName('a')->getAttribute('href').'<br>';
        //this echos all the text in all of the children of $div
        echo $div->textContent.'<br>';
    }
}

Can anyone explain why please?

The culprit is $div->getElementsByTagName('a')->getAttribute('href'). The first part, $div->getElementsByTagName('a') retrieves a list of elements, not a single element. So the following ->getAttribute('href') will not do the right thing.

To fix this, iterate just as you do with the div-tags:

foreach($div->getElementsByTagName('a') as $a) {
  $href = $a->getAttribute('href');
  if ($href) echo "TITLE$href<br>";
}

ok so first

$div->getElementsByTagName('a')

returns a domnodelist (http://php.net/manual/en/class.domnodelist.php) object, You need to get the first item there to get the attribute.

Second

$div->textContent

Does as intended ? show all text content in the $div ?

You may be better off looking at xpath queries( http://php.net/manual/en/class.domxpath.php) for this type of DOM searching

I made some corrections on the php code you posted that doesn't work, may be it can help you keep going

$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div) 
{
    $class = $div->getAttribute('class');
    // _($class);
    if(strpos($class, 'container') !== FALSE) 
    {
        // title doesnt retrieve the href value of title :(
        $a = $div->getElementsByTagName('a');
        foreach ($a as $key => $value) 
        {
            $A = $value;
            break;
        }
        $title = 'TITLE'. $A->getAttribute('href').'<br>';
        //this echos all the text in all of the children of $div
        echo $div->textContent.'<br>';
    }
}