i uses dom doc to load html from database like this:
$doc = new DOMDocument();
@$doc->loadHTML($data);
$doc->encoding = 'utf-8';
$doc->saveHTML();
Then i get the body text by doing these:
$bodyNodes = $doc->getElementsByTagName("body");
$words = htmlspecialchars($bodyNodes->item(0)->textContent);
The words i've gotten included everything in the <body>
. Things like <scripts>
were also included. How do i removed them and keep only the real text content?
You have to visit all nodes and return their text. If some contain other node, visit them too.
This can be done with this basic recursive algorithm:
extractNode:
if node is a text node or a cdata node, return its text
if is an element node or a document node or a document fragment node:
if it’s a script node, return an empty string
return a concatenation of the result of calling extractNode on all the child nodes
for everything else return nothing
Implementation:
function extractText($node) {
if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
return $node->nodeValue;
} else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
if ('script' === $node->nodeName) return '';
$text = '';
foreach($node->childNodes as $childNode) {
$text .= extractText($childNode);
}
return $text;
}
}
This will return the textContent of the given $node, ignoring script tags and comments.
$words = htmlspecialchars(extractText($bodyNodes->item(0)));
Try it here: http://codepad.org/CS3nMp7U
You can use XPath for this.
Borrowing the HTML arnaud used for his example above:
$html = <<< HTML
<p>
test<span>foo<b>bar</b>
</p>
<script>
ignored
</script>
<!-- comment is ignored -->
<p>test</p>
HTML;
You simply query all text nodes that not are not children of a script tag and do not evaluate to an empty string. You'll also make sure you dont preserveWhiteSpace so the whitespace used for formatting isnt considered.
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
$nodes = $xp->query('/html/body//text()[
not(ancestor::script) and
not(normalize-space(.) = "")
]');
foreach($nodes as $node) {
var_dump($node->textContent);
}
will output (demo)
string(10) "
test"
string(3) "foo"
string(3) "bar"
string(4) "test"