I would like to truncate a very long string, formatted via html elements.
I need the first 500 words (somehow I have to avoid html tags <p>
, <br>
while my function truncating the string), but in the result I have to keep/use those html elements because the result also should be formatted by html tags like the "original whole" text.
What's the best way to truncate my string?
Example:
Original text
> <p><a href="/t/the-huffington-post">The Huffington Post</a> (via <a
> href="/t/daily-mail">Daily Mail</a>) is reporting that <a
> href="/t/misty">Misty</a> has been returned to a high kill shelter for
> farting too much! She appeared on Greenville County Pet Rescue’s
> “urgent” list, which means if she doesn’t get readopted, she will be
> euthanized!</p>
I need the first n words (n=10)
> <p><a href="/t/the-huffington-post">The Huffington Post</a> (via <a
> href="/t/daily-mail">Daily Mail</a>) is reporting that.. </p>
A brute force method would be to just split all elements on blanks, then iterate over them. You count only non-tag elements up to a maximum, while you output tags nonetheless. Something along these lines:
$string = "your string here";
$output = "";
$count = 0;
$max = 10;
$tokens = preg_split('/ /', $string);
foreach ($tokens as $token)
{
if (preg_match('/<.*?>/', $token)) {
$output .= "$token ";
} else if ($count < $max) {
$output .= "$token ";
$count += 1;
}
}
print $output;
You could have found something like this with some Googling.
// Original PHP code by Chirp Internet: www.chirp.com.au
// Please acknowledge use of this code by including this header.
function restoreTags($input)
{
$opened = array();
// loop through opened and closed tags in order
if(preg_match_all("/<(\/?[a-z]+)>?/i", $input, $matches)) {
foreach($matches[1] as $tag) {
if(preg_match("/^[a-z]+$/i", $tag, $regs)) {
// a tag has been opened
if(strtolower($regs[0]) != 'br') $opened[] = $regs[0];
} elseif(preg_match("/^\/([a-z]+)$/i", $tag, $regs)) {
// a tag has been closed
unset($opened[array_pop(array_keys($opened, $regs[1]))]);
}
}
}
// close tags that are still open
if($opened) {
$tagstoclose = array_reverse($opened);
foreach($tagstoclose as $tag) $input .= "</$tag>";
}
return $input;
}
When you combine it with another function mentioned in the article:
function truncateWords($input, $numwords, $padding="")
{
$output = strtok($input, "
");
while(--$numwords > 0) $output .= " " . strtok("
");
if($output != $input) $output .= $padding;
return $output;
}
Then you can just achieve what you're looking for by doing this:
$originalText = '...'; // some original text in HTML format
$output = truncateWords($originalText, 500); // This truncates to 500 words (ish...)
$output = restoreTags($output); // This fixes any open tags