There are a number of questions on SO about removing whitespace, usually answered with a preg_replace('/[\s]{2,}/, '', $string)
or similar answer that takes more than one whitespace character and removes them or replaces with one of the characters.
This gets more complicated when certain whitespace duplication may be allowed (e.g. text blocks with two line breaks and one line break both allowed and relevant), moreso combining whitespace characters (,
).
Here is some example text that, whilst messy, covers what I think you could end up with trying to present in a reasonable manner (e.g. user input that's previously been formatted with HTML and now stripped away)
$text = "
Dear Miss Test McTestFace,
We have received your customer support request about:
\tA bug on our website
\t
We will be in touch by :
\tNext Wednesday.
Thank you for your custom;
\t
If you have further questions please feel free to email us.
Sincerely
Customer service team
";
If our target was to have it in the format:
Dear Miss Test McTestFace,
We have received your customer support request about: A bug on our website
We will be in touch by : Next Wednesday.
Thank you for your custom;
If you have further questions please feel free to email us.
Sincerely
Customer service team
How would we achieve this - simple regex, more complex iteration or are there already libraries that can do this?
Also are there ways we could make the test case more complex and thus giving a more robust overall algorithm?
For my own part I chose to attempt an iterative algorithm based on the idea that if we know the current context (are we in a paragraph, or in a series of line breaks/spaces?) we can make better decisions.
I chose to ignore the problem of tabs in this case and would be interested to see how they'd fit into the assumptions - in this case I simply stripped them out.
function strip_whitespace($string){
$string = trim($string);
$string = str_replace(["
", "
"], "
", $string);
// These three could be done as one, but splitting out
// is easier to read and modify/play with
$string = str_replace("", "
", $string);
$string = str_replace("
", "
", $string);
$string = str_replace("\t", '', $string);
$string_arr = str_split($string);
$new_chars = [];
$prev_char_return = 0;
$prev_char_space = $had_space_recently = false;
foreach ($string_arr as $char){
switch ($char){
case ' ':
if ($prev_char_return || $prev_char_space){
continue 2;
}
$prev_char_space = true;
$prev_char_return = 0;
break;
case "
":
case "":
if ($prev_char_return>1 || $had_space_recently){
continue 2;
}
if ($prev_char_space){
$had_space_recently = true;
}
$prev_char_return += 1;
$prev_char_space = false;
break;
default:
$prev_char_space = $had_space_recently = false;
$prev_char_return = 0;
}
$new_chars[] = $char;
}
$return = implode('', $new_chars);
// Shouldn't be necessary as we trimmed to start, but may as well
$return = trim($return);
return $return;
}
I'm still interested to see other ideas, and especially to any text whose obvious interpretation for a function of this type would be different to what this function produces.
Based on the example (and not looking at your code), it looks like the rule is:
If so, then one approach would be to:
E.g.:
$text = preg_replace(
array('/\s*
\s*
\s*/', '/\s+/', '/<PARAGRAPH-SEP>/'),
array('<PARAGRAPH-SEP>', ' ', "
"),
trim($text)
);
If the rule is more complicated, then it might be better to use preg_replace_callback
, e.g.:
$text = preg_replace_callback('/\s+/', 'handle_whitespace', trim($text));
function handle_whitespace($matches)
{
$whitespace = $matches[0];
if (substr_count($whitespace, "
") >= 2)
{
// paragraph-separator: replace with blank line
return "
";
}
else
{
// everything else: replace with single space character
return " ";
}
}