复杂的PHP空格删除

There are a number of questions on SO about removing whitespace, usually answered with a preg_replace('/[\s]{2,}/, '', $string) or similar answer that takes more than one whitespace character and removes them or replaces with one of the characters.

This gets more complicated when certain whitespace duplication may be allowed (e.g. text blocks with two line breaks and one line break both allowed and relevant), moreso combining whitespace characters (, ).

Here is some example text that, whilst messy, covers what I think you could end up with trying to present in a reasonable manner (e.g. user input that's previously been formatted with HTML and now stripped away)

$text = "
Dear Miss           Test McTestFace,
  
 We  have received your customer support request about:
 \tA bug on our website
 \t 
 
 
 We will be in touch by : 
\tNext Wednesday. 
   
   
     Thank you for your custom; 
     \t     
       If you have further questions please feel free to email us. 
     

     
     Sincerely 
 
    Customer service team 
 
";

If our target was to have it in the format:

Dear Miss Test McTestFace,

We have received your customer support request about: A bug on our website

We will be in touch by : Next Wednesday.

Thank you for your custom;

If you have further questions please feel free to email us.

Sincerely

Customer service team

How would we achieve this - simple regex, more complex iteration or are there already libraries that can do this?

Also are there ways we could make the test case more complex and thus giving a more robust overall algorithm?

For my own part I chose to attempt an iterative algorithm based on the idea that if we know the current context (are we in a paragraph, or in a series of line breaks/spaces?) we can make better decisions.

I chose to ignore the problem of tabs in this case and would be interested to see how they'd fit into the assumptions - in this case I simply stripped them out.

function strip_whitespace($string){
    $string = trim($string);
    $string = str_replace(["
", "
"], "
", $string);

    // These three could be done as one, but splitting out
    // is easier to read and modify/play with
    $string = str_replace("", "
", $string);
    $string = str_replace(" 
", "
", $string);
    $string = str_replace("\t", '', $string);

    $string_arr = str_split($string);
    $new_chars = [];

    $prev_char_return = 0;
    $prev_char_space = $had_space_recently = false;
    foreach ($string_arr as $char){
        switch ($char){
            case ' ':
                if ($prev_char_return || $prev_char_space){
                    continue 2;
                }
                $prev_char_space = true;
                $prev_char_return = 0;
            break;
            case "
":
            case "":
                if ($prev_char_return>1 || $had_space_recently){
                    continue 2;
                }
                if ($prev_char_space){
                    $had_space_recently = true;
                }
                $prev_char_return += 1;
                $prev_char_space = false;
            break;
            default:
                $prev_char_space = $had_space_recently = false;
                $prev_char_return = 0;
        }
        $new_chars[] = $char;
    }

    $return = implode('', $new_chars);
    // Shouldn't be necessary as we trimmed to start, but may as well
    $return = trim($return);

    return $return;
}

I'm still interested to see other ideas, and especially to any text whose obvious interpretation for a function of this type would be different to what this function produces.

Based on the example (and not looking at your code), it looks like the rule is:

  • a span of whitespace containing at least 2 LF characters is a paragraph-separator (so convert it to a blank line);
  • any other span of whitespace is a word-separator (so convert it to a single space).

If so, then one approach would be to:

  1. Find the paragraph-separators and convert them to some string (not involving whitespace) that doesn't otherwise occur in the text.
  2. Convert remaining whitespace to single-space.
  3. Convert the paragraph-separator-indicators to .

E.g.:

$text = preg_replace(
    array('/\s*
\s*
\s*/', '/\s+/', '/<PARAGRAPH-SEP>/'),
    array('<PARAGRAPH-SEP>', ' ',     "

"),
    trim($text)
);

If the rule is more complicated, then it might be better to use preg_replace_callback, e.g.:

$text = preg_replace_callback('/\s+/', 'handle_whitespace', trim($text));

function handle_whitespace($matches)
{
    $whitespace = $matches[0];

    if (substr_count($whitespace, "
") >= 2)
    {
        // paragraph-separator: replace with blank line
        return "

";
    }
    else
    {
        // everything else: replace with single space character
        return " ";
    }
}