拆分一个字符串,记住分裂的位置

Assume I have the following string:

I have | been very busy lately and need to go | to bed early

By splitting on "|", you get:

$arr = array(
  [0] => I have
  [1] => been very busy lately and need to go
  [2] => to bed early
)

The first split is after 2 words, and the second split 8 words after that. The positions after how many words to split will be stored: array(2, 8, 3). Then, the string is imploded to be passed on to a custom string tagger:

tag_string('I have been very busy lately and need to go to bed early');

I don't know what the output of tag_string will be exactly, except that the total words will remain the same. Examples of output would be:

I have-nn been-vb very-vb busy lately and-rr need to-r go to bed early-p
I-ee have been-vb very busy-df lately-nn and need-f to go to bed-uu early-yy

This will lengthen the string by an unknown number of characters. I have no control over tag_string. What I know is (1) the number of words will be the same as before and (2) the array was split after 2, and thereafter after 8 words, respectively. I now need a solution explode the tagged string into the same array as before:

$string = "I have-nn been-vb very-vb busy lately and-rr need to-r go to bed early-p"
function split_string_again() {
  // split after 2nd, and thereafter after 8th word
}

With output:

$arr = array(
  [0] => I have-nn
  [1] => been-vb very-vb busy lately and-rr need to-r go
  [2] => to bed early-p
)

So to be clear (I wasn't before): I cannot split by remembering the strpos, because strpos before and after the string went through the tagger, aren't the same. I need to count the number of words. I hope I have made myself more clear :)

Interesting question, although I think the rope data structure still applies it might be a little overkill since word placement won't change. Here is my solution:

$str = "I have | been very busy lately and need to go | to bed early";

function get_breaks($str)
{
    $breaks = array();
    $arr = explode("|", $str);

    foreach($arr as $val)
    {
        $breaks[] = str_word_count($val);
    }

    return $breaks;
}

$breaks = get_breaks($str);

echo "<pre>" . print_r($breaks, 1) . "</pre>";

$str = str_replace("|", "", $str);

function rebreak($str, $breaks)
{
    $return = array();
    $old_break = 0;

    $arr = str_word_count($str, 1);

    foreach($breaks as $break)
    {
        $return[] = implode(" ", array_slice($arr, $old_break, $break));

        $old_break += $break;
    }

    return $return;
}

echo "<pre>" . print_r(rebreak($str, $breaks), 1) . "</pre>";

echo "<pre>" . print_r(rebreak("I have-nn been-vb very-vb busy lately and-rr need to-r go to bed early-p", $breaks), 1) . "</pre>";

Let me know if you have any questions, but it is pretty self explanatory. There are definitely ways to improve this as well.

You wouldn't want to count the number of words, you would want to count the string length (strlen). If it is the same string without the pipes, then you want to split it with substr after a certain amount.

$strCounts = array();

foreach ($arr as $item) {
    $strCounts[] = strlen($item);
}

// Later on.
$arr = array();
$i = 0;
foreach ($strCounts as $count) {
     $arr[] = substr($string, $i, $count);
     $i += $count; // increment the start position by the length
}

I have not tested this, simply a "theory" and probably has some kinks to work out. There may be a better way to go about it, I just don't know it.

I'm not quite sure I understood what you actually wanted to achieve. But here are a couple of things that might help you:

str_word_count() counts the number of words in a string. preg_match_all('/\p{L}[\p{L}\p{Mn}\p{Pd}\x{2019}]*/u', $string, $foo); does pretty much the same, but on UTF-8 strings.

strpos() finds the first occurrence of a string within another. You could easily find the positions of all | with this:

$pos = -1;
$positions = array();
while (($pos = strpos($string, '|', $pos + 1)) !== false) {
  $positions[] = $pos;
}

I'm still not sure I understood why you can't just use explode() for this, though.

<?php
$string = 'I have | been very busy lately and need to go | to bed early';
$parts = explode('|', $string);
$words = array();
foreach ($parts as $s) {
  $words[] = str_word_count($s);
}