I am currently dealing with a performance issue where I cannot find a way to fix it. I want to search a text for usernames mentioned with the @
sign in front. The list of usernames is available as PHP array.
The problem is usernames may contain spaces or other special characters. There is no limitation for it. So I can't find a regex dealing with that. Currently I am using a function which gets the whole line after the @
and checks char by char which usernames could match for this mention, until there is just one username left which totally matches the mention. But for a long text with 5 mentions it takes several seconds (!!!) to finish. for more than 20 mentions the script runs endlessly.
I have some ideas, but I don't know if they may work.
How does Facebook or twitter and any other site do this? Are they parsing the text directly while typing and saving the mentioned usernames directly in the stored text of the message?
This is my current function:
$regular_expression_match = '#(?:^|\\s)@(.+?)(?:
|$)#';
$matches = false;
$offset = 0;
while (preg_match($regular_expression_match, $post_text, $matches, PREG_OFFSET_CAPTURE, $offset))
{
$line = $matches[1][0];
$search_string = substr($line, 0, 1);
$filtered_usernames = array_keys($user_list);
$matched_username = false;
// Loop, make the search string one by one char longer and see if we have still usernames matching
while (count($filtered_usernames) > 1)
{
$filtered_usernames = array_filter($filtered_usernames, function ($username_clean) use ($search_string, &$matched_username) {
$search_string = utf8_clean_string($search_string);
if (strlen($username_clean) == strlen($search_string))
{
if ($username_clean == $search_string)
{
$matched_username = $username_clean;
}
return false;
}
return (substr($username_clean, 0, strlen($search_string)) == $search_string);
});
if ($search_string == $line)
{
// We have reached the end of the line, so stop
break;
}
$search_string = substr($line, 0, strlen($search_string) + 1);
}
// If there is still one in filter, we check if it is matching
$first_username = reset($filtered_usernames);
if (count($filtered_usernames) == 1 && utf8_clean_string(substr($line, 0, strlen($first_username))) == $first_username)
{
$matched_username = $first_username;
}
// We can assume that $matched_username is the longest matching username we have found due to iteration with growing search_string
// So we use it now as the only match (Even if there are maybe shorter usernames matching too. But this is nothing we can solve here,
// This needs to be handled by the user, honestly. There is a autocomplete popup which tells the other, longer fitting name if the user is still typing,
// and if he continues to enter the full name, I think it is okay to choose the longer name as the chosen one.)
if ($matched_username)
{
$startpos = $matches[1][1];
// We need to get the endpos, cause the username is cleaned and the real string might be longer
$full_username = substr($post_text, $startpos, strlen($matched_username));
while (utf8_clean_string($full_username) != $matched_username)
{
$full_username = substr($post_text, $startpos, strlen($full_username) + 1);
}
$length = strlen($full_username);
$user_data = $user_list[$matched_username];
$mentioned[] = array_merge($user_data, array(
'type' => self::MENTION_AT,
'start' => $startpos,
'length' => $length,
));
}
$offset = $matches[0][1] + strlen($search_string);
}
Which way would you go? The problem is the text will be displayed often and parsing it every time will consume a lot of time, but I don't want to heavily modify what the user had entered as text.
I can't find out what's the best way, and even why my function is so time consuming.
A sample text would be:
Okay, @Firstname Lastname, I mention you! Listen @[TEAM] John, you are a team member. @Test is a normal name, but @Thât♥ should be tracked too. And see @Wolfs garden! I just mean the Wolf.
Usernames in that text would be
So yes, there is clearly nothing I know where a name may end. Only thing is the newline.
I think the main problem is, that you can't distinguish usernames from text and it's a bad idea, to lookup maybe thousands of usernames in a text, also this can lead to further problems, that John
is part of [TEAM] John
or JohnFoo
...
It's needed to separate the usernames from other text. Assuming that you're using UTF-8, could put the usernames inside invisible zero-w space \xE2\x80\x8B and non-joiner \xE2\x80\x8C.
The usernames can now be extracted fast and with little effort and if needed still verified in db.
$txt = "
Okay, @\xE2\x80\x8BFirstname Lastname\xE2\x80\x8C, I mention you!
Listen @\xE2\x80\x8B[TEAM] John\xE2\x80\x8C, you are a team member.
@\xE2\x80\x8BTest\xE2\x80\x8C is a normal name, but
@\xE2\x80\x8BThât?\xE2\x80\x8C should be tracked too.
And see @\xE2\x80\x8BWolfs\xE2\x80\x8C garden! I just mean the Wolf.";
// extract usernames
if(preg_match_all('~@\xE2\x80\x8B\K.*?(?=\xE2\x80\x8C)~s', $txt, $out)){
print_r($out[0]);
}
Array ( [0] => Firstname Lastname 1 => [TEAM] John 2 => Test 3 => Thât♥ 4 => Wolfs )
echo $txt;
Okay, @Firstname Lastname, I mention you!
Listen @[TEAM] John, you are a team member.
@Test is a normal name, but
@Thât♥ should be tracked too.
And see @Wolfs garden! I just mean the Wolf.
Could use any characters you like and that possibly don't occur elsewhere for separation.
Regex FAQ, Test at eval.in (link will expire soon)