Hi i'm trying to use this pattern /^(–*\s*2\.2\.|2\.2\.)/
to match this strings, each line is a different string. EDIT sorry about the poor data formating
<?php
$final_texts=array();
$pattern='/^(–*\s*2\.2\.|2\.2\.)/';//this is generated automatically elsewhere btw
$texts = array(
"– 2.2.04 R",
"–– 2.2.04.10 C",
"–– 2.2.04.1 CO",
"–– 2.2.04.2 CO",
"–– 2.2.04.3 CO",
"–– 2.2.04.4 CO",
"–– 2.2.04.5 CO",
"–– 2.2.04.6 CO",
"–– 2.2.04.7 CO",
"–– 2.2.04.8 CO",
"–– 2.2.04.9 CO",
"foooooooooooo",
"barrrrrrrrrr",
"-- foobar",
"- 1123",
);
foreach($texts as $key=>$text){
if(preg_match($pattern, $text)){
$final_texts[]=$text;
}
}
print_r($final_texts); ?>
This is what i'm using preg_match($pattern, $string)
As i Understand it * means 0 or more of the former, but i'm no expert .
But only matches the first string and not the ones with more than one dash "–" keep in mind that they are different string inside an array and i iterate over it to do something. should i be doing something different in the pattern, i'm trying to match all strings that start with any amount of dashes and spaces follwed by the 2.2. string. And I will have this problem with other numbers, and i may have strings with more than 2 dashes in the future so i don't see how can i solve this not using regex i've allready test it here http://preg_match.onlinephpfunctions.com/ and have the same problem. demo thanks to @hwnd for showing me this!
I believe the cause of this is the unicode dash you have placed in your regular expression. I recommend using the Unicode property \p{Pd}
( any kind of hyphen or dash ) to match these characters.
/^(\p{Pd}+\s*2\.2\.|2\.2\.)/mu
Note: The m
(multi-line) modifier causes ^
to match the beginning of each line. The u
modifier turns on additional functionality of PCRE and Pattern strings are treated as (UTF-8).
Just for thought, instead of iterating over your array use preg_grep()
here.
$final_texts = preg_grep('/^(\p{Pd}+\s*2\.2\.|2\.2\.)/mu', $texts);
En dash is encoded as three bytes (E2 80 93
) in UTF-8. A quantifier will only be applied to the last byte so –*
is equivalent to \x{e2}\x{80}\x{93}*
.
You can simply wrap the Unicode character in parentheses (–)*
to apply the quantifier to all three bytes. Or if you don’t want to capture it, use non-capturing group (?:–)*
.
Character sets will also work with Unicode characters [–]
.
See runnable.