i am trying to parse some files line by line and trying to identify it as columns. Two columns that are consecutive are words, but the separation pattern is more than one space. As the columns can have spaces between, i am having some trouble separating these two.
Examples of lines:
2236 ARGEMIRO PATROCINIO ARGEMIRO I I UBC 3,8462
1150721 ZACHARY F CONDON ZACH CONDON I I FINTAGE 8,3333
50300 COMERCIAL FONOGRAFICA RGE LTDA. PF LI ABRAMUS 25,0000`
(fixed)
obs.: it's not showing all the spaces between '2236', 'ARGEMIRO PATROCINIO', 'ARGEMIRO', 'I', 'I', 'UBC' and '3,8462'
I am using this regex:
(\d+)\s+([\.a-zA-Z\s,'À-úÀ-ÿ()\?\-\/\d]+)\s{2,}([\.a-zA-Z\s,'À-úÀ-ÿ()\?\-\/\d]+)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})
but unfortunately, "ARGEMIRO PATROCINIO" is coming with the second "ARGEMIRO"; "ZACHARY F CONDON" with the second "ZACH CONDON" and on.
So,
Thank you!
I'm not actually seeing double spaces in the data you pasted, but you are describing it as such. You can do this to split anywhere there is 2 or more sequential spaces:
preg_split("/[\s]{2,}/", $data);
DEMO: http://www.phpliveregex.com/p/jWZ (click "preg_split" on the right)
You should understand how greediness works. Once your subpattern becomes lazy, it is first skipped, and the subsequent patterns are tried first. Only in case no match is found, the engine goes back to the pattern that is lazily quantified, matches a single char that the pattern matches and goes on testing the subsequent subpatterns again. The mechanism is similar to backtracking, but goes forward.
So, what you may do is to make sure the second and third column patterns are lazy. (Note I guess you are using /U
greediness swapping modifier, and my advice is to not use it to make the pattern as clear as possible):
(\d+)\s+([-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?)\s{2,}([-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})
Add anchors (^
at the start and $
at the end) and /m
modifier if you need to match full lines only.
See the regex demo.
See the [-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?)
patterns, the have +?
lazy quantifier matching 1+ chars, as few as possible.
Note I made some cosmetic changes, too: .
does not need to be escaped in a character class, and -
, when placed at the start of a character class, never needs to be escaped to denote a literal -
.
I would say normally this regex is needed
/(\d+)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})/
But since the last record only has 6 columns it won't match the last record https://regex101.com/r/YynbpP/1
My suggestion is you rethink which columns could be optional.
Then adjust the regex accordingly.
For example, group 2 and 3 are identical in structure.
If you expect the second one is optional, the proper regex is this:
/(\d+)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)(?|\s{2,}((?:[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*))|())\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})/
https://regex101.com/r/ohtTfO/2
Which maintains the column structure
Note that if the 3rd column entry is missing, it is likely that it didn't
pop in an extra \s{2,}
so you can't just say the whole thing is just optional
because it would turn column 3 into a null, instead of an empty string.
To fix that I just used a branch reset(?|\s{2,}(data)|())
which always matches column 3
and makes it an empty string if it's not there...
Formatted (for ease of use)
( \d+ ) # (1)
\s{2,}
( # (2 start)
[.a-zA-Z,'À-úÀ-ÿ()?\-/\d]+
(?:
\s?
[.a-zA-Z,'À-úÀ-ÿ()?\-/\d]
)*
) # (2 end)
(?|
\s{2,}
( # (3 start)
(?:
[.a-zA-Z,'À-úÀ-ÿ()?\-/\d]+
(?:
\s?
[.a-zA-Z,'À-úÀ-ÿ()?\-/\d]
)*
)
) # (3 end)
| ( ) # (3)
)
\s{2,}
( I | PF | MA ) # (4)
\s{2,}
( I | PF | PL | LI | MA | CV | MJ ) # (5)
\s{2,}
( \w+ ) # (6)
\s{2,}
( \d+ , \d{4} ) # (7)