正则表达式允许一个或多个单词允许一个空格并使用两个或更多空格作为列

i am trying to parse some files line by line and trying to identify it as columns. Two columns that are consecutive are words, but the separation pattern is more than one space. As the columns can have spaces between, i am having some trouble separating these two.

Examples of lines:

2236        ARGEMIRO PATROCINIO                                   ARGEMIRO                 I       I          UBC            3,8462

1150721     ZACHARY F CONDON                                      ZACH CONDON               I       I          FINTAGE        8,3333

50300       COMERCIAL FONOGRAFICA RGE LTDA.                                                 PF      LI         ABRAMUS       25,0000`

(fixed)

obs.: it's not showing all the spaces between '2236', 'ARGEMIRO PATROCINIO', 'ARGEMIRO', 'I', 'I', 'UBC' and '3,8462'

I am using this regex:

(\d+)\s+([\.a-zA-Z\s,'À-úÀ-ÿ()\?\-\/\d]+)\s{2,}([\.a-zA-Z\s,'À-úÀ-ÿ()\?\-\/\d]+)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})

but unfortunately, "ARGEMIRO PATROCINIO" is coming with the second "ARGEMIRO"; "ZACHARY F CONDON" with the second "ZACH CONDON" and on.

So,

how can i fix this regex to separate these two "columns"?
how would be another regex that can grab anything between two or more spaces within these 7 columns?

Thank you!

I'm not actually seeing double spaces in the data you pasted, but you are describing it as such. You can do this to split anywhere there is 2 or more sequential spaces:

preg_split("/[\s]{2,}/", $data);

DEMO: http://www.phpliveregex.com/p/jWZ (click "preg_split" on the right)

You should understand how greediness works. Once your subpattern becomes lazy, it is first skipped, and the subsequent patterns are tried first. Only in case no match is found, the engine goes back to the pattern that is lazily quantified, matches a single char that the pattern matches and goes on testing the subsequent subpatterns again. The mechanism is similar to backtracking, but goes forward.

So, what you may do is to make sure the second and third column patterns are lazy. (Note I guess you are using /U greediness swapping modifier, and my advice is to not use it to make the pattern as clear as possible):

(\d+)\s+([-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?)\s{2,}([-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})

Add anchors (^ at the start and $ at the end) and /m modifier if you need to match full lines only.

See the regex demo.

See the [-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?) patterns, the have +? lazy quantifier matching 1+ chars, as few as possible.

Note I made some cosmetic changes, too: . does not need to be escaped in a character class, and -, when placed at the start of a character class, never needs to be escaped to denote a literal -.

I would say normally this regex is needed

/(\d+)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})/

But since the last record only has 6 columns it won't match the last record https://regex101.com/r/YynbpP/1

My suggestion is you rethink which columns could be optional.
Then adjust the regex accordingly.

For example, group 2 and 3 are identical in structure.
If you expect the second one is optional, the proper regex is this:

/(\d+)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)(?|\s{2,}((?:[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*))|())\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})/

https://regex101.com/r/ohtTfO/2

Which maintains the column structure

Note that if the 3rd column entry is missing, it is likely that it didn't
pop in an extra \s{2,} so you can't just say the whole thing is just optional
because it would turn column 3 into a null, instead of an empty string.

To fix that I just used a branch reset
(?|\s{2,}(data)|()) which always matches column 3
and makes it an empty string if it's not there...

Formatted (for ease of use)

 ( \d+ )                                  # (1)
 \s{2,} 
 (                                        # (2 start)
      [.a-zA-Z,'À-úÀ-ÿ()?\-/\d]+ 
      (?:
           \s? 
           [.a-zA-Z,'À-úÀ-ÿ()?\-/\d] 
      )*
 )                                        # (2 end)
 (?|
      \s{2,} 
      (                                        # (3 start)
           (?:
                [.a-zA-Z,'À-úÀ-ÿ()?\-/\d]+ 
                (?:
                     \s? 
                     [.a-zA-Z,'À-úÀ-ÿ()?\-/\d] 
                )*
           )
      )                                        # (3 end)
   |  ( )                                      # (3)
 )
 \s{2,} 
 ( I | PF | MA )                          # (4)
 \s{2,} 
 ( I | PF | PL | LI | MA | CV | MJ )      # (5)
 \s{2,} 
 ( \w+ )                                  # (6)
 \s{2,} 
 ( \d+ , \d{4} )                          # (7)