I've got a string like...
"labour 18909, liberals 12,365,conservatives 14,720"
...and i'd like a regex which can get rid of any thousands separators so i can pull out the numbers easily. Or even a regex which could give me a tidy array like:
(labour => 18909, liberals => 12365, conservatives => 14720)
Oh i wish i had the time to figure out regexes! Maybe i'll buy one as a toilet book, mmm.
Two-liner. Will also get Independents:
preg_match_all('/([a-zA-Z]+)\s*([\d,]+)(?:,|$)/', $str, $matches);
$totals = array_combine($matches[1], $matches[2]);
/* total:
Array
(
[labour] => 18909
[liberals] => 12,365
[conservatives] => 14,720
)
*/
You could do a search and replace such as with sed:
> echo '"labour 18909, liberals 12,365,conservatives 14,720"'
| sed -r -e 's/([0-9]),([0-9]{3})/\1\2/g'
"labour 18909, liberals 12365,conservatives 14720"
I'm not entirely certain what the PHP syntax would be but it basically takes a pattern consisting of a digit (X), a comma, and three other digits (Y) and replaces them with just the XY bit.
Well, using the following regular expression you can separate the numbers from the rest:
labour\s*([\d,.]+),\s*liberals\s*([\d,.]+),\s*conservatives\s*([\d,.]+)
after all, a number clearly ends at a point where no digit follows anymore. You can then proceed with removing the commas from the values.
PowerShell demo (a little bit condensed, sorry):
PS Home:\> $s -match 'labour\s*(?<labour>[\d,.]+),\s*liberals\s*(?<liberals>[\d,.]+),\s*conservatives\s*(?<conservatives>[\d,.]+)' |
Out-Null
PS Home:\> "Labour: {0}`nLiberals: {1}`nConservatives: {2}" -f `
($Matches['labour'],$Matches['liberals'],$Matches['conservatives'] |
foreach { $_ -replace ',' })
Labour: 18909
Liberals: 12365
Conservatives: 14720
What you want seems to be to remove commas only if they are surrounded by digits. Sorry, I don't know the particulars of PHP regex syntax, but a couple of more abstract examples are
str.replace("(\d+),(\d+)", "$1$2")
s/([0-9]+),([0-9]+)/\1\2/g
These would get all correct numbers, but would also get something that wasn't really proper, such as "2,41,11"
In a former life, I did a lot of data processing like this, except there were 100's of millions of records taking days to process.
I always found it was a wise to follow this strategy
Know you data
. The customer will always say their data is perfect, well formed and correct
. It invariably is a pile of steaming dodo poop.
Define the rules for the data, sometimes it is easier to define what the data isn't
Use a regex or even macro search and replace within and editor to find where the data breaks the rules
Repair, request new data sets, discard data
Repeat steps 3 and 4 till the data is clean
Now think about the format of the data, can the regex matching be simplified by some simple manipulation of the data
. For example in you case, replacing a comma followed by multiple white space with a single comma
. Then strip every comma surrounded by numbers
. strip multiple white space (leave single white space)
. strip white space immediately before and alpha character
Define rules for this new data set and make sure it's clean
. this can now include range checking on the numeric data
. even more complex rules
Now your data looks like "labour 18909,liberals 12365,conservatives 14720"
Build you import tool for this new data set (the easy bit)
Make sure you have a repeatable system for 1..9 as the customer will want a simple change or just this extra little bit they need right now.