I am hoping the regular expression experts can tell me why this is going wrong:
This regex:
$pattern = '/(?<percent>[0-9]{1,3}\.[0-9]{1,2})% of (?<filesize>.+) at/';
Should match this sort of string:
[download] 87.1% of 4.40M at 107.90k/s ETA 00:05
[download] 89.0% of 4.40M at 107.88k/s ETA 00:04
[download] 91.4% of 4.40M at 106.09k/s ETA 00:03
[download] 92.9% of 4.40M at 105.55k/s ETA 00:03
Correct? Is there anything that could go wrong with that regex that will not get it to match with the above input? Full usage here:
while(!feof($handle))
{
$progress = fread($handle, 8192);
$pattern = '/(?<percent>[0-9]{1,3}\.[0-9]{1,2})% of (?<filesize>.+) at/';
if(preg_match_all($pattern, $progress, $matches)){
//matched
}
}
Could how much that is being read by fread be effecting the regex to work correctly?
I really need confirmation as I am trying to identify why it isn't working on a new server. This question is related to Change in Server Permits script not to work. Can this be due to PHP.ini being different?
Thanks all
I have made a test script to test the regex but even on its own it doesn't work??
<?php
error_reporting(E_ALL);
echo 'Start';
$progress = "[download]75.1% of 4.40M at 115.10k/s ETA 00:09 [download] 77.2% of 4.40M at 112.36k/s ETA 00:09 [download] 78.6% of 4.40M at 111.41k/s ETA 00:08 [download] 80.3% of 4.40M at 110.80k/s ETA 00:07 [download] 82.3% of 4.40M at 110.30k/s ETA 00:07 [download] 84.3% of 4.40M at 108.33k/s ETA 00:06 [download] 85.7% of 4.40M at 107.62k/s ETA 00:05 [download] 87.5% of 4.40M at 107.21k/s ETA 00:05 [download] 89.5% of 4.40M at 105.10k/s ETA 00:04 [download] 90.7% of 4.40M at 106.45k/s ETA 00:03 [download] 93.2% of 4.40M at 104.92k/s ETA 00:02 [download] 94.8% of 4.40M at 104.40k/s ETA 00:02 [download] 96.5% of 4.40M at 102.47k/s ETA 00:01 [download] 97.7% of 4.40M at 103.48k/s ETA 00:01 [download] 100.0% of 4.40M at 103.15k/s ETA 00:00 [download] 100.0% of 4.40M at 103.16k/s ETA 00:00
";
$pattern = '/(?<percent>\d{1,3}\.\d{1,2})%\s+of\s+(?<filesize>[\d.]+[kBM]) at/';
if(preg_match_all($pattern, $progress, $matches)){
echo 'match';
}
echo '<br>Done<br>';
?>
I am not that familiar with named capture, but I think in PHP it should be:
$pattern = '/(?P<percent>[0-9]{1,3}\.[0-9]{1,2})% of (?P<filesize>.+) at/';
Notice the P after the question mark.
Source:
The regex seems okay to me.
However, there are some things I would improve:
"\s+"
, instead of " "
"\d"
, not with "[0-9]"
(same thing, it's just shorter)".+"
, but with something more specificThis would be my version:
(?<percent>\d{1,3}\.\d{1,2})%\s+of\s+(?<filesize>[\d.]+[kBM])
Depending on how much you expect to get wrong number formats (I would guess: not very likely), you can shorten it to:
(?<percent>[\d.]+)%\s+of\s+(?<filesize>[\d.]+[kBM])
If your stream actually delivers more than 8kb of data in one read, you'll probably truncate the last line, which will prevent it from being matched. Try reading the stream one line at a time using fgets() instead.
I would use fgets() for reading line-based, since you want to match per line I assume. If you match per line instead, you would not need to use preg_match_all, but only preg_match.
You only seem to have 1 decimal in your percentage, but you match 1,2 digits?
Is there anything that could go wrong with that regex that will not get it to match with the above input?
Not that I can see, but there's something that does go wrong to make it match far too much: if you really don't have newlines, then this:
(?P<filesize>.+) at
can match greedily from the start to the last “ at” in the input. So if I match against the whole example input you posted, I get a <percent> of:
75.1
(good) and a filesize of:
4.40M at 115.10k/s ETA 00:09 [download] 77.2% of 4.40M at 112.36k/s ETA 00:09 [download] 78.6% of 4.40M at 111.41k/s ETA 00:08 [download] 80.3% of 4.40M at 110.80k/s ETA 00:07 [download] 82.3% of 4.40M at 110.30k/s ETA 00:07 [download] 84.3% of 4.40M at 108.33k/s ETA 00:06 [download] 85.7% of 4.40M at 107.62k/s ETA 00:05 [download] 87.5% of 4.40M at 107.21k/s ETA 00:05 [download] 89.5% of 4.40M at 105.10k/s ETA 00:04 [download] 90.7% of 4.40M at 106.45k/s ETA 00:03 [download] 93.2% of 4.40M at 104.92k/s ETA 00:02 [download] 94.8% of 4.40M at 104.40k/s ETA 00:02 [download] 96.5% of 4.40M at 102.47k/s ETA 00:01 [download] 97.7% of 4.40M at 103.48k/s ETA 00:01 [download] 100.0% of 4.40M at 103.15k/s ETA 00:00 [download] 100.0% of 4.40M
(not quite so good). To avoid this use the non-greedy match “.+?”, or a more specific expression like “[^ ]+” or Tomalak's version.
Could how much that is being read by fread be effecting the regex to work correctly?
Yes. Reading in chunks is quite unreliable: if a ‘[download]’ line is split over a chunk boundary, it will not match and will be lost. You can either:
As for server differences, the only thing I can think of is that if one of the servers is Windows and one a *ix, they will have different ideas of what a newline is, which might cause the “are there newlines or not?” confusion.