正则表达式清理网址

I am looking for a way to get a valid url out of a string like:

$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';

My original solution was:

preg_match('#^[^:|]*#', str_replace('//', '/', $string), $modifiedPath);

But obviously its going to remove a slash from the http:// instead of the one in the middle of the string.

My expected output that I want from the original is:

http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg

I could always break off the http part of the string first but would like a more elegant solution in the form of regex if possible. Thanks.

This will do exactly what you are asking:

 <?php

$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';

preg_match('/^([^|]+)/', $string, $m); // get everything up to and NOT including the first pipe (|)
$string = $m[1];

$string = preg_replace('/(?<!:)\/\//', '/' ,$string); // replace all occurrences of // as long as they are not preceded by :

echo $string; // outputs: http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg

exit;

 ?>

EDIT:

(?<!X) in regular expressions is the syntax for what is called a lookbehind. The X is replaced with the character(s) we are testing for.

The following expression would match every instance of double slashes (/):

\/\/

But we need to make sure that the match we are looking for is NOT preceded by the : character so we need to 'lookbehind' our match to see if the : character is there. If it is then we don't want it to be counted as a match:

(?<!:)\/\/

The ! is what says NOT to match in our lookbehind. If we changed it to (?=:)\/\/ then it would only match the double slashes that did have the : preceding them.

Here is a Quick tutorial that can explain it all better than I can lookahead and lookbehind tutorial

Assuming all your strings are in the form given, you don't need any but the simplest of regexes to do this; if you want an elegant solution, then a regex is definitely not what you need. Also, double slashes are legal in a URL, just like in a Unix path, and mean the same thing a single slash does, so you don't really need to get rid of them at all.

Why not just

$url = array_shift(preg_split('/\|/', $string));

If you really, really care about getting rid of the double slashes in the URL, then you can follow this with

$url = preg_replace('/([^:])\/\//', '$1/', $url);

or even combine them into

$url = preg_replace('/([^:])\/\//', '$1/', array_shift(preg_split('/\|/', $string)));

although that last form gets a little bit hairy.

Since this is a quite strictly defined situation, I'd consider just one preg to be the most elegant solution.

From the top of my head:

$sanitizedURL = preg_replace('~((?<!:)/(?=/)|\\|.+)~', '', $rawURL);

Basically, what this does is look for any forward slash that IS NOT preceded by a colon (:), and IS followed bij another forward slash. It also searches for any pipe character and any character following it.

Anything found is removed from the result.

I can explain the RegEx in more detail if you like.