I have a HTML page in a string, and I need to replace all the spaces in the a href references with %20 so my parser understands it.
So for example:
<a href="file with spaces.mp3">file with spaces.mp3</a>
needs to turn into
<a href="file%20with%20spaces.mp3">file with spaces.mp3</a>
One space works fine since I can just use
(.+?)([ *])(.+?)
and then substitute it with %20 in between $1 and $3
But how would you do it for multiple and an unknown number of spaces, while still having the file name to put the %20's in between?
While it's not recommended to use regex, here's a potential regex that works for your example:
(?:<a href="|\G)\S*\K (?=[^">]*")
(?:
<a href=" # Match <a href=" literally
|
\G # Or start the match from the previous end-match
)
\S* # Match any non-space characters
\K # Reset the match so only the following matches are replaced
(?=[^">]*") # Ensure that the matching part is still within the href link
The above regex could also break on certain edge-cases, so I recommend using DOMDocument in like Amal's excellent answer which is more robust.
HTML is not a regular language and cannot be properly parsed using a regular expression. Use a DOM parser instead. Here's a solution using PHP's built-in DOMDocument class:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
$href = $tag->getAttribute('href');
$href = str_replace(' ', '%20', $href);
$tag->setAttribute('href', $href);
}
$html = $dom->saveHTML();
It basically iterates over all the links and changes the href
attribute using str_replace
.