For a project, I need to fetch a websites content and alter the HTML code. Every link on that website has to be replaced with my own aswell. I used str_replace
until I realized that links sometimes have classes assigned to them.
I've tried the preg_replace
function to add my own website before every href link that is also between <a>
</a>
tags. It shouldn't matter whether or not the fetched website in $content
contains href=""
or href=''
.
$content = preg_replace('~(<a\b[^>]*\shref=")([^"]*)(")~igs', '\1http://website.com/fetch.php?url=\2\3', $content);
This does not work and I can't find the error. It should behave as follows:
<a class="link" href="http://google.com">Google</a>
should turn into
<a class="link" href="http://website.com/fetch.php?url=http://google.com">Google</a>
Can someone help me find the error? Thank you in advance.
Don't half-arse a regex that will miss plenty of cases. Just read each document into a DOM tree (give this html5 DOM parser a go), and use XPath to get all links with href
attributes, and update them, then save the result.
just use simplexml
and preg_replace
<?php
$string= '<a class="link" href="http://google.com">Google</a>';
$a = new SimpleXMLElement('<a class="link" href="http://google.com">Google</a>');
$newurl="http://website.com/fetch.php?url=".urlencode($a['href']);
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
$body = preg_replace($pattern,$newurl,$string);
echo $body;
?>
OUTPUT:
<a class="link" href="http://website.com/fetch.php?url=http%3A%2F%2Fgoogle.com">Google</a>