I want to extract all the urls from an XML file, excludeing the the tracking code in the url:
Here's an example of a URL, they all follow the same format
http://www.domain.com.au/category/pXXXXXX?uni_id=XXXXXX&cid=1_demo_1
So the only thing that changes between the domains is XXXXXX which is a numerical value
The end result I want is
http://www.domain.com.au/category/pXXXXXX
I have tried to use preg_replace in the below code but it ended up replacing the whole URL with a random (i think) number
$data = preg_replace('/http\:\/\/www\.domain\.com.au\/[^\?]+([^.]+)/','',$data);
Match URLs in the XML with preg_match()
:
preg_match("(http://[^\s]+|ftp://[^\s]+)", $input, $matches);
Then, you should use preg_replace()
and should only match the part of the string that needs to be removed:
foreach($matches as $value)
{
preg_replace("(\?[^\s]+)","",$value);
}