I have a variable that contains an entire article including text and some links.
I need to loop through the content in the variable and find all instances of links containing a specific word. Once they have been found, I then need to remove everything after the last / in each of the found URL's.
For example: Let's say the page has 8 links - 4 of them contain the word "article". I need to find each of those links that contain the word "article" and then remove everything after the last occurrence of / in each of those links.
So far I've tried using some Regex such as:
/<a.*?href\s*=\s*["\']([^"\'>]*article[^"\'>]*)["\'][^>]*>.*?<\/a>/si
But haven't found a way to actually replace everything after the last /
Any ideas on how this could be accomplished?
Using DOM tools means that you care much more about your CPU. I don't say RegEx, that is meant to be used totally for text-processing, is not a proper tool for offering a solution but well, specific tool for specific job is always almost cleaner and does perform better.
By what you have said already, I modified your regex in this way:
(<a(?>.+?)href\s*=\s*(["'])(?>[^"'><]*?article)(?>[^>]*?/))(.*?)(\2.*?>[^<]++</a>)
and you only need to replace a complete match with 1st and 4th captured groups. So the code would be:
echo preg_replace('~(<a(?>.+?)href\s*=\s*(["\'])(?>[^"\'><]*?article)(?>[^>]*?/))(.*?)(\2.*?>[^<]++</a>)~s', '\1\4', $html);
I have made a live demo as well.