Many Instagram posts end with a plethora of hashtags, for example:
"This is one of the amazing Mountains you can find in the National Forest Park in #Zhangjiajie #Chinawhich is where James Cameron drew his inspiration for the flying mountains in #Avatar..
Credit: @phototravelnomads
#pictoura #gydr
#destinationearth #earthpix #ourlonelyplanet#wonderful_earthLife #timeoutsociety#fantastic_earthpics #liveoutdoors #igglobalclub#awesomeearth #mist_vision #earthdeluxe
# #worldbestgram #mthrworld #fantastic_earth#famouscaptures #destination_wow #dreamlifepix#wonderful_places #igworldclub #ig_global_life
#natureaddict #beautifuldestinations #traveler #guider#locals"
I'm looking to process the captions to remove the hashtag collection at the end, while leaving the rest intact. What would be a good approach to doing this? I'm sure I can figure out a brute force way, but I'm hoping to get some thoughts on an elegant solution. Doesn't have to be actual code. :)
Edit per burna's comment: The expected result would be:
"This is one of the amazing Mountains you can find in the National Forest Park in #Zhangjiajie #Chinawhich is where James Cameron drew his inspiration for the flying mountains in #Avatar..
Credit: @phototravelnomads"
Edit per Alan Moore's answer: This works quite well, but not in every situation. For instance, if the input text would be:
"This is one of the amazing Mountains you can find in the National Forest Park in #Zhangjiajie #Chinawhich is where James Cameron drew his inspiration for the flying mountains in #Avatar"
... it would be cut off from "#Zhangjiajie" on.
I'm thinking there's probably a bit more logic required, perhaps splitting the string into an array; checking if it ends in hashtags; if so then how many; if more than X (4?), cut it off from the first one in the last complete series.
If I understand correctly the following should work:
$hashTag="pictoura #gydr
destinationearth #earthpix #ourlonelyplanet#wonderful_earthLife #timeoutsociety#fantastic_earthpics #liveoutdoors #igglobalclub#awesomeearth #mist_vision #earthdeluxe
#worldbestgram #mthrworld #fantastic_earth#famouscaptures #destination_wow #dreamlifepix#wonderful_places #igworldclub #ig_global_life
natureaddict #beautifuldestinations #traveler #guider#locals";
echo preg_replace('/(#.*\s*)/','',$hashTag);
That outputs:
pictoura destinationearth natureaddict
Good luck!!
It looks like this will do it:
$result = preg_replace('/#[#\w\s]*\z/', '', $subject);
The regex matches a hash (#
), followed by zero or more of the characters that make up hashtags plus the whitespace that separates them ([#\w\s]*
), followed by the end of the string (\z
).
\w
is equivalent to [A-Za-z0-9_]
. If there are other characters that are allowed in hashtags, or if digits are not allowed, let me know and I'll update the regex.
UPDATE: If you want to remove all robo-tags while leaving the legitimate ones, there's probably no reliable way--certainly not with regexes alone. However, this will remove all but the first line of hashtags:
$result = preg_replace('/^(#[#\w\h]+\R)#[#\w\s]*\z/m', '$1', $subject);
\h
matches only horizontal whitespace (space, tab, nbsp...), and \R
matches any line separator ( or any single vertical whitespace character).
As for hashtag-like things in the text, this won't touch them because it's anchored to the end of the text. The beginning-of-line anchor (^
in multiline mode) isn't really necessary, but it may help future readers of the regex (including yourself) understand what it does. Of course, comments will help even more. ;)