I would like to grab all the hashtags using PHP from http://search.twitter.com/search.atom?q=%23eu-jele%C4%A1%C4%A1i
The hashtags are in the content, title nodes within the RSS feed. They are prefixed with #
The problem I am having is with non-English letters (outside of the range a-zA-Z).
If you look at the RSS feed and then view the html source my struggle might be clearer.
<title>And more: #eu-jeleġġi #eu-kiest #ue-wybiera #eu-eleger #ue-alege #eu-vyvolenej #eu-izvoli #eu-elegir #eu-välja #eu-elect</title>
Do I need to do some something with the title node before I find my rexexp matches.
My ultimate aim is to replace the hashtag with the twitter search url e.g. http://search.twitter.com/search.atom?q=%23eu-jele%C4%A1%C4%A1i
Here is some sample code to help you along.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<body>
<?php
$title="And more: #eu-jeleġġi #eu-kiest #ue-wybiera #eu-eleger #ue-alege #eu-vyvolenej #eu-izvoli #eu-elegir #eu-välja #eu-elect";
// this is the regexp that hashtags.org use (http://twitter.pbwiki.com/Hashtags)
$r = preg_replace("/(?:(?:^#|[\s\(\[]#(?!\d\s))(\w+(?:[_\-\.\+\/]\w+)*)+)/"," <a href=\"http://search.twitter.com/search?q=%23\1\">\1</a> ", $title);
echo "<p>$r</p>";
$r = preg_replace("/(#.+?)(?:(\s|$))/"," <a href=\"http://search.twitter.com/search?q=\1\">\1</a> ", $title);
echo "<p>$r</p>";
// This is my desired end result
echo "<p><a href=\"http://search.twitter.com/search?q=%23eu-jeleġġi\">#eu-jeleġġi</a></p>";
?>
</body>
</html>
Any advice or solution would be greatly appreciated.
Grab a '#' plus all characters until you hit a whitespace character:
(#.+?)(?:\s)
Or a little more flexible (allows end of string) :
(#.+?)(?:(\s|$))
Or just
(#\S+)
Why are you using a regexp? Remove anything that's not preceded by a hash, then explode by hash. Regexp seems unnecessarily complicated and ill-suited to the problem.
Perhaps you can explain further why this needs to be done in a regexp?
heres what i would use :)
(?<![^\s#])(#[^\s#]+)(?=(\s|$))
example matching on this string
#test #test#test #test-test test#test
hope this is helpful
If you need the exact regular expression Twitter uses to render hashtags, Twitter provides it, along with the patterns for link, mentions, etc., in this open source library.
(^|[^0-9A-Z&/]+)(#|\uFF03)([0-9A-Z_]*[A-Z_]+[a-z0-9_\\u00c0-\\u00d6\\u00d8-\\u00f6\\u00f8-\\u00ff]*)
The above pattern can be pieced together from this java file. Validation tests for this pattern are located in this file around line 115.