谁可以破解这个twitter regexp?

I would like to grab all the hashtags using PHP from http://search.twitter.com/search.atom?q=%23eu-jele%C4%A1%C4%A1i

The hashtags are in the content, title nodes within the RSS feed. They are prefixed with #

The problem I am having is with non-English letters (outside of the range a-zA-Z).

If you look at the RSS feed and then view the html source my struggle might be clearer.

    <title>And more: #eu-jele&#289;&#289;i #eu-kiest #ue-wybiera #eu-eleger #ue-alege #eu-vyvolenej #eu-izvoli #eu-elegir #eu-v&#228;lja #eu-elect</title>

Do I need to do some something with the title node before I find my rexexp matches.

My ultimate aim is to replace the hashtag with the twitter search url e.g. http://search.twitter.com/search.atom?q=%23eu-jele%C4%A1%C4%A1i

Here is some sample code to help you along.


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<body>
<?php 
$title="And more: #eu-jele&#289;&#289;i #eu-kiest #ue-wybiera #eu-eleger #ue-alege #eu-vyvolenej #eu-izvoli #eu-elegir #eu-v&#228;lja #eu-elect";

// this is the regexp that hashtags.org use (http://twitter.pbwiki.com/Hashtags)
$r = preg_replace("/(?:(?:^#|[\s\(\[]#(?!\d\s))(\w+(?:[_\-\.\+\/]\w+)*)+)/"," <a href=\"http://search.twitter.com/search?q=%23\1\">\1</a> ", $title);
echo "<p>$r</p>";

$r = preg_replace("/(#.+?)(?:(\s|$))/"," <a href=\"http://search.twitter.com/search?q=\1\">\1</a> ", $title);
echo "<p>$r</p>";

// This is my desired end result
echo "<p><a href=\"http://search.twitter.com/search?q=%23eu-jeleġġi\">#eu-jeleġġi</a></p>";
?>

</body>
</html>

Any advice or solution would be greatly appreciated.

Grab a '#' plus all characters until you hit a whitespace character:

(#.+?)(?:\s)

Or a little more flexible (allows end of string) :

(#.+?)(?:(\s|$))

Or just

(#\S+)

Why are you using a regexp? Remove anything that's not preceded by a hash, then explode by hash. Regexp seems unnecessarily complicated and ill-suited to the problem.

Perhaps you can explain further why this needs to be done in a regexp?

heres what i would use :)

(?<![^\s#])(#[^\s#]+)(?=(\s|$))

example matching on this string

#test #test#test #test-test test#test

hope this is helpful

If you need the exact regular expression Twitter uses to render hashtags, Twitter provides it, along with the patterns for link, mentions, etc., in this open source library.

Hashtag Match Pattern

(^|[^0-9A-Z&/]+)(#|\uFF03)([0-9A-Z_]*[A-Z_]+[a-z0-9_\\u00c0-\\u00d6\\u00d8-\\u00f6\\u00f8-\\u00ff]*)

The above pattern can be pieced together from this java file. Validation tests for this pattern are located in this file around line 115.