如何匹配多种语言

I am writing a regex in golang to capture hashtags that might appear in different languages. For example obvious one is english, but there might be latin or arabic users who will try and create hashtags using those character set. I am aware of the unicode character class name but how can you use multiple at once without generating a regex for each one?

example code:

r, err := regexp.Compile(`\B(\#[[:ascii:]]+\b)[^?!;]*`)

This will match "#hello #ذوق" and output []string{#hello, #ذوق} but not match for just "#ذوق"

I suggest using

\B#[\p{L}\p{N}\p{M}_]+

where [\p{L}\p{N}\p{M}_] is roughly the Unicode-aware \w pattern. The \p{L} matches any Uniciode letter, \p{M} matches any combining mark and \p{N} matches any Unicode digit.

See Go demo:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    text := "#hello #ذوق #citroën"
    r := regexp.MustCompile(`\B#[\p{L}\p{N}\p{M}_]+`)
    res := r.FindAllString(text, -1)
    for _, element := range res {
        fmt.Println(element)
    }
}

Output:

#hello
#ذوق

With text := "#ذوق", the output is #ذوق.

See the regex demo.