SQL搜索字符串中的#字符问题

I use some Regexp in my mySQL search queries and it seems to work fine unless I have a # character in the query.

The Regexp matches based on word boundaries because the field in which this query is searching is entire resumes / curriculum-vitaes stored in the SQL database.

For instance this works as expected and returns the correct number of results:

SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[[:<:]]java[[:>:]]');

However, this doesn't and returns 0 results when it should return a few hundred:

SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[[:<:]]c#[[:>:]]');

I understand now that this is because I am matching based on word boundaries and # cannot be the end of the word. Interestingly, "C++" works fine though.

Is there a way of modifying this REGEXP so it also works with a string like "c#"?

You might be able to use something like this:

SELECT 'c#' REGEXP '(^|[^a-zA-Z0-9_])c#($|[^a-zA-Z0-9_])'
SELECT 'java' REGEXP '(^|[^a-zA-Z0-9_])java($|[^a-zA-Z0-9_])'

In newer MySQL versions (8.0.4+) which support regex through ICU instead of Henry Spencer's implementation you can use \w which looks a bit cleaner:

SELECT 'c#' REGEXP '(^|[^\w])c#($|[^\w])'
SELECT 'java' REGEXP '(^|[^\w])java($|[^\w])'

One option could be substitution:

SELECT COUNT(*) n 
FROM (SELECT REPLACE(cv, '#','sharp') AS cv FROM candidate) c 
WHERE (c.CV REGEXP '[[:<:]]csharp[[:>:]]');

I think you can achieve more or less what you want using this:

SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]java[^[:alpha:]]');

which can work for the C# case, like this

SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]c#[^[:alpha:]]');

Note that if you just replace c# with c++ you will run into problems, because this regex is invalid:

SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]c++[^[:alpha:]]');

whereas

SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]c\\+\\+[^[:alpha:]]');

works for me (using the mysql CLI)

If you are fussy about these words appearing as the start/end of the text, you can use something like this

SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]c#[^[:alpha:]]|^c#|c#$');

This is pretty close to the word-boundary requirement.

# can be used as a regex delimiter. Thus you need to escape it with a backslash:

'[[:<:]]c\#[[:>:]]'

I don't know why can't you use something like this

[[:<:]]c#([^#a-zA-Z0-9_]|$)

since [[:>:]] is an end of word boundary meaning a word behind AND not a word ahead,
but # is not a word and you still need not a word ahead and I assume not a # ahead.