I use some Regexp in my mySQL search queries and it seems to work fine unless I have a # character in the query.
The Regexp matches based on word boundaries because the field in which this query is searching is entire resumes / curriculum-vitaes stored in the SQL database.
For instance this works as expected and returns the correct number of results:
SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[[:<:]]java[[:>:]]');
However, this doesn't and returns 0 results when it should return a few hundred:
SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[[:<:]]c#[[:>:]]');
I understand now that this is because I am matching based on word boundaries and # cannot be the end of the word. Interestingly, "C++" works fine though.
Is there a way of modifying this REGEXP so it also works with a string like "c#"?
You might be able to use something like this:
SELECT 'c#' REGEXP '(^|[^a-zA-Z0-9_])c#($|[^a-zA-Z0-9_])'
SELECT 'java' REGEXP '(^|[^a-zA-Z0-9_])java($|[^a-zA-Z0-9_])'
In newer MySQL versions (8.0.4+) which support regex through ICU instead of Henry Spencer's implementation you can use \w
which looks a bit cleaner:
SELECT 'c#' REGEXP '(^|[^\w])c#($|[^\w])'
SELECT 'java' REGEXP '(^|[^\w])java($|[^\w])'
One option could be substitution:
SELECT COUNT(*) n
FROM (SELECT REPLACE(cv, '#','sharp') AS cv FROM candidate) c
WHERE (c.CV REGEXP '[[:<:]]csharp[[:>:]]');
I think you can achieve more or less what you want using this:
SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]java[^[:alpha:]]');
which can work for the C# case, like this
SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]c#[^[:alpha:]]');
Note that if you just replace c# with c++ you will run into problems, because this regex is invalid:
SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]c++[^[:alpha:]]');
whereas
SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]c\\+\\+[^[:alpha:]]');
works for me (using the mysql CLI)
If you are fussy about these words appearing as the start/end of the text, you can use something like this
SELECT COUNT(*) n FROM candidate c WHERE (c.CV REGEXP '[^[:alpha:]]c#[^[:alpha:]]|^c#|c#$');
This is pretty close to the word-boundary requirement.
#
can be used as a regex delimiter. Thus you need to escape it with a backslash:
'[[:<:]]c\#[[:>:]]'
I don't know why can't you use something like this
[[:<:]]c#([^#a-zA-Z0-9_]|$)
since [[:>:]]
is an end of word boundary meaning a word behind AND not a word ahead,
but # is not a word and you still need not a word ahead and I assume not a # ahead.