PHP正则表达式找到特定的阿拉伯语关键字

I need to find a reliable method to search for Arabic words using PHP. The text which I'll be serching may be in English or Arabic, so English words mustn't break the system.

I've been reading the PHP manual and some other material and think that I have a correct solution, but would be grateful for some opinions from some regex mavens.

One major complication to this task is that I don't speak or read a word of Arabic, or know how it works.

One thing that definitely doesn't work are the \b boundary tags. For some reason this doesn't work for Arabic text (works for some not for others).

My regex is:

/\X(?<!\p{Arabic})(my_arabic_keyword)(?!\p{Arabic})/ui

and my reasoning for this is:

The \X modifier means that unicode characters that could be treated as two separate unicode characters (a character and an accent), or as a single character are all taken account of.

The (?<!\p{Arabic}) and (?!\p{Arabic}) parts are to ensure that anything preceding or following the word is a unicode character in the Arabic range. I'm worried that I'm not doing this right. For one thing, it seems to be matching spaces on either side. Which is good because I need to isolate words, but this makes me think that I haven't really understood the function of the \p{Arabic}. Does it have to match one character of Arabic either side of my keyword with the regex above?

Also someone has suggested \p{L}, but as far as I can see this means any letter at all, so I don't see the point in that. I really just want a substitution for the \b boundary markers, so I need to match white space and beginning and ends or the string.

The \u modifier is, I believe necessary with PHP to say that it's unicode.

The \i modifier is to make the matching case insensitive. I have no idea whether Arabic has capital letters in it, or if it does, whether the case insensitive modifier would work in the same way.

So basically I want to find specific Arabic keywords with definite word boundaries without resorting to using the \b boundary markers (because they don't work). The regex musting break if they are given english text, but should just return false. Do you think I have acheived this with my regex?

Many thanks

I try to answer on the lookbehind and lookahead part.

(?<!a)SomeWord is a negativ lookbehind, i.e. it will match if SomeWord is not preceded by an "a".

SomeWord(?!a) is a negativ lookahead, i.e. it will match if SomeWord is not followed by an "a".

\p{Arabic} is matching a code point containing an arabic letter (I have never used this by myself). See http://www.regular-expressions.info/unicode.html

So (?<!\p{Arabic})SomeArabicWord(?!\p{Arabic}) should match "SomeArabicWord" that is not preceded or followed by an Arabic letter. What would make sense to find the word boundaries, but I don't know if there are punctuation marks included into \p{Arabic} or not.

If you want to have this then use the positive versions: (?<=\p{Arabic})SomeArabicWord(?=\p{Arabic})