在mysql中搜索utf8 enocded字符串,显示相同但utf代码不同

I am having a problem when searching the database for utf8 enocded strings in MySQL. I have a kind of a social website with users and they are allowed to add descriptions for their profile and because in my country we use cyrillic alphabet the obvious thing is to use UTF8. I have a search field that searches for the descriptions of the profiles and it is something like this:

SELECT usr.* FROM user AS usr WHERE usr.city = '{$city}' AND usr.desc LIKE '%{$srch}%'

I am using this in PHP by the way and in most of the cases it works. The thing is that some search results can't be searched and I found out that the problem is that some of the users for some reason have the same representations of some letters (so the letter displays exactly the same) but the encoding behind it is not the same. For example the text:

'Оптички стакла' = ÐпÑиÑки ÑÑакла

when encoded and then written in the most common way while using the keyboard language support the most OSes have. But this string of some user:

'Oптички ​​​стaклa' = OпÑиÑки âÑÑaклa

outputs a different code when enocded with UTF8. So because of this the search doesn't work in all the cases and I don't know how to solve it. I think that my database is set properly I tried many combinations and now I am out of ideas. Any help would be appreciated.

Thanks in advance.

I too found out that the case is like @duskwuff said, the problem was that not only one user input this kind of data, but at least it was rare. I managed to find a solution myself. Because in every case this happened on the letters 'A', 'a', 'O', 'o' I just check every letter in the word and if the word is mostly ASCII but an UTF8 is found just convert it like this:

function convert_ascii_to_utf($str)
{
        $length = strlen($str);
        $ascii = false;
        $utf8 = false;
        $mixed_encode = false;

        //the new string
        $new_str = '';

        //check for mixed encoding in the same string
        for($i = 0; $i < $length; $i++)
        {
            if(mb_detect_encoding($str[$i]) == 'ASCII')
            {
                $ascii = true;
            }
            if(mb_detect_encoding($str[$i]) == 'UTF-8')
            {
                $utf8 = true;
            }

            if($ascii == true && $utf8 == true)
            {
                $mixed_encode = true;
                break;
            }
        }

        if($mixed_encode)
        {
            for($i = 0; $i < $length; $i++)
            {
                if($str[$i] == 'a') { $new_str .= 'а'; }
                else if($str[$i] == 'A') { $new_str .= 'А'; }
                else if($str[$i] == 'o') { $new_str .= 'о'; }
                else if($str[$i] == 'O') { $new_str .= 'О'; }
                else { $new_str .= $str[$i]; }
            }

            return $new_str;
        }
        else
        {
            return $str;
        }
    }

That's "Mojibake". It usually comes from

  • The bytes you have in the client are correctly encoded in utf8 (good).
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.

The second string is bizarre. It has a Latin "O" instead of the Cyrillic "О", two Latin "a"s instead of Cyrillic "а"s, and contains three zero-width spaces before the second word. Here's a dump of what you entered:

U+4F   'O'  LATIN CAPITAL LETTER O
U+43F  'п'  CYRILLIC SMALL LETTER PE
U+442  'т'  CYRILLIC SMALL LETTER TE
U+438  'и'  CYRILLIC SMALL LETTER I
U+447  'ч'  CYRILLIC SMALL LETTER CHE
U+43A  'к'  CYRILLIC SMALL LETTER KA
U+438  'и'  CYRILLIC SMALL LETTER I
U+20   ' '  SPACE
U+200B      ZERO WIDTH SPACE
U+200B      ZERO WIDTH SPACE
U+200B      ZERO WIDTH SPACE
U+441  'с'  CYRILLIC SMALL LETTER ES
U+442  'т'  CYRILLIC SMALL LETTER TE
U+61   'a'  LATIN SMALL LETTER A
U+43A  'к'  CYRILLIC SMALL LETTER KA
U+43B  'л'  CYRILLIC SMALL LETTER EL
U+61   'a'  LATIN SMALL LETTER A

I'm… honestly not sure how a user would have ended up with that. It's certainly not something that would be entered on purpose. Unless this is a common occurrence, I'd ignore it.