删除所有ï»¿ï»¿类型字符

I have constant problems with data where odd characters like ï»¿ï»¿ will show up in our database causing everything to break at some point down the line. I need to get a system in place that only allows specific characters through and ignores all of these crazy things that can be pasted from Microsoft Office. Is there something like this built in, or should I start from scratch?

Well, you can remove all such characters via e.g. $text = preg_replace('@[^\d\w\s,.;:]@', '', $text); where [^\d\w\s,.;:] is a set of characters to keep (\d\w\s means all digits, letters, and spaces). Amend the set with other characters you do want to keep.

However, that is the wrong approach. You should instead ensure that your entire application is using and processing UTF-8 from ground up, so that you can store and handle those characters correctly. Making an ASCII or ISO Latin site in this day and age is just weird and essentially causes data loss due to cutting out characters that people actually use...

Ok, I am no expert in character encoding, but was told about this specific problem and why your getting it. As stated in my comment above, you have to verify all your character sets match.

However, here is why your getting that specific set of characters:

"That particular sequence of characters is the 3-byte UTF-8 code for the [?] [unknown] character you see in Firefox. you get that when you display a 1-byte Windows-1252 character as UTF-8 in a form, and then submit it back to the database. The browser sends the 3-byte UTF-8 character in its place"

Understanding charsets is a challenge and I highly recommend you read more on this subject. Here is a good start: Character Sets / Character Encoding Issues