I am having a problem with a web form that is being submitted to a PHP script and then inserting into a MySQL database.
The problem lies with Copy & Paste from Microsoft Word or similar word processing software and mostly effects bullets but sometimes will effect quotes and single-quotes. I am not able to sniff the character encoding the person is submitting.
I have the following code(functions) at the top of my file that processes the data:
function init_byte_map(){
global $byte_map;
for($x=128;$x<256;++$x){
$byte_map[chr($x)]=utf8_encode(chr($x));
}
$cp1252_map=array(
"\x80"=>"\xE2\x82\xAC", // EURO SIGN
"\x82" => "\xE2\x80\x9A", // SINGLE LOW-9 QUOTATION MARK
"\x83" => "\xC6\x92", // LATIN SMALL LETTER F WITH HOOK
"\x84" => "\xE2\x80\x9E", // DOUBLE LOW-9 QUOTATION MARK
"\x85" => "\xE2\x80\xA6", // HORIZONTAL ELLIPSIS
"\x86" => "\xE2\x80\xA0", // DAGGER
"\x87" => "\xE2\x80\xA1", // DOUBLE DAGGER
"\x88" => "\xCB\x86", // MODIFIER LETTER CIRCUMFLEX ACCENT
"\x89" => "\xE2\x80\xB0", // PER MILLE SIGN
"\x8A" => "\xC5\xA0", // LATIN CAPITAL LETTER S WITH CARON
"\x8B" => "\xE2\x80\xB9", // SINGLE LEFT-POINTING ANGLE QUOTATION MARK
"\x8C" => "\xC5\x92", // LATIN CAPITAL LIGATURE OE
"\x8E" => "\xC5\xBD", // LATIN CAPITAL LETTER Z WITH CARON
"\x91" => "\xE2\x80\x98", // LEFT SINGLE QUOTATION MARK
"\x92" => "\xE2\x80\x99", // RIGHT SINGLE QUOTATION MARK
"\x93" => "\xE2\x80\x9C", // LEFT DOUBLE QUOTATION MARK
"\x94" => "\xE2\x80\x9D", // RIGHT DOUBLE QUOTATION MARK
"\x95" => "\xE2\x80\xA2", // BULLET
"\x96" => "\xE2\x80\x93", // EN DASH
"\x97" => "\xE2\x80\x94", // EM DASH
"\x98" => "\xCB\x9C", // SMALL TILDE
"\x99" => "\xE2\x84\xA2", // TRADE MARK SIGN
"\x9A" => "\xC5\xA1", // LATIN SMALL LETTER S WITH CARON
"\x9B" => "\xE2\x80\xBA", // SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
"\x9C" => "\xC5\x93", // LATIN SMALL LIGATURE OE
"\x9E" => "\xC5\xBE", // LATIN SMALL LETTER Z WITH CARON
"\x9F" => "\xC5\xB8" // LATIN CAPITAL LETTER Y WITH DIAERESIS
);
foreach($cp1252_map as $k=>$v){
$byte_map[$k]=$v;
}
}
function fix_latin($instr){
if(mb_check_encoding($instr,'UTF-8'))return $instr; // no need for the rest if it's all valid UTF-8 already
global $nibble_good_chars,$byte_map;
$outstr='';
$char='';
$rest='';
while((strlen($instr))>0){
if(1==preg_match($nibble_good_chars,$input,$match)){
$char=$match[1];
$rest=$match[2];
$outstr.=$char;
}elseif(1==preg_match('@^(.)(.*)$@s',$input,$match)){
$char=$match[1];
$rest=$match[2];
$outstr.=$byte_map[$char];
}
$instr=$rest;
}
return $outstr;
}
$byte_map=array();
init_byte_map();
$ascii_char='[\x00-\x7F]';
$cont_byte='[\x80-\xBF]';
$utf8_2='[\xC0-\xDF]'.$cont_byte;
$utf8_3='[\xE0-\xEF]'.$cont_byte.'{2}';
$utf8_4='[\xF0-\xF7]'.$cont_byte.'{3}';
$utf8_5='[\xF8-\xFB]'.$cont_byte.'{4}';
$nibble_good_chars = "@^($ascii_char+|$utf8_2|$utf8_3|$utf8_4|$utf8_5)(.*)$@s";
I then receive each form field and run the fix_latin function.
foreach ($jobdata AS $field => $string)
{
$string = fix_latin($string);
$jobdata[$field] = addslashes(str_replace("
", '<br />', htmlspecialchars($string)));
}
The data is entered in the database and also e-mailed to the system admin for approval. Today I received an admin e-mail that had the following for a bullet point:
Job Description: Responsibilities:
路 Assist multi-state companies
And when I view the database or edit within the script, the bullet is replaced with a square box, not the • entity.
Forms should submit with the same character encoding as their host document. In theory you can override the character encoding by using <form accept-charset="UTF-8">
when declaring your form, but this doesn't work in internet explorer (surprise surprise).
If you use the same character encoding for the page that contains the form as you want your data to be submitted in, you should get data using the correct character encoding.
Additionally, if your script is both sending an e-mail and storing the data to a table, then you need to make sure both the e-mail and the table are using the same character encoding. You need to set the appropriate headers in your email to make sure the reader knows what character encoding you're using.
I'd recommend using UTF8 throughout, make sure both your database and your web pages are encoded with UTF8, and that any emails your scripts send also set a header indicating that they're encoded with UTF8 as well. It should hopefully eliminate the need for cumbersome conversion functions like the one you've been using. I was running into similar problems myself in a project, and at first I tried your approach to the problem, but in the end it was simply too much to deal with as there's thousands of potential inputs you need to catch and deal with.
In the meantime, a simple work around is to not paste directly from Word, but to paste from Word to a simple text editor such as Notepad, then copy and paste from Notepad to the browser.