Prepping a Curl Response for particular data to be inserted into a MySQL Table.
Noticed some special characters in the saved data for certain URL's.
$curldata = curl_exec($curl);
$encoding = mb_detect_encoding($curldata);
brought back ASCII
encoding.
Okay, don't want that.
The tables in my database are an InnoDB
type with a utf8mb4_unicode_ci
collation.
Added this to my curl options:
curl_setopt($curl, CURLOPT_ENCODING, 1);
And an iconv
function based on the above mb_detect_encoding
/ $encoding
variable upon save.
$curldata = iconv($encoding, "UTF-8", $curldata);
// save to file to test output
file_put_contents('test.html', $curldata);
Not sure if this is the best way to go about this, but my test.html
output no longer has any encoding for special characters, so... (perhaps) mission accomplished.
As I parse through the data, I then notice this character.
,
Not an ordinary comma... [Comparison: ,/,]
But acts like one. Try doing a ctrl+f
and try to find a comma. It treats them as the same, and both as a UTF-8 character - var_dump(mb_detect_encoding(','));
I look at my table row, and see it as a row inserted as such
8,8
If I try to search for a ,
it does indeed bring back the instances where ,
is present.
Vice versa, if I search for ,
it brings back all instances where that and a comma occurs.
Basically for all intents and purposes it is a comma, yet obviously isn't.
This is of course workable, but rather annoying and feels riddled with inconsistency.
Can anyone explain why the two commas are the same, yet obviously different?
Is there a solution for me to prevent these odd characters from entering my CURL response, or further in within my DOM
response and PDO
Insert.
edit:
If relevant,
// dom
$dom = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML(mb_convert_encoding($curldata, 'HTML-ENTITIES', 'UTF-8'));
// pdo
$pdoquery = "INSERT INTO `table` (`Attr`) VALUES (?)";
$value = "8,8";
$stmt = $pdo->prepare("INSERT INTO `table` (`Attr`) VALUES (?)");
$stmt->execute([$value]);
edit 2:
Well, it appears to be a FULLWIDTH COMMA
..
var_dump(utf8_to_unicode(','));
string '%uff0c' (length=6)
var_dump(utf8_to_unicode(','));
string '%2c' (length=3)
Starting to make more sense... now to figure out how to prevent such characters from entering the curl response/DOM/database...
You might want the function mb_convert_kana
which can convert characters of different widths into a uniform width.
$s = 'This is a string with ,, (commas having different widths)';
echo 'original : ', $s, PHP_EOL;
echo 'converted: ', mb_convert_kana($s, 'a');
result:
original : This is a string with ,, (commas having different widths)
converted: This is a string with ,, (commas having different widths)
PHP documentation: mb_convert_kana
To get an idea what the meaning is, see also http://unicode.org/reports/tr11-2/
By convention, 1/2 Em wide characters of East Asian legacy encodings are called "half-width" (or hankaku characters in Japanese), the others are called correspondingly "full-width" (or zenkaku) characters.
With a suitable COLLATION
, the two commas are treated as equal:
mysql> SELECT ',' = ',' COLLATE utf8mb4_general_ci;
+----------------------------------------+
| ',' = ',' COLLATE utf8mb4_general_ci |
+----------------------------------------+
| 0 |
+----------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT ',' = ',' COLLATE utf8mb4_unicode_ci;
+----------------------------------------+
| ',' = ',' COLLATE utf8mb4_unicode_ci |
+----------------------------------------+
| 1 |
+----------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT ',' = ',' COLLATE utf8mb4_unicode_520_ci;
+--------------------------------------------+
| ',' = ',' COLLATE utf8mb4_unicode_520_ci |
+--------------------------------------------+
| 1 |
+--------------------------------------------+
1 row in set (0.00 sec)
It would be better to talk in terms of HEX, not unicode:
mysql> SELECT HEX(','), HEX(',');
+------------+----------+
| HEX(',') | HEX(',') |
+------------+----------+
| EFBC8C | 2C |
+------------+----------+
1 row in set (0.00 sec)