I'm trying to import a CSV file which looks something like this:
"source "," destination "
férfi-/ruházat-Öltöny," férfi-/ruházat-blézer_zakó",
Note that this is just a sample of the CSV, not the whole CSV.
The way I'm reading the file is pretty straight forward:
$line = fgets($this->fileHandle) ;
$line = mb_convert_encoding($line , 'UTF-8', mb_detect_encoding($line));
Where $this->fileHandle
is just a resource pointing to the file opened using fopen
. So nothing too special there.
I want to do some string manipulation on the strings inside the CSV. I can import it just fine.
When I read from the file, either using fgets
, fread
or whatever other function I can think if I end up with garbled text.
Something along the lines of this:
So far I've tried mb_internal_encoding("UTF-8")
, to ISO-8859-2
and a few other encodings. Nothing worked.
I've also tried mb_convert_encoding($line , 'UTF-8', mb_detect_encoding($line))
where $line
is the line read from the csv. Again, nothing. Still garbled text.
Next I assumed it may be something from my OS. I' using MAC with a docker instance on Ubuntu.
Using High Sierra v10.13.4 on mac
A locale
command in the terminal gives me:
LANG="C.UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
As far as the docker instance:
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
# locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
So everything seems to be fine in that regard.
I've also tried an online PHP interpreter and that works fine. So clearly the issue is on my side.
To be honest I have no idea where the issue lies.
Any pointing in the right direction is greatly appreciated.
To answer my own question:
I had to ini_set("default_charset", "UTF-8");
. The default was an empty string.
I have no idea how it worked without it so far, I assume it has some sort of fallback encoding.
Either way, I hope this helps anybody else who gets stuck on this.