CSV文件中的匈牙利语/保加利亚语字符最终在PHP中出现乱码

I'm trying to import a CSV file which looks something like this:

"source "," destination "

férfi-/ruházat-Öltöny," férfi-/ruházat-blézer_zakó",

Note that this is just a sample of the CSV, not the whole CSV.

The way I'm reading the file is pretty straight forward:

$line = fgets($this->fileHandle) ;
$line = mb_convert_encoding($line , 'UTF-8', mb_detect_encoding($line));

Where $this->fileHandle is just a resource pointing to the file opened using fopen. So nothing too special there.

I want to do some string manipulation on the strings inside the CSV. I can import it just fine.

When I read from the file, either using fgets, fread or whatever other function I can think if I end up with garbled text.

Something along the lines of this:

enter image description here

So far I've tried mb_internal_encoding("UTF-8"), to ISO-8859-2 and a few other encodings. Nothing worked.

I've also tried mb_convert_encoding($line , 'UTF-8', mb_detect_encoding($line)) where $line is the line read from the csv. Again, nothing. Still garbled text.

Next I assumed it may be something from my OS. I' using MAC with a docker instance on Ubuntu.

Using High Sierra v10.13.4 on mac

A locale command in the terminal gives me:

LANG="C.UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL= 

As far as the docker instance:

Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:   trusty

# locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=

So everything seems to be fine in that regard.


I've also tried an online PHP interpreter and that works fine. So clearly the issue is on my side.

To be honest I have no idea where the issue lies.

Any pointing in the right direction is greatly appreciated.

To answer my own question:

I had to ini_set("default_charset", "UTF-8");. The default was an empty string.

I have no idea how it worked without it so far, I assume it has some sort of fallback encoding.

Either way, I hope this helps anybody else who gets stuck on this.