I have a text with diacritic characters that are displayed bad, like this: ¤ or ˇ or ˘. I don't know what charset the text was. Is there any easy way to figure it out? It would be nice if there is some online charset detector or maybe charset conversion previewer? I think about a application that would show me how some specific diacritic characters look like malformed in all available encodings so i would be able to track the one that fits into the chars i have in the text.
Any ideas?
In Windows PowerShell:
$bytes = [IO.File]::ReadAllBytes('some file.txt')
[Text.Encoding]::GetEncodings() |
%{
$_|Add-Member -pass Noteproperty Text ($_.GetEncoding().GetString($bytes))
} | fl Name,Codepage,Text
Adjust the path to the file and browse the results until you see something that looks correct ;-)
This simply iterates through all encodings that are known to .NET and converts the text into a string using the respective encoding.
In C#:
foreach (EncodingInfo encodingInfo in Encoding.GetEncodings())
using (FileStream fileStream = File.OpenRead(filePath))
using (StreamReader reader = new StreamReader(fileStream, encodingInfo.GetEncoding(), false))
textBox1.Text += encodingInfo.DisplayName + ":\t " + reader.ReadToEnd() + "
";
where textBox1
is a large multiline TextBox
(or any other suitable control).
Some caveats I learnt:
File.ReadAllText
attempts to automatically detect the encoding of a file based on the presence of byte order marks, even when another encoding is explicitly specified. The only way to suppress this is through the StreamReader
constructor overload which allows one to suppress looking for byte order marks.