I have a standard file upload where the user is supposed to upload a text file. But "text file" is not egual to "text file". The same file can have different encodings: UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI
To be more clear I noticed that some encodings are not able to show all characters, another encoding can show.
Tree questions:
witch encoding is the one that is "the most compete", where you can convert any encoding into without loosing content
check if the file a text file and not a binary
check if the content of the text file is base64 encoded or not?
if the uploaded encoding is not "the most compete" , change the encoding "on the fly" to the "the most compete" encoding (see question 1)
I do not want to troll here sending the whole code, so lets admit I have the form and the action="upload.php", now comes the part where I need to check the above.
$target_dir = "uploads/";
$target_file = $target_dir . basename($_FILES["fileToUpload"]["name"]);
[...]
// this ist the check after the upload
if(isset($_POST["submit"])) {
// check 1 : what encoding has been uploaded ?
// check 2 : is the file a text file and not a binary?
// check 3 : in the content of the file a base64 encoded text?
}
// if the encoding is different to the "most preferred" change the encoding to the "most preferred"
[...]
can you please help quick ?
witch encoding is the one that is "the most compete", where you can convert any encoding into without loosing content
Unicode. Choose any of the common encodings of the Unicode standard, like UTF-8 or UTF-16. The de facto standard on the internet is UTF-8.
check if the file a text file and not a binary
There's no such difference as such. Text files also just contain binary data, it just so happens that this binary data interpreted in the right encoding results in human readable text.
You can try to check whether the file contains a lot of "control characters" or NUL
bytes or such, then it may not be text.
You can try confirming whether the file is valid in any of your expected encodings. Have a list of supported/expected encodings at hand and check against that list. Note though that any random binary garbage is "valid" in any single byte encoding like ISO-8859-1...
check if the content of the text file is base64 encoded or not?
Try to decode it as Base64. If it decodes properly, then it was probably Base64 encoded. If it can't be decoded due to bad/malformed characters, then it probably wasn't. However, this can easily yield false positives, as simple short text sequences may look like Base64 encoded text.
if the uploaded encoding is not "the most compete" , change the encoding "on the fly" to the "the most compete" encoding (see question 1)
If it's not UTF-8 encoded, convert it to UTF-8... from its original encoding...
How do you know its original encoding? You don't. You can guess. Again, have a list of encodings at hand and check them off one by one, using the one that seems most likely.
This doesn't sound very sane to you? Well, that's because it isn't.
Trying to handle unknown encodings is a nightmare you best try to avoid outright.
There is no right answer. There will be false positives. You cannot be sure without having a human confirm the result. If you have a text file in an unknown encoding, try to interpret it in all known encodings, rule out the ones in which it cannot be decoded correctly, and let a human pick the best result. There are libraries which implement such guessing/detection logic, probably paired with statistical text analysis to guesstimate the likelihood of decoded text being actual text, but be aware that all such libraries fundamentally can only provide you with a best guess.
Or know what the encoding is to begin with. From meta data, or by having a human tell you.