什么使文件UTF-8?

I've read that adding the UTF-8 Byte Order Mark (3 characters) at the start of a text file makes it a UTF-8 file, but I've also read that unicode recommends against using the BOM for UTF-8.

I'm generating files in PHP and I have a requirement that the files be UTF-8. I've added the UTF-8 BOM to the start of the file but I've received feedback about garbage characters at the start of the file from the company that is parsing the files and that gave me the requirement to make the files UTF-8.

If I open the file in notepad it doesn't show the BOM, and if I go to save as, it shows UTF-8 as the default choice.

Opening the file in Textpad32 shows the 3 characters at the start of the file.

So what makes a file UTF-8?

Text is UTF-8 because it's valid as UTF-8 and the author decides it is.

How that decision by the author is communicated to the consumer is a different question, which involves convention, guessing, and various schemes for in-band- or out-of-band-signalling, like HTTP or HTML charset, BOM (which enhances guessing), some envelope / embedding Format, additional data-streams, file-naming, and many more.

UTF-8 is a particular encoding. All 7-bit ASCII files are also valid UTF-8, and it can encode every Unicode character as well.

You will often get the advice to save as UTF-8 without a BOM. In practice, it is very unlikely that a file in a legacy encoding (such as code page 1252, Big5 or Shift-JIS) would just happen to look like valid UTF-8 unless it is an intentionally-ambiguous test case. Many programs, such as web browsers, are good in practice at figuring out when a file is UTF-8. Most recent software uses UTF-8 as its preferred text encoding unless it’s forced to default to something else for compatibility with last century. (LaTeX, for example, changed its default source encoding to UTF-8 in April 2018, and both the LuaLaTeX and XeLaTeX engines had been doing the same for years.)

There are some document types with special requirements. For example, the default encoding of web pages is theoretically Windows 1252, although browsers in the real world will take their best guess. The current best practice on the Web is to save as UTF-8 without a BOM. Instead, you write inside the <head> of the document, <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> or <meta charset="utf-8"/> This tells the user agent explicitly what the character encoding is.

On the other hand, some older versions of software either break if they see a BOM, or only recognize UTF-8 if there is a BOM. Microsoft in the ’aughts was especially guilty of this, its software doesn’t want to break any files that used to work back then, and so, to this day, I save my C source files as UTF-8 with a BOM. This is the only format that just works on every compiler I use: even the latest version of MSVC might guess wrong if you don’t give it either a BOM or the right command-line flag, whereas Clang only supports UTF-8 and has no option to read files in any other encoding. Some older versions of MSVC that I was once forced to use cannot understand UTF-8 at all unless the BOM is there, and do not provide any way to override its autodetection.

The file doesn't need any explicit indicator that it is UTF-8, modern text editors should detect UTF-8 encoding from the context as UTF-8 sequences are quite distinct.

Also, as you experienced for yourself, PHP doesn't like the BOM header, it's a silly thing that often messes up with the script output and creates more problems than it solves.

HTML has it's own way of declaring the encoding of a file, you can do it within the HTML itself:

<head>
    <meta charset="UTF-8">
</head>

Or declare the encoding in the HTTP headers, here with PHP:

header('Content-Type: text/html; charset=utf-8');

Modern browsers will also assume UTF-8 as default encoding in case none is specified. It is the standard of the web after all.