file_get_contents()打破了ISO-8859-1编码

I am trying to read a page using file_get_contents() but I cannot get the character encoding to work.

this is my code:

    $username = "masked";
    $password = "maskedPass";
    $remote_url = 'https://utfws.utfpr.edu.br/aluno01/sistema/mplistahorario.inicio?p_curscodnr=212';

    // Create a stream
    $opts = array(
        'http'=>array(
            'method'=>"GET",
            'header' => array(
                "Authorization: Basic " . base64_encode("$username:$password"),
                'Accept-Charset: iso-8859-1'
            )

        )
    );

    $context = stream_context_create($opts);

    // Open the file using the HTTP headers set above
    $file = file_get_contents($remote_url, false, $context);

    echo $file;

I tried to change the character encoding to utf-8 but I always get a page with question marks instead of áéíóúãõç.

When I open the page directly in my browser it works just fine. Why is this happening?

It sounds to me like this might just be a problem of lost encoding details.

What you're describing is:

  1. request document from webserver, specifying encoding 8859-1
  2. server responds with document in requested encoding, including header specifying the encoding is 8859-1. This will look correct in a browser.
  3. output document ( but not header data! ) from php ( where this goes isn't specified
  4. open the data in some sort of viewer.

See where the encoding specification was lost, there in step 3?

The data can correctly be decoded with 8859-1, but only will be decoded with 8859-1 if the viewer is configured to use that encoding by default. Some apps may have a default of 8859-1, but UTF-8 is a lot more common these days.

If you load the data into a different storage engine, say mysql, the problem may compound. mysql associates a charset with text data. If your database defaults to utf-8, and you don't tell it the data is actually in 8859-1, but you don't tell it the data is in 8859-1, now you're feeding it data that is assumed to be in utf-8, and the data will be treated as such in the database going forward. Now even if you ask the database for 8859-1 in the future, the data will be re-encoded from utf-8 to 8859-1, but it's not valid utf-8 - it's yet another incorrect set of bytes.

To address this problem, specify the encoding when you view the data, or when you save it to a database.