在Go语言中使用io.Copy时,为什么文本的编码不同?

I am trying to rebuild a tee-like util by go language on Windows. But I found the encoding of the output is not always the same.

To make the problem simple, I wrote this program:

package main

import (
    "fmt"
    "io"
    "os"
)

func main() {
    count, err := io.Copy(os.Stdout, os.Stdin)
    fmt.Println(count, err)
}

I named it test. In the Windows command console, I got these output:

>test
中
中
5 <nil>

It works fine with no pipe and redirect.

>echo 中 | test
��
5 <nil>

The output is collapsed if I get stdin from a pipe.

>echo 中 | test > test.txt

>type test.txt
中
5 <nil>

It works again when I redirect the output to a file.

>test > test.txt
中

>type test.txt
荳ュ
5 <nil>

But not work when I use the normal stdin and redirect to a file. If I open this test.txt here by other editors like notepad++, I found it is encoded in UTF-8 and the content is .

If I use Cygwin with a UTF-8 encoded console on Windows, everything is just good.

From the output, I know that the number of bytes the program copied is 5, which means it is using UTF-8 in the program no matter what the stdin is. But as I know the windows command line console is basically use non-unicode encoding, why it is converted into UTF-8? And is there a way to let the program just copy what the stdin send without any converting?

btw. If I use tee from gnuWin32 to do the same test, everything just works good.

>where tee
D:\Tools\gnuWin32\bin\tee.exe

>echo 中 | tee
中

>tee tee.txt
中
中
^C
>type tee.txt
中

Is there anyone know the reason of this and what is the solution?

it not use utf8, why 5 bytes wrote is because there a space(0x20) after 中

C:\Users\jan>echo 中| go run src/main.go
00000000  d6 d0 0d 0a                                       |....|
��
4 <nil>

so in my system, console not use utf8, but GBK.

the bug is because windows console can not change the on screen character even the appended byte make the character another one. e.g. 'd6 d0' is 中, d6 already on screen as �, 0a appended, not make the two byte be one display character.

for testing, i have a c# console program

static void Main(string[] args)
    {

        using (Stream stdout = Console.OpenStandardOutput())
        {
            stdout.WriteByte((byte)'A');
            stdout.WriteByte(0xd6);
            stdout.WriteByte(0xd0);
        }

        using (Stream stdout = Console.OpenStandardOutput())
        {
            stdout.WriteByte((byte)'B');
            stdout.WriteByte(0xd6);

        }

        using (Stream stdout = Console.OpenStandardOutput())
        {
            stdout.WriteByte(0xd0);
        }
    }

get result:

A中BPress any key to continue . . .

so I guess windows libc have a buffer before stdout, it make up two bytes be one character and print to console.

the interesting thing i found is that, even if windows console in gbk page, go lang can write stdou with utf8 encoding. seems bytes wrote to os.Stdout not directly passed to console.

package main

import (
    "fmt"
    "os"
)

func main() {
    os.Stdout.Write([]byte{0xe4,0xb8,0xad})
    fmt.Print("\xe4")
    fmt.Print("\xb8")
    fmt.Println("\xad")
}

got:

C:\Users\jan>go run src/main.go
中中

C:\Users\jan>