字符串到UCS-2

I want to translate in Go my python program to convert an unicode string to a UCS-2 HEX string.

In python, it's quite simple:

u"Bien joué".encode('utf-16-be').encode('hex')
-> 004200690065006e0020006a006f007500e9

I am a beginner in Go and the simplest way I found is:

package main

import (
    "fmt"
    "strings"
)

func main() {
    str := "Bien joué" 
    fmt.Printf("str: %s
", str)

    ucs2HexArray := []rune(str)
    s := fmt.Sprintf("%U", ucs2HexArray)
    a := strings.Replace(s, "U+", "", -1)
    b := strings.Replace(a, "[", "", -1)
    c := strings.Replace(b, "]", "", -1)
    d := strings.Replace(c, " ", "", -1)
    fmt.Printf("->: %s", d)
}

str: Bien joué
->: 004200690065006E0020006A006F007500E9
Program exited.

I really think it's clearly not efficient. How can-I improve it?

Thank you

Make this conversion a function then you can easily improve the conversion algorithm in the future. For example,

package main

import (
    "fmt"
    "strings"
    "unicode/utf16"
)

func hexUTF16FromString(s string) string {
    hex := fmt.Sprintf("%04x", utf16.Encode([]rune(s)))
    return strings.Replace(hex[1:len(hex)-1], " ", "", -1)
}

func main() {
    str := "Bien joué"
    fmt.Println(str)
    hex := hexUTF16FromString(str)
    fmt.Println(hex)
}

Output:

Bien joué
004200690065006e0020006a006f007500e9

NOTE:

You say "convert an unicode string to a UCS-2 string" but your Python example uses UTF-16:

u"Bien joué".encode('utf-16-be').encode('hex')

The Unicode Consortium

UTF-16 FAQ

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters.

Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

The standard library has the built-in utf16.Encode() (https://golang.org/pkg/unicode/utf16/#Encode) function for this purpose.

For anything other than trivially short input (and possibly even then), I'd use the golang.org/x/text/encoding/unicode package to convert to UTF-16 (as @peterSo and @JimB point out, slightly different from obsolete UCS-2).

The advantage (over unicode/utf16) of using this (and the golang.org/x/text/transform package) is that you get BOM support, big or little endian, and that you can encode/decode short strings or bytes, but you can also apply this as a filter to an io.Reader or to an io.Writer to transform your data as you process it instead of all up front (i.e. for a large stream of data you don't need to have it all in memory at once).

E.g.:

package main

import (
    "bytes"
    "fmt"
    "io"
    "io/ioutil"
    "log"
    "strings"

    "golang.org/x/text/encoding/unicode"
    "golang.org/x/text/transform"
)

const input = "Bien joué"

func main() {
    // Get a `transform.Transformer` for encoding.
    e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
    t := e.NewEncoder()
    // For decoding, allows a Byte Order Mark at the start to
    // switch to corresponding Unicode decoding (UTF-8, UTF-16BE, or UTF-16LE)
    // otherwise we use `e` (UTF-16BE without BOM):
    t2 := unicode.BOMOverride(e.NewDecoder())
    _ = t2 // we don't show/use this

    // If you have a string:
    str := input
    outstr, n, err := transform.String(t, str)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("string:   n=%d, bytes=%02x
", n, []byte(outstr))

    // If you have a []byte:
    b := []byte(input)
    outbytes, n, err := transform.Bytes(t, b)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("bytes:    n=%d, bytes=%02x
", n, outbytes)

    // If you have an io.Reader for the input:
    ir := strings.NewReader(input)
    r := transform.NewReader(ir, t)
    // Now just read from r as you normal would and the encoding will
    // happen as you read, good for large sources to avoid pre-encoding
    // everything. Here we'll just read it all in one go though which negates
    // that benefit (normally avoid ioutil.ReadAll).
    outbytes, err = ioutil.ReadAll(r)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("reader: len=%d, bytes=%02x
", len(outbytes), outbytes)

    // If you have an io.Writer for the output:
    var buf bytes.Buffer
    w := transform.NewWriter(&buf, t)
    _, err = fmt.Fprint(w, input) // or io.Copy from an io.Reader, or whatever
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("writer: len=%d, bytes=%02x
", buf.Len(), buf.Bytes())
}

// Whichever of these you need you could of
// course put in a single simple function. E.g.:

// NewUTF16BEWriter returns a new writer that wraps w
// by transforming the bytes written into UTF-16-BE.
func NewUTF16BEWriter(w io.Writer) io.Writer {
    e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
    return transform.NewWriter(w, e.NewEncoder())
}

// ToUTFBE converts UTF8 `b` into UTF-16-BE.
func ToUTF16BE(b []byte) ([]byte, error) {
    e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
    out, _, err := transform.Bytes(e.NewEncoder(), b)
    return out, err
}

Gives:

string:   n=10, bytes=004200690065006e0020006a006f007500e9
bytes:    n=10, bytes=004200690065006e0020006a006f007500e9
reader: len=18, bytes=004200690065006e0020006a006f007500e9
writer: len=18, bytes=004200690065006e0020006a006f007500e9