转:从[] byte转换为字符串,反之亦然

I always seem to be converting strings to []byte to string again over and over. Is there a lot of overhead with this? Is there a better way?

For example, here is a function that accepts a UTF8 string, normalizes it, remove accents, then converts special characters to ASCII equivalent:

var transliterations = map[rune]string{'Æ':"AE",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"ae",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}
func RemoveAccents(s string) string {
    b := make([]byte, len(s))
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    _, _, e := t.Transform(b, []byte(s), true)
    if e != nil { panic(e) }
    r := string(b)

    var f bytes.Buffer
    for _, c := range r {
        temp := rune(c)
        if val, ok := transliterations[temp]; ok {
            f.WriteString(val)
        } else {
            f.WriteRune(temp)
        }
    }
    return f.String()
}

So I'm starting with a string because that's what I get, then I'm converting it to a byte array, then back to a string, then to a byte array again, then back to a string again. Surely this is unnecessary but I can't figure out how to not do this..? And does it really have a lot of overhead or do I not have to worry about slowing things down with excessive conversions?

(Also if anyone has the time I've not yet figured out how bytes.Buffer actually works, would it not be better to initialize a buffer of 2x the size of the string, which is the maximum output size of the return value?)

In Go, strings are immutable so any change creates a new string. As a general rule, convert from a string to a byte or rune slice once and convert back to a string once. To avoid reallocations, for small and transient allocations, over-allocate to provide a safety margin if you don't know the exact number.

For example,

package main

import (
    "bytes"
    "fmt"
    "unicode"
    "unicode/utf8"

    "code.google.com/p/go.text/transform"
    "code.google.com/p/go.text/unicode/norm"
)

var isMn = func(r rune) bool {
    return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}

var transliterations = map[rune]string{
    'Æ': "AE", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th",
    'ß': "ss", 'æ': "ae", 'ð': "d", 'ł': "l", 'ø': "oe",
    'þ': "th", 'Œ': "OE", 'œ': "oe",
}

func RemoveAccents(b []byte) ([]byte, error) {
    mnBuf := make([]byte, len(b)*125/100)
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    n, _, err := t.Transform(mnBuf, b, true)
    if err != nil {
        return nil, err
    }
    mnBuf = mnBuf[:n]
    tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*125/100))
    for i, w := 0, 0; i < len(mnBuf); i += w {
        r, width := utf8.DecodeRune(mnBuf[i:])
        if s, ok := transliterations[r]; ok {
            tlBuf.WriteString(s)
        } else {
            tlBuf.WriteRune(r)
        }
        w = width
    }
    return tlBuf.Bytes(), nil
}

func main() {
    in := "test stringß"
    fmt.Println(in)
    inBytes := []byte(in)
    outBytes, err := RemoveAccents(inBytes)
    if err != nil {
        fmt.Println(err)
    }
    out := string(outBytes)
    fmt.Println(out)
}

Output:

test stringß
test stringss

There is a small overhead with converting a string to a byte slice (not an array, that's a different type). Namely allocating the space for the byte slice.

Strings are its own type and are an interpretation of a sequence of bytes. But not every sequence of bytes is a useful string. Strings are also immutable. If you look at the strings package, you will see that strings will be sliced a lot.

In your example you can omit the second conversion back to string. You can also range over a byte slice.

As with every question about performance: you will probably need to measure. Is the allocation of byte slices really your bottleneck?

You can initialize your bytes.Buffer like so:

f := bytes.NewBuffer(make([]byte, 0, len(s)*2))

where you have a size of 0 and a capacity of 2x the size of your string. If you can estimate the size of your buffer, it is probably good to do that. It will save you a few reallocations of the underlying byte slices.

There is no answer to this question. If these conversions are a performance bottleneck in your application you should fix them. If not: Not.

Did you profile your application under realistic load and RemoveAccents is the bottleneck? No? So why bother?

Really: I assume one could do better (in the sense of less garbage, less iterations and less conversions) e.g. by chaining in some "TransliterationTransformer". But I doubt it would be wirth the hassle.