I'm trying to map UTF-8 characters to their "similar" ISO8859-1 representation. Removing diacritics, but also replacing characters like Ł
with L
or ı
with i
.
Example: José Kakışır
should become Jose Kakisir
.
I'm aware that removing diacritics can be done this way:
// (From https://blog.golang.org/normalization#TOC_10.)
import (
"unicode"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
isMn := func(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
result, _, err := transform.String(t, "José Kakışır")
println(result)
Which prints out Jose Karısır
- ş
replaced with s
, but ı
not replaced with i
.
What's the best way to achieve that in Go?
I believe the charmap package does what you want with a charmap.ISO8859_1.NewEncoder()
Edit: nevermind, that will barf on unsupported runes. Sorry. It may be worth looking into this package some more though.
Ultimately, it feels like you will need to find (or create) a mapping from UTF-8 to ISO8859. I don't think you'll find a "standard" one out there though, the mapping is too arbitrary.