如何检测/处理多种unicode方式来对字母上的重音进行编码

Believe it or not, it appears that the iota (the last letter) for this word has been encoded in two different ways in unicode:

  • εἰμί (GREEK SMALL LETTER IOTA WITH TONOS' U+03AF)
  • εἰμί (GREEK SMALL LETTER IOTA WITH OXIA' U+1F77)

I assume that sometimes the letter is being encoded as a single letter, and at other times it is encoded as a letter+accent.

Is there some kind of map or database that allows us to do conversion between one or the other that I can import into my code.

Believe it or not

Let's leave the world of fantasy.

Duplicated vowel+oxia characters in Greek Unicode range

The Unicode Consortium

Unicode: Frequently Asked Questions: Normalization

The Go Blog: Text normalization in Go


For example,

package main

import (
    "bytes"
    "fmt"

    "golang.org/x/text/unicode/norm"
)

func Equal(a, b string) bool {
    var ia, ib norm.Iter
    ia.InitString(norm.NFKD, a)
    ib.InitString(norm.NFKD, b)
    for !ia.Done() && !ib.Done() {
        if !bytes.Equal(ia.Next(), ib.Next()) {
            return false
        }
    }
    return ia.Done() && ib.Done()
}

func main() {
    a := "εἰμ\u03AF"
    b := "εἰμ\u1F77"
    fmt.Println(a)
    fmt.Println(b)
    fmt.Println(a == b)
    fmt.Println(Equal(a, b))
}

Output:

εἰμί
εἰμί
false
true