From https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points, I got to know that U+D800 through U+DFFF are invalid. So in decimal system, it is 55296 through 57343.
And Maximum valid Unicode is '\U0010FFFF'. In decimal system, it is 1114111
My code:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Case 1(Invalid Range)")
str := fmt.Sprintf("%c", rune(55296+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2(More than maximum valid range)")
str = fmt.Sprintf("%c", rune(1114111+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
Why ValidString function is not returning false for invalid unicode characters given as input ? I am sure my understanding is wrong, could some one explain??
Your problem happens in Sprintf. Since you give it an invalid character Sprintf replaces with with rune(65533)
which is the unicode replacement character used instead of invalid characters. So your string is valid UTF8.
This will also happen if you do something like this: str := string([]rune{ 55297 })
so this might be something that happens when creating runes. It's not immediately obvious from: https://blog.golang.org/strings
If you want to force your string to contain invalid UTF8 you can write the first string like this:
str := string([]byte{237, 159, 193})
You take an invalid value and convert it using Sprintf. It's converted to the error value. You then check the error value, which is a valid Unicode code point.
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
fmt.Println("Case 1: Invalid Range")
str := fmt.Sprintf("%c", rune(55296+1))
fmt.Printf("%q %X %d %d
", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2: More than maximum valid range")
str = fmt.Sprintf("%c", rune(1114111+1))
fmt.Printf("%q %X %d %d
", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
Output:
Case 1: Invalid Range
"�" EFBFBD 65533 65533
� is valid unicode character
Case 2: More than maximum valid range
"�" EFBFBD 65533 65533
� is valid unicode character