From the documentation of Go's unicode
package:
func IsSpace
func IsSpace(r rune) bool
IsSpace reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is
'\t', ' ', '\v', '\f', '', ' ', U+0085 (NEL), U+00A0 (NBSP).
Other definitions of spacing characters are set by category Z and property Pattern_White_Space.
My question is: What does it mean that "other definitions" are set by the Z category and Pattern_White_Space
? Does this mean that calling unicode.IsSpace()
, checking whether a character is in the Z category, and checking whether a character is in Pattern_White_Space
will all yield different results? If so, what are the differences? And why are there differences?
The IsSpace function will first check if your rune
is in the Latin1 char space. If it is, it will use the space characters you listed to determine white-spacing.
If not, isExcludingLatin
(http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170) is called which looks like:
170 func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
171 r16 := rangeTab.R16
172 if off := rangeTab.LatinOffset; len(r16) > off && r <= rune(r16[len(r16)-1].Hi) {
173 return is16(r16[off:], uint16(r))
174 }
175 r32 := rangeTab.R32
176 if len(r32) > 0 && r >= rune(r32[0].Lo) {
177 return is32(r32, uint32(r))
178 }
179 return false
180 }
The *RangeTable
being passed in is White_Space
which looks is defined here:
http://golang.org/src/unicode/tables.go?h=White_Space#L6069
6069 var _White_Space = &RangeTable{
6070 R16: []Range16{
6071 {0x0009, 0x000d, 1},
6072 {0x0020, 0x0020, 1},
6073 {0x0085, 0x0085, 1},
6074 {0x00a0, 0x00a0, 1},
6075 {0x1680, 0x1680, 1},
6076 {0x2000, 0x200a, 1},
6077 {0x2028, 0x2029, 1},
6078 {0x202f, 0x202f, 1},
6079 {0x205f, 0x205f, 1},
6080 {0x3000, 0x3000, 1},
6081 },
6082 LatinOffset: 4,
6083 }
To answer your main question, the IsSpace
check is not limited to Latin-1.
EDIT
For clarification, if the character you are testing is not in the Latin-1 charset, then the range table lookup is used. The Range16
values in the table represent ranges of 16bit numbers {Low, Hi, Stride}. The isExcludingLatin
will call is16
with that range table sub-section (R16
) and determine if the rune
provided falls in any of the ranges after the index of LatinOffset
(which is 4 in this case).
So, that is checking these ranges:
{0x1680, 0x1680, 1},
{0x2000, 0x200a, 1},
{0x2028, 0x2029, 1},
{0x202f, 0x202f, 1},
{0x205f, 0x205f, 1},
{0x3000, 0x3000, 1},
There are unicode code points for:
http://www.fileformat.info/info/unicode/char/1680/index.htm http://www.fileformat.info/info/unicode/char/2000/index.htm http://www.fileformat.info/info/unicode/char/2001/index.htm http://www.fileformat.info/info/unicode/char/2002/index.htm http://www.fileformat.info/info/unicode/char/2003/index.htm http://www.fileformat.info/info/unicode/char/2004/index.htm http://www.fileformat.info/info/unicode/char/2005/index.htm http://www.fileformat.info/info/unicode/char/2006/index.htm http://www.fileformat.info/info/unicode/char/2007/index.htm http://www.fileformat.info/info/unicode/char/2008/index.htm http://www.fileformat.info/info/unicode/char/2009/index.htm http://www.fileformat.info/info/unicode/char/200a/index.htm http://www.fileformat.info/info/unicode/char/2028/index.htm http://www.fileformat.info/info/unicode/char/2029/index.htm http://www.fileformat.info/info/unicode/char/202f/index.htm http://www.fileformat.info/info/unicode/char/205f/index.htm http://www.fileformat.info/info/unicode/char/3000/index.htm
All of the above are considers "white space"