The Go scanner package in text/scanner/scanner.go
uses trick to find whitespace:
const GoWhitespace = 1<<'\t' | 1<<'
' | 1<<'' | 1<<' '
And then:
// skip white space
for s.Whitespace&(1<<uint(ch)) != 0 {
ch = s.next()
}
Since character values shift left by more than 31, can there be cases where this is not unique? I mean, when some char is the same as tab modulo 32, it will be recognized as whitespace?
Fully answer:
Spec explicitly say that for operations on unsigned, we get high bits masked out so the low bits are really "wrap around".
The reason it works is:
Scanner.Whitespace
is actually uint64
so value of GoWhitespace
fully fits.Whitespace&(1<<uint(ch))
on unsigned integer at runtime can have intermediate values arbitrary large and will wrap around. So if say char is "a" (96) we have 1 << 96
which overflows so modulo size of 64-bit int it's 0.As mentioned in the spec, the <<
is a shift operation:
The shift operators shift the left operand by the shift count specified by the right operand. They implement arithmetic shifts if the left operand is a signed integer and logical shifts if it is an unsigned integer. There is no upper limit on the shift count. Shifts behave as if the left operand is shifted n times by 1 for a shift count of n. As a result, x << 1 is the same as x*2 and x >> 1 is the same as x/2 but truncated towards negative infinity.
For large values of ch
, 1<<uint(ch)
will cause an overflow:
For unsigned integer values, the operations +, -, *, and << are computed modulo 2n, where n is the bit width of the unsigned integer's type. Loosely speaking, these unsigned integer operations discard high bits upon overflow, and programs may rely on ``wrap around''.
For signed integers, the operations +, -, *, and << may legally overflow and the resulting value exists and is deterministically defined by the signed integer representation, the operation, and its operands. No exception is raised as a result of overflow. A compiler may not optimize code under the assumption that overflow does not occur. For instance, it may not assume that x < x + 1 is always true.
So implementing <<
with a bitwise rotation operator (what you seem to be describing) would violate the spec. 1<<uint(ch)
will evaluate to zero for values of ch
larger than the size of the int
type, so won't cause any false positives.