buffo.Scanner逐行读取文件的奇怪行为

i use bufio.Scanner for reading a file line-by-line into the variable wordlist ([][]byte)

This is the code (tested with go 1.1 / 1.3).

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
)

func main() {
    fle, err := os.Open("words.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer fle.Close()

    scanner := bufio.NewScanner(fle)

    n := 1000
    dCnt := 5
    var wordlist [][]byte

    for scanner.Scan() {
        if len(wordlist) == n {
            break
        }
        word := scanner.Bytes()
        for ii := 0; ii < len(wordlist); ii++ {
            if string(word) == string(wordlist[ii]) {
                log.Println(ii, string(word), string(wordlist[ii]))
                log.Println(len(wordlist), "double")

                dCnt--
                if dCnt == 0 {
                    for i, v := range wordlist {
                        fmt.Println(i, string(v))
                    }
                    log.Fatal("double")
                }
            }
        }
        wordlist = append(wordlist, word)
    }
    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

words.txt is a file of 5040 lines of permutations of the sequenz "abcdefg":

line 1 .. 
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040

generated by this small python script:

from itertools import permutations as perm
c = "abcdefg"
p = perm(c, len(c))
with file('words.txt','wb') as outFle:
    for i in xrange(5040):
        n = ''.join(p.next())
        print >> outFle, n

The problem is, that after running the above go program the wordlist contains the following:

index string(wordlist[])

0 afcdebg      <-- this is line 513 of words.txt
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg    <-- this is the begin of a repition of line 513 .. 1024 in words.ttx
513 afcdegb
514 afcdgbe

Instead wordlist should contain the first 1000 lines of words.txt

Any Ideas ?

The answer was given by Daniel Darabos (see below)

changing

word := scanner.Bytes()

word := scanner.Text() ' did the job.

(Thanks for your help!)

The documentation of Scanner.Bytes says:

The underlying array may point to data that will be overwritten by a subsequent call to Scan.

So if you save the returned slice, you can expect to see its contents change. This wreaks havoc in your application. Better to not save the returned slice!

A nice solution is to build a string from the bytes:

word := string(scanner.Bytes())

Then you can work with strings everywhere and the code becomes more pleasant.

What is going on?

Why does Scanner.Bytes hate me? The answer is also in the documentation:

It does no allocation.

This makes the Scanner nicely efficient. From what you see, I guess it allocates buffers for 512 lines in the constructor and then rotates over them.

This is not a problem in applications where you do not need to keep references to the lines. (For example a grep-like program only looks at each line once.) Often you parse the line and store a reference to that. But if you want to store the raw byte data, you are responsible for copying it out from the Scanner.

This may be a hassle, but while you can implement the convenient behavior on top of the inconvenient one, it would be impossible to implement the efficient behavior on top of the inefficient one.

Also a simpler script for generating the input:

import itertools
for p in itertools.permutations('abcdefg'):
  print ''.join(p)