i use bufio.Scanner for reading a file line-by-line into the variable wordlist ([][]byte)
This is the code (tested with go 1.1 / 1.3).
package main
import (
"bufio"
"fmt"
"log"
"os"
)
func main() {
fle, err := os.Open("words.txt")
if err != nil {
log.Fatal(err)
}
defer fle.Close()
scanner := bufio.NewScanner(fle)
n := 1000
dCnt := 5
var wordlist [][]byte
for scanner.Scan() {
if len(wordlist) == n {
break
}
word := scanner.Bytes()
for ii := 0; ii < len(wordlist); ii++ {
if string(word) == string(wordlist[ii]) {
log.Println(ii, string(word), string(wordlist[ii]))
log.Println(len(wordlist), "double")
dCnt--
if dCnt == 0 {
for i, v := range wordlist {
fmt.Println(i, string(v))
}
log.Fatal("double")
}
}
}
wordlist = append(wordlist, word)
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
}
words.txt is a file of 5040 lines of permutations of the sequenz "abcdefg":
line 1 ..
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040
generated by this small python script:
from itertools import permutations as perm
c = "abcdefg"
p = perm(c, len(c))
with file('words.txt','wb') as outFle:
for i in xrange(5040):
n = ''.join(p.next())
print >> outFle, n
The problem is, that after running the above go program the wordlist contains the following:
index string(wordlist[])
0 afcdebg <-- this is line 513 of words.txt
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg <-- this is the begin of a repition of line 513 .. 1024 in words.ttx
513 afcdegb
514 afcdgbe
Instead wordlist should contain the first 1000 lines of words.txt
Any Ideas ?
The answer was given by Daniel Darabos (see below)
changing
word := scanner.Bytes()
to
word := scanner.Text() ' did the job.
(Thanks for your help!)
The documentation of Scanner.Bytes
says:
The underlying array may point to data that will be overwritten by a subsequent call to Scan.
So if you save the returned slice, you can expect to see its contents change. This wreaks havoc in your application. Better to not save the returned slice!
A nice solution is to build a string from the bytes:
word := string(scanner.Bytes())
Then you can work with strings everywhere and the code becomes more pleasant.
Why does Scanner.Bytes
hate me? The answer is also in the documentation:
It does no allocation.
This makes the Scanner nicely efficient. From what you see, I guess it allocates buffers for 512 lines in the constructor and then rotates over them.
This is not a problem in applications where you do not need to keep references to the lines. (For example a grep
-like program only looks at each line once.) Often you parse the line and store a reference to that. But if you want to store the raw byte data, you are responsible for copying it out from the Scanner
.
This may be a hassle, but while you can implement the convenient behavior on top of the inconvenient one, it would be impossible to implement the efficient behavior on top of the inefficient one.
Also a simpler script for generating the input:
import itertools
for p in itertools.permutations('abcdefg'):
print ''.join(p)