The sample code is,
// test.go
package main
import (
"bufio"
"os"
)
func main() {
if len(os.Args) != 2 {
println("Usage:", os.Args[0], "")
os.Exit(1)
}
fileName := os.Args[1]
fp, err := os.Open(fileName)
if err != nil {
println(err.Error())
os.Exit(2)
}
defer fp.Close()
r := bufio.NewScanner(fp)
var lines []string
for r.Scan() {
lines = append(lines, r.Text())
}
}
c:\>go build test.go
c:\>test.exe test.txt
Then I monitored its process using process monitor when executing it, part of the output is:
test.exe ReadFile SUCCESS Offset: 4,692,375, Length: 8,056
test.exe ReadFile SUCCESS Offset: 4,700,431, Length: 7,198
test.exe ReadFile SUCCESS Offset: 4,707,629, Length: 8,134
test.exe ReadFile SUCCESS Offset: 4,715,763, Length: 7,361
test.exe ReadFile SUCCESS Offset: 4,723,124, Length: 8,056
test.exe ReadFile SUCCESS Offset: 4,731,180, Length: 4,322
test.exe ReadFile END OF FILE Offset: 4,735,502, Length: 8,192
The equivalent java code is,
//Test.java
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
public class Test{
public static void main(String[] args) {
try
{
FileInputStream in = new FileInputStream("test.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
while((strLine = br.readLine())!= null)
{
;
}
}catch(Exception e){
System.out.println(e);
}
}
}
c:\>javac Test.java
c:\>java Test
Then part of the monitoring output is:
java.exe ReadFile SUCCESS Offset: 4,694,016, Length: 8,192
java.exe ReadFile SUCCESS Offset: 4,702,208, Length: 8,192
java.exe ReadFile SUCCESS Offset: 4,710,400, Length: 8,192
java.exe ReadFile SUCCESS Offset: 4,718,592, Length: 8,192
java.exe ReadFile SUCCESS Offset: 4,726,784, Length: 8,192
java.exe ReadFile SUCCESS Offset: 4,734,976, Length: 526
java.exe ReadFile END OF FILE Offset: 4,735,502, Length: 8,192
As you see, the buffer size in java is 8192 and it read 8192 bytes each time.Why is the Length in Go changing during each time reading file?
I have tried bufio.ReadString(' ')
,bufio.ReadBytes(' ')
and both of them have the same problem.
[Update] I have tested the sample in C,
//test.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE * fp;
char * line = NULL;
size_t len = 0;
ssize_t read;
fp = fopen("test.txt", "r");
if (fp == NULL)
exit(EXIT_FAILURE);
while ((read = getline(&line, &len, fp)) != -1) {
printf("Retrieved line of length %zu :
", read);
}
if (line)
free(line);
return EXIT_SUCCESS;
}
The output is similar with java code(the buffer size is 65536 on my system).So why Go is so different here?
Reading bufio.Scan
's source shows that while the buffer size is 4096, it reads depending on how much "empty" space is left in it, specifically this part:
n, err := s.r.Read(s.buf[s.end:len(s.buf)])
Now performance wise, I'm almost positive whatever file system you're using will be smart enough to read-ahead and cache the data, so the buffer size shouldn't make that much of a difference.
This may be the reason:
In all of the examples you cite, the Scan
function output is determined by line-endings.
Go's default scan function splits by line (http://golang.org/pkg/bufio/#Scanner.Scan):
the default split function breaks the input into lines with line termination stripped
And bufio.ReadString(' ')
and bufio.ReadBytes(' ')
have the same problem due to the character.
Try removing all newlines from your test file and testing if it still gives non 4096 multiples on the READFILE
records.
As some have suggested, what you're seeing may actually be due to the IO strategy used by the bufio
package.