I have the following Go program to process a TSV input. But it is slower than awk
and cut
. I know cut
uses string manipulate tricks to achieve a fast performance.
https://github.com/coreutils/coreutils/blob/master/src/cut.c
Is it possible to achieve the same performance as cut
with Go (or at least better than awk
)? What should things be used in Go to achieve a better performance?
$ ./main_.sh | indent.sh
time ./main.go 10 < "$infile" > /dev/null
real 0m1.431s
user 0m0.978s
sys 0m0.436s
time cut -f 10 < "$infile" > /dev/null
real 0m0.252s
user 0m0.225s
sys 0m0.025s
time awk -v FS='\t' -v OFS='\t' -e '{ print $10 }' < "$infile" > /dev/null
real 0m1.134s
user 0m1.108s
sys 0m0.024s
$ cat.sh main_.sh
#!/usr/bin/env bash
# vim: set noexpandtab tabstop=2:
infile=$(mktemp)
seq 10000000 | paste -s -d $'\t\t\t\t\t\t\t\t\t
' > "$infile"
set -v
time ./main.go 10 < "$infile" > /dev/null
time cut -f 10 < "$infile" > /dev/null
time awk -v FS='\t' -v OFS='\t' -e '{ print $10 }' < "$infile" > /dev/null
$ cat main.go
#!/usr/bin/env gorun
// vim: set noexpandtab tabstop=2:
package main
import (
"bufio"
"fmt"
"os"
"strings"
"strconv"
)
func main() {
idx, _ := strconv.Atoi(os.Args[1])
col := idx - 1
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
line := strings.TrimRight(scanner.Text(), "
")
fields := strings.Split(line, "\t")
fmt.Printf("%s
", fields[col])
}
}
If you profile the application, it will show most of the time is spent in
fmt.Printf("%s
", fields[col])
The main issue there is really the 10000000 syscalls you're making to write to stdout, so making stdout buffered will significantly reduce the execution time. Removing the overhead of the fmt
calls will help even further.
The next step would be to reduce allocations, which you can do by using byte slices rather than strings. Combining these would lead to something like
stdout := bufio.NewWriter(os.Stdout)
defer stdout.Flush()
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
line := scanner.Bytes()
fields := bytes.Split(line, []byte{'\t'})
stdout.Write(fields[col])
stdout.Write([]byte{'
'})
}