I wrote a piece of code to illustrate the standard command grep
in Go, but the speed is far behind it, could someone give me any advances? here is the code:
package main
import (
"bufio"
"fmt"
"log"
"os"
"strings"
"sync"
)
func parse_args() (file, pat string) {
if len(os.Args) < 3 {
log.Fatal("usage: gorep2 <file_name> <pattern>")
}
file = os.Args[1]
pat = os.Args[2]
return
}
func readFile(file string, to chan<- string) {
f, err := os.Open(file)
if err != nil {
log.Fatal(err)
}
defer f.Close()
freader := bufio.NewReader(f)
for {
line, er := freader.ReadBytes('
')
if er == nil {
to <- string(line)
} else {
break
}
}
close(to)
}
func grepLine(pat string, from <-chan string, result chan<- bool) {
var wg sync.WaitGroup
for line := range from {
wg.Add(1)
go func(l string) {
defer wg.Done()
if strings.Contains(l, pat) {
result <- true
}
}(string(line))
}
wg.Wait()
close(result)
}
func main() {
file, pat := parse_args()
text_chan := make(chan string, 10)
result_chan := make(chan bool, 10)
go readFile(file, text_chan)
go grepLine(pat, text_chan, result_chan)
var total uint = 0
for r := range result_chan {
if r == true {
total += 1
}
}
fmt.Printf("Total %d
", total)
}
The time
in Go:
>>> time gogrep /var/log/task.log DEBUG
Total 21089
real 0m0.156s
user 0m0.156s
sys 0m0.015s
The time
in grep
:
>>> time grep DEBUG /var/log/task.log | wc -l
21089
real 0m0.069s
user 0m0.046s
sys 0m0.064s
For an easily reproducible benchmark, I counted the number of occurences of the text "and" in Shakespeare.
gogrep: $ go build gogrep.go && time ./gogrep /home/peter/shakespeare.txt and Total 21851 real 0m0.613s user 0m0.651s sys 0m0.068s grep: $ time grep and /home/peter/shakespeare.txt | wc -l 21851 real 0m0.108s user 0m0.107s sys 0m0.014s petergrep: $ go build petergrep.go && time ./petergrep /home/peter/shakespeare.txt and Total 21851 real 0m0.098s user 0m0.092s sys 0m0.008s
petergrep is written in Go. It's fast.
package main
import (
"bufio"
"bytes"
"fmt"
"log"
"os"
)
func parse_args() (file, pat string) {
if len(os.Args) < 3 {
log.Fatal("usage: petergrep <file_name> <pattern>")
}
file = os.Args[1]
pat = os.Args[2]
return
}
func grepFile(file string, pat []byte) int64 {
patCount := int64(0)
f, err := os.Open(file)
if err != nil {
log.Fatal(err)
}
defer f.Close()
scanner := bufio.NewScanner(f)
for scanner.Scan() {
if bytes.Contains(scanner.Bytes(), pat) {
patCount++
}
}
if err := scanner.Err(); err != nil {
fmt.Fprintln(os.Stderr, err)
}
return patCount
}
func main() {
file, pat := parse_args()
total := grepFile(file, []byte(pat))
fmt.Printf("Total %d
", total)
}
Data: Shakespeare: pg100.txt
Go regular expressions are fully utf-8 and I think that has some overhead. They also have a different theoretical basis meaning they will always run in a time proportional to the length of the input. It is noticeable that Go regexps just aren't as fast as the pcre regexp in use by other languages. If you look at the benchmarks game shootouts for the regexp test you'll see what I mean.
You can always use the pcre library directly if you want a bit more speed though.
A datapoint on the relevance of UTF-8 in regexp parsing: I've a long-used custom perl5 script for source grepping. I recently modified it to support UTF-8 so it could match fancy golang symbol names. It ran a FULL ORDER OF MAGNITUDE slower in repeated tests. So while golang regexp's do pay a price for the predictability of it's runtime, we also have to factor UTF-8 handling into the equation.