优化Go文件读取程序

I'm trying to process a log file, each line of which looks something like this:

flow_stats: 0.30062869162666672 gid 0 fid 1 pkts 5.0 fldur 0.30001386666666674 avgfldur 0.30001386666666674 actfl 3142 avgpps 16.665896331902879 finfl 1

I'm interested in the pkts field and the fldur field. I've got a Python script that can read in a million-line log file, create a list for each number of packets of all the different durations, sort those lists and figure out the median in about 3 seconds.

I'm playing around with the Go programming language and thought I'd rewrite this, in the hope that it would run faster.

So far, I've been disappointed. Just reading the file in to the data structure takes about 5.5 seconds. So I'm wondering if some of you wonderful people can help me make this go (hehe) faster.

Here's my loop:

data := make(map[int][]float32)
infile, err := os.Open("tmp/flow.tr")
defer infile.Close()
if err != nil {
  panic(err)
}
reader := bufio.NewReader(infile)

line, err := reader.ReadString('
')
for {
  if len(line) == 0 {
    break
  }
  if err != nil && err != io.EOF {
    panic(err)
  }
  split_line := strings.Fields(line)
  num_packets, err := strconv.ParseFloat(split_line[7], 32)
  duration, err := strconv.ParseFloat(split_line[9], 32)
  data[int(num_packets)] = append(data[int(num_packets)], float32(duration))

  line, err = reader.ReadString('
')
}

Note that I do actually check the errs in the loop -- I've omitted that for brevity. google-pprof indicates that a majority of the time is being spent in strings.Fields by strings.FieldsFunc, unicode.IsSpace, and runtime.stringiter2.

How can I make this run faster?

Replacing

split_line := strings.Fields(line)

with

split_line := strings.SplitN(line, " ", 11)

Yielded ~4x speed improvement on a 1M line randomly generated file that mimicked the format you provided above:

strings.Fields version: Completed in 4.232525975s

strings.SplitN version: Completed in 1.111450755s

Some of the efficiency comes from being able to avoid parsing and splitting the input line after the duration has be split, but most of it comes from the simpler splitting logic in SplitN. Even splitting all of the strings doesn't take much longer than stopping after the duration. Using:

split_line := strings.SplitN(line, " ", -1)

Completed in 1.554971313s

SplitN and Fields are not the same. Fields assumes tokens are bounded by 1 or more whitespace characters, where SplitN treats tokens as anything bounded by the separator string. If your input had multiple spaces between tokens, split_line would contain empty tokens for each pair of spaces.

Sorting and calculating the median does not add much time. I changed the code to use a float64 rather than a float32 as a matter of convenience when sorting. Here's the complete program:

package main

import (
    "bufio"
    "fmt"
    "os"
    "sort"
    "strconv"
    "strings"
    "time"
)

// SortKeys returns a sorted list of key values from a map[int][]float64.
func sortKeys(items map[int][]float64) []int {
    keys := make([]int, len(items))
    i := 0
    for k, _ := range items {
        keys[i] = k
        i++
    }
    sort.Ints(keys)
    return keys
}

// Median calculates the median value of an unsorted slice of float64.
func median(d []float64) (m float64) {
    sort.Float64s(d)
    length := len(d)
    if length%2 == 1 {
        m = d[length/2]
    } else {
        m = (d[length/2] + d[length/2-1]) / 2
    }
    return m
}

func main() {
    data := make(map[int][]float64)
    infile, err := os.Open("sample.log")
    defer infile.Close()
    if err != nil {
        panic(err)
    }
    reader := bufio.NewReaderSize(infile, 256*1024)

    s := time.Now()
    for {
        line, err := reader.ReadString('
')
        if len(line) == 0 {
            break
        }
        if err != nil {
            panic(err)
        }
        split_line := strings.SplitN(line, " ", 11)
        num_packets, err := strconv.ParseFloat(split_line[7], 32)
        if err != nil {
            panic(err)
        }
        duration, err := strconv.ParseFloat(split_line[9], 32)
        if err != nil {
            panic(err)
        }
        pkts := int(num_packets)
        data[pkts] = append(data[pkts], duration)
    }

    for _, k := range sortKeys(data) {
        fmt.Printf("pkts: %d, median: %f
", k, median(data[k]))
    }
    fmt.Println("
Completed in ", time.Since(s))
}

And the output:

pkts: 0, median: 0.498146
pkts: 1, median: 0.511023
pkts: 2, median: 0.501408
...
pkts: 99, median: 0.501517
pkts: 100, median: 0.491499

Completed in  1.497052072s