Golang的地图访问瓶颈

I am using Golang to implement naive bayesian classification for a dataset with over 30000 possible tags. I have built the model and I am in the classification phase. I am working on classifying 1000 records and this is taking up to 5 minutes. I have profiled the code with pprof functionality; the top10 are shown below:

Total: 28896 samples
   16408  56.8%  56.8%    24129  83.5% runtime.mapaccess1_faststr
    4977  17.2%  74.0%     4977  17.2% runtime.aeshashbody
    2552   8.8%  82.8%     2552   8.8% runtime.memeqbody
    1468   5.1%  87.9%    28112  97.3% main.(*Classifier).calcProbs
     861   3.0%  90.9%      861   3.0% math.Log
     435   1.5%  92.4%      435   1.5% runtime.markspan
     267   0.9%  93.3%      302   1.0% MHeap_AllocLocked
     187   0.6%  94.0%      187   0.6% runtime.aeshashstr
     183   0.6%  94.6%     1137   3.9% runtime.mallocgc
     127   0.4%  95.0%      988   3.4% math.log10

Surprisingly the map access seems to be the bottleneck. Has anyone experienced this. What other key, value datastructure can be used to avoid this bottleneck? All the map access is done in the following piece of code given below:

func (nb *Classifier) calcProbs(data string) *BoundedPriorityQueue{
    probs := &BoundedPriorityQueue{} 
    heap.Init(probs)

    terms := strings.Split(data, " ")
    for class, prob := range nb.classProb{
        condProb := prob
        clsProbs := nb.model[class]
        for _, term := range terms{
            termProb := clsProbs[term]
            if termProb != 0{
                condProb += math.Log10(termProb)
            }else{
                condProb += -6 //math.Log10(0.000001)
            }
        }
       entry := &Item{
            value: class,
            priority: condProb,
        }
        heap.Push(probs,entry)
    }
    return probs
}

The maps are nb.classProb which is map[string]float64 while the nb.model is a nested map of type

map[string]map[string]float64

In addition to what @tomwilde said, another approach that may speed up your algorithm is string interning. Namely, you can avoid using a map entirely if you know the domain of keys ahead of time. I wrote a small package that will do string interning for you.

Yes, the map access will be the bottleneck in this code: it's the most significant operation inside the two nested loops.

It's not possible to tell for sure from the code that you've included, but I expect you've got a limited number of classes. What you might do, is number them, and store the term-wise class probabilities like this:

map[string][NumClasses]float64

(ie: for each term, store an array of class-wise probabilities [or perhaps their logs already precomputed], and NumClasses is the number of different classes you have).

Then, iterate over terms first, and classes inside. The expensive map lookup will be done in the outer loop, and the inner loop will be iteration over an array.

This'll reduce the number of map lookups by a factor of NumClasses. This may need more memory if your data is extremely sparse.

The next optimisation is to use multiple goroutines to do the calculations, assuming you've more than one CPU core available.