I am using Golang to implement naive bayesian classification for a dataset with over 30000 possible tags. I have built the model and I am in the classification phase. I am working on classifying 1000 records and this is taking up to 5 minutes. I have profiled the code with pprof functionality; the top10 are shown below:
Total: 28896 samples
16408 56.8% 56.8% 24129 83.5% runtime.mapaccess1_faststr
4977 17.2% 74.0% 4977 17.2% runtime.aeshashbody
2552 8.8% 82.8% 2552 8.8% runtime.memeqbody
1468 5.1% 87.9% 28112 97.3% main.(*Classifier).calcProbs
861 3.0% 90.9% 861 3.0% math.Log
435 1.5% 92.4% 435 1.5% runtime.markspan
267 0.9% 93.3% 302 1.0% MHeap_AllocLocked
187 0.6% 94.0% 187 0.6% runtime.aeshashstr
183 0.6% 94.6% 1137 3.9% runtime.mallocgc
127 0.4% 95.0% 988 3.4% math.log10
Surprisingly the map access seems to be the bottleneck. Has anyone experienced this. What other key, value datastructure can be used to avoid this bottleneck? All the map access is done in the following piece of code given below:
func (nb *Classifier) calcProbs(data string) *BoundedPriorityQueue{
probs := &BoundedPriorityQueue{}
heap.Init(probs)
terms := strings.Split(data, " ")
for class, prob := range nb.classProb{
condProb := prob
clsProbs := nb.model[class]
for _, term := range terms{
termProb := clsProbs[term]
if termProb != 0{
condProb += math.Log10(termProb)
}else{
condProb += -6 //math.Log10(0.000001)
}
}
entry := &Item{
value: class,
priority: condProb,
}
heap.Push(probs,entry)
}
return probs
}
The maps are nb.classProb which is map[string]float64
while the nb.model is a nested map of type
map[string]map[string]float64
In addition to what @tomwilde said, another approach that may speed up your algorithm is string interning. Namely, you can avoid using a map entirely if you know the domain of keys ahead of time. I wrote a small package that will do string interning for you.
Yes, the map access will be the bottleneck in this code: it's the most significant operation inside the two nested loops.
It's not possible to tell for sure from the code that you've included, but I expect you've got a limited number of classes. What you might do, is number them, and store the term-wise class probabilities like this:
map[string][NumClasses]float64
(ie: for each term, store an array of class-wise probabilities [or perhaps their logs already precomputed], and NumClasses is the number of different classes you have).
Then, iterate over terms first, and classes inside. The expensive map lookup will be done in the outer loop, and the inner loop will be iteration over an array.
This'll reduce the number of map lookups by a factor of NumClasses. This may need more memory if your data is extremely sparse.
The next optimisation is to use multiple goroutines to do the calculations, assuming you've more than one CPU core available.