如何同时搜索一大堆地图[字符串]字符串

I need to search a huge slice of maps[string]string. My thought was that this is a good chance for go's channel and go routines.

The Plan was to divide the slice in parts and send search them in parallel. But I was kind of shocked that my parallel version timed out while the search of the whole slice did the trick.

I am not sure what I am doing wrong. Down below is my code which I used to test the concept. The real code would involve more complexity

//Search for a giving term
//This function gets the data passed which will need to be search
//and the search term and it will return the matched maps
// the data is pretty simply the map contains { key: andSomeText }
func Search(data []map[string]string, term string) []map[string]string {

    set := []map[string]string{}

    for _, v := range data {
        if v["key"] == term {

            set = append(set, v)
        }

    }
    return set

}

So this works pretty well to search the slice of maps for a given SearchTerm.

Now I thought if my slice would have like 20K entries, I would like to do the search in parallel

// All searches all records concurrently
// Has the same function signature as the the search function
// but the main task is to fan out the slice in 5 parts and search
// in parallel
func All(data []map[string]string, term string) []map[string]string {
    countOfSlices := 5

    part := len(data) / countOfSlices

    fmt.Printf("Size of the data:%v
", len(data))
    fmt.Printf("Fragemnt Size:%v
", part)

    timeout := time.After(60000 * time.Millisecond)

    c := make(chan []map[string]string)

    for i := 0; i < countOfSlices; i++ {
        // Fragments of the array passed on to the search method
        go func() { c <- Search(data[(part*i):(part*(i+1))], term) }()

    }

    result := []map[string]string{}

    for i := 0; i < part-1; i++ {
        select {
        case records := <-c:
            result = append(result, records...)
        case <-timeout:
            fmt.Println("timed out!")
            return result
        }
    }
    return result
}

Here are my tests:

I have a function to generate my test data and 2 tests.

func GenerateTestData(search string) ([]map[string]string, int) {
    rand.Seed(time.Now().UTC().UnixNano())
    strin := []string{"String One", "This", "String Two", "String Three", "String Four", "String Five"}
    var matchCount int
    numOfRecords := 20000
    set := []map[string]string{}
    for i := 0; i < numOfRecords; i++ {
        p := rand.Intn(len(strin))
        s := strin[p]
        if s == search {
            matchCount++
        }
        set = append(set, map[string]string{"key": s})
    }
    return set, matchCount
}

The 2 tests: The first just traverses the slice and the second searches in parallel

func TestSearchItem(t *testing.T) {

    tests := []struct {
        InSearchTerm string
        Fn           func(data []map[string]string, term string) []map[string]string
    }{
        {
            InSearchTerm: "This",
            Fn:           Search,
        },
        {InSearchTerm: "This",
            Fn: All,
        },
    }

    for i, test := range tests {

        startTime := time.Now()
        data, expectedMatchCount := GenerateTestData(test.InSearchTerm)
        result := test.Fn(data, test.InSearchTerm)

        fmt.Printf("Test: [%v]:
Time: %v 

", i+1, time.Since(startTime))
        assert.Equal(t, len(result), expectedMatchCount, "expected: %v to be: %v", len(result), expectedMatchCount)

    }
}

It would be great if someone could explain me why my parallel code is so slow? What is wrong with the code and what I am missing here as well as what the recommended way would be to search huge slices in memory 50K+.

This looks like just a simple typo. The problem is that you divide your original big slice into 5 pieces (countOfSlices), and you properly launch 5 goroutines to search each part:

for i := 0; i < countOfSlices; i++ {
    // Fragments of the array passed on to the search method
    go func() { c <- Search(data[(part*i):(part*(i+1))], term) }()

}

This means you should expect 5 results, but you don't. You expect 4000-1 results:

for i := 0; i < part-1; i++ {
    select {
    case records := <-c:
        result = append(result, records...)
    case <-timeout:
        fmt.Println("timed out!")
        return result
    }
}

Obviously if you only launched 5 goroutines, each of which delivers 1 single result, you can only expect as many (5). And since your loop waits a lot more (which will never come), it times out as expected.

Change the condition to this:

for i := 0; i < countOfSlices; i++ {
    // ...
}

There are two problems - as icza notes you never finish the select as you need to use countOfSlices, and then also your call to Search will not get the data you want as you need to allocate that before calling the go func(), so allocate the slice outside and pass it in.

You might find it still isn't faster though to do this particular work in parallel with such simple data (perhaps with more complex data on a machine with lots of cores it would be worthwhile)?

Be sure when testing that you try swapping the order of your test runs - you might be surprised by the results! Also perhaps try the benchmarking tools available in the testing package which runs your code lots of times for you and averages the results, this might help you get a better idea of whether the fanout speeds things up.

Concurrency is not parallelism. Go is massively concurrent language, not parallel. Even using multicore machine you will pay for data exchange between CPUs when accessing your shared slice in computation threads. You can take advantage of concurrency searching just first match for example. Or doing something with results(say print them, or write to some Writer, or sort) while continue to search.

func Search(data []map[string]string, term string, ch chan map[string]string) {
    for _, v := range data {
        if v["key"] == term {
            ch <- v
        }
    }
}
func main(){
...
    go search(datapart1, term, ch)
    go search(datapart2, term, ch)
    go search(datapart3, term, ch)
...
    for vv := range ch{
        fmt.Println(vv) //do something with match concurrently
    }
...
}

The recommended way to search huge slice would be to keep it sorted, or make binary tree. There are no magic.