前往：稀疏阵列读写的线程安全并发问题

I'm writing a search engine in Go in which I have an inverted index of words to the corresponding results for each word. There is a set dictionary of words and so the words are already converted into a StemID, which is an integer starting from 0. This allows me to use a slice of pointers (i.e. a sparse array) to map each StemID to the structure which contains the results of that query. E.g. var StemID_to_Index []*resultStruct. If aardvark is 0 then the pointer to the resultStruct for aardvark is located at StemID_to_Index[0], which will be nil if the result for this word is currently not loaded.

There is not enough memory on the server to store all of this in memory, so the structure for each StemID will be saved as separate files and these can be loaded into the StemID_to_Index slice. If StemID_to_Index is currently nil for this StemID then the result is not cached and needs to be loaded, otherwise it's already loaded (cached) and so can be used directly. Each time a new result is loaded the memory usage is checked and if it's over the threshold then 2/3 of the loaded results are thrown away (StemID_to_Index is set to nil for these StemIDs and a garbage collection is forced.)

My problem is the concurrency. What is the fastest and most efficient way in which I can have multiple threads searching at the same time without having problems with different threads trying to read and write to the same place at the same time? I'm trying to avoid using mutexes on everything as that would slow down every single access attempt.

Do you think I would get away with loading the results from disk in the working thread and then delivering the pointer to this structure to an "updater" thread using channels, which then updates the nil value in the StemID_to_Index slice to the pointer of the loaded result? This would mean that two threads would never attempt to write at the same time, but what would happen if another thread tried to read from that exact index of StemID_to_Index while the "updater" thread was updating the pointer? It doesn't matter if a thread is given a nil pointer for a result which is currently being loaded, because it will just be loaded twice and while that is a waste of resources it would still deliver the same result and since that is unlikely to happen very often, it's forgiveable.

Additionally, how would the working thread which send the pointer to be updated to the "updater" thread know when the "updater" thread has finished updating the pointer in the slice? Should it just sleep and keep checking, or is there an easy way for the updater to send a message back to the specific thread which pushed to the channel?

UPDATE

I made a little test script to see what would happen if attempting to access a pointer at the same time as modifying it... it seems to always be OK. No errors. Am I missing something?

package main

import (
    "fmt"
    "sync"
)

type tester struct {
 a uint
}

var things *tester

func updater() {
    var a uint
    for {
        what := new(tester)
        what.a = a
        things = what
        a++
    }
}

func test() {
    var t *tester
    for {
        t = things
        if t != nil {
            if t.a < 0 {
                fmt.Println(`Error1`)
            }
        } else {
            fmt.Println(`Error2`)
        }
    }
}

func main() {
    var wg sync.WaitGroup
    things = new(tester)
    go test()
    go test()
    go test()
    go test()
    go test()
    go test()
    go updater()
    go test()
    go test()
    go test()
    go test()
    go test()
    wg.Add(1)
    wg.Wait()
}

UPDATE 2

Taking this further, even if I read and write from multiple threads to the same variable at the same time... it makes no difference, still no errors:

From above:

func test() {
    var a uint
    var t *tester
    for {
        t = things
        if t != nil {
            if t.a < 0 {
                fmt.Println(`Error1`)
            }
        } else {
            fmt.Println(`Error2`)
        }
        what := new(tester)
        what.a = a
        things = what
        a++
    }
}

This implies I don't have to worry about concurrency at all... again: am I missing something here?

My top answer would be to use elasticsearch with a client like elastigo.

If that's not an option, it would really help to know how much you care about race-y behavior. If you don't care, a write could happen right after a read finishes, the user finishing the read will get stale data. You can just have a queue of write and read operations and have multiple threads feed into that queue and one dispatcher issue the operations to the map one-at-a-time as they come it. In all other scenarios, you will need a mutex if there are multiple readers and writers. Maps aren't thread safe in go.

Honestly though, I would just add a mutex to make things simple for now and optimize by analyzing where your bottlenecks actually lie. It seems like you checking a threshold and then purging 2/3 of your cache is a bit arbitrary, and I wouldn't be surprised if you kill performance by doing something like that. Here's on situation where that would break down:

Requesters 1, 2, 3, and 4 are frequently accessing many of the same words on files A & B. Requester 5, 6, 7 and 8 are frequently accessing many of the same words stored on files C & D.

Now when requests interleaved between these requesters and files happen in rapid succession, you may end up purging your 2/3 of your cache over and over again of results that may be requested shortly after. There are a couple other approaches:

Cache words that are frequently accessed at the same time on the same box and have multiple caching boxes.
Cache on a per-word basis with some sort of ranking of how popular that word is. If a new word is accessed from a file while the cache is full, see if other more popular words live in that file and purge less popular entries in the cache in hopes that those words will have a higher hit rate.
Both approaches 1 & 2.

This sounds like a perfect use case for a memory mapped file:

package main

import (
    "log"
    "os"
    "unsafe"

    "github.com/edsrzf/mmap-go"
)

func main() {
    // Open the backing file
    f, err := os.OpenFile("example.txt", os.O_RDWR|os.O_CREATE, 0644)
    if err != nil {
        log.Fatalln(err)
    }
    defer f.Close()

    // Set it's size
    f.Truncate(1024)

    // Memory map it
    m, err := mmap.Map(f, mmap.RDWR, 0)
    if err != nil {
        log.Fatalln(err)
    }
    defer m.Unmap()

    // m is a byte slice
    copy(m, "Hello World")
    m.Flush()

    // here's how to use it with a pointer
    type Coordinate struct{ X, Y int }
    // first get the memory address as a *byte pointer and convert it to an unsafe
    // pointer
    ptr := unsafe.Pointer(&m[20])
    // next convert it into a different pointer type
    coord := (*Coordinate)(ptr)
    // now you can use it directly
    *coord = Coordinate{1, 2}
    m.Flush()
    // and vice-versa
    log.Println(*(*Coordinate)(unsafe.Pointer(&m[20])))
}

The memory map can be larger than real memory and the operating system will handle all the messy details for you.

You will still need to make sure that separate goroutines never read/write to the same segment of memory at the same time.