I am currently hitting an api to gather data for my own processing and what not. Currently I am doing 100 http.Get per second and am wondering what the best methodology is to do around 1000 concurrent http.Gets per second.
Here is what I have right now:
waitTime := time.Second
var lastID uint64 = 1234567890
for {
for i := 0; i < 100; i++ {
var tmpID uint64 = lastID
lastID++
go func(ID uint64) {
err = scrape(ID) // this does the http.Get and saves the
// resulting json into postgresql
if err != nil {
errStr := strings.TrimSpace(err.Error())
if strings.HasSuffix(errStr, "Too Many request to server") {
log.Println("hit a real 429")
panic(err)
}
}
}(tmpID)
}
time.Sleep(waitTime - time.Now().Sub(now)) // this is here to
// ensure I dont go over the limit
}
The api I am hitting is rate limited to 1000 req/s.
The reason for my go func(ID)
is so I can just incrementally increase my ID without having to worry about using a lock for access "what the next ID is". I just feel like I am doing this wrong. I am pretty new to go in general as well.
I also assume I have to raise my ulimit
on my ubuntu server to something over 1000 as well to handle all these open connections.
any tips or suggestions are greatly appreciated!
Does your http client cache the connections? Default one does.
By default, Transport caches connections for future re-use. This may leave many open connections when accessing many hosts. This behavior can be managed using Transport's CloseIdleConnections method and the MaxIdleConnsPerHost and DisableKeepAlives fields.
Why do you spawn goroutines in a loop instead of spawn some gouroutines with loop inside, if you hit the limit it could backoff for a bit.
Primitive example (I did not test it. May contain typos).
numWorkers := 1000
var delay time.Duration = 0.01 //10 ms (iirc) =)
var maxDelay time.Duration = 0.1 //100 ms (i guess)
quit := make(chan struct{})
for i := 0; i < numWorkers ; i++ {
go func(ID, shift uint){
var iter := 0
var curDelay time.Duration = delay
for {
select {
case <-quit:
return
default:
//0th worker: lastID + 0 + 0, lastID + 100 + 0, lastID + 200 + 0, ...
//1st worker: lastID + 0 + 1, lastID + 100 + 1, lastID + 200 + 2, ...
//...
//99th worker: lastID + 0 + 99, lastID + 100 + 99, lastID + 100 + 299, ...
curID := ID + iter * numWorkers + shift
err = scrape(curID) // this does the http.Get and saves the
// resulting json into postgresql
if err != nil {
errStr := strings.TrimSpace(err.Error())
if strings.HasSuffix(errStr, "Too Many request to server") { log.Println("hit a real 429")
if curDelay > maxDelay {
return //or panic, whatever you want
}
time.Sleep(curDelay)
curdelay = curdelay * 2 //exponential delay: 10ms, 20ms, 40ms, 80ms, return/panic
continue //no increment on iter
}
}
//increment on success
iter++
time.Sleep(1) // 1000 workers, each make request and sleep for 1 sec, sounds like 1000 rpm
}
}
}(lastID, i)
}
IDs never overlap, but there will be holes, probably. But you cant avoid it without syncronization (mutex is fine) and, probably, you can do it on 1000rpm, but performance will suffer on bigger number of workers.
close(quit)
when you want to stop.