The aws s3 sync
command in the CLI can download a large collection of files very quickly, and I can not achieve the same performance with the AWS Go SDK. I have millions of files in the bucket so this is critical to me. I need to use the list pages command as well so that I can add a prefix which is not supported well by the sync
CLI command.
I have tried using multiple goroutines (10 up to 1000) to make requests to the server, but the time is just so much slower compared to the CLI. It takes about 100 ms per file to run the Go GetObject
function which is unacceptable for the number of files that I have. I know that the AWS CLI also uses the Python SDK in the backend, so how does it have so much better performance (I tried my script in boto as well as Go).
I am using ListObjectsV2Pages
and GetObject
. My region is the same as the S3 server's.
logMtx := &sync.Mutex{}
logBuf := bytes.NewBuffer(make([]byte, 0, 100000000))
err = s3c.ListObjectsV2Pages(
&s3.ListObjectsV2Input{
Bucket: bucket,
Prefix: aws.String("2019-07-21-01"),
MaxKeys: aws.Int64(1000),
},
func(page *s3.ListObjectsV2Output, lastPage bool) bool {
fmt.Println("Received", len(page.Contents), "objects in page")
worker := make(chan bool, 10)
for i := 0; i < cap(worker); i++ {
worker <- true
}
wg := &sync.WaitGroup{}
wg.Add(len(page.Contents))
objIdx := 0
objIdxMtx := sync.Mutex{}
for {
<-worker
objIdxMtx.Lock()
if objIdx == len(page.Contents) {
break
}
go func(idx int, obj *s3.Object) {
gs := time.Now()
resp, err := s3c.GetObject(&s3.GetObjectInput{
Bucket: bucket,
Key: obj.Key,
})
check(err)
fmt.Println("Get: ", time.Since(gs))
rs := time.Now()
logMtx.Lock()
_, err = logBuf.ReadFrom(resp.Body)
check(err)
logMtx.Unlock()
fmt.Println("Read: ", time.Since(rs))
err = resp.Body.Close()
check(err)
worker <- true
wg.Done()
}(objIdx, page.Contents[objIdx])
objIdx += 1
objIdxMtx.Unlock()
}
fmt.Println("ok")
wg.Wait()
return true
},
)
check(err)
Many results look like:
Get: 153.380727ms
Read: 51.562µs
I ended up settling for my script in the initial post. I tried 20 goroutines and that seemed to work pretty well. On my laptop, the initial script is definitely slower than the command line (i7 8-thread, 16 GB RAM, NVME) versus the CLI. However, on the EC2 instance, the difference was small enough that it was not worth my time to optimize it further. I used a c5.xlarge
instance in the same region as the S3 server.
Have you tried using https://docs.aws.amazon.com/sdk-for-go/api/service/s3/s3manager/?
iter := new(s3manager.DownloadObjectsIterator)
var files []*os.File
defer func() {
for _, f := range files {
f.Close()
}
}()
err := client.ListObjectsV2PagesWithContext(ctx, &s3.ListObjectsV2Input{
Bucket: aws.String(bucket),
Prefix: aws.String(prefix),
}, func(output *s3.ListObjectsV2Output, last bool) bool {
for _, object := range output.Contents {
nm := filepath.Join(dstdir, *object.Key)
err := os.MkdirAll(filepath.Dir(nm), 0755)
if err != nil {
panic(err)
}
f, err := os.Create(nm)
if err != nil {
panic(err)
}
log.Println("downloading", *object.Key, "to", nm)
iter.Objects = append(iter.Objects, s3manager.BatchDownloadObject{
Object: &s3.GetObjectInput{
Bucket: aws.String(bucket),
Key: object.Key,
},
Writer: f,
})
files = append(files, f)
}
return true
})
if err != nil {
panic(err)
}
downloader := s3manager.NewDownloader(s)
err = downloader.DownloadWithIterator(ctx, iter)
if err != nil {
panic(err)
}