从Google存储设备加载文件时在bigquery中使用.Run（ctx）复制记录

For each day wise partition, we load files into bigquery every 3 minutes and each file is of size 200MB approx. (.gz). Sometimes I get duplication and I am not sure why. I already verified that the input file only contains the data once and the logs prove that the file was processed only once. What could be the possible reasons for the duplication? Are there any ways to prevent it before uploading in bigquery?

client, err := bigquery.NewClient(ctx, loadJob.ProjectID, clientOption)
if err != nil {
    return nil, jobID, err
}
defer client.Close()
ref := bigquery.NewGCSReference(loadJob.URIs...)
if loadJob.Schema == nil {
    ref.AutoDetect = true
} else {
    ref.Schema = loadJob.Schema
}
ref.SourceFormat = bigquery.JSON
dataset := client.DatasetInProject(loadJob.ProjectID, loadJob.DatasetID)
if err := dataset.Create(ctx, nil); err != nil {
    // Create dataset if it does exist, otherwise ignore duplicate error
    if !strings.Contains(err.Error(), ErrorDuplicate) {
        return nil, jobID, err
    }
}
loader := dataset.Table(loadJob.TableID).LoaderFrom(ref)
loader.CreateDisposition = bigquery.CreateIfNeeded
loader.WriteDisposition = bigquery.WriteAppend
loader.JobID = jobID
job, err := loader.Run(ctx)
if err != nil {
    return nil, jobID, err
}
status, err := job.Wait(ctx)
return status, jobID, err

BigQuery load jobs are atomic. So, if a job returned with success, then data will be guaranteed to have been loaded exactly once.

That said, duplication is possible in case of job retries that succeed on the backend for both the original and the retried attempts.

From the code snippet, I am not sure if that retry happens in the client implementation (some clients retry the same load if connection drops. The usual method to prevent duplication is to send BigQuery load jobs with the same job_id for the same data. BigQuery front ends will try to dedupe the retries if the original submission is still running.