一个cuda代码，不明白哪里有错，请指教

这是设备端代码，将tmp0中的数据移动到arr数组中。每个块256个线程，块的数量根据输入的n划分，不过我不知道这有没有关系。。。。求各位解答。。。。。。。。。。
图片说明

__syncthreads 是所有一个block内所有线程运行到这里后才往下运行。你放到for循环里，是不是就错了呢？
按照你所述的需求，就不应该有for语句。

 __global_ void test(int* arr, int* tmp0)
 {
     int tid = blockDim.x*blockIdx.x+threadIdx.x;
     arr[tid] = tmp0[tid];
 }

加一个if（tid<n）就行了