CUDA编程中的错误处理

CUDA程序往往很难高度，特别是核函数出错时，甚至有些情况下核函数中内容访问错误时，运行程序之后系统返回的报错信息是free() error之类的错误，会严重误导错误查找过程。所以编写CUDA程序一定记得进行错误错误处理，多写几行代码，有可能省出很多的时间。假如自己的CUDA程序出现了各种奇葩、难以理解的错误，可以考虑加上错误处理，也许就柳岸花明了。
在CUDA学习笔记最后的总结中刻意提到编写CUDA程序一定要做错误处理，但其中一直未提及如何进行错误处理，这里给补上。

CUDA编程中的错误处理实际上非常简单，错误处理相关的函数一共也只有4个：

__host__ __device__ const char* cudaGetErrorName( cudaError_t error ) 接受一个错误码，返回错误名称
__host__ __device__ const char* cudaGetErrorString( cudaError_t error ) 接受一个错误码，返回错误描述
__host__ __device__ cudaError_t cudaGetLastError( void ) 返回上一次CUDA runtime调用时的错误码，并将CUDA错误码置为cudaSuccess
__host__ __device__ cudaError_t cudaPeekAtLastError( void ) 与上面功能一样，但不会重置CUDA的错误信息
当核函数被调用或者CUDA API被调用时，通常都会返回一个错误码，如果返回的错误码为0，则表示该调用没有出错，即cudaSuccess（即错误名称），CUDA的错误码一共有82个（截止到2017.3.29），所有的错误码信息可以在这里查看cudaError，其实就是个enum类型的数据。

根据官方提代的4个错误处理函数，明显可以看出对于cudaMemcpy这类API调用函数，可以直接传递给cudaGetErrorString来进行错误检查，而核函数调用之后必然会更新CUDA Runtime的错误值，那么就可以在核函数调用之后调用cudaGetLastError()来进行核函数调用的错误检查。由于CPU与GPU程序时异步执行，所以为了避免获取到的核函数调用错误信息有误，应该在调用cudaGetLastError()这类函数之前使用cudaDeviceSynchronize()对CPU进行阻塞，以同步GPU与CPU的线程，保证核函数已经调用完成。
为了便于错误处理，通常会将错误处理写成函数，或者进行宏定义，不然每次获取到错误码之后，都还要去调用cudaGetErrorString来查看错误描述。
下面是我常用的一个错误处理宏定义，兼容C与C++，可以参考：

//API调用错误处理，可以接受CUDA的API函数调用作为参数
#define CHECK_ERROR(error) checkCudaError(error, __FILE__, __LINE__)
//检查CUDA Runtime状态码，可以接受一个指定的提示信息
#define CHECK_STATE(msg) checkCudaState(msg, __FILE__, __LINE__)

inline void checkCudaError(cudaError_t error, const char *file, const int line)
{
   if (error != cudaSuccess) {
      std::cerr << "CUDA CALL FAILED:" << file << "( " << line << ")- " << cudaGetErrorString(error) << std::endl;
      exit(EXIT_FAILURE);
   }
}

inline void checkCudaState(const char *msg, const char *file, const int line)
{
   cudaError_t error = cudaGetLastError();
   if (error != cudaSuccess) {
      std::cerr << "---" << msg << " Error---" << std::endl;
      std::cerr << file << "( " << line << ")- " << cudaGetErrorString(error) << std::endl;
      exit(EXIT_FAILURE);
   }
}

调用举例，下面是从以前一个项目代码中抽出来的代码片断：

//统计用于GPU计算的时间
cudaEvent_t start, stop;
CHECK_ERROR(cudaEventCreate(&start));
CHECK_ERROR(cudaEventCreate(&stop));
CHECK_ERROR(cudaEventRecord(start, 0));
CHECK_ERROR(cudaEventSynchronize(start));

CHECK_ERROR(cudaMemcpy(dev_grid_in, grid_in, SIZE * sizeof(double), cudaMemcpyHostToDevice));
CHECK_ERROR(cudaMemcpy(dev_grid_out, grid_out, SIZE * sizeof(double), cudaMemcpyHostToDevice));

for (int i=0; i<times; ++i) {
   kernel<<<grids, blocks>>>(dev_grid_in, dev_grid_out);
   std::swap(dev_grid_in, dev_grid_out);
}
cudaDeviceSynchronize();

CHECK_STATE("kernel call");

CHECK_ERROR(cudaMemcpy(grid_in, dev_grid_in, SIZE * sizeof(double), cudaMemcpyDeviceToHost));

//计算统计的时间
CHECK_ERROR(cudaEventRecord(stop, 0));
CHECK_ERROR(cudaEventSynchronize(stop));
float elapsedTime;
CHECK_ERROR(cudaEventElapsedTime(&elapsedTime, start, stop));

上述代码片断的完整代码链接在https://github.com/blueyi/stencil_computation_sample/blob/master/7_point_stencil_jacobi3d_Optimize.cu

参考：
1.http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html#group__CUDART__ERROR
2.How to Query Device Properties and Handle Errors in CUDA C/C++
3.How to do error checking in CUDA