I'm trying to run a simple example of dynamic parallelism in cuda. the code of the .cu file is
__global__ void child_launch(int *data) {
data[threadIdx.x] = data[threadIdx.x]+1;
}
__global__ void parent_launch(int *data) {
data[threadIdx.x] = threadIdx.x;
__syncthreads();
if (threadIdx.x == 0) {
child_launch<<< 1, 256 >>>(data);
cudaDeviceSynchronize();
}
__syncthreads();
}
where parent_launch
is the kernel I want matlab to run, and each thread of parent_launch
can run a grid of blocks with the kernel child_launch
(in practice, only the 0th thread should create such a grid, but that's just an example).
I tried to run it all by compiling the .cu file into a .ptx file and then executing the following commands in matlab:
k = parallel.gpu.CUDAKernel('file_name.ptx', 'file_name.cu');
k.GridSize = [1,256];
k.ThreadBlockSize = [1,256];
r1 = feval(k, data);% data is an array of ints on the gpu
the problem is that when I tried to compile the .cu file, I got the following error:
error: kernel launch from __device__ or __global__ functions requires separate compilation mode
Does anyone know how to fix it?