Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have an array of square matrices int *M[10]; so that M[i] locates the first element of the i-th matrix. I want to multiply all the matrices M[i] by another matrix N, so that I receive an array of square matrices int *P[10] as output.

There are different possibilities I see:

  1. Assing the computation of a different element of M[i] to a different thread; for example, I have 10 matrices, 4x4 sized, so that the number of involved threads would be 160; how to use CUDA to implement this approach?
  2. In the framework of the example above, creating a composite matrix size 40x40 (i.e., collecting 10, 4x4 sized matrices together) and use 40x40 threads; but this approach seems to require more time; I'm trying with the array of matrices, but I think I'm doing something wrong; how can I use this approach with 10 matrices? How to code it in Kernel function?

This is what I'm trying;

void GPU_Multi(int *M[2], int *N, int *P[2], size_t width)
{

    int *devM[2];
    int *devN[2];
    int *devP[2];
    size_t allocasize =sizeof(int) *width*width;

    for(int i = 0 ; i < 10 ; i ++ ) 
    {
        cudaMalloc((void**)&devM[ i ], allocasize );
        cudaMalloc((void**)&devP[ i ], allocasize ); 
    }

    cudaMalloc((void**)&devN, allocasize );

    for(int i = 0 ; i < 10 ; i ++ ) {

        cudaMemcpy(devM[ i ],M[ i ], allocasize , cudaMemcpyHostToDevice);
        cudaMemcpy(devN, N, allocasize , cudaMemcpyHostToDevice);
        dim3 block(width*2, width*2);
        dim3 grid(1,1,1);
        Kernel_Function<<<grid, block>>>  (devM[2], devN, devP[2],width);

        for(int i = 0 ; i < 10 ; i ++ ) 
        {
            cudaMemcpy(P[ i ], P[ i ], allocatesize, cudaMemcpyDeviceToHost);
            cudaFree(devM[ i ]);   
            cudaFree(devP[ i ]);
        }

    }
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
245 views
Welcome To Ask or Share your Answers For Others

1 Answer

As it emerged from the comments above and the answer by Robert Crovella, there are different possible approaches. Each of the approach can be better suited for a different situation, i.e., for a different number N of matrices to multiply and for different matrix dimensions MxM. Let me summarize them below:

  1. If N is small and M is large, perhaps the best approach would be to use cublas<t>gemm called from host code;
  2. If N is moderate and M is moderate, and if a device with compute capability of at least 3.5 is available, then a good possibility would be to use dynamic parallelism, namely, creating a thread grid of N threads and launching a cublas<t>gemm from within a kernel; perhaps this approach would fail for large N or M due to the large number of threads required;
  3. If N is large and M is small, then the cuBLAS batched approach linked to by Robert Crovella could be of interest;
  4. Similarly, if N is large and M is small, then a cuBLAS stream-based approach would be worth a try, as also mentioned in Robert's comment;
  5. If N is large and M is very small, an approach using a thread grid of N threads, each "manually" calculating an optimized matrix multiplication could be appealing; for example, if one has to construct a matrix multiplication algorithm for 4x4 matrices, then one could optimize the matrix multiplication performed by each thread according to Number of elementary multiplications for multiplying 4x4 matrices.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...