Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm using a machine having 8 cores and 32GB ram. In this machine, I'm running a code in c++ using VS2010 on Windows x64 which takes 3 days to complete 8 trees(8 is the number of outer threads). I searched for bottleneck and find out that crossCorrelate method takes around 75-80% of the time. Now, I'm trying to make that method more efficient, code is as follows:

int main(){
    int numThread = 8;
    //create threads, run build_tree method for each of them
    //and join after running all of them
}

// I'm creating 8 tree

void build_tree(int i){  //called millions of times 
    for(some_value to another_val){
        //do some stuff
        read_corresponding_matrices
        crossCorrelate(mat1,mat2);
    }
    //write the results to a file 
}

//each tree is working with its own data, no dependency between trees.

Mat crossCorrelate(Mat mat1_real, Mat mat2_real){
    Mat mat1, mat2,result;

    //1st multi-threading part  // around 20 ms
    Scalar mean1 = mean(mat1_real);
    subtract(mat1_real,(float)mean1[0],mat1);

    Scalar mean2 = mean(mat2_real);
    subtract(mat2_real,(float)mean2[0],mat2);
    //1st part ends

    Mat tilted_mat2 = flip_cross(mat2);

    Mat planes[] = {Mat_<float>(mat1), Mat::zeros(mat1.size(), CV_32F)};
    Mat planes2[] = {Mat_<float>(tilted_mat2), Mat::zeros(mat1.size(), CV_32F)};

    Mat complexI;

    //2nd multi-threaded part   //around 150 ms
    merge(planes, 2, complexI);                     
    dft(complexI, complexI);                        
    split(complexI, planes);                        

    merge(planes2, 2, complexI);            
    dft(complexI, complexI);                        
    split(complexI, planes2);
    //2nd m-t part ends 

    // do some operations with mat1, mat2, planes etc
    clock_t s11 = clock();
    cout << "total time diff " << s11-s1 << endl;

    return result;
}

This is the method that I want to make more efficient. This part takes around 600 ms for each call. What I thought is to make some independent parts of the method multi-threaded and found two places that can be written in parallel.

For this aim, I wrote two simple code for each (1st and 2nd m-t parts), and run those methods:

t1 = boost::thread( subtract_mean, mat1_real, mat1); 

subtract_mean(mat_ori, mat){
    Scalar mean1 = mean(mat_ori);
    subtract(mat_ori,(float)mean1[0],mat1);
}

similarly 2nd thread creates two thread for each dft.(dft_thread)

The code includes a lot of computations so, when I run it cpu usage becomes around 90%.

Before running with inner threads, I was expecting a better result however it is not.

Here are my question: Why does my code is working faster when I run without dft_thread and sub_thread? How can I make crossCorrelation faster? Could I use an inner thread, I used once, over and over by doing that would it make my code faster? Is there a clever way of inserting inner threads to my code?

EDIT: I did some new tests: I have no inner thread and checked what happens when the number of outer threads are 1-2-4-6-8 for tree size = 16. Here are the results:

numThread 1 ------ 2 ------ 4 ------ 6 ------ 8

Time takes 29 ----- 35 ----- 51 ----- 77 ----- 104 (in sec)

avg_time 29 ---- 17.5 ---- 12.7 ---- 12.8 ---- 13 (in sec)

I think this shows, I can on make 2.5 time faster with threads. I was expecting/thinking it is 5-6 times faster with 8 thread. Is it what it should have been? Am I doing something wrong or my understanding of threads fails?

EDIT2: I did one more test:

First one: running the code with 6 thread

The second one is copy the visual studio project 5 times and run 6 process at the same time all of them are running with one thread. (multithreading vs parallel processing)

multithreading takes 141 mins whereas, parallel processing takes 70 mins.

Note that: running one process with one thread takes 53 mins.

What could be the reason for that? Anybody seeing such an abnormal situation? I'm thinking both should be in the same speed (maybe multithreading is a bit more faster) as they are using same amount of resources, am I wrong?

Thanks,

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
122 views
Welcome To Ask or Share your Answers For Others

1 Answer

well not really an answer but as comment it will not be very readable:

  1. try to avoid any parameters/returns for any function that is called often

    Well const parameters are not enough if it is possible use global variables instead its much faster then heap trashing. For example in crossCorrelate(mat1,mat2); the mat1,mat2 can be global (own for each thread of course). Parameters are in the best case scenario referenced by pointer. In that case its not a big deal but still can buy some time. In worse case its copied into new object on every call. When your matrices are big then it takes time. And also do not forget the constructor/destructor is called too ...

  2. avoid dynamic allocation in often executed code

    Allocate only once if possible. Modern C/C++ engines have pretty good memory managers so this will not buy much time but even 1-5% sometimes count

  3. check DFT

    As mentioned before it should be computed as DFFT. I am not sure if you have fast enough implementation of DFFT but if your input data is all the time with the same matrix size then you can pre-compute weights once and use them all the time. It will speed up the DFFT / IDFFT significantly.

    BTW merge,dft,split can be rewritten too (to be in place and without parameters). Or you can use double buffering techniques (swap pointers on execute).

    As you wrote you cannot go inside source so try to use different DFFT/IDFFT

    What about NTT/INTT ? If your algorithm just use FFT for its properties then sometimes NTT is faster but if your input data is complex then you have no other choice.

  4. you are reading matrices (I assume from some file)

    Check performance of that. If it is a binary file than you have nothing to improve, but if it is in text form check the reading efficiency. For example WinAPI ini file reading is about 1000x times slower then efficiently written ini parser in C++ especially for big files.

  5. you can try to improve performance by better thread management

    • use threads count according to CPU count. If you have 4xCPU then 100 threads will not be faster than 4 threads but slower actually.
    • You can change thread/process priority/class
    • sometimes well placed Sleep() actually speed things up
    • You can play with affinity masks to better profile which thread runs on which CPU, sometimes it helps.

PS. btw how big are your matrices ?

[Edit1]

When you go to parallel/multi-threading you are accessing N times more resources at once ...

  • single matrix of yours is 1K x 1K x float = 4 MB
  • after FFT you switch to complex so it became 8 MB
  • you are doing some operations on 2 matrices (A-B for example) so that is 16 MB
  • if you do not use in place algorithms (like A=A-B but C=A-B) then you are using 24 MB

So check for the CACHE size on you computer and there is your bottleneck (at least in my opinion). Also when you have matrix as operand or return value without reference (not pointer but object) then you can add 8 MB for each of them. Also consider the amount of recursion calls inside 2D (I)DFFT when N = 1024 !!! The amount of heap trashing is horrible.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...