Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Is it safe to use __syncthreads() in a block where I have purposefully dropped threads using return?

The documentation states that __syncthreads() must be called by every thread in the block or else it will lead to a deadlock, but in practice I have never experienced such behavior.

Sample code:

__global__ void kernel(float* data, size_t size) {
    // Drop excess threads if user put too many in kernel call.
    // After the return, there are `size` active threads.
    if (threadIdx.x >= size) {
        return;
    }

    // ... do some work ...

    __syncthreads(); // Is this safe?

    // For the rest of the kernel, we need to drop one excess thread
    // After the return, there are `size - 1` active threads
    if (threadIdx.x + 1 == size) {
        return;
    }

     // ... do more work ...

    __syncthreads(); // Is this safe?
}
question from:https://stackoverflow.com/questions/6666382/can-i-use-syncthreads-after-having-dropped-threads

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
601 views
Welcome To Ask or Share your Answers For Others

1 Answer

The answer to the short question is "No". Warp level branch divergence around a __syncthreads() instruction will cause a deadlock and result in a kernel hang. Your code example is not guaranteed to be safe or correct. The correct way to implement the code would be like this:

__global__ void kernel(...)

    if (tidx < N) {
        // Code stanza #1
    }

    __syncthreads();


    if (tidx < N) {
        // Code stanza #2
    }

    // etc
}

so that the __syncthreads() instructions are executed unconditionally.


EDIT: Just to add a bit of additional information which confirms this assertion, __syncthreads() calls get compiled into the PTX bar.sync instruction on all architectures. The PTX2.0 guide (p133) documents bar.sync and includes the following warning:

Barriers are executed on a per-warp basis as if all the threads in a warp are active. Thus, if any thread in a warp executes a bar instruction, it is as if all the threads in the warp have executed the bar instruction. All threads in the warp are stalled until the barrier completes, and the arrival count for the barrier is incremented by the warp size (not the number of active threads in the warp). In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the condition identically (the warp does not diverge). Since barriers are executed on a per-warp basis, the optional thread count must be a multiple of the warp size.

So despite any assertions to the contrary, it is not safe to have conditional branching around a __syncthreads() call unless you can be 100% certain that every thread in any given warp follows the same code path and no warp divergence can occur.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...