This depends upon your application's execution profile. It probably doesn't pay to try any detailed profiling: simply run a series of experiments to find the "sweet spot" for your performance.
Start with an artificially low number, such as 16. Try the even numbers through 24, measuring your performance with whatever metric you've chosen. When you identify the relative maximum that way, then try the odd numbers on either side to find the best fit.
This is a common technique in systems. My team did it to train deep learning models. We found that we needed to keep a bit over 10% of the processors free for typical OS operations: model I/O and other resource maintenance.
Additional comment from @Steve
:
I've done a lot of this sort of testing over the years, and you'll often be surprised by the answer you come up with. I'd suggest that you make it easy to reconsider the optimal number, and do so regularly, as a seemingly insignificant code change will sometimes alter the optimum value quite a bit.
If finding the optimum is important enough to your bottom line ($ and/or throughput), you'd benefit by building a system that determines the optimum and adjusts for it dynamically and somewhat continuously. This isn't a terribly difficult thing to do.