Thank you, @tromp! Testing this on our “super” box, which you also have an account on and can use for testing now that the code is (almost) open source (need a license, as Zooko pointed out), the eqcuda and feqcuda sometimes fail to find solutions (and take multiple seconds to complete in that case). For example, the first time I ran them, they reported 0 solutions. Trying other nonce values, I got them to non-zero solutions, and then trying nonce 0 again finally gave the expected 3 solutions. Retrying after some other tests - and it’s 0 solutions again. You probably have an uninitialized variable somewhere.
Failing run:
$ time ./eqcuda -n 0
Looking for wagner-tree on ("",0) with 10 20-bits digits and 8192 threads (128 per block)
Digit 0
Digit 1
Digit 2
Digit 3
Digit 4
Digit 5
Digit 6
Digit 7
Digit 8
Digit 9
9 rounds completed in 3.900 seconds.
0 solutions
0 total solutions
real 0m5.344s
user 0m2.875s
sys 0m2.281s
Working run:
$ time ./eqcuda
Looking for wagner-tree on ("",0) with 10 20-bits digits and 8192 threads (128 per block)
Digit 0
Digit 1
Digit 2
Digit 3
Digit 4
Digit 5
Digit 6
Digit 7
Digit 8
Digit 9
9 rounds completed in 0.096 seconds.
3 solutions
3 total solutions
real 0m1.532s
user 0m0.081s
sys 0m1.265s
0.096 would suggest 1.88/0.096 = 19.6 Sol/s, right? Per nvidia-smi, this runs on Maxwell Titan X. The box also has old Kepler Titan, but you don’t seem to have included an option to choose the CUDA device.
I also tried CPU runs. Works great on i7-4770K, but the scaling to 32 threads on 2x E5-2670 in this “super” box is poor - perhaps running some independent instances with fewer threads each (maybe just 1 thread/instance) would be faster (but would eat up more RAM, which is fine at least for testing - got 128 GB here). Feel free to experiment with this, too.
Edit: “-t 12288” (upping CUDA thread count in accordance with the difference between GTX 980 and GTX Titan X) somehow makes the speed slightly worse for eqcuda, but improves it for feqcuda, which now gets (also not all the time, but when it’s lucky):
$ time ./feqcuda -t 12288
Looking for wagner-tree on ("",0) with 10 20-bits digits and 12288 threads (128 per block)
Digit 0
Digit 1
Digit 2
Digit 3
Digit 4
Digit 5
Digit 6
Digit 7
Digit 8
Digit 9
9 rounds completed in 0.076 seconds.
3 solutions
3 total solutions
real 0m1.524s
user 0m0.070s
sys 0m1.328s
This is apparently 1.88/0.076 = 24.7 Sol/s.