Utilizing all CPUs on Voxl2
-
Hi,
I had a question related to utilizing all the cores available on the voxl2. I’m using a cpp library that runs multi threaded optimization using openmp and allows me to specify the number of available threads. What I notice is that the performance increased as I increase the number of threads upto 4, but after that there’s no increase in performance. I would have expected an increase in performance up to 8 threads, since there’s 8 cpus on the voxl2. During my tests, upon running voxl-inspect-cpu, I see that cpu 0-3 have a high utilization while cpu 4-7 have very low utilization even when using 8 threads.
Could you shed some light on whether this is expected, and if so how I could go about better utilizing the last 4 cpus? (I’m pretty confident that the optimization is resource limited, the same optimization problem runs faster on a laptop with more cpus)
Thanks!
-
@Mrunal-Sarvaiya voxl set cpu mode performance and than try running your experiment. By default some cpu cores are slower than the others even in perf mode
-
@Darshit-Desai Yup, all these tests were run with the cpu mode set to performance mode
-
@Mrunal-Sarvaiya , VOXL2 has three types of cores. 0-3 are low power cores, 4-6 are medium and core 7 is the fastest. It seems that OpenMP may not automatically understand how to use this type of CPU architecture. You may want to look into explicitly specifying which cores should be used by the OpenMP (assigning each thread to specific core). Also, you may need to provide a cpu type as your build flag.
The cpu on VOXL2 is Kryo 585 / Snapdragon 865 , which is a combination of Cortex A55 and A77 cores. OpenMP may think that these are completely different cpus and does not "dare" to use the additional faster 4 cores by default.
Alex
-
@Alex-Kushleyev Thanks a ton! I was able to explicitly specify the cpus to use and the optimization is 3-4x faster.
Posting the command needed here in case someone else stumbles upon this post. Export the following environment variables
export OMP_PROC_BIND=close
# this may not be necessary
export GOMP_CPU_AFFINITY="4 5 6 7"
# here 4 5 6 7 specifies the cpus to use -
@Mrunal-Sarvaiya , so you still used 4 cores, but just the more powerful ones? were you able to make user of all 8 cores (you would need 8 threads in your application)
Alex
-
@Alex-Kushleyev Correct, I just used the last 4 more powerful cpus. That was enough of a performance boost to run my optimization at the frequency I was hoping for. I didn't try using all 8 threads, I can give it a shot if that's useful information for you