// DBOINCP-300: added node comment count condition in order to get Preview working ?>
darkclown
Joined: 27 Sep 06
Posts: 5
Credit: 300356
RAC: 0
24 Aug 2019 23:23:41 UTC
Topic 219455
(moderation:
)
It's tough to tell, but do any of the CPU apps here take advantage of AVX-512? I've got one system with 2x AVS-512 units, and on other projects, its tears up compared to FMA3/AVX2.
Short Answer: I did not see a significant speedup for v2.09 of the Gravity Wave application using AVX-512 hardware.
Detail: I have been running an Intel i7-9700 with AVX2 hardware for a while and had a good baseline for the range of times. Since it is currently summer and based on the RAM requirements of the current version, I was running 4 GW tasks at a time.
I recently upgraded to an Intel i7-11700 with AVX-512 hardware. The times decreased 15% across the range of running the same 4 GW tasks as before. I don't have any way to actually tell if the code utilized AVX-512 instructions, but I did not see the drastic speedup often seen in AVX-512 benchmarks.
I believe 3 things can account for the 15% speedup; 1) 4.6 GHz versus 4.8 GHz CPU speeds, 2) 12 MB versus 16 MB of L3 cache and 3) general instruction performance improvements across 2 generations of CPU.
Installed base of CPUs that support AVX512 is very low. Probably not worth the extensive code adaption necessary. Intel also has ditched AVX512 for their upcoming CPU generation and no AMD CPUs support it.
Oh and you should notice an increase in thermals and throttling down of your AVX512 CPU when these instructions are actually run. AVX512 is very demanding.
I remember reading some study where AVX2 was actually faster than AVX512 due to the heavy CPU throttling needed when running AVX512
There are certainly applications that use AVX512 out there. and I've seen some Phoronix benchmarks where W3175 Xeons (28 core flagship) outperform 64 Core Zen2 Threadrippers. AVX512 is no joke if the application is built for it. But both supported CPUs and optimized applications are niche.
Intel retains the instructions on their server CPUs (only big cores there, the atom cores can't handle it) and AMD plans to include it in future Epyc generations and maybe TR Pro.
Oh and you should notice an increase in thermals and throttling down of your AVX512 CPU when these instructions are actually run.
I remember reading some study where AVX2 was actually faster than AVX512 due to the heavy CPU throttling needed when running AVX512
That shouldn't be the case unless there's something wrong; and I don't recall seeing any results where it was (maybe if you've got a weird mix of normal and avx code or are thrashing the memory subsystem?). AVX-512 normally runs a bit more than 2/3rds the speed of normal code but is 2x as fast as AVX-256 for around a 50% speedup. Because CPUs are more efficient at lower clock rates the 50% reduction in power needed to keep in the thermal envelope should result in a decent bit less than a 50% drop in clock rate.
As much as I'd like to have AVX512 (it's not simple to work with, looking at some rather intimidating tables), I'd prefer to see AVX2 on FGRP tasks first, the GW app is 5 years newer and already has AVX2 built-in according to wiki and some looks at the runtimes.
But I'd really like to see a full-blown Zen4 EPYC Genoa crunching 1.000.000 RAC on it's own in 2023. That is totally possible with good AVX512 code.
I have taken a look at the GPU code (OpenCl) of this project. I do not know of the order or the frequency of the GPU kernel calls, but I do know what the kernels do.
To make use of the AVX512 or whatever instruction set or a new GPU the processor or the compiler must be really good at 32 bit floating point arithmetic, parallelisation of the code and efficient memory access with both reads and writes of stride 6*4 or 8*4 bytes and have an exceptionally good FFT library.
Not even a hand coded cache management scheme can address the issues of a bad selection of data arrangement. As an exercise of memory and thought: How would you write an array of structs or a struct of arrays full of elements like {int32 x; int32 y; int 32 z;} ... most efficiently when the writes or reads are done in parallel?
The array of structs is bad, bad and bad... It always leads to a strided memory access pattern. (you touch only every n:Th of memory location thus losing both memory and cache bandwidth.)
I have wondered the same for
)
I have wondered the same for a while.
Short Answer: I did not see a significant speedup for v2.09 of the Gravity Wave application using AVX-512 hardware.
Detail: I have been running an Intel i7-9700 with AVX2 hardware for a while and had a good baseline for the range of times. Since it is currently summer and based on the RAM requirements of the current version, I was running 4 GW tasks at a time.
I recently upgraded to an Intel i7-11700 with AVX-512 hardware. The times decreased 15% across the range of running the same 4 GW tasks as before. I don't have any way to actually tell if the code utilized AVX-512 instructions, but I did not see the drastic speedup often seen in AVX-512 benchmarks.
I believe 3 things can account for the 15% speedup; 1) 4.6 GHz versus 4.8 GHz CPU speeds, 2) 12 MB versus 16 MB of L3 cache and 3) general instruction performance improvements across 2 generations of CPU.
Installed base of CPUs that
)
Installed base of CPUs that support AVX512 is very low. Probably not worth the extensive code adaption necessary. Intel also has ditched AVX512 for their upcoming CPU generation and no AMD CPUs support it.
Oh and you should notice an increase in thermals and throttling down of your AVX512 CPU when these instructions are actually run. AVX512 is very demanding.
Exard3k wrote: Oh and you
)
I remember reading some study where AVX2 was actually faster than AVX512 due to the heavy CPU throttling needed when running AVX512
_________________________________________________________________________
Ian&Steve C. wrote: I
)
There are certainly applications that use AVX512 out there. and I've seen some Phoronix benchmarks where W3175 Xeons (28 core flagship) outperform 64 Core Zen2 Threadrippers. AVX512 is no joke if the application is built for it. But both supported CPUs and optimized applications are niche.
Intel retains the instructions on their server CPUs (only big cores there, the atom cores can't handle it) and AMD plans to include it in future Epyc generations and maybe TR Pro.
Ian&Steve C. wrote: Exard3k
)
That shouldn't be the case unless there's something wrong; and I don't recall seeing any results where it was (maybe if you've got a weird mix of normal and avx code or are thrashing the memory subsystem?). AVX-512 normally runs a bit more than 2/3rds the speed of normal code but is 2x as fast as AVX-256 for around a 50% speedup. Because CPUs are more efficient at lower clock rates the 50% reduction in power needed to keep in the thermal envelope should result in a decent bit less than a 50% drop in clock rate.
As much as I'd like to have
)
As much as I'd like to have AVX512 (it's not simple to work with, looking at some rather intimidating tables), I'd prefer to see AVX2 on FGRP tasks first, the GW app is 5 years newer and already has AVX2 built-in according to wiki and some looks at the runtimes.
But I'd really like to see a full-blown Zen4 EPYC Genoa crunching 1.000.000 RAC on it's own in 2023. That is totally possible with good AVX512 code.
I have taken a look at the
)
I have taken a look at the GPU code (OpenCl) of this project. I do not know of the order or the frequency of the GPU kernel calls, but I do know what the kernels do.
To make use of the AVX512 or whatever instruction set or a new GPU the processor or the compiler must be really good at 32 bit floating point arithmetic, parallelisation of the code and efficient memory access with both reads and writes of stride 6*4 or 8*4 bytes and have an exceptionally good FFT library.
Not even a hand coded cache management scheme can address the issues of a bad selection of data arrangement. As an exercise of memory and thought: How would you write an array of structs or a struct of arrays full of elements like {int32 x; int32 y; int 32 z;} ... most efficiently when the writes or reads are done in parallel?
The array of structs is bad, bad and bad... It always leads to a strided memory access pattern. (you touch only every n:Th of memory location thus losing both memory and cache bandwidth.)
--
petri33