O3AS Broken on RDNA3 (Linux, opencl-ati), Worked a Few Months Ago, Need Help Building Test Env

Paul
Paul
Joined: 3 May 07
Posts: 132
Credit: 1840411720
RAC: 388843
Topic 232049

A few months ago, I got new drivers, and O3AS, which had been my best performing E@H app, started throwing errors withing a few seconds of starting.  I recently tried it again and the error looks the same from what I can remember. Here's an example:

https://einsteinathome.org/task/1722984570

But, it used to work.  So, I'm confused.  This isn't entirely out of character for RDNA3, though.  At the same time, another E@H app -- maybe MeerKat? -- started working when it never had worked before.  And there is yet other problems, like crashes, that I'm not even thinking about right now.

In any case, I could use some help troubleshooting it.  This seems like a good place to start because it's 100 % reproducible and occurs very quickly.  I've reported some of the past problems upstream, but they say they cannot help without being able to debug the live code for themselves.  I pointed them to the source page, but neither they nor I can quite get started building a test environment.  We could use some help.  Would appreciate someone who could help me build a test environment so I could document it and explain it to AMGPU devs so they can reproduce.  I realize this will require a bit more interaction, but there's no hurry; we can work asynchronously here or DM or maybe personal e-mail?  Whatever works.

ahorek's team
ahorek's team
Joined: 16 Dec 05
Posts: 40
Credit: 249842315
RAC: 13230

It appears to be an issue

It appears to be an issue related to incompatibility with Fedora's drivers.

https://einsteinathome.org/cs/content/all-sky-gravitational-wave-search-o3-v107-tasks-compilation-fail-ldlld-error-undefined-symbo

https://github.com/ROCm/ROCm/issues/3575

Only the developers can confirm whether there is a way around it in the code. You could try contacting Oliver Behnke.

Paul
Paul
Joined: 3 May 07
Posts: 132
Credit: 1840411720
RAC: 388843

Yeah, thanks.  I believe I

Yeah, thanks.  I believe I have contact the right people, but that's a new name.

tictoc
tictoc
Joined: 1 Jan 13
Posts: 47
Credit: 7850834284
RAC: 6672035

There really shouldn't be any

There really shouldn't be any issues running O3AS on a 7900xtx.

 

I see that you also have an A750 in that system.  Do you have mesa-libOpenCL installed?  There can be conflicts between the two OpenCL drivers.  

Paul
Paul
Joined: 3 May 07
Posts: 132
Credit: 1840411720
RAC: 388843

Interesting suggestion.  I'm

Interesting suggestion.  I'm not sure why I don't, since I have every thing else from mesa, but no, i don't have mesa-libopencl installed.  I wonder if I ran into this conflict before, removed it, and forgot about it.

I think I finally see what AHOREK's Team was saying about the error.  I now see that it looks like a simple case of a missing call.  Not sure why O3AS is the only app that calls it, out of all the ones I run or have tried recently, but that is what the error suggests.

Paul
Paul
Joined: 3 May 07
Posts: 132
Credit: 1840411720
RAC: 388843

So, follow-up question: what

So, follow-up question: what is __printf_alloc()?  The suggestion above is that there is something wrong with Fedora.  But, that isn't a good explanation for the symptom.  I can find __printf_alloc() in both llvm libs and rocm-comgr.  So, it doesn't *seem* like it's missing.

It also looks like an internal call, so I'm really confused as to how any library could be built if it were missing internal calls.  Something doesn't add up.

So, I'm back to my original question.  Can someone please help me actually build an test environment for E@H?  It seems like the only way I can satisfy people who are will to help me is if I can actually figureout how to build and test this app for myself.

ahorek's team
ahorek's team
Joined: 16 Dec 05
Posts: 40
Credit: 249842315
RAC: 13230

btw what is your glibc

btw what is your glibc version?

ldd --versionldd --versionldd --version

__printf_alloc is included in glibc 2.34+, but Einstain apps usually include all libraries statically for compatibility with older systems. The app is simply searching for a function in a dynamic library that is either missing on your system or doesn't match the expected version.

Unfortunately, the current O3AS source doesn’t appear to be publicly available, so only the Einstein developers can assist. They may try building it with different flags, libraries, or configurations...

Paul
Paul
Joined: 3 May 07
Posts: 132
Credit: 1840411720
RAC: 388843

ldd (GNU libc) 2.40 Ah,

ldd (GNU libc) 2.40

Ah, okay, that is very helpful information about O3AS not being available.  That definitely is a problem for my current approach.

Other people I've asked say __printf_alloc is NOT in gcc libs.  I've looked for it myself, and I cannot find it.

Thank you for your continued help!  I have to run, but I'll do more digging later...

ahorek's team
ahorek's team
Joined: 16 Dec 05
Posts: 40
Credit: 249842315
RAC: 13230

well, it could be a custom

well, it could be a custom function, for instance:
https://github.com/stjordanis/ROCm-Device-Libs/blob/master/opencl/src/misc/printf.cl#L18

it's more likely because it crashes when building the OpenCL code
ld.lld: error: undefined symbol: __printf_alloc
Error: Creating the executable from LLVM IRs failed.
XLAL Error - XLALOpenCLGetProgramFromSource (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/LIBC215/TARGET/linux-x86_64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/OpenCLUtils.c:705): clBuildProgram failed with OpenCL error: CL_BUILD_PROGRAM_FAILURE

unfortunately, this log doesn't show the OpenCL backtrace.

I attempted to find references on https://git.ligo.org/lscsoft/lalsuite, but it appears that the current O3AS code is sourced from a different location. Without developers who have access to and are knowledgeable about the current codebase, there's not much we can do.

Paul
Paul
Joined: 3 May 07
Posts: 132
Credit: 1840411720
RAC: 388843

I appreciate your

I appreciate your help!

Yes, that's what I was thinking, too.

You got me thinking about other versions of libs that might be on my system, like mesa.  So, I checked for LLVM and found several different versions installed.  Since I think __printf_alloc() might be in there, I removed some unnecessary LLVM libs.  I haven't tried O3AS again, yet, though.

I have to have more than one, because it seems Intel uses llvm-15, while ROCm is built against llvm-18, and llvm-19 is actually the current version of the standard llvm-libs pkg on F41 and several pkgs, like mesa-dri-drivers and mesa-libEGL are built against it.  So, I like this thinking about conflicting libs available at OCL compile time.

I'll enable O3AS again, but I'm not hopeful.  I think we might be on the right track, but I'm not sure what to try/check, next.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.