An Einstein Schizoid Embolism?

Hello-Urgo

Joined: 6 Feb 21

Posts: 3

Credit: 800430900

RAC: 0

19 Jan 2022 9:23:09 UTC

Topic 226801

(moderation:

)

Now here's the real question. How is it possible that you get WUs you haven't subscribed to?

Since joining Einstein I've had problems getting GW WU to run smoothly and reliably on just 1 of my WS. That HW config is:

AMD Ryzen 9 3900XT 12-Core Processor
[2] NVIDIA GeForce GTX 1070 Ti
Windows 10 Enterprise x64 Edition
BOINC client version:7.16.11
Memory:65460.88 MiB
Air-cooled system
Max CPU temp on all (24) cores at 100%, and all GPUs cores @100%, the avg is around 57C.
I run my client with web-based profiles enabled.

I have other apps that task the HW just as much and they don't produce errors, Abode Premiere Pro for example runs all CPU and GPU cores @100% while rendering, sometimes for 6-8 hours, and never it craps out on me.

So I started trying to limit GW WUs from downloading, and the whole experience has been so frustratingly difficult I've considered, going to another project, which I don't want to do because I think this is seriously important cutting edge research. Rather than go through the lists of weirdness I've seen, I'll just address the current standing on this concern, which BTW, I have repeated a half dozen times over the last year with essentially the same outcome each time.

4-5 days ago, my current Profile was set to download 1.5 days of work for FGRP#5 and FGRPB1G#1 and had been doing so reliably for a couple of months. Then I tried again to tick GW O2 Multi-Directional GPU and Gravitational Wave search O3 All-Sky. When the WUs came down, there were in fact 7 days of work for each CPU/GPU core in my system, in other words, more than 4x what I asked for. And when they began running, all ran at High Priority immediately after downloading and the GPU WUs would only utilize 1 GPU, leaving the other idle. Then the time-outs and errors started, and because the existing WU in my local queue were displaced by the GW WUs downloaded, they were going to time out as well.

So I change my web profile back to FGRP#5 and FGRPB1G#1 only, aborted the GW WUs in the queue that had displaced everything else or they kept running at High Priority even though they say they had another 6 days until they expire, and waited for the client to update. When it did, it downloads only GW WUs and FGRP#5 WUs for CPUs, and again, gives me 7 days of WU's for each core when I only asked for 1.5.

I change my web profile to download 0.1 days of work and repeat the steps above. Finally, I get just 7 WUs at a time, but they are still GW WUs that are not checked in my web profile.

That was two days ago, Checking server status indicates there's plenty of FGRPB1G WUs that still need to be processed, I'm just not getting any of them. For whatever reason, Einstein keeps pushing GW WU's that are not ticked in my profile and if 10% of those are going to error out, or displace other WU's then I'd just as soon steer clear of them.

As I said earlier, I've tried the steps above many times over the last year and each time the outcome is about the same. I literally have to leave the project for a few weeks, let all WU's complete as they normally do, crunch numbers for some other project for a few weeks, then come back to Einstein and try again.

I'm an IT professional so I've tried a number of suggestions made on this site by others, and so far, nothing that works reliably every time. It's almost as though the WU sent my way are in fact, configured to run the way they do despite the settings I declare, either in the web profile or via an app_config.xml.

The thing is, I don't want thousands of WUs when I only ask for a couple hundred, that run and then timeout or error out at the last minute. Nor do I want WUs to jump to running at High Priority as soon as they download. If I can't contribute WUs that get the job done accurately and "safely on my HW", what's the point in sharing my HW and compute cycles? And in case you're wondering, this is not about getting max credits in a race to get to the top, it's about getting reliable results consistently, and for reasons I have yet to figure out, GW WU's are problematic on my hardware and Einstein doesn't seem able to do the basic math necessary, to limit downloads to 1 day's worth of work.

At one point in the past, while troubling shooting this issue, I set my web profile to download 4 days of work, when I checked on it the next day, I had enough WUs to run each CPU and GPU core for more than a month. There was something like 2,900 WUs downloaded, that's just nuts!!

Maybe what I need at this point, is a nuclear option that zaps everything back to the beginning defaults and then some guidance on how to avoid falling into this dilemma again, if that's even possible. If so, I'm open to suggestions.

Hello-Urgo

Joined: 6 Feb 21

Posts: 3

Credit: 800430900

RAC: 0

After posting this, I placed

19 Jan 2022 11:46:34 UTC

Message 191893

(moderation:

)

After posting this, I placed two more systems on the same web profile. After a couple of hours, those two did the same, they only downloaded GW WU when none were selected in the profile.

Thinking this sounds like an operator error as opposed to anything else, I made a couple more changes to the web profile limiting WU only to the Gamma-ray pulsar binary search #1 (GPU), and Request CPU-only tasks from this project to NO, and Run CPU versions of applications for which GPU versions are available to NO. and then waited. An hour later, all 3 systems were downloading Gamma-ray pulsar binary search #1 (GPU) again, solving 1 of 2 problems.

The 2nd problem, downloading more WUs than a system can complete in the allotted time, that's still happening. 1 of the systems switched to the updated profile, had 234 FGRP#5 WU's in the queue. After reading the profile again, it downloads another 700 plus WUs for a total of 956. At 14.25 hours per WU, on a system with 11 cores dedicated to that task, that's still 51.6 days of work per core, when you're only allotted 6 days for the work. This is with a profile setting of 0.1 days of work. I don't see any settings that could contribute to this kind of math error.

Maybe I have misinterpreted these settings in the past and I will Google a bit more to see if I can get a more detailed explanation. In the meantime, if anyone would like to share their interpretation of the settings I've posted here, please do.

Which one of these was allowing GW WUs to be downloaded, when they hadn't been selected in the profile?

What settings could be causing Einstein to download 8x the number of WUs a system can complete, in the time allowed, usually 6-7 days?

mikey

Joined: 22 Jan 05

Posts: 12656

Credit: 1839052161

RAC: 4469

Carter9304 wrote: After

19 Jan 2022 12:03:09 UTC

Message 191896 in response to message 191893

(moderation:

)

Carter9304 wrote:

After posting this, I placed two more systems on the same web profile. After a couple of hours, those two did the same, they only downloaded GW WU when none were selected in the profile.

Thinking this sounds like an operator error as opposed to anything else, I made a couple more changes to the web profile limiting WU only to the Gamma-ray pulsar binary search #1 (GPU), and Request CPU-only tasks from this project to NO, and Run CPU versions of applications for which GPU versions are available to NO. and then waited. An hour later, all 3 systems were downloading Gamma-ray pulsar binary search #1 (GPU) again, solving 1 of 2 problems.

The 2nd problem, downloading more WUs than a system can complete in the allotted time, that's still happening. 1 of the systems switched to the updated profile, had 234 FGRP#5 WU's in the queue. After reading the profile again, it downloads another 700 plus WUs for a total of 956. At 14.25 hours per WU, on a system with 11 cores dedicated to that task, that's still 51.6 days of work per core, when you're only allotted 6 days for the work. This is with a profile setting of 0.1 days of work. I don't see any settings that could contribute to this kind of math error.

Maybe I have misinterpreted these settings in the past and I will Google a bit more to see if I can get a more detailed explanation. In the meantime, if anyone would like to share their interpretation of the settings I've posted here, please do.

Which one of these was allowing GW WUs to be downloaded, when they hadn't been selected in the profile?

What settings could be causing Einstein to download 8x the number of WUs a system can complete, in the time allowed, usually 6-7 days?

Einstein has no clue you only allow 11 cpu cores to be used for Einstein because the Boinc client doesn't tell it that, so getting a ton of tasks is Einstein thinking you want tasks for 24 cpu cores for the total size of your cache. The easiest answer is to go to a zero resource share for the venue this pc is on, that way it only gets tasks as needed instead of filling up a cache of tasks you can't possibly finish before the deadline. Then as you get the tasks you want to run and want a bigger cache you can raise the resource share a little bit at a time.

Harri Liljeroos

Joined: 10 Dec 05

Posts: 4310

Credit: 3187334892

RAC: 1979985

There is a bug in Boinc

19 Jan 2022 12:11:32 UTC

Message 191898

(moderation:

)

There is a bug in Boinc client that makes it request tasks again and again if you have max_concurrent setting in your app_config.

Gandolph1

Joined: 20 Feb 05

Posts: 180

Credit: 389633764

RAC: 525

I wonder if adding a command

19 Jan 2022 21:11:12 UTC

Message 191906

(moderation:

)

I wonder if adding a command line option to the App_Config file would work?

"--fetch_minimal_work"

Fetch only enough jobs to use all device instances (CPU, GPU). Used with --exit_when_idle, the client will use all devices (possibly with a single multicore job), then exit when this initial set of jobs is completed.

It doesn't appear that you are required to use the "--exit" option...

Gandolph1

Joined: 20 Feb 05

Posts: 180

Credit: 389633764

RAC: 525

Just wanted to add - Mine

19 Jan 2022 21:13:41 UTC

Message 191907

(moderation:

)

Just wanted to add - Mine seems to be doing the same thing BEFORE I even had the app_config file, that's why I had CPU tasks shut off.

Keith Myers

Joined: 11 Feb 11

Posts: 4960

Credit: 18651232247

RAC: 5534890

The OP should update to BOINC

19 Jan 2022 21:24:59 UTC

Message 191908

(moderation:

)

The OP should update to BOINC version 7.16.20 which includes the fix for Issue#4592 max_concurrent scheduling bug.

Harri Liljeroos

Joined: 10 Dec 05

Posts: 4310

Credit: 3187334892

RAC: 1979985

Keith Myers wrote:The OP

19 Jan 2022 21:51:41 UTC

Message 191910 in response to message 191908

(moderation:

)

Keith Myers wrote:

The OP should update to BOINC version 7.16.20 which includes the fix for Issue#4592 max_concurrent scheduling bug.

Does it contain that? 7.16.20 was published in October 2021 and fix was made December 2021. Also I remember Richard Haselgrove posting some time ago that this fix isn't yet in any published Boinc versions. Sorry if I am wrong. I don't want to give wrong information.

Gandolph1

Joined: 20 Feb 05

Posts: 180

Credit: 389633764

RAC: 525

I have my "Store at Least"

19 Jan 2022 23:45:54 UTC

Message 191913

(moderation:

)

I have my "Store at Least" set to .2 and my "Store additional" set to .1. I'm using BOINC client v 7.16.20 and it still downloaded HUNDREDS. There is no way they will be complete in time. If I cant fix this I guess I will have to deselect it again. For those with no GPU on a machine I'm not sure how you manage it.

Another strange thing is my Intel system doesn't appear to be doing this and I have the clients setup the same...

Keith Myers

Joined: 11 Feb 11

Posts: 4960

Credit: 18651232247

RAC: 5534890

The Master commit list shows

19 Jan 2022 23:51:32 UTC

Message 191914 in response to message 191910

(moderation:

)

The Master commit list shows Issue#4592 merged into the Master codebase on December 7, 2021

Master branch commit list

Yes, you are correct. I was mistaken thinking that the 7.16.20 release was based off the Master.

So you would need to either build the Master yourself or grab one of the artifacts built after December 7, 2021.

The December 18, 2021 artifact contains the Issue#4592 fix.

December 18, 2021 artifact builds

Gandolph1

Joined: 20 Feb 05

Posts: 180

Credit: 389633764

RAC: 525

Giving it a try right now.

20 Jan 2022 2:03:44 UTC

Message 191920

(moderation:

)

Giving it a try right now. Running v7.19.0

An Einstein Schizoid Embolism?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner