WU become longer and longer

Message boards : Number crunching : WU become longer and longer
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Rainer Baumeister

Send message
Joined: 21 Dec 20
Posts: 3
Credit: 9,018,486
RAC: 8
Message 2075 - Posted: 23 May 2023, 16:27:22 UTC
Last modified: 23 May 2023, 17:02:42 UTC

Dear Team, (Translate with Deepl.com)
since a few days the tasks are "extrapolated".
Instead of the remaining time decreasing, it is getting longer and longer.

After a restart it goes for a few hours again and then it starts all over again .-(

Meanwhile 4 days calculated and remaining time over 6 days. Increasing.

What is going wrong here?

Greetings Rainer

edit

just found out: it is enough to pause the tasks in the Boinc manager and then go back to always run.
But of course this is not a state.
I have a few computers and can use my time better....
Greetings, Rainer
ID: 2075 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 328
Credit: 24,162,301
RAC: 12,493
Message 2076 - Posted: 24 May 2023, 5:48:05 UTC - in response to Message 2075.  

Hello! Some workunits really can be very long due to properties of compounds and their distribution into packages. For example, on my Ryzen, for one of unusual task, one of compound has been processed in 111 seconds, but next - in 2338 seconds. If task include more "long" compounds than usual, it's getting long.
Also, time that need for modeling of compound and target docking varies from target to target.

If you see an unusual task, simply check property "CPU time since last checkpoint" in BOINC Manager. If you see hours on computer with "Ryzen-class" CPU - you can check time of last change (or contents) of file docking_out.log in task slot directory. If it changes - computing is fine.

P.S. I also see on my computer, that tasks in last 2 days become longer. Don't panic, that is normal. :)
ID: 2076 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rainer Baumeister

Send message
Joined: 21 Dec 20
Posts: 3
Credit: 9,018,486
RAC: 8
Message 2078 - Posted: 24 May 2023, 6:26:05 UTC

Moin, (Translate Deepl.com)

no The objection does not fit my contribution:
We are talking past each other.

I do not speak about long running times in itself, but that the remaining time >drastisch< strongly rises!
Last time the remaining time showed > 420 days and jumped further.

I calculate for Boinc since 2006 and know about the possible long runtimes.
Greetings, Rainer
ID: 2078 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 328
Credit: 24,162,301
RAC: 12,493
Message 2079 - Posted: 24 May 2023, 11:10:20 UTC - in response to Message 2078.  

... I do not speak about long running times in itself, but that the remaining time >drastisch< strongly rises!
Last time the remaining time showed > 420 days and jumped further. ...

Yes, predicted time like this is abnormal. Would you post some info about computer (or host id) and one of two strange tasks: elapsed time; estimate time; contents of "docking_out.log" from slot directory; time of last change of "docking_out.log" and current time.

Thank you!
ID: 2079 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rainer Baumeister

Send message
Joined: 21 Dec 20
Posts: 3
Credit: 9,018,486
RAC: 8
Message 2080 - Posted: 24 May 2023, 18:28:42 UTC - in response to Message 2079.  

Sorry, deepl.com as Translate is down.
Mi englich is very wrong.

my ID-Computer is 41159, 41160 etc..
all WU cancelt, no Info
My Hardware: Ryzen3700x with 32GB Ram, stock, no OC
1700 32GB stock
an any Hardware.
docking_out.log is 0 Byte
Greets Rainer
ID: 2080 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 328
Credit: 24,162,301
RAC: 12,493
Message 2081 - Posted: 29 May 2023, 21:09:41 UTC - in response to Message 2080.  

I looked data about tasks, computed by your hosts. Successful tasks were completed in the usual time. If an anomaly really occurred, then I think it was necessary to restart BOINC or computer. You can also compare with my computer.
ID: 2081 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jurma

Send message
Joined: 15 Jul 23
Posts: 3
Credit: 870,924
RAC: 3,488
Message 2094 - Posted: 19 Jul 2023, 16:28:52 UTC - in response to Message 2075.  

I have similar one task is 344 days and another 103 days to complete although their supposed to be completed by the 29th of this month? Is their an error in the WU's I wonder.
ID: 2094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 328
Credit: 24,162,301
RAC: 12,493
Message 2095 - Posted: 20 Jul 2023, 6:02:27 UTC - in response to Message 2094.  

Hi! :)
I have similar one task is 344 days and another 103 days to complete although their supposed to be completed by the 29th of this month? Is their an error in the WU's I wonder.

How many time this task already consumed and what process reached?
ID: 2095 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thorak

Send message
Joined: 13 Sep 21
Posts: 9
Credit: 758,030
RAC: 0
Message 2103 - Posted: 2 Aug 2023, 20:14:56 UTC - in response to Message 2095.  
Last modified: 2 Aug 2023, 20:22:47 UTC

I make it simple cancel the project, don't accept, project that prolong deadline, that mean they don't know what they are doing, and time slot windows is wrong and they forget pc can be use to other things,
and this is what happen then you micro-mangement thing and you have no knowledge in human behavior.
i'm sure ofc. you buy a pc and only run client , and you have no use of a pc to do other things.

ps
any WU longer then 24hour, over 3 days off line work only at sleep time, should start to raise a issue for normal user.

gl with it.
this project is shutdown by me as user, until they change WU time slot. window, and gl send msg to client so i open for traffic again, i will not bother read here or waster my pc resource on bad managment.

any pause or reboot or restart client will cause issue for even more prolong WU time, this is seen before in other screwup time index and lost save, nothing new in bonic manager lack of option.
ID: 2103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 7 Nov 20
Posts: 15
Credit: 478,270
RAC: 275
Message 2114 - Posted: 4 Oct 2023, 4:57:41 UTC

C:\programdata\BOINC\slots\#\docking_log should display more correct estimated time for completion.
ID: 2114 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Covid killer

Send message
Joined: 27 Aug 23
Posts: 4
Credit: 123,198
RAC: 1
Message 2119 - Posted: 29 Oct 2023, 17:20:30 UTC

I have experienced similar issues. Usually a WU takes about 12 hours to complete. Some tasks not only take waaaay longer, but the completion % drops and the estimated completion time is completely inaccurate.

It's honestly put me off from running this projects, but I will give it another shot today.
ID: 2119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Speedy51

Send message
Joined: 11 Feb 21
Posts: 17
Credit: 464,084
RAC: 658
Message 2120 - Posted: 5 Nov 2023, 20:51:00 UTC - in response to Message 2114.  
Last modified: 5 Nov 2023, 20:52:17 UTC

C:\programdata\BOINC\slots\#\docking_log should display more correct estimated time for completion.

In some cases I have found the boinc client 7. 24. 1 to show a more realistic completion time than the log in the slot directory. This is on a Windows 11 pro
ID: 2120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 4 Nov 20
Posts: 23
Credit: 2,852,461
RAC: 3,258
Message 2215 - Posted: 8 Mar 2024, 8:32:18 UTC
Last modified: 8 Mar 2024, 8:34:19 UTC

I have three , (possibly more...), jobs with a problem. All had run 1d+, and had been running normally. For a printer issue, I had to power the system down and up again. Upon doing so, I now see one has 169 days, one 232 days and the other 316 days, all rapidly increasing. I've suspended them for now. Others that were running appeared to start again, but watching them, the remaining time looks to be in a short cycle. I have suspended all but one of them also. If it doesn't change for a while, I'll suspend that also. It would seem you have an issue with tasks running when a power cycle is necessary...

That is several CPU days wasted. No new tasks set.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream...
ID: 2215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 10 Feb 21
Posts: 21
Credit: 4,828,469
RAC: 1,001
Message 2216 - Posted: 9 Mar 2024, 9:52:08 UTC - in response to Message 2215.  

There are a couple of known bugs in the CmDock science app:

  • On Windows, BOINC doesn’t know about its checkpoints. After a restart, the percentage complete usually recovers after a few minutes, but the previous elapsed CPU time is lost and so the BOINC credit will be lower. The problem has been fixed at source, though the project hasn’t updated its application yet.

  • If a task resumes from a checkpoint but gets restarted before it has saved another checkpoint, it might restart from the beginning or report completion and fail validation. (Also mentioned in threads here and here.) The bug has been reported.

ID: 2216 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mad_Max

Send message
Joined: 27 Dec 21
Posts: 18
Credit: 13,669,242
RAC: 25,647
Message 2299 - Posted: 22 Jul 2024, 3:55:03 UTC - in response to Message 2216.  

There are a couple of known bugs in the CmDock science app:

  • On Windows, BOINC doesn’t know about its checkpoints. After a restart, the percentage complete usually recovers after a few minutes, but the previous elapsed CPU time is lost and so the BOINC credit will be lower. The problem has been fixed at source, though the project hasn’t updated its application yet.

Yes, although the error itself has already been identified and fixed for a long time (more than a year ago), the problem remains that the fixed code is not used here in the actual working application sent to BOINC clients - it has not been updated for more than 1.5 years:
Platform 	Version 	Created 	Average computing
Microsoft Windows running on an AMD x86_64 or Intel EM64T CPU 	2.02 	[b]21 Jan 2023[/b], 16:49:05 UTC 	24,051 GigaFLOPS

I even did a workaround for this problem for all of my Windows machines working for SiDoc(4 of them currently), without waiting for the official fix.
Here is his description, maybe it will be useful to someone, because you dont want to wait another year or more for an official fix.
Since the checkpoint files themselves are written correctly and the problem is caused only by the LACK of updating the timestamp when writing the checkpoint, I wrote a simple script that regularly does exactly this - updates the modification date of all files with checkpoints.
The code is like this (win CMD)
for /R "D:\Boinc\Data\slots\" %%a in (docking*.chk) do touch --no-create %%a

This one-line script does two things:
1 - recursively scans all sub-folders of the "slots" BOINC directory and finds all *.chk files which store CmDock checkpoints
2 - calls the "touch" CLI utility for them, which updates the timestamps of the specified file to the current date-time without changing the contents of the file.

In general, this is a standard *nix utility, but I added it a long time ago to my win machines along with some other handy cli tools (like "head" and "tail") from the GnuWin32 package: https://gnuwin32.sourceforge.net/packages/coreutils.htm

After that, I set this script to run on a schedule every 10 minutes. The time is chosen relatively arbitrarily: large enough for the task to have time to add at least one more checkpoint (although even if it does not have time, there will be no problems from this), and on the other hand, not too large so that the "lost" execution time of WU during restart is minimal.

[/quote]* If a task resumes from a checkpoint but gets restarted before it has saved another checkpoint, it might restart from the beginning or report completion and fail validation. (Also mentioned in threads here and here.) The bug has been reported.

[/quote]
Yes, I can confirm this bug too. And I hate it!

Specifically, for me, it leads to the fact that the task does not restart from scratch, but for some reason suddenly decides that its calculation has already been completed (although it is NOT), packs the results (which are not fully formed yet) and sends them to the server. Which naturally leads to the fact that during validation by the server, ALL such tasks are recognized as Invalid and sent for re-calculation to someone else again.

Fresh examples of tasks failed (marked as invalid) due ti this bug:
https://www.sidock.si/sidock/result.php?resultid=82293580
https://www.sidock.si/sidock/result.php?resultid=82283666
https://www.sidock.si/sidock/result.php?resultid=82282278
https://www.sidock.si/sidock/result.php?resultid=82287949

What makes the 2nd problem even worse (and perhaps is its original cause) is that there is also a 3rd bug with checkpoints.
For some reason, the application writes checkpoints with a 2-3 targets shift back. In current standard WUs, 500 target substances are currently being processed and the checkpoint is recorded after a complete calculation of one substance, i.e. 500 times during the entire WU calculation, regardless of how much real time it took.

But at the same time, the checkpoint recording for some reason occurs with a backward shift.
Let's say the application has finished processing the 250th substance and starts calculating the 251st, but at this point it records in the checkpoint that the calculation is completed only until the 248th.
As a result, if the application is restarted again shortly after the previous one, the checkpoints may even shift back in time!
If we take the example above, then if in the process of calculating the 251st substance, the application will be restarted, after restarting, it will not calculate the 251st again, as would be correct, but starts from the 249th. And upon its completion, the first recorded checkpoint will indicate that the calculation of the 247th has been completed. I.e., the checkpoint has moved back (from 248th to 247th).
Perhaps this causes the 2nd problem.

ID: 2299 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : WU become longer and longer

©2024 SiDock@home Team