Tasks hanging -

Message boards : Number crunching : Tasks hanging -
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Bryn Mawr

Send message
Joined: 16 Aug 21
Posts: 36
Credit: 13,394,909
RAC: 26,460
Message 1784 - Posted: 8 Jan 2023, 15:48:12 UTC

Has anyone else had a problem with the 0.2.0 long tasks hanging?

I aborted https://www.sidock.si/sidock/result.php?resultid=76865427 when it was in high priority mode with 4 days remaining and a 3 day deadline as the progress was not moving and the outstanding time was still rising.

I now have 2 more in the same state, for example corona_Sprot_delta_v1_RM_sidock_00411620_r1_s1000.0_0 at 51.004% after nearly 15 hours.
ID: 1784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Vato
Avatar

Send message
Joined: 23 Oct 20
Posts: 5
Credit: 2,165,435
RAC: 4,659
Message 1785 - Posted: 8 Jan 2023, 20:23:44 UTC - in response to Message 1784.  

i have 2 of these hanging tasks ongoing - should i abort them?
both are using a full core of cpu time.
in both cases, the only files in the slot directory that is being updated is boinc_mmap_file and init_data.xml
ID: 1785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 16 Aug 21
Posts: 36
Credit: 13,394,909
RAC: 26,460
Message 1786 - Posted: 8 Jan 2023, 23:17:56 UTC - in response to Message 1785.  

i have 2 of these hanging tasks ongoing - should i abort them?
both are using a full core of cpu time.
in both cases, the only files in the slot directory that is being updated is boinc_mmap_file and init_data.xml


I've suspended the 3 I have in progress until admin have a chance to review them.
ID: 1786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 323
Credit: 22,882,096
RAC: 12,033
Message 1787 - Posted: 9 Jan 2023, 7:08:40 UTC

Hello! Yes, If progress counter become frozen during unusually long time, you can try suspend and resume task or stop BOINC and start again. The second option is more reliable.
ID: 1787 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Vato
Avatar

Send message
Joined: 23 Oct 20
Posts: 5
Credit: 2,165,435
RAC: 4,659
Message 1789 - Posted: 9 Jan 2023, 10:23:31 UTC - in response to Message 1787.  

task suspend (without leave in memory defined) looks like it did the trick.
what sort of time threshold are we looking at to suggest hung task and dig into the slots dir?
4 hours? more?
ID: 1789 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 16 Aug 21
Posts: 36
Credit: 13,394,909
RAC: 26,460
Message 1790 - Posted: 9 Jan 2023, 11:05:28 UTC - in response to Message 1789.  

Percentage static and time remaining increasing over a 5 minute period.
ID: 1790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 323
Credit: 22,882,096
RAC: 12,033
Message 1791 - Posted: 9 Jan 2023, 12:21:21 UTC

I agree. 5 .. 10 minutes. For RPi - 5 ... 30 minutes.
ID: 1791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 16 Aug 21
Posts: 36
Credit: 13,394,909
RAC: 26,460
Message 1794 - Posted: 12 Jan 2023, 19:11:54 UTC

Confirmed, this issue was caused by low memory - running 3 CPDN OpenIFS apps and 21 mixed SiDock, TN-Grid and WCG tasks in 16GB was always going to be tight but having upgraded to 64GB the hanging tasks have disappeared.
ID: 1794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mad_Max

Send message
Joined: 27 Dec 21
Posts: 12
Credit: 11,234,523
RAC: 26,534
Message 1799 - Posted: 15 Jan 2023, 2:52:29 UTC

Also see some of such "stuck" tasks with latest app (never seen such behavior before with previous version).

CPU core is still fully used, but actual progress stops.

To make it worse, it seems in the application there is no "watchdog" timer (or inadequate settings are set in it).
Normal tasks are successfully completed in 1.5-3 hours each on single core on my computers(depends on CPU - i have few different) , but the bad one can occupy a processor for a day or two and never end until I cancel or restart it. During this time, if there was no such failure, 10-20 other tasks on the same core could be successfully completed.

If you do not manage to find out the root cause of the failures and eliminate it, I would recommend adding a guard timer. And better not for the entire task(WU - BOINC work unit), there are actually a lot of separate micro-tasks in it (modeling attempts, judging by the logs of about 500 pieces packed into each “long” task by default).
If such an individual micro task does not end for more than 10-15 minutes(normal run times on relative modern CPUs <1 min), it will never end and it should be restarted or canceled.
ID: 1799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 323
Credit: 22,882,096
RAC: 12,033
Message 1800 - Posted: 15 Jan 2023, 7:12:31 UTC - in response to Message 1799.  

Hello Max! Thank you for report! Good idea.
ID: 1800 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PMH_UK

Send message
Joined: 23 Dec 20
Posts: 20
Credit: 1,360,768
RAC: 0
Message 1802 - Posted: 15 Jan 2023, 9:27:57 UTC - in response to Message 1784.  

Below task went to end after re-start (earlier re-start due to another stuck task).
https://www.sidock.si/sidock/result.php?resultid=77388891

Paul.
ID: 1802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
arcturus

Send message
Joined: 27 Nov 22
Posts: 20
Credit: 3,479,366
RAC: 26,354
Message 1808 - Posted: 15 Jan 2023, 15:13:12 UTC

I have had 12 tasks hang among 4 different computers over the past few days. The lost time represents roughly 116 tasks which could of been processed.
ID: 1808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 323
Credit: 22,882,096
RAC: 12,033
Message 1814 - Posted: 15 Jan 2023, 17:50:43 UTC
Last modified: 15 Jan 2023, 17:51:33 UTC

Hi folks! I caught one of them and now try to reproduce the problem. If it succeeds, it will greatly help.
Thank you for reports!
ID: 1814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 16 Aug 21
Posts: 36
Credit: 13,394,909
RAC: 26,460
Message 1820 - Posted: 16 Jan 2023, 12:50:54 UTC - in response to Message 1814.  

Hi folks! I caught one of them and now try to reproduce the problem. If it succeeds, it will greatly help.
Thank you for reports!


I note that all the tasks giving me problems are RdRp_v2_sample whereas all of the successful tasks are Sprot_delta_v1_RM_sidock

Are the sample jobs a faulty batch and should I abort them on sight?
ID: 1820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PMH_UK

Send message
Joined: 23 Dec 20
Posts: 20
Credit: 1,360,768
RAC: 0
Message 1821 - Posted: 16 Jan 2023, 13:27:08 UTC - in response to Message 1820.  
Last modified: 16 Jan 2023, 13:27:43 UTC

Hi folks! I caught one of them and now try to reproduce the problem. If it succeeds, it will greatly help.
Thank you for reports!


I note that all the tasks giving me problems are RdRp_v2_sample whereas all of the successful tasks are Sprot_delta_v1_RM_sidock

Are the sample jobs a faulty batch and should I abort them on sight?


Not so for me, I have not got to the RdRp yet.
The ones I have hangs on are Sprot.
Current is
https://www.sidock.si/sidock/workunit.php?wuid=49502557

Paul.
ID: 1821 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryn Mawr

Send message
Joined: 16 Aug 21
Posts: 36
Credit: 13,394,909
RAC: 26,460
Message 1822 - Posted: 16 Jan 2023, 13:38:55 UTC - in response to Message 1821.  

Hi folks! I caught one of them and now try to reproduce the problem. If it succeeds, it will greatly help.
Thank you for reports!


I note that all the tasks giving me problems are RdRp_v2_sample whereas all of the successful tasks are Sprot_delta_v1_RM_sidock

Are the sample jobs a faulty batch and should I abort them on sight?


Not so for me, I have not got to the RdRp yet.
The ones I have hangs on are Sprot.
Current is
https://www.sidock.si/sidock/workunit.php?wuid=49502557

Paul.


Having reset the 3 that appeared to be hanging they’re running ok for now with an apparent 17 hour expected run time.
ID: 1822 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
arcturus

Send message
Joined: 27 Nov 22
Posts: 20
Credit: 3,479,366
RAC: 26,354
Message 1824 - Posted: 16 Jan 2023, 15:35:26 UTC

Project reset with no new work units instructed based on the number of them hanging.

On to another project until this issue is resolved.
ID: 1824 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
Avatar

Send message
Joined: 24 Oct 20
Posts: 19
Credit: 9,803,224
RAC: 5,041
Message 1825 - Posted: 16 Jan 2023, 16:01:17 UTC - in response to Message 1824.  

Project reset with no new work units instructed based on the number of them hanging.

On to another project until this issue is resolved.


Early in the past week I've aborted more than about a dozen or so tasks across several pc's but over the last few days I haven't had to abort any tasks, don't know if I've been lucky or they are being worked thru.
ID: 1825 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crissy

Send message
Joined: 18 Oct 22
Posts: 9
Credit: 9,184,528
RAC: 0
Message 1828 - Posted: 16 Jan 2023, 18:27:54 UTC
Last modified: 16 Jan 2023, 18:31:04 UTC

Teams, Crunchers,

There are still lots of buggy workunits around.
I consider those as being a waste of energy and computing power.

I am fully aware that this is a voluntary efford everyone is contributing here.
However, for the sake of the scientits as well as the volunteers issues should be fixed.

I feel there is either no attention or progress on fixing anything.
I will now start aborting all suspicious tasks, hoping an increase in aborted items will catch someones attention.

Finally, if there should be no improvement visible, I will dedicate computing capacity to other projects.
Very sad to talk like this but I do not see any other option.

Cheers
ID: 1828 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 323
Credit: 22,882,096
RAC: 12,033
Message 1829 - Posted: 16 Jan 2023, 19:02:44 UTC - in response to Message 1828.  

Tasks like "corona_RdRp_v2_*" is much longer than "corona_Sprot_*". Estimated time for Ryzens - 16 hours and more. And 30 seconds ... 600 seconds per one step (depends on the "luck" of ligand - modeling for lucky ones takes longer).
ID: 1829 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Tasks hanging -

©2024 SiDock@home Team