lots of tasks error out suddenly

Message boards : Number crunching : lots of tasks error out suddenly
Message board moderation

To post messages, you must log in.

AuthorMessage
bfromcolo

Send message
Joined: 31 Dec 20
Posts: 3
Credit: 2,816,167
RAC: 0
Message 1377 - Posted: 6 Dec 2021, 2:22:53 UTC

I have been running non stop since 11/22 with no issues and suddenly 40 tasks fail all at once. Bad batch? Not much to see in the error log.

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:58:05 (36465): wrapper (7.17.26016): starting
17:58:06 (36465): wrapper (7.17.26016): starting
17:58:06 (36465): wrapper: running cmdock (-c -j 1 -r target.prm -p "/var/lib/boinc-client/slots/14/data/scripts/dock.prm" -f htvs.ptc -i ligands.sdf -o docking_out)
18:01:16 (36465): cmdock exited; CPU time 189.741309
18:01:16 (36465): app exit status: 0x8b
18:01:16 (36465): called boinc_finish(195)
ID: 1377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Natalia
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 9 Oct 20
Posts: 181
Credit: 2,689,264
RAC: 42
Message 1378 - Posted: 6 Dec 2021, 8:08:23 UTC - in response to Message 1377.  

Could you specify the computer please? I see several computers belonging to you in the database, there are some error results, but some of them seem to have been resolved.
ID: 1378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Falconet

Send message
Joined: 24 Oct 20
Posts: 23
Credit: 9,020
RAC: 0
Message 1379 - Posted: 6 Dec 2021, 14:23:35 UTC - in response to Message 1378.  
Last modified: 6 Dec 2021, 14:28:23 UTC

Looks like it's this one - https://www.sidock.si/sidock/show_host_detail.php?hostid=23025


Enough RAM available?
I see you are using Ubuntu, maybe a permissions issue? - Never mind that, I see 1 WU successfully completed while others errored out.
You could try a project reset and see if that helps.
ID: 1379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bfromcolo

Send message
Joined: 31 Dec 20
Posts: 3
Credit: 2,816,167
RAC: 0
Message 1380 - Posted: 6 Dec 2021, 17:46:10 UTC

Yes computer 23025 is one I am questioning. It is a 12c/24t Xeon with 32G of memory, and it was not running anything else demanding at the time, so I doubt memory was an issue, although I did add memory recently. It continued to run some tasks successfully even after the 40 that failed. These all failed within a very short time period. I moved it to another project and it is running fine there, which is why I as questioning if there was something up with the tasks that failed. If it is some system issue I am not sure what to look for, I have not restarted the OS or BOINC since this occurred. I will try moving work back to it and see what happens.
ID: 1380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmstick

Send message
Joined: 30 Nov 21
Posts: 2
Credit: 1,245,009
RAC: 0
Message 1381 - Posted: 6 Dec 2021, 18:51:21 UTC - in response to Message 1380.  

If you're running Linux, you should check the kernel logs (`journalctl -k` / `dmesg`) for any errors.
ID: 1381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bfromcolo

Send message
Joined: 31 Dec 20
Posts: 3
Credit: 2,816,167
RAC: 0
Message 1382 - Posted: 7 Dec 2021, 0:42:02 UTC

Don't know if it helps but I found a number of these in my system log when these all failed.

cmdock[36125]: segfault at 5634063e8400 ip 00007f1fa00126d6 sp 00007ffe7b743ba0 error 4 in libcmdock.so.0[7f1f9fd52000+45b000]
cmdock[36541]: segfault at 55adf716bb30 ip 00007fb85fe95539 sp 00007fffc47c7d00 error 4 in libcmdock.so.0[7fb85fb46000+45b000]
cmdock[36441]: segfault at 557dcd3dace0 ip 00007f99ecbb66d9 sp 00007ffc3a8657a0 error 4 in libcmdock.so.0[7f99ec8f6000+45b000]

I won't paste all 40 of them here but they all end the same "error 4 in libcmdock.so.0[************+45b000]'
ID: 1382 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmstick

Send message
Joined: 30 Nov 21
Posts: 2
Credit: 1,245,009
RAC: 0
Message 1383 - Posted: 7 Dec 2021, 3:54:28 UTC - in response to Message 1382.  

If it's not a typical use-after-free C/C++ programming error, then it's caused by memory corruption from unstable memory or CPU. It could also be caused by a cosmic ray, however unlikely.
ID: 1383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : lots of tasks error out suddenly

©2024 SiDock@home Team