Posts by threatripper

1) Message boards : Cafe : I wonder how SiDock (distributed computing) compares with supercomputing (Message 2009)
Posted 25 Feb 2023 by Profile threatripper
Post:
Seems like they (Ohio’s University of Toledo) may be doing vaguely similar research, but with a very expensive computer cluster:

https://www.theregister.com/2023/02/23/petaflops_covid/
2) Message boards : Cafe : interesting numbers (Message 1896)
Posted 20 Jan 2023 by Profile threatripper
Post:
in decimal? base 9? base 12? hexidecimal?
3) Message boards : Number crunching : Tasks hanging - (Message 1852)
Posted 17 Jan 2023 by Profile threatripper
Post:
Thanks very much. I'm still getting quite a few slow tasks. I'll do a system reboot soon, and that should restart boinc.

If that doen't improve things then one more thing to look into is whether the code for the simulation has changed over the past week (as opposed to just new data).In my experience things were going super smooth just a week ago before these symptoms.
4) Message boards : Number crunching : Tasks hanging - (Message 1843)
Posted 17 Jan 2023 by Profile threatripper
Post:
Thank you for the clarifications and helpful tip regarding the logs.

I checked on one task which was about 87% complete. It seems to have hung on one of the records. On record 432 there is the last estimate made for time of completion, with 69 records remaining.

RECORD #432
NAME:   ZINC001026963444
                     RNG seed:std::random_device
Numer of docking runs done:   20 (0 errors)
Ligand docking duration:      22.0796 second(s)

Average duration per ligand:  24.7868 second(s)
Approximately 69 record(s) remaining, will finish Sat Jan 14 07:49:39 2023

**************************************************
RECORD #433
NAME:   ZINC001026963481
                     RNG seed:std::random_device
Numer of docking runs done:   20 (0 errors)
Ligand docking duration:      23.5675 second(s)

**************************************************
RECORD #434
NAME:   ZINC001026963482
 [blah.... truncated for brevity]
**************************************************
RECORD #435
NAME:   ZINC001026963521 
 [blah.... truncated for brevity]
**************************************************
RECORD #436
 [blah.... truncated for brevity]
**************************************************
RECORD #437   [blah.... truncated for brevity]
**************************************************
RECORD #438
NAME:   ZINC001026963524
                     RNG seed:std::random_device
Numer of docking runs done:   20 (0 errors)
Ligand docking duration:      23.1778 second(s)

**************************************************
RECORD #439
NAME:   ZINC001026963525
                     RNG seed:std::random_device



The log file ends with Record #439. The last complete record logged without missing information was the previous record #438.

For what reason this ZINC001026963525 simulation seems to have hung I don't know.

However, I think I will kill the task, as well as other tasks that seem to be hanging. I'll make a zip file of the logs in slot 4 in case there's further info in these to shed some light on the cause.

The directory listing indicates that the only file being updated in the slot 4 directory is some sort of mmap file. All the other files have stopped changing for hours:


root@mars2:/var/lib/boinc-client/slots/4# date; ls -alt
Mon 16 Jan 2023 09:29:08 PM EST
total 12012
-rw-r--r--  1 boinc boinc    8192 Jan 16 21:28 boinc_mmap_file
drwxrwx--x  4 boinc boinc    4096 Jan 15 06:34 .
-rw-r--r--  1 boinc boinc    6358 Jan 15 06:34 init_data.xml
-rw-r--r--  1 boinc boinc     517 Jan 14 07:23 boinc_task_state.xml
-rw-r--r--  1 boinc boinc      28 Jan 14 07:23 wrapper_checkpoint.txt
-rw-r--r--  1 boinc boinc  107863 Jan 14 07:23 docking_out.log
-rw-r--r--  1 boinc boinc   47151 Jan 14 07:23 docking_log
-rw-r--r--  1 boinc boinc 1617067 Jan 14 07:23 docking_out
-rw-r--r--  1 boinc boinc     255 Jan 14 07:23 docking_out.chk
-rw-r--r--  1 boinc boinc       8 Jan 14 07:23 docking_out.progress
-rw-r--r--  1 boinc boinc       0 Jan 14 04:22 boinc_lockfile
-rw-r--r--  1 boinc boinc     274 Jan 14 04:22 stderr.txt
-rw-r--r--  1 boinc boinc      56 Jan 14 04:22 htvs.ptc
-rw-r--r--  1 boinc boinc  180365 Jan 14 04:22 target.mol2
-rw-r--r--  1 boinc boinc 7856840 Jan 14 04:22 target.as
-rwxr-xr-x  1 boinc boinc  408352 Jan 14 04:22 cmdock
-rw-r--r--  1 boinc boinc     100 Jan 14 04:22 cmdock-l_wrapper_2.0_x86_64-pc-linux-gnu
-rw-r--r--  1 boinc boinc     721 Jan 14 04:22 job.xml
-rw-r--r--  1 boinc boinc    1033 Jan 14 04:22 target.prm
drwxr-xr-x 13 boinc boinc    4096 Jan 12 02:27 ..
drwxr-xr-x  6 boinc boinc    4096 Dec 21 03:53 data
drwxr-xr-x  2 boinc boinc    4096 Dec 21 03:53 lib
-rw-rw-r--  1 boinc boinc 1983385 Jan 25  2022 ligands.sdf


Boincmgr under the task properties displays the following for the properties:

Application CurieMarieDock 0.2.0 long tasks 2.00 
Name corona_Sprot_delta_v1_RM_sidock_00518854_r4_s1000.0
State Running
Received Sat 14 Jan 2023 12:27:51 AM EST
Report deadline Thu 19 Jan 2023 12:27:51 AM EST
Estimated computation size 30,000 GFLOPs
CPU time 2d 15:20:50
CPU time since checkpoint 2d 12:21:11
Elapsed time 2d 16:10:37
Estimated time remaining 09:05:03 
Fraction done 87.600%
Virtual memory size 141.95 MB
Working set size 2.73 MB
Directory slots/4
Process ID 275748
Progress rate 1.440% per hour
Executable cmdock-l_wrapper_2.0_x86_64-pc-linux-gnu


Maybe putting a timeout break point in the code and adding some debugging code will shed light on why some simulations seem to hang.
5) Message boards : Number crunching : Tasks hanging - (Message 1835)
Posted 16 Jan 2023 by Profile threatripper
Post:
I don't mind long tasks, as long as you can be sure they complete in a reasonable or known time limit.
6) Message boards : Number crunching : Tasks hanging - (Message 1833)
Posted 16 Jan 2023 by Profile threatripper
Post:
I'm having the same issue right around now as well. Over half my threads seem to be bogged down.

Previously I've only had to kill a rare hanging tasks after a system reboot (which points to a possible inability for a task to resume from a checkpoint).

This time around I have a batch full of tasks that are going either really slow or are apparently hanging. They do peg the CPU about 100%, so they're definitely using the CPU but the code may be stuck in a loop.

Circumstances for this might be heavy system load. I have only 8 threads (8c/8t AM3 processor) and many times my system load is quite a bit above 10 . However, I have boinc set to keep computing regardless of system load. This, afaik should prevent the possiblity in case the programming is unable to properly suspend resume a compute thread (i.e. if errors arise from this step). Other possible factor is that I've had steam running, which is chromium/google based. Google regularly has questionable code so chromium derivatives often end up being vulnerable to exploits from memory safety issues and other matters of open sores, when are then exploited by malware dished out by real time bidding ads); there might be some possibility this chrome derivative is basically borking the system to the point of corrupting other processes.

Given the other reports though, it seems like the likeliest bet is some bugs or suboptimal programming of the work units.

One last possibility, is there any chance these work units depend on results of other work units? I did manually change the order the compute units would normally start (reason being I like to avoid high system loads during gaming sessions, so I have all unstarted tasks on suspend, and manually start them off when system load is low).




©2024 SiDock@home Team