1)
Message boards :
Cafe :
I wonder how SiDock (distributed computing) compares with supercomputing
(Message 2009)
Posted 25 Feb 2023 by threatripper Post: Seems like they (Ohio’s University of Toledo) may be doing vaguely similar research, but with a very expensive computer cluster: https://www.theregister.com/2023/02/23/petaflops_covid/ |
2)
Message boards :
Cafe :
interesting numbers
(Message 1896)
Posted 20 Jan 2023 by threatripper Post: in decimal? base 9? base 12? hexidecimal? |
3)
Message boards :
Number crunching :
Tasks hanging -
(Message 1852)
Posted 17 Jan 2023 by threatripper Post: Thanks very much. I'm still getting quite a few slow tasks. I'll do a system reboot soon, and that should restart boinc. If that doen't improve things then one more thing to look into is whether the code for the simulation has changed over the past week (as opposed to just new data).In my experience things were going super smooth just a week ago before these symptoms. |
4)
Message boards :
Number crunching :
Tasks hanging -
(Message 1843)
Posted 17 Jan 2023 by threatripper Post: Thank you for the clarifications and helpful tip regarding the logs. I checked on one task which was about 87% complete. It seems to have hung on one of the records. On record 432 there is the last estimate made for time of completion, with 69 records remaining. RECORD #432 NAME: ZINC001026963444 RNG seed:std::random_device Numer of docking runs done: 20 (0 errors) Ligand docking duration: 22.0796 second(s) Average duration per ligand: 24.7868 second(s) Approximately 69 record(s) remaining, will finish Sat Jan 14 07:49:39 2023 ************************************************** RECORD #433 NAME: ZINC001026963481 RNG seed:std::random_device Numer of docking runs done: 20 (0 errors) Ligand docking duration: 23.5675 second(s) ************************************************** RECORD #434 NAME: ZINC001026963482 [blah.... truncated for brevity] ************************************************** RECORD #435 NAME: ZINC001026963521 [blah.... truncated for brevity] ************************************************** RECORD #436 [blah.... truncated for brevity] ************************************************** RECORD #437 [blah.... truncated for brevity] ************************************************** RECORD #438 NAME: ZINC001026963524 RNG seed:std::random_device Numer of docking runs done: 20 (0 errors) Ligand docking duration: 23.1778 second(s) ************************************************** RECORD #439 NAME: ZINC001026963525 RNG seed:std::random_device The log file ends with Record #439. The last complete record logged without missing information was the previous record #438. For what reason this ZINC001026963525 simulation seems to have hung I don't know. However, I think I will kill the task, as well as other tasks that seem to be hanging. I'll make a zip file of the logs in slot 4 in case there's further info in these to shed some light on the cause. The directory listing indicates that the only file being updated in the slot 4 directory is some sort of mmap file. All the other files have stopped changing for hours: root@mars2:/var/lib/boinc-client/slots/4# date; ls -alt Mon 16 Jan 2023 09:29:08 PM EST total 12012 -rw-r--r-- 1 boinc boinc 8192 Jan 16 21:28 boinc_mmap_file drwxrwx--x 4 boinc boinc 4096 Jan 15 06:34 . -rw-r--r-- 1 boinc boinc 6358 Jan 15 06:34 init_data.xml -rw-r--r-- 1 boinc boinc 517 Jan 14 07:23 boinc_task_state.xml -rw-r--r-- 1 boinc boinc 28 Jan 14 07:23 wrapper_checkpoint.txt -rw-r--r-- 1 boinc boinc 107863 Jan 14 07:23 docking_out.log -rw-r--r-- 1 boinc boinc 47151 Jan 14 07:23 docking_log -rw-r--r-- 1 boinc boinc 1617067 Jan 14 07:23 docking_out -rw-r--r-- 1 boinc boinc 255 Jan 14 07:23 docking_out.chk -rw-r--r-- 1 boinc boinc 8 Jan 14 07:23 docking_out.progress -rw-r--r-- 1 boinc boinc 0 Jan 14 04:22 boinc_lockfile -rw-r--r-- 1 boinc boinc 274 Jan 14 04:22 stderr.txt -rw-r--r-- 1 boinc boinc 56 Jan 14 04:22 htvs.ptc -rw-r--r-- 1 boinc boinc 180365 Jan 14 04:22 target.mol2 -rw-r--r-- 1 boinc boinc 7856840 Jan 14 04:22 target.as -rwxr-xr-x 1 boinc boinc 408352 Jan 14 04:22 cmdock -rw-r--r-- 1 boinc boinc 100 Jan 14 04:22 cmdock-l_wrapper_2.0_x86_64-pc-linux-gnu -rw-r--r-- 1 boinc boinc 721 Jan 14 04:22 job.xml -rw-r--r-- 1 boinc boinc 1033 Jan 14 04:22 target.prm drwxr-xr-x 13 boinc boinc 4096 Jan 12 02:27 .. drwxr-xr-x 6 boinc boinc 4096 Dec 21 03:53 data drwxr-xr-x 2 boinc boinc 4096 Dec 21 03:53 lib -rw-rw-r-- 1 boinc boinc 1983385 Jan 25 2022 ligands.sdf Boincmgr under the task properties displays the following for the properties: Application CurieMarieDock 0.2.0 long tasks 2.00 Name corona_Sprot_delta_v1_RM_sidock_00518854_r4_s1000.0 State Running Received Sat 14 Jan 2023 12:27:51 AM EST Report deadline Thu 19 Jan 2023 12:27:51 AM EST Estimated computation size 30,000 GFLOPs CPU time 2d 15:20:50 CPU time since checkpoint 2d 12:21:11 Elapsed time 2d 16:10:37 Estimated time remaining 09:05:03 Fraction done 87.600% Virtual memory size 141.95 MB Working set size 2.73 MB Directory slots/4 Process ID 275748 Progress rate 1.440% per hour Executable cmdock-l_wrapper_2.0_x86_64-pc-linux-gnu Maybe putting a timeout break point in the code and adding some debugging code will shed light on why some simulations seem to hang. |
5)
Message boards :
Number crunching :
Tasks hanging -
(Message 1835)
Posted 16 Jan 2023 by threatripper Post: I don't mind long tasks, as long as you can be sure they complete in a reasonable or known time limit. |
6)
Message boards :
Number crunching :
Tasks hanging -
(Message 1833)
Posted 16 Jan 2023 by threatripper Post: I'm having the same issue right around now as well. Over half my threads seem to be bogged down. Previously I've only had to kill a rare hanging tasks after a system reboot (which points to a possible inability for a task to resume from a checkpoint). This time around I have a batch full of tasks that are going either really slow or are apparently hanging. They do peg the CPU about 100%, so they're definitely using the CPU but the code may be stuck in a loop. Circumstances for this might be heavy system load. I have only 8 threads (8c/8t AM3 processor) and many times my system load is quite a bit above 10 . However, I have boinc set to keep computing regardless of system load. This, afaik should prevent the possiblity in case the programming is unable to properly suspend resume a compute thread (i.e. if errors arise from this step). Other possible factor is that I've had steam running, which is chromium/google based. Google regularly has questionable code so chromium derivatives often end up being vulnerable to exploits from memory safety issues and other matters of open sores, when are then exploited by malware dished out by real time bidding ads); there might be some possibility this chrome derivative is basically borking the system to the point of corrupting other processes. Given the other reports though, it seems like the likeliest bet is some bugs or suboptimal programming of the work units. One last possibility, is there any chance these work units depend on results of other work units? I did manually change the order the compute units would normally start (reason being I like to avoid high system loads during gaming sessions, so I have all unstarted tasks on suspend, and manually start them off when system load is low). |
©2024 SiDock@home Team