Posts by Mad_Max

1) Message boards : Science : Other uses of docking than viruses? (Message 2236)
Posted 17 days ago by Mad_Max
Post:
Dear colleagues,

I am glad to inform, you that we did some trials with our version of the GPU (OpenCL) application. The scoring function for the GPU version is currently different from that of the CPU version,
and currently we are cleaning the code and are searching for bugs. We will very soon start with beta testing, ...

And more than half a year has passed again. Apparently, "very soon" in some scientific circles can stretch indefinitely... lol
2) Message boards : News : Target # 22: corona_RdRp_v2 (Message 1948)
Posted 25 Jan 2023 by Mad_Max
Post:
For RdRp_v2 target it is a perfectly normal. These are long tasks, even on modern computers the computing time for them can reach up to a day of pure CPU time. On older ones, you can expect 2 days or more.

While WUs for Sprot_delta target are relatively short tasks (like 10-15 times shorter/less computing time compared to RdRp_v2)
3) Message boards : Science : Other uses of docking than viruses? (Message 1947)
Posted 25 Jan 2023 by Mad_Max
Post:
I don't understand why so little time and attention is paid to GPU development in general (not only in your particular project).
I know it is not an easy and simple task...

But according to my personal and many other active volunteers opinion, for those tasks where it is generally applicable (and molecular dynamics is definitely one of such areas), the development/porting of the GPU version should be generally the number one priority.

Because the work that the project has been doing for a year in this case can be done in just a few weeks. I'm not overestimating or exaggerating. This is not only because GPU computing is much more productive/fast, but also because in the field of voluntary computing there is a great demand for projects that perform important meaningful scientific work in the field of biomedical research. Because most of the existing BOINC projects for GPU (especially for AMD/Intel = OpenCL, for NV/CUDA choice is somewhat larger) are devoted to pure, far-from-life theory - such as solving abstract mathematical problems or astrophysics. And when a new project appears that solves more applied and significant tasks, we can expect a large influx of new participants and computing power in additions to more efficient GPU computing by itself.

This was perfectly demonstrated at least twice with medical projects within WCG (Help Conquer Cancer - few years ago and Open Pandemic - last year) - as soon as it was possible to develop and launch a well-functioning GPU application - the overall calculation speed of the project increased not even by several times, but by more than an order of magnitude and from that moment on, the problem of the available amount of computing power ceased to stand at all - and the overall performance/throughput of the project was limited only by the available server resources for database operation and download servers for generation and processing of huge quantities of WUs for the computers of volunteers standing in line and waiting for when they can get some more work to process.
4) Message boards : Science : Other uses of docking than viruses? (Message 1946)
Posted 25 Jan 2023 by Mad_Max
Post:
I know that the CPU version is in active development.
But up to this point, I had heard nothing at all about the development of the GPU version. And I was VERY surprised to read in the old (you can say archival!) topic of the forum is that it turns out(according to the old plans) it should have already been ready about a year ago.

And no updates since then - why have the deadlines shifted a lot or plans changed or was it canceled completely? If not canceled, then at what stage it now and when at least approximately can we wait now first beta versions to test?
5) Message boards : Science : Other uses of docking than viruses? (Message 1941)
Posted 24 Jan 2023 by Mad_Max
Post:

We expect a fully functional version of GPU CMDOCK at the end of this year or the beginning of next year. The GPU version will use OPENCL and run on NVIDIA and AMD graphics cards.
Crtomir


Oops! The next year has not only begun, but has even already ended (it was the past 2022 - 1.5 year already passed ).
But not only is there no new GPU docking application released, a year after the originally planned deadline, but even no any news / updates about the progress of its development.

What happened to this project, where did it silently and without a trace go?
6) Message boards : News : СmDock "long" and "short" tasks applications (Message 1940)
Posted 24 Jan 2023 by Mad_Max
Post:
P.S.
There is an example of "saved" WU.
It was stuck running at 100% of one core for about 4 days (24/7) but does not make any actual progress last ~1.5 days.
(Judging by the time of the last modification of the <docking_out.log> file, which is updated correctly when writing to it.)
it's even more disappointing that it's stuck at 97%, just 3% from finale. And restarting means loss of all CPU time and credits.
But i have updated "last modified" timestamp of <docking_out.chk> before restart - and BOINC correctly restored all after restart.
You can notice it by just less <1 hours of runtime(CPU time 3371) after restart, but all the time from first WU start was restored and added correctly and reported just ~13h before deadline expire:

https://www.sidock.si/sidock/result.php?resultid=77591932

Same with https://www.sidock.si/sidock/result.php?resultid=77607538 although it not finished yet (but it should be when you read this), but i found it also stuck at 75% after ~3.5 days of running.
Timestamp fix + retart seems fixed it too.
7) Message boards : News : СmDock "long" and "short" tasks applications (Message 1938)
Posted 24 Jan 2023 by Mad_Max
Post:
After special "restart test" under Ubuntu, I did the same test for Windows 10 + BOINC 7.16.11. Before ~ 1 hour of tasks completion I restart a VM with Windows. First task, for workunit 49655115 is complete. And as you see, CPU time does not lost:
.........
Maybe it does not work under some systems, but under nearest for me Windows system it works.

Hello.
I did some debug to and have found part of the problem with checkpoints and loosing cpu/elapsed time counters after each WU/BOINC/Computer restart.
It may be OS related indeed. But not caused by OS itself (as all other projects running on same computer and in the same BOINC installation do not have such problems). I see it on all of my computers, but they use same OS ver installed (Win7 Pro x64).
May be something like new OS API call which was added only on latest win ver and do not work properly(partial support) on older versions?

I did some monitoring of files which running SiDock WU writes to the disk during checkpoint save in working "/slots/" folder.
And have found very interesting things: while checkpoint files written to the disk OK it miss write of file metadata: after modifying these files by app
docking_out.progress
docking_out.chk

File timestamps of "last modified" do not change and always stays same(equal) as time-stamp of initial file creation at WUs first start up.

Doesn't look like a significant problem? It just a file timestamps..
At first I thought so too, but just in case I tried to change the timestamp of the last modification of the file "docking_out.chk" - BOINC immediately (instantly!) noticed it, created a file boinc_task_state.xml (it had been missing until that time, despite the fact that the calculation of the WU was already coming to an end) and updated the information in the GUI about the time of the last checkpoint (just a few seconds from the last checkpoint).
Also file "wrapper_checkpoint.txt" was created at the same moment. It had been missing too despite >20 hours of WUs computation. So may be it is not BOINC but wrapper intermediate app is so depended on timestamps?

And I'm not familiar enough with the internal program algorithms, and in general it looks strange(some even call it stupid) programming decision... But it seems that BOINC(or it SiDock wrapper app failure?) determines the fact of actual science app recording a new checkpoints ONLY by the date/time of files modification? And if time-stamps do not update when writing these files, it does not notice at all the fact that a brand new checkpoint was recorded by a working application.

Also interesting fact:
During other files modification like
docking_out.log
docking_log
docking_out 
wrapper_checkpoint.txt

time stamps updated correctly after each file modification/additions! For some reason only <docking_out.progress> and <docking_out.chk> files miss timestamp updates during these files updates.

I don't even understand how this can happen at all...
Why writing files by the same program (not even just a program, but by the same process already loaded and running in memory) on the same computer and OS and even in the same folder in one case updates the modification timestamps of the file when it is modified, and in others - no.
This files used in different modules written by different programmers and by using different API/libs to access disk functions?
8) Message boards : News : СmDock "long" and "short" tasks applications (Message 1881)
Posted 18 Jan 2023 by Mad_Max
Post:

Errors 0xc2, 0xc3 - looks intresting, thank you for report.

This additional errors (all of them only on a computer with hostid 25851 or Cruiser-2 as name) you can ignore.
I know exact reason and it is not SiDock or BOINC related. This was 3rd party(non BOINC) buggy app running on same computer with nasty memory leak. It just ate up all the memory (including the virtual/swap file - about 24 GB total) yesterday and other programs started crashing due to out of RAM. After I noticed it and restarted it to free trashed RAM, all these errors stopped immediately.

But this has nothing to do with the problem of resets of time counters, progress bar and credit calculations which i see on all of my computers.
9) Message boards : News : СmDock "long" and "short" tasks applications (Message 1875)
Posted 18 Jan 2023 by Mad_Max
Post:
After restart time counts from last successful checkpoint (as for any other project).

Yes, it works this way for any OTHER projects, but not for SiDock - for sidoc it resets to zero afters restart!
May be it due to use of 2 level wrapped app (app launched by BOINC is not an actual app but just wrapper app which launch actual app which do all actual work/computations) - i do not run any other projects with wrapped apps used to compare.

Just reproduced it again. There were 4 SiDock WUs running, and few WUs from other project (from WCG this time, but also work OK with Rosetta@Home and Einstein@Home and MilkyWay@Home).
I restarted BOINC
For WCG WUs CPU/elapsed time counters and progress bar were immediately restored to values close to prior restart (from latest checkpoint i guess).
But for all SiDock WUs all time counters and progress bars were reset to zero.
After 5-7 minutes of computation progress bar recovered to near pre-restart values. But CPU/elapsed times still counting from moment of restart.

Suspending/resume WUs (with "leave in RAM" option turned off) also kills time counters but save progress bar % of BOINC manager is not restarted.
....
Oh, look like i just have found problem (or at least part of it) - BOINC do not see a checkpoints from SiDock at all:
CPU time
01:05:37
CPU time since checkpoint
01:05:34
Elapsed time
01:05:46

SiDock use own implementation of checkpoints ? Or do not report to BOINC properly after checkpoint saved? I know they actually works fine. But BOINC does not see/know about it.
Any way - BOINC thinks there are no any checkpoint made for WU and that's why it reset CPU/elapsed time counters.

Also SiDock does not report progress % to BOINC properly.
In working slot directory in boinc_task_state.xml files of all running SiDock WUs i see
<fraction_done>0.000000</fraction_done>
While in BOINC GUI i see correct values.

Probable it report it via API (app-to-app communication on the fly) but does not write same info to the state file as it should? It could explain strange progress bar behavior after restart: BOINC always reads files fist and see fraction_done = 0 and so revert progress bar in GUI to zero too. But later gets actual progress % some other way and corrects progress bar.

P.S.
I use latest BOINC(v7.20.2) on x64 windows. May be on *nix it works differently...
10) Message boards : Number crunching : Tasks hanging - (Message 1870)
Posted 18 Jan 2023 by Mad_Max
Post:

This reported runtime is IMPOSSIBLE because it was running on a single thread and the machine returned 48 other WU in the same time period..

Something is wrong with these work units.

It loose(resets to zero) CPU time stats after each restart (full restart without leaving in RAM). So only CPU/elapsed time since last app restart counted. Looks like another bug...
I post about it in detail already in the another thread before saw your message: https://www.sidock.si/sidock/forum_thread.php?id=225&postid=1866#1866
11) Message boards : Number crunching : Tasks hanging - (Message 1869)
Posted 18 Jan 2023 by Mad_Max
Post:

I note that all the tasks giving me problems are RdRp_v2_sample whereas all of the successful tasks are Sprot_delta_v1_RM_sidock

Are the sample jobs a faulty batch and should I abort them on sight?

No. I only saws "hang" tasks in Sprot_delta.
RdRp_v2_sample runs OK (At least I have never come across a hung task from this series).

It's just that these tasks are considered MUCH (about 10-20 times) longer than the previous ones from a Sprot_delta series . And the calculation times exceeding a day (and on weak computers, more than 2 days of non stop computing) is NORMAL situation for these tasks and is not a failure!

Although such long tasks can be a problem in themselves - admins need to at least increase the BOINC deadline setting for them, because weaker computers (or modern but not working 24/7, but only a few hours a day) simply will not have enough time to finish all calculations before the deadline.
12) Message boards : News : СmDock "long" and "short" tasks applications (Message 1866)
Posted 18 Jan 2023 by Mad_Max
Post:
This new app looses CPU/elapsed time stats if restarted (full restarts,without leaving in memory). And so loose points/credits as well.
At the same time, actual progress is NOT lost. That is, checkpoints are working. After restarting the app (BOINC restart or BOINC manager just switch to another project without active option "leave in memory while suspended" ), calculations continue from the last checkpoint as intended, but all the stats counters reported to BOINC of elapsed time, CPU time and time elapsed from the last checkpoint are resets to zero.
Examples of such tasks:
https://www.sidock.si/sidock/result.php?resultid=77568221 13,613.24/13,521.96 sec of elapsed/CPU time

https://www.sidock.si/sidock/result.php?resultid=77568222 13,731.34/13,642.55 sec of elapsed/CPU time and 543.41 credits

https://www.sidock.si/sidock/result.php?resultid=77568208 - 1,302/1,280 sec of elapsed/CPU time and 50.30 credits

https://www.sidock.si/sidock/result.php?resultid=77568215 - 11,880/11,686 sec of elapsed/CPU time and 431.07 credits

While actual run times was about 40 000 - 60 000 sec for all of these tasks (~100 sec per 1 ligand on average in docking_out.log and there are 500 of them in each tasks)
I just restarted computer few times during it computation. And after each restarts tasks continues from checkpoint successfully but all time counters resets to zero each time.

Probably the problem that users have recently complained about in other topics (about a very small amount of credits granted for some of the tasks ) is related to this as well - if the task was restarted often during the calculation process, then only the calculation time since the last restart will be taken into account and evaluated. As it look like credits calculations are based on CPU time used by task and reported by BOINC.

P.S.
BOINC progress bar (% of task completed) also resets to zero after each restart. But it restore to correct values after some time (usually few mins). But time counters does not restore.
13) Message boards : Number crunching : Tasks hanging - (Message 1799)
Posted 15 Jan 2023 by Mad_Max
Post:
Also see some of such "stuck" tasks with latest app (never seen such behavior before with previous version).

CPU core is still fully used, but actual progress stops.

To make it worse, it seems in the application there is no "watchdog" timer (or inadequate settings are set in it).
Normal tasks are successfully completed in 1.5-3 hours each on single core on my computers(depends on CPU - i have few different) , but the bad one can occupy a processor for a day or two and never end until I cancel or restart it. During this time, if there was no such failure, 10-20 other tasks on the same core could be successfully completed.

If you do not manage to find out the root cause of the failures and eliminate it, I would recommend adding a guard timer. And better not for the entire task(WU - BOINC work unit), there are actually a lot of separate micro-tasks in it (modeling attempts, judging by the logs of about 500 pieces packed into each “long” task by default).
If such an individual micro task does not end for more than 10-15 minutes(normal run times on relative modern CPUs <1 min), it will never end and it should be restarted or canceled.




©2024 SiDock@home Team