1)
Message boards :
Number crunching :
No short tasks
(Message 2327)
Posted 20 Oct 2024 by Mad_Max Post: I don't think there is a max job limit, unless it's some sort of multiple of number of cores. I easily get way more than 12 jobs at a time. Could it be your local preference settings or app_config? There is a limit on the number of tasks per host in SiDoc@home: 19/10/2024 19:13:25 | SiDock@home | Sending scheduler request: To fetch work. 19/10/2024 19:13:25 | SiDock@home | Requesting new tasks for CPU 19/10/2024 19:13:26 | SiDock@home | Scheduler request completed: got 0 new tasks 19/10/2024 19:13:26 | SiDock@home | No tasks sent 19/10/2024 19:13:26 | SiDock@home | This computer has reached a limit on tasks in progress 19/10/2024 19:13:26 | SiDock@home | Project requested delay of 21 seconds 19/10/2024 19:28:15 | SiDock@home | Sending scheduler request: To fetch work. 19/10/2024 19:28:15 | SiDock@home | Requesting new tasks for CPU 19/10/2024 19:28:16 | SiDock@home | Scheduler request completed: got 0 new tasks 19/10/2024 19:28:16 | SiDock@home | No tasks sent 19/10/2024 19:28:16 | SiDock@home | This computer has reached a limit on tasks in progress 19/10/2024 19:28:16 | SiDock@home | Project requested delay of 21 seconds But you were right in your guess: it is not set as a fixed number, but as a multiplier of the number of cores. At the moment, it is set to a maximum of 4 WUs per 1 CPU core. More precisely, 4 WUs per 1 CPU thread (virtual core). By the way, it would be nice to increase this limit! It was set to 4 WU back when the project had very long COVID tasks, the computation of each of which took from 1 to 3 days, depending on CPU performance. And 4 tasks limit were more than enough because it corresponded to the stock(cache) of tasks from 4 to 12 days of non-stop(24/7) computation. But now, since the beginning of work on the Ebola sub-project, all the tasks are short. If they are computed on a modern decent processors it takes only a little over an hour (for example on my AMD Ryzen 5600X@4Ghz with SMT=off it takes 80-90 min of CPU time per WU on average). And a current limit of 4 tasks per CPU core is quite a hindrance now - the maximum work cache can reach only 5-7 hours of work before hitting this limit. |
2)
Questions and Answers :
Windows :
не получает задания
(Message 2303)
Posted 22 Jul 2024 by Mad_Max Post: Кто-то сталкивался с такой ситуацией, вроде дело было в том, что нужны права администратора, и загрузка файла его просто не меняет. скорее всего, странно что предупреждение не появилось Знаю, что сообщение старое, но может кому пригодится, т.к. неочевидная проблема. Это "приколы" от службы виртуализации файлов Windows. http://msdn.microsoft.com/en-us/library/windows/desktop/bb756960.aspx По умолчанию (если это отдельно не отключить через реестр или групповые политики), когда какая-то программа НЕ имеющая прав администратора пытается что-то записать в защищенную папку (в которую можно писать только с админскими правами), она не дает отлуп с ошибкой "НЕЛЬЗЯ!", а вместо этого создает для этой программы виртуальное хранилище (по умолчанию оно в Users\текущий пользователь\AppData\Local\VirtualStore\) и пишет файлы туда. Когда эта же программа пытается его из той папки прочитать - она его отдает назад. Но никакие другие программы больше этот файл не видят, т.к. на самом деле он лежит в совсем другом месте. Ну а если он там уже был, то видят его исходную/предыдущую версию. А замененную видит только та программа, которая пыталась его заменить (т.к. конкретно для нее чтение тоже переадресоывается на ее персональное виртуальное хранилище). В основном это сделано для совместимости со старыми программами, которые вообще не в курсе про права доступа в файловой системы (например изначально еще под работу на FAT или других простых ФС разрабатывались) и получив такой отказ оказываются вообще не работоспособны - т.к. обработка таких ситуаций в них не была заложена. В данном конкретном случае под виртуализацию просто попадал браузер при попытке скачивать из интернета файл напрямую в одну из системных папок. P.S. У себя я эту функцию отключил полностью - пусть лучше программа сразу вываливается с ошибой (из-за невозможности записать файл), чем потом ломать голову почему что-то не работает/не обновляется из-за файла улетевшего в виртуальное хранилеще. А паре старых программ которым все-таки это оказалось нужно вместо этого просто выписал права на доступ. Только не полный админские(что плохо с точки зрения безопасности), а конкретно просто разрешение на запись к конкретному папке/файлу которые им были нужны для работы. |
3)
Message boards :
Number crunching :
Gpu app?
(Message 2302)
Posted 22 Jul 2024 by Mad_Max Post: Now it's been a full 8 months and counting on and still no any updates? So we weren't anywhere close last year then... |
4)
Message boards :
Number crunching :
Multi CPU core enabled work units/tasks
(Message 2301)
Posted 22 Jul 2024 by Mad_Max Post: The best multicore processing possible from the point of view of computational efficiency is X independent tasks that are processed independently in their own single-threaded process. As it is now(and from the start) implemented in SiDock and many(most of) other BOINC projects. Parallelism is provided at the data level and can be efficiently scaled to any available number of cores/threads without additional efforts. No multi-threaded application can surpass it, at best it only can come pretty close to it. The only significant drawback of the "X independent tasks running on each of the CPU cores" approach vs MP-app is a much higher consumption of RAM. But since specifically in SiDock the RAM usage is quite low (usually under 200 MB per running process), then SiDock does not really need any MP applications at all. And all the available resources/time of developers is much better and more useful instead, to let them complete the development of the GPU application of CmDock. Which has been in the works for more than two years now. And which they promise to "finish soon" (last such "soon" was about 8 months ago), but so far they have not been able to finish it. |
5)
Message boards :
Number crunching :
Is it possible to turn on the setting "Max # of simultaneous tasks" ?
(Message 2300)
Posted 22 Jul 2024 by Mad_Max Post: Note that the <max_concurrent> tag in the app_config.xml file sets the maximum number of tasks of a specific application to run at a given time. It does not control the work cache size, meaning if you set the max_concurrent to "2" say, it will download tasks to fill whatever cache size and the cpu thread to use that you specified. It will only run 2 tasks irrespective of your cache size. However there is a bug in the boinc manager (not sure what version that is) that could potentially load more tasks that you need and override the cache size. Just fyi. Actually, this setting do AFFECTS the work cache size. Just not directly in the sense that BOINC can not download no more than the number of tasks specified in this parameter. But so that no more is downloaded than can be calculated in the time specified in the cache size settings with the number of threads specified in the max_concurrent setting. In a simple example. Source data: the machine has 6 cores/12 threads and has a cache setting = 2 days 1 task takes an average of 16 hours (note: the average computation time per task is taken for this particular machine - from the client's local statistics, not from server). Without additional restrictions, BOINC can download up to 36 tasks from one project to such a machine: 2*24/16*12 = 36 And thus fill the entire cache with only tasks for this one project. For example, this may happen if at some point there are no available tasks in other connected projects. Now if, all other things being equal, we add a setting to the same machine <project_max_concurrent>4</project_max_concurrent> Then BOINC will not download more than 12 tasks to the cache: 2*24/16*4 = 12 P.S. Yes, there was a critical bug with the operation of this mechanism in older versions of BOINC (lower than 7.20.xx). Because of which, with the max_concurrent option active, downloading tasks to the cache could happen in huge quantities (up to 1000 tasks from one project, regardless of the specified cache size). But this bug has been already fixed about 2 years ago in BOINC clients version 7.20.0: https://boinc.berkeley.edu/wiki/Release_Notes Changes in 7.20.0 BOINC versions starting from >= 7.20 should work correctly from now on, according to the logic described at the beginning of the post: at the same time limiting both the number of simultaneously running tasks and the number of tasks in the working cache. |
6)
Message boards :
Number crunching :
WU become longer and longer
(Message 2299)
Posted 22 Jul 2024 by Mad_Max Post: There are a couple of known bugs in the CmDock science app: Yes, although the error itself has already been identified and fixed for a long time (more than a year ago), the problem remains that the fixed code is not used here in the actual working application sent to BOINC clients - it has not been updated for more than 1.5 years: Platform Version Created Average computing Microsoft Windows running on an AMD x86_64 or Intel EM64T CPU 2.02 [b]21 Jan 2023[/b], 16:49:05 UTC 24,051 GigaFLOPS I even did a workaround for this problem for all of my Windows machines working for SiDoc(4 of them currently), without waiting for the official fix. Here is his description, maybe it will be useful to someone, because you dont want to wait another year or more for an official fix. Since the checkpoint files themselves are written correctly and the problem is caused only by the LACK of updating the timestamp when writing the checkpoint, I wrote a simple script that regularly does exactly this - updates the modification date of all files with checkpoints. The code is like this (win CMD) for /R "D:\Boinc\Data\slots\" %%a in (docking*.chk) do touch --no-create %%a This one-line script does two things: 1 - recursively scans all sub-folders of the "slots" BOINC directory and finds all *.chk files which store CmDock checkpoints 2 - calls the "touch" CLI utility for them, which updates the timestamps of the specified file to the current date-time without changing the contents of the file. In general, this is a standard *nix utility, but I added it a long time ago to my win machines along with some other handy cli tools (like "head" and "tail") from the GnuWin32 package: https://gnuwin32.sourceforge.net/packages/coreutils.htm After that, I set this script to run on a schedule every 10 minutes. The time is chosen relatively arbitrarily: large enough for the task to have time to add at least one more checkpoint (although even if it does not have time, there will be no problems from this), and on the other hand, not too large so that the "lost" execution time of WU during restart is minimal. [/quote]* If a task resumes from a checkpoint but gets restarted before it has saved another checkpoint, it might restart from the beginning or report completion and fail validation. (Also mentioned in threads here and here.) The bug has been reported. [/quote] |
7)
Message boards :
Science :
Other uses of docking than viruses?
(Message 2236)
Posted 8 Apr 2024 by Mad_Max Post: Dear colleagues, And more than half a year has passed again. Apparently, "very soon" in some scientific circles can stretch indefinitely... lol |
8)
Message boards :
News :
Target # 22: corona_RdRp_v2
(Message 1948)
Posted 25 Jan 2023 by Mad_Max Post: For RdRp_v2 target it is a perfectly normal. These are long tasks, even on modern computers the computing time for them can reach up to a day of pure CPU time. On older ones, you can expect 2 days or more. While WUs for Sprot_delta target are relatively short tasks (like 10-15 times shorter/less computing time compared to RdRp_v2) |
9)
Message boards :
Science :
Other uses of docking than viruses?
(Message 1947)
Posted 25 Jan 2023 by Mad_Max Post: I don't understand why so little time and attention is paid to GPU development in general (not only in your particular project). I know it is not an easy and simple task... But according to my personal and many other active volunteers opinion, for those tasks where it is generally applicable (and molecular dynamics is definitely one of such areas), the development/porting of the GPU version should be generally the number one priority. Because the work that the project has been doing for a year in this case can be done in just a few weeks. I'm not overestimating or exaggerating. This is not only because GPU computing is much more productive/fast, but also because in the field of voluntary computing there is a great demand for projects that perform important meaningful scientific work in the field of biomedical research. Because most of the existing BOINC projects for GPU (especially for AMD/Intel = OpenCL, for NV/CUDA choice is somewhat larger) are devoted to pure, far-from-life theory - such as solving abstract mathematical problems or astrophysics. And when a new project appears that solves more applied and significant tasks, we can expect a large influx of new participants and computing power in additions to more efficient GPU computing by itself. This was perfectly demonstrated at least twice with medical projects within WCG (Help Conquer Cancer - few years ago and Open Pandemic - last year) - as soon as it was possible to develop and launch a well-functioning GPU application - the overall calculation speed of the project increased not even by several times, but by more than an order of magnitude and from that moment on, the problem of the available amount of computing power ceased to stand at all - and the overall performance/throughput of the project was limited only by the available server resources for database operation and download servers for generation and processing of huge quantities of WUs for the computers of volunteers standing in line and waiting for when they can get some more work to process. |
10)
Message boards :
Science :
Other uses of docking than viruses?
(Message 1946)
Posted 25 Jan 2023 by Mad_Max Post: I know that the CPU version is in active development. But up to this point, I had heard nothing at all about the development of the GPU version. And I was VERY surprised to read in the old (you can say archival!) topic of the forum is that it turns out(according to the old plans) it should have already been ready about a year ago. And no updates since then - why have the deadlines shifted a lot or plans changed or was it canceled completely? If not canceled, then at what stage it now and when at least approximately can we wait now first beta versions to test? |
11)
Message boards :
Science :
Other uses of docking than viruses?
(Message 1941)
Posted 24 Jan 2023 by Mad_Max Post:
Oops! The next year has not only begun, but has even already ended (it was the past 2022 - 1.5 year already passed ). But not only is there no new GPU docking application released, a year after the originally planned deadline, but even no any news / updates about the progress of its development. What happened to this project, where did it silently and without a trace go? |
12)
Message boards :
News :
СmDock "long" and "short" tasks applications
(Message 1940)
Posted 24 Jan 2023 by Mad_Max Post: P.S. There is an example of "saved" WU. It was stuck running at 100% of one core for about 4 days (24/7) but does not make any actual progress last ~1.5 days. (Judging by the time of the last modification of the <docking_out.log> file, which is updated correctly when writing to it.) it's even more disappointing that it's stuck at 97%, just 3% from finale. And restarting means loss of all CPU time and credits. But i have updated "last modified" timestamp of <docking_out.chk> before restart - and BOINC correctly restored all after restart. You can notice it by just less <1 hours of runtime(CPU time 3371) after restart, but all the time from first WU start was restored and added correctly and reported just ~13h before deadline expire: https://www.sidock.si/sidock/result.php?resultid=77591932 Same with https://www.sidock.si/sidock/result.php?resultid=77607538 although it not finished yet (but it should be when you read this), but i found it also stuck at 75% after ~3.5 days of running. Timestamp fix + retart seems fixed it too. |
13)
Message boards :
News :
СmDock "long" and "short" tasks applications
(Message 1938)
Posted 24 Jan 2023 by Mad_Max Post: After special "restart test" under Ubuntu, I did the same test for Windows 10 + BOINC 7.16.11. Before ~ 1 hour of tasks completion I restart a VM with Windows. First task, for workunit 49655115 is complete. And as you see, CPU time does not lost: Hello. I did some debug to and have found part of the problem with checkpoints and loosing cpu/elapsed time counters after each WU/BOINC/Computer restart. It may be OS related indeed. But not caused by OS itself (as all other projects running on same computer and in the same BOINC installation do not have such problems). I see it on all of my computers, but they use same OS ver installed (Win7 Pro x64). May be something like new OS API call which was added only on latest win ver and do not work properly(partial support) on older versions? I did some monitoring of files which running SiDock WU writes to the disk during checkpoint save in working "/slots/" folder. And have found very interesting things: while checkpoint files written to the disk OK it miss write of file metadata: after modifying these files by app docking_out.progress docking_out.chk File timestamps of "last modified" do not change and always stays same(equal) as time-stamp of initial file creation at WUs first start up. Doesn't look like a significant problem? It just a file timestamps.. At first I thought so too, but just in case I tried to change the timestamp of the last modification of the file "docking_out.chk" - BOINC immediately (instantly!) noticed it, created a file boinc_task_state.xml (it had been missing until that time, despite the fact that the calculation of the WU was already coming to an end) and updated the information in the GUI about the time of the last checkpoint (just a few seconds from the last checkpoint). Also file "wrapper_checkpoint.txt" was created at the same moment. It had been missing too despite >20 hours of WUs computation. So may be it is not BOINC but wrapper intermediate app is so depended on timestamps? And I'm not familiar enough with the internal program algorithms, and in general it looks strange(some even call it stupid) programming decision... But it seems that BOINC(or it SiDock wrapper app failure?) determines the fact of actual science app recording a new checkpoints ONLY by the date/time of files modification? And if time-stamps do not update when writing these files, it does not notice at all the fact that a brand new checkpoint was recorded by a working application. Also interesting fact: During other files modification like docking_out.log docking_log docking_out wrapper_checkpoint.txt time stamps updated correctly after each file modification/additions! For some reason only <docking_out.progress> and <docking_out.chk> files miss timestamp updates during these files updates. I don't even understand how this can happen at all... Why writing files by the same program (not even just a program, but by the same process already loaded and running in memory) on the same computer and OS and even in the same folder in one case updates the modification timestamps of the file when it is modified, and in others - no. This files used in different modules written by different programmers and by using different API/libs to access disk functions? |
14)
Message boards :
News :
СmDock "long" and "short" tasks applications
(Message 1881)
Posted 18 Jan 2023 by Mad_Max Post:
This additional errors (all of them only on a computer with hostid 25851 or Cruiser-2 as name) you can ignore. I know exact reason and it is not SiDock or BOINC related. This was 3rd party(non BOINC) buggy app running on same computer with nasty memory leak. It just ate up all the memory (including the virtual/swap file - about 24 GB total) yesterday and other programs started crashing due to out of RAM. After I noticed it and restarted it to free trashed RAM, all these errors stopped immediately. But this has nothing to do with the problem of resets of time counters, progress bar and credit calculations which i see on all of my computers. |
15)
Message boards :
News :
СmDock "long" and "short" tasks applications
(Message 1875)
Posted 18 Jan 2023 by Mad_Max Post: After restart time counts from last successful checkpoint (as for any other project). Yes, it works this way for any OTHER projects, but not for SiDock - for sidoc it resets to zero afters restart! May be it due to use of 2 level wrapped app (app launched by BOINC is not an actual app but just wrapper app which launch actual app which do all actual work/computations) - i do not run any other projects with wrapped apps used to compare. Just reproduced it again. There were 4 SiDock WUs running, and few WUs from other project (from WCG this time, but also work OK with Rosetta@Home and Einstein@Home and MilkyWay@Home). I restarted BOINC For WCG WUs CPU/elapsed time counters and progress bar were immediately restored to values close to prior restart (from latest checkpoint i guess). But for all SiDock WUs all time counters and progress bars were reset to zero. After 5-7 minutes of computation progress bar recovered to near pre-restart values. But CPU/elapsed times still counting from moment of restart. Suspending/resume WUs (with "leave in RAM" option turned off) also kills time counters but save progress bar % of BOINC manager is not restarted. .... Oh, look like i just have found problem (or at least part of it) - BOINC do not see a checkpoints from SiDock at all: CPU time 01:05:37 CPU time since checkpoint 01:05:34 Elapsed time 01:05:46 SiDock use own implementation of checkpoints ? Or do not report to BOINC properly after checkpoint saved? I know they actually works fine. But BOINC does not see/know about it. Any way - BOINC thinks there are no any checkpoint made for WU and that's why it reset CPU/elapsed time counters. Also SiDock does not report progress % to BOINC properly. In working slot directory in boinc_task_state.xml files of all running SiDock WUs i see <fraction_done>0.000000</fraction_done> While in BOINC GUI i see correct values. Probable it report it via API (app-to-app communication on the fly) but does not write same info to the state file as it should? It could explain strange progress bar behavior after restart: BOINC always reads files fist and see fraction_done = 0 and so revert progress bar in GUI to zero too. But later gets actual progress % some other way and corrects progress bar. P.S. I use latest BOINC(v7.20.2) on x64 windows. May be on *nix it works differently... |
16)
Message boards :
Number crunching :
Tasks hanging -
(Message 1870)
Posted 18 Jan 2023 by Mad_Max Post:
It loose(resets to zero) CPU time stats after each restart (full restart without leaving in RAM). So only CPU/elapsed time since last app restart counted. Looks like another bug... I post about it in detail already in the another thread before saw your message: https://www.sidock.si/sidock/forum_thread.php?id=225&postid=1866#1866 |
17)
Message boards :
Number crunching :
Tasks hanging -
(Message 1869)
Posted 18 Jan 2023 by Mad_Max Post:
No. I only saws "hang" tasks in Sprot_delta. RdRp_v2_sample runs OK (At least I have never come across a hung task from this series). It's just that these tasks are considered MUCH (about 10-20 times) longer than the previous ones from a Sprot_delta series . And the calculation times exceeding a day (and on weak computers, more than 2 days of non stop computing) is NORMAL situation for these tasks and is not a failure! Although such long tasks can be a problem in themselves - admins need to at least increase the BOINC deadline setting for them, because weaker computers (or modern but not working 24/7, but only a few hours a day) simply will not have enough time to finish all calculations before the deadline. |
18)
Message boards :
News :
СmDock "long" and "short" tasks applications
(Message 1866)
Posted 18 Jan 2023 by Mad_Max Post: This new app looses CPU/elapsed time stats if restarted (full restarts,without leaving in memory). And so loose points/credits as well. At the same time, actual progress is NOT lost. That is, checkpoints are working. After restarting the app (BOINC restart or BOINC manager just switch to another project without active option "leave in memory while suspended" ), calculations continue from the last checkpoint as intended, but all the stats counters reported to BOINC of elapsed time, CPU time and time elapsed from the last checkpoint are resets to zero. Examples of such tasks: https://www.sidock.si/sidock/result.php?resultid=77568221 13,613.24/13,521.96 sec of elapsed/CPU time https://www.sidock.si/sidock/result.php?resultid=77568222 13,731.34/13,642.55 sec of elapsed/CPU time and 543.41 credits https://www.sidock.si/sidock/result.php?resultid=77568208 - 1,302/1,280 sec of elapsed/CPU time and 50.30 credits https://www.sidock.si/sidock/result.php?resultid=77568215 - 11,880/11,686 sec of elapsed/CPU time and 431.07 credits While actual run times was about 40 000 - 60 000 sec for all of these tasks (~100 sec per 1 ligand on average in docking_out.log and there are 500 of them in each tasks) I just restarted computer few times during it computation. And after each restarts tasks continues from checkpoint successfully but all time counters resets to zero each time. Probably the problem that users have recently complained about in other topics (about a very small amount of credits granted for some of the tasks ) is related to this as well - if the task was restarted often during the calculation process, then only the calculation time since the last restart will be taken into account and evaluated. As it look like credits calculations are based on CPU time used by task and reported by BOINC. P.S. BOINC progress bar (% of task completed) also resets to zero after each restart. But it restore to correct values after some time (usually few mins). But time counters does not restore. |
19)
Message boards :
Number crunching :
Tasks hanging -
(Message 1799)
Posted 15 Jan 2023 by Mad_Max Post: Also see some of such "stuck" tasks with latest app (never seen such behavior before with previous version). CPU core is still fully used, but actual progress stops. To make it worse, it seems in the application there is no "watchdog" timer (or inadequate settings are set in it). Normal tasks are successfully completed in 1.5-3 hours each on single core on my computers(depends on CPU - i have few different) , but the bad one can occupy a processor for a day or two and never end until I cancel or restart it. During this time, if there was no such failure, 10-20 other tasks on the same core could be successfully completed. If you do not manage to find out the root cause of the failures and eliminate it, I would recommend adding a guard timer. And better not for the entire task(WU - BOINC work unit), there are actually a lot of separate micro-tasks in it (modeling attempts, judging by the logs of about 500 pieces packed into each “long” task by default). If such an individual micro task does not end for more than 10-15 minutes(normal run times on relative modern CPUs <1 min), it will never end and it should be restarted or canceled. |
©2025 SiDock@home Team