Message boards :
News :
СmDock "long" and "short" tasks applications
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 11 Oct 20 Posts: 334 Credit: 25,572,042 RAC: 6,715 |
Hi folks! Some news: It was decided to exclude the use of one of new functions of the application. New "sub-version" of application cmdock-l with number 2.02 are deployed. Please report if problems occurs. |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
I suggest you move on to a different project (like I and others) or simply stop until the issues are resolved. It's not worth getting bent out of shape over. I have WU's still in progress, that I'm spending an hour or more per day baby sitting, and want to keep working on this project. They need to hear how much time we are spending on managing their project WU's. We're not directly being paid for this work; but they are. COVID research is highly paid research and the vaccine companies have made record profits the last 2 years. I already watch top contributors leaving this project and I will shortly follow when there is no improvement.Greger, top contributor of our team, was 230k RAC has pulled out. To put this in perspective; SiDock has a beta test server that one of my machines is still waiting on tasks from. They have beta test capability, with volunteers willing to accept the risks of beta WU's, and it could have been used to prevent what happened last week. |
Send message Joined: 11 Oct 20 Posts: 334 Credit: 25,572,042 RAC: 6,715 |
After special "restart test" under Ubuntu, I did the same test for Windows 10 + BOINC 7.16.11. Before ~ 1 hour of tasks completion I restart a VM with Windows. First task, for workunit 49655115 is complete. And as you see, CPU time does not lost: 77614604 49655066 20 Jan 2023, 19:14:36 UTC 21 Jan 2023, 20:29:48 UTC Completed and validated 85,387.91 85,211.20 1,004.96 CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77614663 49655123 20 Jan 2023, 19:14:36 UTC 21 Jan 2023, 20:00:16 UTC Completed and validated 76,331.40 76,198.72 898.37 CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77614655 49655115 20 Jan 2023, 19:14:35 UTC 21 Jan 2023, 18:28:11 UTC Completed and validated 76,226.16 76,041.19 897.14 CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 77614657 49655117 20 Jan 2023, 19:14:35 UTC 21 Jan 2023, 20:29:20 UTC Completed and validated 81,194.31 80,985.95 955.61 CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64 Maybe it does not work under some systems, but under nearest for me Windows system it works. |
Send message Joined: 11 Oct 20 Posts: 334 Credit: 25,572,042 RAC: 6,715 |
SiDock has a beta test server that one of my machines is still waiting on tasks from. First server of SiDock@home (sidocktest) was frozen after credit was moved in main project and now is stopped. Before deploy this application was tested during several days and no any problems were found. For example: where one short ("Sprot") task is hung on my computer, I perform a run of several dozens (~40) of copies of this task on my machine and all of them were completed successfully, without any problem. Only with help of auxiliary server, we were able to get a similar task hung for a long ("RdRp_v2") task. Actually, finding of this problem is a side, but very important result of our project also, received already now. |
Send message Joined: 24 Oct 20 Posts: 21 Credit: 10,159,102 RAC: 0 |
Yes, I did that. The long tasks are too long. Start running the badge program wuprop and start LOVING the extra long tasks!! http://wuprop.boinc-af.org/ |
Send message Joined: 18 Oct 22 Posts: 9 Credit: 13,517,152 RAC: 168,125 |
Folks, I was away for two days, so no babysitting for the WUs. Guess what, some tasks claimed to be > 100 hrs and still have to run. I suspended and resumed those items and they instantly returned to only a fraction of the already reported and wasted CPU time. That's it for me, I will quit the project. However I might return from time to time to review the postings. If something improved I might consider coming back. Take care and best of success. Chris |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
Folks, It's some improvement. If the tasks report they are checkpointing every 10 min or less then those are the 'good' WU. The ones that only create a checkpoint at the 1st second, and never again, are risky and will lose you credit if you restart to get them moving again. Not sure how many more of these non-checkpointing WU are to come (I'm posting a question about it in Crunching forum). You could abort every WU that refuses to checkpoint within 11 minutes after they start? They still have a high chance of completing as long as you do not pause the client (my electric company has peak hours of 9x pricing to avoid). I'm going to abort 1 of about 40 received on my 2700X in the last day. |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
After special "restart test" under Ubuntu, I did the same test for Windows 10 + BOINC 7.16.11. Before ~ 1 hour of tasks completion I restart a VM with Windows. First task, for workunit 49655115 is complete. And as you see, CPU time does not lost: With these results, could you please comment on my post about checkpointing and some WU that still refuse to checkpoint here: https://www.sidock.si/sidock/forum_thread.php?id=231 |
Send message Joined: 10 Dec 20 Posts: 24 Credit: 10,767,590 RAC: 0 |
SiDock has a beta test server that one of my machines is still waiting on tasks from. Trying to clarify. So the WU that went on for 3-4 days, and maybe hung, were sent intentionally and the bug with the application you were trying to uncover was found by the results returned by the BOINC community running SiDock last week? And the beta server is down and you only beta test in house now then send the new apps straight out? |
Send message Joined: 18 Oct 22 Posts: 9 Credit: 13,517,152 RAC: 168,125 |
Thanks Marmot, But I really expect that I can leave a system with its workload without standing aside watching and managing each WU.. This is a minimum that I feel I can get in return for donating my hardware and energy to a project. Chris |
Send message Joined: 11 Oct 20 Posts: 334 Credit: 25,572,042 RAC: 6,715 |
SiDock has a beta test server that one of my machines is still waiting on tasks from. Looks, that I need to repeat. SiDock@home never had a separate project for wide tests (like Albert@Home). In the first phase, project did not export statistics and due of this it was called SiDockTest. After a while, we turned on export and renamed the project to SiDock@home. Later, we migrate to another infrastructure and first server was shutdown and saved as image file only. New (at this moment not latest) version of application, initially was tested by developers, then it was used for internal computing over sample set and then was deployed. And only in real environment, this unexpected behavior was caught. And ~ 15 hours after detection, at January 9th, was posted a recommendation for usage a BOINC restart as more reliable option for problem solving. There was no to need to do "babysitting" with tasks, because you could restart BOINC client on machines 1 or 2 times per day, simply. Moreover, you could made this by Windows Scheduler. You were afraid that CPU time before last checkpoint on your computers not be saved? You could test it on one of your machine, and then - restart BOINC at all others, instead of "forum sitting" in several threads simultaneously, during next two weeks. What we have with application at this moment: 3 days ago we deploy new subversion of application (# 2.02) with excluding usage of one of new features, which presumably, leads to hangs. If any faced any problems with this application please post about this. Most of tasks that linked to previous version will be processed during next several days and problem will leave, I hope. Of course, the absence of problem is much better than having one. But if we detected "the problem habitat" rightly, the hang is not predetermined, depends on random numbers and due this it is not reproduce after restart. Great thanks for all crunchers for catching of this problem. |
Send message Joined: 27 Dec 21 Posts: 19 Credit: 16,468,469 RAC: 12,268 |
After special "restart test" under Ubuntu, I did the same test for Windows 10 + BOINC 7.16.11. Before ~ 1 hour of tasks completion I restart a VM with Windows. First task, for workunit 49655115 is complete. And as you see, CPU time does not lost: Hello. I did some debug to and have found part of the problem with checkpoints and loosing cpu/elapsed time counters after each WU/BOINC/Computer restart. It may be OS related indeed. But not caused by OS itself (as all other projects running on same computer and in the same BOINC installation do not have such problems). I see it on all of my computers, but they use same OS ver installed (Win7 Pro x64). May be something like new OS API call which was added only on latest win ver and do not work properly(partial support) on older versions? I did some monitoring of files which running SiDock WU writes to the disk during checkpoint save in working "/slots/" folder. And have found very interesting things: while checkpoint files written to the disk OK it miss write of file metadata: after modifying these files by app docking_out.progress docking_out.chk File timestamps of "last modified" do not change and always stays same(equal) as time-stamp of initial file creation at WUs first start up. Doesn't look like a significant problem? It just a file timestamps.. At first I thought so too, but just in case I tried to change the timestamp of the last modification of the file "docking_out.chk" - BOINC immediately (instantly!) noticed it, created a file boinc_task_state.xml (it had been missing until that time, despite the fact that the calculation of the WU was already coming to an end) and updated the information in the GUI about the time of the last checkpoint (just a few seconds from the last checkpoint). Also file "wrapper_checkpoint.txt" was created at the same moment. It had been missing too despite >20 hours of WUs computation. So may be it is not BOINC but wrapper intermediate app is so depended on timestamps? And I'm not familiar enough with the internal program algorithms, and in general it looks strange(some even call it stupid) programming decision... But it seems that BOINC(or it SiDock wrapper app failure?) determines the fact of actual science app recording a new checkpoints ONLY by the date/time of files modification? And if time-stamps do not update when writing these files, it does not notice at all the fact that a brand new checkpoint was recorded by a working application. Also interesting fact: During other files modification like docking_out.log docking_log docking_out wrapper_checkpoint.txt time stamps updated correctly after each file modification/additions! For some reason only <docking_out.progress> and <docking_out.chk> files miss timestamp updates during these files updates. I don't even understand how this can happen at all... Why writing files by the same program (not even just a program, but by the same process already loaded and running in memory) on the same computer and OS and even in the same folder in one case updates the modification timestamps of the file when it is modified, and in others - no. This files used in different modules written by different programmers and by using different API/libs to access disk functions? |
Send message Joined: 27 Dec 21 Posts: 19 Credit: 16,468,469 RAC: 12,268 |
P.S. There is an example of "saved" WU. It was stuck running at 100% of one core for about 4 days (24/7) but does not make any actual progress last ~1.5 days. (Judging by the time of the last modification of the <docking_out.log> file, which is updated correctly when writing to it.) it's even more disappointing that it's stuck at 97%, just 3% from finale. And restarting means loss of all CPU time and credits. But i have updated "last modified" timestamp of <docking_out.chk> before restart - and BOINC correctly restored all after restart. You can notice it by just less <1 hours of runtime(CPU time 3371) after restart, but all the time from first WU start was restored and added correctly and reported just ~13h before deadline expire: https://www.sidock.si/sidock/result.php?resultid=77591932 Same with https://www.sidock.si/sidock/result.php?resultid=77607538 although it not finished yet (but it should be when you read this), but i found it also stuck at 75% after ~3.5 days of running. Timestamp fix + retart seems fixed it too. |
Send message Joined: 11 Oct 20 Posts: 334 Credit: 25,572,042 RAC: 6,715 |
We will check what is happening with this files (docking_out.progress and docking_out.chk). Maybe something changed in OS API, maybe not. Inside my VM with Windows 10 metadata of this files is not stalled during computing, but changes performs not in the same time that other files. Maybe it is normal, maybe not. |
Send message Joined: 3 Mar 22 Posts: 4 Credit: 8,334,432 RAC: 0 |
An up-to-date list of tasks for which processing time was exceeded. https://www.sidock.si/sidock/result.php?resultid=77646081 https://www.sidock.si/sidock/result.php?resultid=77646087 https://www.sidock.si/sidock/result.php?resultid=77646088 https://www.sidock.si/sidock/result.php?resultid=77646025 https://www.sidock.si/sidock/result.php?resultid=77624328 https://www.sidock.si/sidock/result.php?resultid=77624332 |
Send message Joined: 12 Nov 22 Posts: 1 Credit: 77,699 RAC: 0 |
Hi there! I'm using BOINC on a Raspberry Pi. Like the person above, I'm not sure how to accomplish "switching" to short tasks. Can someone with a Raspberry Pi chime in and give clarity? Thanks. |
Send message Joined: 11 Oct 20 Posts: 334 Credit: 25,572,042 RAC: 6,715 |
Hi there! Hi! My steps. Maybe not all of them is needed for you. Good luck! P.S. And de-select all applications except "CurieMarieDock 0.2.0 short tasks" in project preferences for "venue" of RPi computers. |
Send message Joined: 10 Feb 22 Posts: 24 Credit: 475,114,483 RAC: 112,482 |
Can confirm "2.00" jobs were often/always using CPU cycles indefinitely, no problem with 2.02 I deleted all 2.00 jobs from all my queues that seems to be the solution to all trouble. |
Send message Joined: 10 Feb 21 Posts: 21 Credit: 4,831,331 RAC: 0 |
I did some debug to and have found part of the problem with checkpoints and loosing cpu/elapsed time counters after each WU/BOINC/Computer restart. I get this all the time. It seems to be a Windows thing. From the documentation of the WriteFile function: When writing to a file, the last write time is not fully updated until all handles used for writing have been closed. Therefore, to ensure an accurate last write time, close the file handle immediately after writing to the file. So AFAICT what is happening is that CmDock is writing the checkpoint to docking_out.chk, but not closing the handle – so the wrapper (which is polling the last-write time of that file to report the last-checkpoint time back to BOINC) does not see any change. |
Send message Joined: 11 Oct 20 Posts: 334 Credit: 25,572,042 RAC: 6,715 |
It is good remark. I posted the issue in Cm Dock repository. Thank you! |
©2024 SiDock@home Team