СmDock "long" and "short" tasks applications

Message boards : News : СmDock "long" and "short" tasks applications
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 332
Credit: 25,126,350
RAC: 12,910
Message 1913 - Posted: 21 Jan 2023, 16:58:08 UTC

Hi folks! Some news: It was decided to exclude the use of one of new functions of the application. New "sub-version" of application cmdock-l with number 2.02 are deployed. Please report if problems occurs.
ID: 1913 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 10 Dec 20
Posts: 24
Credit: 10,767,590
RAC: 0
Message 1914 - Posted: 21 Jan 2023, 17:26:35 UTC - in response to Message 1908.  
Last modified: 21 Jan 2023, 17:44:02 UTC

I suggest you move on to a different project (like I and others) or simply stop until the issues are resolved. It's not worth getting bent out of shape over.

I have WU's still in progress, that I'm spending an hour or more per day baby sitting, and want to keep working on this project.
They need to hear how much time we are spending on managing their project WU's.
We're not directly being paid for this work; but they are.
COVID research is highly paid research and the vaccine companies have made record profits the last 2 years.

I already watch top contributors leaving this project and I will shortly follow when there is no improvement.
Greger, top contributor of our team, was 230k RAC has pulled out.


To put this in perspective; SiDock has a beta test server that one of my machines is still waiting on tasks from.
They have beta test capability, with volunteers willing to accept the risks of beta WU's, and it could have been used to prevent what happened last week.
ID: 1914 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 332
Credit: 25,126,350
RAC: 12,910
Message 1921 - Posted: 21 Jan 2023, 18:41:38 UTC
Last modified: 21 Jan 2023, 20:33:11 UTC

After special "restart test" under Ubuntu, I did the same test for Windows 10 + BOINC 7.16.11. Before ~ 1 hour of tasks completion I restart a VM with Windows. First task, for workunit 49655115 is complete. And as you see, CPU time does not lost:
77614604	49655066	20 Jan 2023, 19:14:36 UTC	21 Jan 2023, 20:29:48 UTC	Completed and validated	85,387.91	85,211.20	1,004.96	CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64
77614663	49655123	20 Jan 2023, 19:14:36 UTC	21 Jan 2023, 20:00:16 UTC	Completed and validated	76,331.40	76,198.72	898.37	CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64
77614655	49655115	20 Jan 2023, 19:14:35 UTC	21 Jan 2023, 18:28:11 UTC	Completed and validated	76,226.16	76,041.19	897.14	CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64
77614657	49655117	20 Jan 2023, 19:14:35 UTC	21 Jan 2023, 20:29:20 UTC	Completed and validated	81,194.31	80,985.95	955.61	CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64

Maybe it does not work under some systems, but under nearest for me Windows system it works.
ID: 1921 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 332
Credit: 25,126,350
RAC: 12,910
Message 1922 - Posted: 21 Jan 2023, 18:55:58 UTC - in response to Message 1914.  
Last modified: 21 Jan 2023, 20:28:16 UTC

SiDock has a beta test server that one of my machines is still waiting on tasks from.

First server of SiDock@home (sidocktest) was frozen after credit was moved in main project and now is stopped. Before deploy this application was tested during several days and no any problems were found. For example: where one short ("Sprot") task is hung on my computer, I perform a run of several dozens (~40) of copies of this task on my machine and all of them were completed successfully, without any problem. Only with help of auxiliary server, we were able to get a similar task hung for a long ("RdRp_v2") task.

Actually, finding of this problem is a side, but very important result of our project also, received already now.
ID: 1922 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
Avatar

Send message
Joined: 24 Oct 20
Posts: 20
Credit: 10,159,102
RAC: 0
Message 1924 - Posted: 22 Jan 2023, 1:50:27 UTC - in response to Message 1909.  

Yes, I did that. The long tasks are too long.


Start running the badge program wuprop and start LOVING the extra long tasks!!

http://wuprop.boinc-af.org/
ID: 1924 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crissy

Send message
Joined: 18 Oct 22
Posts: 9
Credit: 9,184,528
RAC: 0
Message 1927 - Posted: 22 Jan 2023, 23:31:52 UTC

Folks,

I was away for two days, so no babysitting for the WUs.
Guess what, some tasks claimed to be > 100 hrs and still have to run.
I suspended and resumed those items and they instantly returned to only a fraction of the already reported and wasted CPU time.
That's it for me, I will quit the project.
However I might return from time to time to review the postings. If something improved I might consider coming back.

Take care and best of success.

Chris
ID: 1927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 10 Dec 20
Posts: 24
Credit: 10,767,590
RAC: 0
Message 1929 - Posted: 23 Jan 2023, 18:28:43 UTC - in response to Message 1927.  
Last modified: 23 Jan 2023, 18:45:52 UTC

Folks,

I was away for two days, so no babysitting for the WUs.
Guess what, some tasks claimed to be > 100 hrs and still have to run.
I suspended and resumed those items and they instantly returned to only a fraction of the already reported and wasted CPU time.
That's it for me, I will quit the project.
However I might return from time to time to review the postings. If something improved I might consider coming back.

Take care and best of success.

Chris


It's some improvement.

If the tasks report they are checkpointing every 10 min or less then those are the 'good' WU.

The ones that only create a checkpoint at the 1st second, and never again, are risky and will lose you credit if you restart to get them moving again.

Not sure how many more of these non-checkpointing WU are to come (I'm posting a question about it in Crunching forum).

You could abort every WU that refuses to checkpoint within 11 minutes after they start?

They still have a high chance of completing as long as you do not pause the client (my electric company has peak hours of 9x pricing to avoid).

I'm going to abort 1 of about 40 received on my 2700X in the last day.
ID: 1929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 10 Dec 20
Posts: 24
Credit: 10,767,590
RAC: 0
Message 1931 - Posted: 23 Jan 2023, 18:43:52 UTC - in response to Message 1921.  

After special "restart test" under Ubuntu, I did the same test for Windows 10 + BOINC 7.16.11. Before ~ 1 hour of tasks completion I restart a VM with Windows. First task, for workunit 49655115 is complete. And as you see, CPU time does not lost:
77614604	49655066	20 Jan 2023, 19:14:36 UTC	21 Jan 2023, 20:29:48 UTC	Completed and validated	85,387.91	85,211.20	1,004.96	CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64
77614663	49655123	20 Jan 2023, 19:14:36 UTC	21 Jan 2023, 20:00:16 UTC	Completed and validated	76,331.40	76,198.72	898.37	CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64
77614655	49655115	20 Jan 2023, 19:14:35 UTC	21 Jan 2023, 18:28:11 UTC	Completed and validated	76,226.16	76,041.19	897.14	CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64
77614657	49655117	20 Jan 2023, 19:14:35 UTC	21 Jan 2023, 20:29:20 UTC	Completed and validated	81,194.31	80,985.95	955.61	CurieMarieDock 0.2.0 long tasks v2.00 windows_x86_64

Maybe it does not work under some systems, but under nearest for me Windows system it works.


With these results, could you please comment on my post about checkpointing and some WU that still refuse to checkpoint here: https://www.sidock.si/sidock/forum_thread.php?id=231
ID: 1931 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 10 Dec 20
Posts: 24
Credit: 10,767,590
RAC: 0
Message 1932 - Posted: 23 Jan 2023, 18:48:00 UTC - in response to Message 1922.  
Last modified: 23 Jan 2023, 18:51:34 UTC

SiDock has a beta test server that one of my machines is still waiting on tasks from.

First server of SiDock@home (sidocktest) was frozen after credit was moved in main project and now is stopped. Before deploy this application was tested during several days and no any problems were found. For example: where one short ("Sprot") task is hung on my computer, I perform a run of several dozens (~40) of copies of this task on my machine and all of them were completed successfully, without any problem. Only with help of auxiliary server, we were able to get a similar task hung for a long ("RdRp_v2") task.

Actually, finding of this problem is a side, but very important result of our project also, received already now.


Trying to clarify.
So the WU that went on for 3-4 days, and maybe hung, were sent intentionally and the bug with the application you were trying to uncover was found by the results returned by the BOINC community running SiDock last week?
And the beta server is down and you only beta test in house now then send the new apps straight out?
ID: 1932 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crissy

Send message
Joined: 18 Oct 22
Posts: 9
Credit: 9,184,528
RAC: 0
Message 1933 - Posted: 23 Jan 2023, 21:45:04 UTC - in response to Message 1929.  

Thanks Marmot,

But I really expect that I can leave a system with its workload without standing aside watching and managing each WU..
This is a minimum that I feel I can get in return for donating my hardware and energy to a project.

Chris
ID: 1933 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 332
Credit: 25,126,350
RAC: 12,910
Message 1935 - Posted: 23 Jan 2023, 23:13:26 UTC - in response to Message 1932.  
Last modified: 23 Jan 2023, 23:18:12 UTC

SiDock has a beta test server that one of my machines is still waiting on tasks from.

First server of SiDock@home (sidocktest) was frozen after credit was moved in main project and now is stopped. Before deploy this application was tested during several days and no any problems were found. For example: where one short ("Sprot") task is hung on my computer, I perform a run of several dozens (~40) of copies of this task on my machine and all of them were completed successfully, without any problem. Only with help of auxiliary server, we were able to get a similar task hung for a long ("RdRp_v2") task.

Actually, finding of this problem is a side, but very important result of our project also, received already now.


Trying to clarify.
So the WU that went on for 3-4 days, and maybe hung, were sent intentionally and the bug with the application you were trying to uncover was found by the results returned by the BOINC community running SiDock last week?
And the beta server is down and you only beta test in house now then send the new apps straight out?

Looks, that I need to repeat.
SiDock@home never had a separate project for wide tests (like Albert@Home). In the first phase, project did not export statistics and due of this it was called SiDockTest. After a while, we turned on export and renamed the project to SiDock@home. Later, we migrate to another infrastructure and first server was shutdown and saved as image file only.
New (at this moment not latest) version of application, initially was tested by developers, then it was used for internal computing over sample set and then was deployed. And only in real environment, this unexpected behavior was caught. And ~ 15 hours after detection, at January 9th, was posted a recommendation for usage a BOINC restart as more reliable option for problem solving. There was no to need to do "babysitting" with tasks, because you could restart BOINC client on machines 1 or 2 times per day, simply. Moreover, you could made this by Windows Scheduler.
You were afraid that CPU time before last checkpoint on your computers not be saved? You could test it on one of your machine, and then - restart BOINC at all others, instead of "forum sitting" in several threads simultaneously, during next two weeks.


What we have with application at this moment: 3 days ago we deploy new subversion of application (# 2.02) with excluding usage of one of new features, which presumably, leads to hangs. If any faced any problems with this application please post about this. Most of tasks that linked to previous version will be processed during next several days and problem will leave, I hope.

Of course, the absence of problem is much better than having one. But if we detected "the problem habitat" rightly, the hang is not predetermined, depends on random numbers and due this it is not reproduce after restart. Great thanks for all crunchers for catching of this problem.
ID: 1935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mad_Max

Send message
Joined: 27 Dec 21
Posts: 18
Credit: 15,630,025
RAC: 19,073
Message 1938 - Posted: 24 Jan 2023, 1:03:22 UTC - in response to Message 1921.  
Last modified: 24 Jan 2023, 1:53:06 UTC

After special "restart test" under Ubuntu, I did the same test for Windows 10 + BOINC 7.16.11. Before ~ 1 hour of tasks completion I restart a VM with Windows. First task, for workunit 49655115 is complete. And as you see, CPU time does not lost:
.........
Maybe it does not work under some systems, but under nearest for me Windows system it works.

Hello.
I did some debug to and have found part of the problem with checkpoints and loosing cpu/elapsed time counters after each WU/BOINC/Computer restart.
It may be OS related indeed. But not caused by OS itself (as all other projects running on same computer and in the same BOINC installation do not have such problems). I see it on all of my computers, but they use same OS ver installed (Win7 Pro x64).
May be something like new OS API call which was added only on latest win ver and do not work properly(partial support) on older versions?

I did some monitoring of files which running SiDock WU writes to the disk during checkpoint save in working "/slots/" folder.
And have found very interesting things: while checkpoint files written to the disk OK it miss write of file metadata: after modifying these files by app
docking_out.progress
docking_out.chk

File timestamps of "last modified" do not change and always stays same(equal) as time-stamp of initial file creation at WUs first start up.

Doesn't look like a significant problem? It just a file timestamps..
At first I thought so too, but just in case I tried to change the timestamp of the last modification of the file "docking_out.chk" - BOINC immediately (instantly!) noticed it, created a file boinc_task_state.xml (it had been missing until that time, despite the fact that the calculation of the WU was already coming to an end) and updated the information in the GUI about the time of the last checkpoint (just a few seconds from the last checkpoint).
Also file "wrapper_checkpoint.txt" was created at the same moment. It had been missing too despite >20 hours of WUs computation. So may be it is not BOINC but wrapper intermediate app is so depended on timestamps?

And I'm not familiar enough with the internal program algorithms, and in general it looks strange(some even call it stupid) programming decision... But it seems that BOINC(or it SiDock wrapper app failure?) determines the fact of actual science app recording a new checkpoints ONLY by the date/time of files modification? And if time-stamps do not update when writing these files, it does not notice at all the fact that a brand new checkpoint was recorded by a working application.

Also interesting fact:
During other files modification like
docking_out.log
docking_log
docking_out 
wrapper_checkpoint.txt

time stamps updated correctly after each file modification/additions! For some reason only <docking_out.progress> and <docking_out.chk> files miss timestamp updates during these files updates.

I don't even understand how this can happen at all...
Why writing files by the same program (not even just a program, but by the same process already loaded and running in memory) on the same computer and OS and even in the same folder in one case updates the modification timestamps of the file when it is modified, and in others - no.
This files used in different modules written by different programmers and by using different API/libs to access disk functions?
ID: 1938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mad_Max

Send message
Joined: 27 Dec 21
Posts: 18
Credit: 15,630,025
RAC: 19,073
Message 1940 - Posted: 24 Jan 2023, 4:44:51 UTC

P.S.
There is an example of "saved" WU.
It was stuck running at 100% of one core for about 4 days (24/7) but does not make any actual progress last ~1.5 days.
(Judging by the time of the last modification of the <docking_out.log> file, which is updated correctly when writing to it.)
it's even more disappointing that it's stuck at 97%, just 3% from finale. And restarting means loss of all CPU time and credits.
But i have updated "last modified" timestamp of <docking_out.chk> before restart - and BOINC correctly restored all after restart.
You can notice it by just less <1 hours of runtime(CPU time 3371) after restart, but all the time from first WU start was restored and added correctly and reported just ~13h before deadline expire:

https://www.sidock.si/sidock/result.php?resultid=77591932

Same with https://www.sidock.si/sidock/result.php?resultid=77607538 although it not finished yet (but it should be when you read this), but i found it also stuck at 75% after ~3.5 days of running.
Timestamp fix + retart seems fixed it too.
ID: 1940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 332
Credit: 25,126,350
RAC: 12,910
Message 1942 - Posted: 24 Jan 2023, 7:59:17 UTC

We will check what is happening with this files (docking_out.progress and docking_out.chk). Maybe something changed in OS API, maybe not. Inside my VM with Windows 10 metadata of this files is not stalled during computing, but changes performs not in the same time that other files. Maybe it is normal, maybe not.
ID: 1942 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
3man001

Send message
Joined: 3 Mar 22
Posts: 4
Credit: 8,334,432
RAC: 0
Message 1944 - Posted: 24 Jan 2023, 15:50:07 UTC

An up-to-date list of tasks for which processing time was exceeded.
https://www.sidock.si/sidock/result.php?resultid=77646081
https://www.sidock.si/sidock/result.php?resultid=77646087
https://www.sidock.si/sidock/result.php?resultid=77646088
https://www.sidock.si/sidock/result.php?resultid=77646025
https://www.sidock.si/sidock/result.php?resultid=77624328
https://www.sidock.si/sidock/result.php?resultid=77624332
ID: 1944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JahSkinny
Avatar

Send message
Joined: 12 Nov 22
Posts: 1
Credit: 77,699
RAC: 0
Message 1955 - Posted: 28 Jan 2023, 2:04:47 UTC - in response to Message 1796.  

Hi there!
I use BOINC to run your project on iOS 11.7.2 . Wondering if you good give me some clarity on how to do the above, and switch to short tasks?

I'm using BOINC on a Raspberry Pi. Like the person above, I'm not sure how to accomplish "switching" to short tasks. Can someone with a Raspberry Pi chime in and give clarity?
Thanks.
ID: 1955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 332
Credit: 25,126,350
RAC: 12,910
Message 1956 - Posted: 28 Jan 2023, 2:41:57 UTC - in response to Message 1955.  
Last modified: 28 Jan 2023, 21:17:41 UTC

Hi there!
I use BOINC to run your project on iOS 11.7.2 . Wondering if you good give me some clarity on how to do the above, and switch to short tasks?

I'm using BOINC on a Raspberry Pi. Like the person above, I'm not sure how to accomplish "switching" to short tasks. Can someone with a Raspberry Pi chime in and give clarity?
Thanks.

Hi! My steps. Maybe not all of them is needed for you. Good luck!

P.S. And de-select all applications except "CurieMarieDock 0.2.0 short tasks" in project preferences for "venue" of RPi computers.
ID: 1956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile danwat1234
Avatar

Send message
Joined: 10 Feb 22
Posts: 23
Credit: 468,368,039
RAC: 237,972
Message 1984 - Posted: 7 Feb 2023, 7:16:29 UTC

Can confirm "2.00" jobs were often/always using CPU cycles indefinitely, no problem with 2.02 I deleted all 2.00 jobs from all my queues that seems to be the solution to all trouble.
ID: 1984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 10 Feb 21
Posts: 21
Credit: 4,831,331
RAC: 78
Message 2008 - Posted: 25 Feb 2023, 19:55:58 UTC - in response to Message 1938.  

I did some debug to and have found part of the problem with checkpoints and loosing cpu/elapsed time counters after each WU/BOINC/Computer restart.
It may be OS related indeed. But not caused by OS itself (as all other projects running on same computer and in the same BOINC installation do not have such problems). I see it on all of my computers, but they use same OS ver installed (Win7 Pro x64).
May be something like new OS API call which was added only on latest win ver and do not work properly(partial support) on older versions?

I did some monitoring of files which running SiDock WU writes to the disk during checkpoint save in working "/slots/" folder.
And have found very interesting things: while checkpoint files written to the disk OK it miss write of file metadata: after modifying these files by app
docking_out.progress
docking_out.chk

File timestamps of "last modified" do not change and always stays same(equal) as time-stamp of initial file creation at WUs first start up.

I get this all the time. It seems to be a Windows thing. From the documentation of the WriteFile function:
When writing to a file, the last write time is not fully updated until all handles used for writing have been closed. Therefore, to ensure an accurate last write time, close the file handle immediately after writing to the file.

So AFAICT what is happening is that CmDock is writing the checkpoint to docking_out.chk, but not closing the handle – so the wrapper (which is polling the last-write time of that file to report the last-checkpoint time back to BOINC) does not see any change.
ID: 2008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 332
Credit: 25,126,350
RAC: 12,910
Message 2011 - Posted: 26 Feb 2023, 9:25:17 UTC - in response to Message 2008.  

It is good remark. I posted the issue in Cm Dock repository. Thank you!
ID: 2011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : СmDock "long" and "short" tasks applications

©2024 SiDock@home Team