Posts by sam6861

1) Message boards : Number crunching : Tasks hanging - (Message 2102)
Posted 28 Jul 2023 by sam6861
Post:
More update: Looked at source code in random number code... I may be wrong about hardware errors this time and are more likely a software bug. The randomizer just makes this problem happen by random chance.

I believe I found a bug with a function in source code, RbtRand::generate_cauchy.
https://gitlab.com/Jukic/cmdock/-/blob/v0.2.0/src/lib/RbtRand.cxx Line 216

val = a random decimal number between -0.5 to 0.5.
The problem is the use of tan(pi * val) in radians trigonometry mode.

tan(pi*0.4999999999) in linux Qalculate is 3183098861837907, a huge number.

RbtRand::generate_cauchy in src/lib/RbtRand.cxx line 220
RbtRand::GetCauchyRandom in src/lib/RbtRand.cxx line 69
RbtChromElement::CauchyMutate in src/lib/RbtChromElement.cxx line 86

--- The next function CauchyMutate calls is RbtChrom::Mutate... which is where relStepSize=16331239353195370 went to get stuck in RbtChromDihedralElement::StandardisedValue.

I am not sure of hwo to solve this problem in RbtRand::generate_cauchy, this is mostly up to other people to find a good fix for this random huge number I guess.
2) Message boards : Number crunching : Tasks hanging - (Message 2098)
Posted 26 Jul 2023 by sam6861
Post:
Update: I am guessing this may be my possible faulty hardware that may make random errors. The more I look at where the numbers (relStepSize=16331239353195370) may have possibly come from, the more I think this could be my faulty hardware making random wrong calculation.

I started to believe it could be a possible faulty RAM or CPU hardware on N270 HP Mini 110-1000, very old computer. I had some difficulty just getting this computer to start. Often, display just stay blank, no boot, having to power cycle a few times. Once this computer randomly lost power, but all other devices stayed on. I guess this computer may possibly fail soon. Oh well, I have some other computers I can use.

Software may be written to have some check for some faulty numbers to try to reduce the chance of getting stuck in endless loop, with Primegrid being an example of having lots of checks for possible errors.
3) Message boards : Number crunching : Tasks hanging - (Message 2096)
Posted 26 Jul 2023 by sam6861
Post:
Hung task on intel atom N270, 32 bit. Manually compiled.
With this off "leave non-GPU task in memory while suspended", pause and resume can get task unstuck.

corona_RdRp_v2_sidock-s_98_00014708_r1_s-20_0
https://www.sidock.si/sidock/result.php?resultid=79096387

with gdb, got some information.
RbtChromDihedralElement::StandardisedValue (dihedralAngle=2.8241407014048006e+17)
At: src/lib/RbtChromDihedralElement.cxx:152

Looks like a very huge angle, 282414070140480060. The function, StandardisedValue, repeatedly subtracts 360, but due to rounding, cannot subtract huge 64 bit float by a tiny number, called Double in programming language. This results in endless loop, endless task.

A possible source code fix may be to use remainder or fmod in StandardisedValue function. The remainder function is probably best to use. This can avoid the need to do loops.

dihedralAngle = remainder(dihedralAngle, 360);
// OR //
dihedralAngle = fmod(dihedralAngle, 360);

(gdb) bt
#0  0x083144e8 in RbtChromDihedralElement::StandardisedValue (dihedralAngle=2.8241407014048006e+17)
    at ../src/lib/RbtChromDihedralElement.cxx:152
#1  0x08314890 in RbtChromDihedralElement::Mutate (this=0xd537120, relStepSize=16331239353195370)
    at ../src/lib/RbtChromDihedralElement.cxx:73
#2  0x08311c58 in RbtChrom::Mutate (relStepSize=16331239353195370, this=<optimized out>) at ../src/lib/RbtChrom.cxx:56
#3  RbtChrom::Mutate (this=<optimized out>, relStepSize=16331239353195370) at ../src/lib/RbtChrom.cxx:56
#4  0x082e4140 in RbtPopulation::GAstep (this=0xb514750, nReplicates=nReplicates@entry=1100, relStepSize=relStepSize@entry=1, 
    equalityThreshold=equalityThreshold@entry=0.10000000000000001, pcross=pcross@entry=0.40000000000000002, 
    xovermut=xovermut@entry=true, cmutate=cmutate@entry=false) at ../src/lib/RbtPopulation.cxx:105
#5  0x082806f8 in RbtGATransform::Execute (this=<optimized out>) at ../include/RbtSmartPointer.h:131
#6  0x0830e73d in RbtTransformAgg::Execute (this=0x85c3960) at ../src/lib/RbtTransformAgg.cxx:165
#7  0x081dfcb3 in RbtWorkSpace::Run (this=0x85b02b0) at ../src/lib/RbtWorkSpace.cxx:170
#8  0x080955d0 in main (argc=<optimized out>, argv=<optimized out>) at ../src/exe/cmdock.cxx:775
4) Message boards : Number crunching : Never ending WU's ? (Message 1632)
Posted 22 May 2022 by sam6861
Post:
This topic is 1 year old. We now have checkpoints, and more accurate percent complete status.

Each tasks have 5 day deadline at the time of this reply, plenty of time for my slow intel atom N270 which mostly takes 1 to 3 days to complete 2 tasks.
5) Message boards : Number crunching : Task's_0 file is too large to upload (Message 1585)
Posted 20 Mar 2022 by sam6861
Post:
Have you caught a rare large fish! I have increased limit up to 32 Mb, try again, please!
Stuck upload of 32.89 MB file. Upload limit can use an increase.
https://www.sidock.si/sidock/workunit.php?wuid=26638889
6) Message boards : Number crunching : Problems accessing the project and web site (Message 1528)
Posted 26 Feb 2022 by sam6861
Post:
I have noticed that when the web browser slowly and eventually loads sidock website, then the website then load very fast for the rest of the time, until you stop clicking on links and idle for more then 5 seconds. Looks like this is keep-alive connection stuff. Probably the server's limited connection limit is configured too low and/or have reached the limit. I have to reload page (F5 button), copy and paste then hit post, all in less then 5 seconds.

For BOINC clients, limit to 1 file transfer at a time does help due to 5 second keep-alive using the same connection.
cc_config.xml
<cc_config><options>
<max_file_xfers_per_project>1</max_file_xfers_per_project>
</options></cc_config>

I found sched_reply_www.sidock.si_sidock.xml in BOINC files, shows:
<request_delay>7</request_delay>
Other BOINC project servers have request_delay set to 31, 121, or 181. Too low can cause server can get overloaded more easily. Server owner/admin can raise this number to see if this helps. I still believe the problem is mostly hitting the limit of TCP web server limited connections.
7) Message boards : Number crunching : Problems accessing the project and web site (Message 1522)
Posted 26 Feb 2022 by sam6861
Post:
Hi, I am recently seeing some slow connection problem with both boinc and web browser when trying to access this website.
From my Linux, I ran this command a few times on my computer:
nmap -p 80,443 sidock.si
80/tcp http 100% open.
443/tcp https 58% filtered, 42% open, this is random chance. Looks like there something wrong with this port number. Possibly this port number is overloaded.

I have found some possible ways to reduce this problem.

For Windows 10 clients users, this TCP connection adjustments got me higher connection success chance at a cost of taking longer to timeout if it doesn't connect. Open command prompt (cmd) and try these commands.
netsh interface tcp show global
netsh interface tcp set global initialrto=300
netsh interface tcp set global maxsynretransmissions=7
Can go back to initialrto=1000 maxsynretransmissions=4 if you don't like extra long timeout for no connection.

------
For server owner or admin, this can possibly help allow more connections to work when using Apache web server.
/etc/apache2/conf-enabled/custom.conf (can make a new file)
MaxRequestWorkers 1000
ServerLimit 1000
ListenBackLog 10000

service apache2 restart

; Can stress test own web server to check speed.
ab -n 10000 -c 1000 http://127.0.0.1/
8) Message boards : Number crunching : Peer certificate cannot be authenticated with given CA certificates (Message 1244)
Posted 30 Sep 2021 by sam6861
Post:
Windows 10 with BOINC 7.16.11
Thu 30 Sep 2021 01:26:57 PM CDT | SiDock@home | Scheduler request failed: Peer certificate cannot be authenticated with given CA certificates.

Works fine: BOINC version 7.16.16 on Linux Debian 11 bullseye on both my Linux computers (32 bit anonymous platform and 64 bit).
9) Message boards : Number crunching : Have memory requirements for tasks increased? (Message 1130)
Posted 3 Sep 2021 by sam6861
Post:
Got enough free space?

My slow Intel Atom N270 2GB RAM have huge amount of free space, 437 GB free on 500 GB SSD. Continues to receive tasks just fine.

I did look at the server specified limits in client_state.xml, shows:
<rsc_memory_bound>500000000</rsc_memory_bound>
<rsc_disk_bound>1000000000</rsc_disk_bound>
Requires 1000 MB free storage space? On my computer, Sidock project use 55.61 MB storage space. Perhaps maybe some server admins could drop the rsc_disk_bound down to 100000000 (100 MB, 1 less zero)[/code]
10) Message boards : Number crunching : Have memory requirements for tasks increased? (Message 1124)
Posted 1 Sep 2021 by sam6861
Post:
Whats your memory limit set at? Some tasks may not start if BOINC client global memory settings is set to as low as 10% and/or low overall memory.

BOINC GUI, can look at options, disk and memory, memory, when computer in use 100%. when computer not in use 100%.

Can change memory limit, also possible to set max memory usage to 200% in config file (not GUI).
/var/lib/boinc/global_prefs_override.xml
<global_preferences>
<ram_max_used_busy_pct>200.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>200.000000</ram_max_used_idle_pct>
</global_preferences>
boinccmd --read_global_prefs_override

This 200% max memory used can make BOINC start tasks past the system RAM size and the system Linux kernel can fall back to using any enabled swap when low on memory. Note: low memory with no swap can run out of memory and can cause computation errors. Works great to get Rosetta at home to run 2 tasks at the same time on a limited 2GB memory with a fast Samsung Evo 860 SSD.

2 CmDock tasks, custom compiled for 32 bit is just 198 MiB total memory used based on a Linux free command, on my Intel atom N270 2GB RAM Linux 32 bit. 2 Rosetta at home tasks have pushed total memory used to 1800 MiB.
11) Message boards : Number crunching : SiDock tasks on Raspberry Pi keep running when waiting or suspended (Message 1096)
Posted 25 Jul 2021 by sam6861
Post:
I have same problem, but only if Boinc started from service. 32 bit Intel Atom N270, Linux Debian GNU/Linux 10 (buster) [4.19.0-16-686|libc 2.28 (Debian GLIBC 2.28-10)]
~ Suspend tasks failed to pause.
~ killall -SIGTSTP cmdock ...failed to pause.

Found a way to successfully pause. When I stop boinc-client service and manually run boinc as boinc user, then it is able to pause, both by suspend task and by SIGTSTP/SIGCONT signals. I guess something could be possibly wrong with 32 bit Debian 10 service blocks SIGTSTP signal or refuse to pause. My AMD64 Debian 11 Boinc service does not have this problem.
service boinc-client stop
su -l boinc -s /bin/bash
cd /var/lib/boinc-client
boinc & 
I may soon update my 32 bit to Debian 11 to see if this can fix boinc service fail to pause tasks issue.
12) Message boards : Number crunching : 100% (Message 1093)
Posted 19 Jul 2021 by sam6861
Post:
Got the 100% bug on some tasks.
corona_Eprot_v1_run_2_nb3di_109561_4_1

On Linux, can look at BOINC slots files. Slot number may be different. Hex 0a is a new line.
while :
do
hexdump -C /var/lib/boinc-client/slots/2/docking_out.progress
sleep 30
done
00000000  30 2e 31 32 38 32 35 37 0a 0a 0a  |0.128257...|
00000000  30 2e 31 33 32 0a 35 37 0a 0a 0a  |0.132.57...|
00000000  30 2e 31 33 36 32 37 33 0a 0a 0a  |0.136273...|

Looks like CmDock seek to beginning of file and wrote 0.132 which is fewer characters. Then, I guess BOINC wrapper read the last line containing a number as 57, which is 5700% if BOINC didn't limit max number of 100%.

The extra new lines ( 0a 0a ) is when it lost 2 digits as the number went from 0.00999999 to 0.0111111, then to 0.111111.

For now, can temporary override the progress for some time, mostly just 1 minute then it goes back to showing actual progress.
printf "0.5\n" > /var/lib/boinc-client/slots/2/docking_out.progress

Would like to see a fix. Preferably can write the same number of characters "0.13200000". I guess a possibly easier fix is to add spaces to clear away extra digits "0.132 ".
13) Message boards : Number crunching : is it only me? (Message 383)
Posted 3 Jan 2021 by sam6861
Post:
I noticed the credits for each task going up and down. I guess this gets worst with more CPU threads. Tasks have different runtimes, mostly between 30 to 60 minutes.

Just imagine what happens on CPU with 128 threads. All 128 tasks start at the same time with same FLOPS and same estimated runtime, then the fastest 10 task completed first. This can make the credit system think the computer is turbo fast and make the credits per hour for next few tasks go very high points per hour. Then the last few slowest task eventually complete, which then makes the credits per hour or task go very low, thinking the computer is now very slow.

This credits going up and down somewhat happens with my 16 thread Ryzen 2700x if I start all task at the same time.

I guess one way this problem could be reduced, is to pre-calculate or estimate each task to be high FLOPS on long tasks, low FLOPS on short tasks. I like to see estimated remaining time to be a little different for each task not yet started.




©2024 SiDock@home Team