Tasks hanging -

Message boards : Number crunching : Tasks hanging -
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile jstateson

Send message
Joined: 12 Feb 23
Posts: 1
Credit: 4,058,877
RAC: 33,357
Message 2039 - Posted: 26 Mar 2023, 11:30:07 UTC

Have several systems running sidock, all long tasks 0.2.0
Unaccountably 7 seem to be hung on a dual xeon 24 thread. running Linux
Time to complete is several weeks past the deadline. The other 11 tasks do not show any problem.
None of the other system have this problem but they run windows.

jstateson@dual-linux:~$ free -l
              total        used        free      shared  buff/cache   available
Mem:       12271956     5256616     5031304       27584     1984036     6622240
Low:       12271956     7240652     5031304
High:             0           0           0
Swap:       2097148           0     2097148


HTOP shows %100 usage but temperatures are strange on one CPU. Both CPUs are liquid cooled with sparate closed loop systems so there could be some differences.

jstateson@dual-linux:~$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +31.0°C  (high = +80.0°C, crit = +96.0°C)
Core 1:       +35.0°C  (high = +80.0°C, crit = +96.0°C)
Core 2:       +34.0°C  (high = +80.0°C, crit = +96.0°C)
Core 8:       +36.0°C  (high = +80.0°C, crit = +96.0°C)
Core 9:       +32.0°C  (high = +80.0°C, crit = +96.0°C)
Core 10:      +34.0°C  (high = +80.0°C, crit = +96.0°C)

amdgpu-pci-0400
Adapter: PCI adapter
vddgfx:      1000.00 mV
fan1:        2991 RPM  (min =    0 RPM, max = 3700 RPM)
edge:         +45.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       89.03 W  (cap =  90.00 W)

coretemp-isa-0001
Adapter: ISA adapter
Core 0:       +27.0°C  (high = +80.0°C, crit = +96.0°C)
Core 1:       +26.0°C  (high = +80.0°C, crit = +96.0°C)
Core 2:       +29.0°C  (high = +80.0°C, crit = +96.0°C)
Core 8:       +28.0°C  (high = +80.0°C, crit = +96.0°C)
Core 9:       +22.0°C  (high = +80.0°C, crit = +96.0°C)
Core 10:      +24.0°C  (high = +80.0°C, crit = +96.0°C)
ID: 2039 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
Avatar

Send message
Joined: 24 Oct 20
Posts: 19
Credit: 9,927,290
RAC: 4,031
Message 2043 - Posted: 28 Mar 2023, 3:22:48 UTC

Can we multi-thread these task or are they single core only tasks?
ID: 2043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
Avatar

Send message
Joined: 24 Oct 20
Posts: 19
Credit: 9,927,290
RAC: 4,031
Message 2044 - Posted: 28 Mar 2023, 3:24:25 UTC - in response to Message 1945.  

The newer 2.02 units seem to be completing ok, one puzzle that remains is inconsistent granted credit, in some cases a variation of up to 25% for the same computer.

https://www.sidock.si/sidock/workunit.php?wuid=49667156
https://www.sidock.si/sidock/workunit.php?wuid=49667239


That's what happens when you use an algorithm to figure out credits instead of a fixed amount.
ID: 2044 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 324
Credit: 23,207,807
RAC: 9,001
Message 2045 - Posted: 28 Mar 2023, 5:57:42 UTC - in response to Message 2039.  
Last modified: 28 Mar 2023, 5:58:11 UTC

Have several systems running sidock, all long tasks 0.2.0
Unaccountably 7 seem to be hung on a dual xeon 24 thread. running Linux

Hello! Would you post tasks names and current run time?

Thank you!
ID: 2045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
sam6861

Send message
Joined: 28 Dec 20
Posts: 13
Credit: 8,957,185
RAC: 0
Message 2096 - Posted: 26 Jul 2023, 14:33:18 UTC - in response to Message 2045.  

Hung task on intel atom N270, 32 bit. Manually compiled.
With this off "leave non-GPU task in memory while suspended", pause and resume can get task unstuck.

corona_RdRp_v2_sidock-s_98_00014708_r1_s-20_0
https://www.sidock.si/sidock/result.php?resultid=79096387

with gdb, got some information.
RbtChromDihedralElement::StandardisedValue (dihedralAngle=2.8241407014048006e+17)
At: src/lib/RbtChromDihedralElement.cxx:152

Looks like a very huge angle, 282414070140480060. The function, StandardisedValue, repeatedly subtracts 360, but due to rounding, cannot subtract huge 64 bit float by a tiny number, called Double in programming language. This results in endless loop, endless task.

A possible source code fix may be to use remainder or fmod in StandardisedValue function. The remainder function is probably best to use. This can avoid the need to do loops.

dihedralAngle = remainder(dihedralAngle, 360);
// OR //
dihedralAngle = fmod(dihedralAngle, 360);

(gdb) bt
#0  0x083144e8 in RbtChromDihedralElement::StandardisedValue (dihedralAngle=2.8241407014048006e+17)
    at ../src/lib/RbtChromDihedralElement.cxx:152
#1  0x08314890 in RbtChromDihedralElement::Mutate (this=0xd537120, relStepSize=16331239353195370)
    at ../src/lib/RbtChromDihedralElement.cxx:73
#2  0x08311c58 in RbtChrom::Mutate (relStepSize=16331239353195370, this=<optimized out>) at ../src/lib/RbtChrom.cxx:56
#3  RbtChrom::Mutate (this=<optimized out>, relStepSize=16331239353195370) at ../src/lib/RbtChrom.cxx:56
#4  0x082e4140 in RbtPopulation::GAstep (this=0xb514750, nReplicates=nReplicates@entry=1100, relStepSize=relStepSize@entry=1, 
    equalityThreshold=equalityThreshold@entry=0.10000000000000001, pcross=pcross@entry=0.40000000000000002, 
    xovermut=xovermut@entry=true, cmutate=cmutate@entry=false) at ../src/lib/RbtPopulation.cxx:105
#5  0x082806f8 in RbtGATransform::Execute (this=<optimized out>) at ../include/RbtSmartPointer.h:131
#6  0x0830e73d in RbtTransformAgg::Execute (this=0x85c3960) at ../src/lib/RbtTransformAgg.cxx:165
#7  0x081dfcb3 in RbtWorkSpace::Run (this=0x85b02b0) at ../src/lib/RbtWorkSpace.cxx:170
#8  0x080955d0 in main (argc=<optimized out>, argv=<optimized out>) at ../src/exe/cmdock.cxx:775
ID: 2096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
sam6861

Send message
Joined: 28 Dec 20
Posts: 13
Credit: 8,957,185
RAC: 0
Message 2098 - Posted: 26 Jul 2023, 19:10:12 UTC - in response to Message 2096.  
Last modified: 26 Jul 2023, 19:26:20 UTC

Update: I am guessing this may be my possible faulty hardware that may make random errors. The more I look at where the numbers (relStepSize=16331239353195370) may have possibly come from, the more I think this could be my faulty hardware making random wrong calculation.

I started to believe it could be a possible faulty RAM or CPU hardware on N270 HP Mini 110-1000, very old computer. I had some difficulty just getting this computer to start. Often, display just stay blank, no boot, having to power cycle a few times. Once this computer randomly lost power, but all other devices stayed on. I guess this computer may possibly fail soon. Oh well, I have some other computers I can use.

Software may be written to have some check for some faulty numbers to try to reduce the chance of getting stuck in endless loop, with Primegrid being an example of having lots of checks for possible errors.
ID: 2098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer

Send message
Joined: 11 Oct 20
Posts: 324
Credit: 23,207,807
RAC: 9,001
Message 2099 - Posted: 28 Jul 2023, 9:17:01 UTC

sam6861, thank you for interesting notice!
ID: 2099 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
sam6861

Send message
Joined: 28 Dec 20
Posts: 13
Credit: 8,957,185
RAC: 0
Message 2102 - Posted: 28 Jul 2023, 20:05:42 UTC - in response to Message 2099.  

More update: Looked at source code in random number code... I may be wrong about hardware errors this time and are more likely a software bug. The randomizer just makes this problem happen by random chance.

I believe I found a bug with a function in source code, RbtRand::generate_cauchy.
https://gitlab.com/Jukic/cmdock/-/blob/v0.2.0/src/lib/RbtRand.cxx Line 216

val = a random decimal number between -0.5 to 0.5.
The problem is the use of tan(pi * val) in radians trigonometry mode.

tan(pi*0.4999999999) in linux Qalculate is 3183098861837907, a huge number.

RbtRand::generate_cauchy in src/lib/RbtRand.cxx line 220
RbtRand::GetCauchyRandom in src/lib/RbtRand.cxx line 69
RbtChromElement::CauchyMutate in src/lib/RbtChromElement.cxx line 86

--- The next function CauchyMutate calls is RbtChrom::Mutate... which is where relStepSize=16331239353195370 went to get stuck in RbtChromDihedralElement::StandardisedValue.

I am not sure of hwo to solve this problem in RbtRand::generate_cauchy, this is mostly up to other people to find a good fix for this random huge number I guess.
ID: 2102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Tasks hanging -

©2024 SiDock@home Team