32g system ram, running with hyper threading off to keep temps a bit lower on cpu. RX6900 runs
4 tasks and this seems to fully 'saturate' the gpu at peaks so I dialed it down from first 8, and then 6
OAS3 tasks. 4 seems better and they just run faster if less in count. I get a total system crash about
every 36-48 hrs. (requires full reboot) lately. Here is the last snippet out of my kernel.log, so it has some info but probably
not enough of a full dump of the good stuff a developer needs to really shoot trouble. RX6900 has 16 g.
of ram and that's why I got away with running more tasks (up to a point).
-------------------------------------------------------------------------------------------------------------------------------------
Dec 27 00:07:12 kernel: [107576.642108] [drm] Unknown EDID CEA parser results
Dec 27 00:23:50 kernel: [108574.671480] [drm] Unknown EDID CEA parser results
Dec 27 00:41:01 kernel: [109605.667960] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:88 vmid:8 pasid:32771, for process einstein_O3AS_1 pid 9689 thread einstein_O3AS_1 pid 9689)
Dec 27 00:41:01 kernel: [109605.667967] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00007f87e6ccc000 from client 0x1b (UTCL2)
Dec 27 00:41:01 kernel: [109605.667970] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x008008B0
Dec 27 00:41:01 kernel: [109605.667971] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPF (0x4)
Dec 27 00:41:01 kernel: [109605.667972] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
Dec 27 00:41:01 kernel: [109605.667973] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
Dec 27 00:41:01 kernel: [109605.667974] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0xb
Dec 27 00:41:01 kernel: [109605.667975] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
Dec 27 00:41:01 kernel: [109605.667976] amdgpu 0000:03:00.0: amdgpu: RW: 0x0
Dec 27 00:47:22 kernel: [109985.919646] [drm] Unknown EDID CEA parser results
Dec 27 00:47:22 kernel: [109985.932499] [drm] Unknown EDID CEA parser results
----------------------------------------------------------------------------------------------------------------------------------------
BTW, that last line there is literally the last line of the kernal log after system crashed.
those 'Unknown EDID' notifications are just another nuisance message afaik, I haven't had much luck finding info
on them. But I get them all the time the gpu is running. This is running on the latest AMD driver 21.40.1.40501-1,
at least as of last month. If someone can direct me to better debug info I can put my hands on I'd be happy to seek
it out and forward it. 'gfxhub page fault' - maybe a problem with the gpu card's memory? ?
Happy Holidays and CHEERS! to all,
-Mike
Copyright © 2024 Einstein@Home. All rights reserved.
Are you running CPU tasks
)
Are you running CPU tasks from either Einstein or another project? From what I remember, the gravitational wave tasks were a good bit CPU bound, and maybe some conflicts are occurring there? personally I would re-enable hyperthreading even if only to give the CPU some more oomph in processing.
how is the GPU connected? are you using a riser of any kind? or GPU plugged directly to the motherboard? how about other peripherals that might be sharing PCIe resources with the GPU? what motherboard do you have and which slot is populated by the GPU?
finally, if you're feeling up to it, you might look into the ROCm driver stack (the AMD shipped drivers only have rocr which doesnt have the same level of support as the full ROCm stack) and from my experience ROCm has better support than the AMDGPUpro drivers. It installs fairly easily on Ubuntu, but there are sometimes GPU detection issues in BOINC with ROCm that requires a workaround.
_________________________________________________________________________
Thanks for your detailed
)
Thanks for your detailed reply.
I'm only running this project now on this system. I have my config files set to use one cpu
per einstein_O3AS task. They peak in heat during that last "99%" phase but never more than
~130F tops. GPU only shows 80F on 'edge' sensor using hardware monitor.
I still have hyperthreading off for now, so this box on 8 cores will run 4 x O3AS tasks, and right
now 4 x FGRP5 cpu-only tasks.
This gpu is connected into the slot 1 PCIEX on an ASUS prime Z590-A 32g ram. I bought
an 850 W PS at the same time as that RX6900 and have all 3 power connectors hooked up.
I 'built' the last drivers running that script AMD has for d/l now - the 21.40.1 dated 11/11/21.
I installed with only the basic options 'opencl=rocr,legacy' and nothing else.
Might be time to read up on the ROCm stack next. And thanks again for taking a look.
I also posted the original kernel message to the AMD support community site.
Cheers,
Mike
Seems kind of crazy but I
)
Seems kind of crazy but I turned back on hyper threading on this 8 core intel and that last cpu intense processing that occurs when the O3AS w/u's show 99% complete can take 16 minutes for that last 1%. That's 2 at a time (in the last 1% phase) with that 6900 gpu. At least with my rig. The total load on the gpu drops by half at this point as I guess the cpu's are doing the final work. I have 2 other O3AS units still running since the gpu is supposed to be running 4x. Yes, I would say cpu intensive for sure. I turned off hyper threading in BIOS and the units run in about 7-8 minutes, and the last 1% takes ~ 2 mins. Talk about a huge difference. I can't believe I didn't look at this closer before. With 4 gpu tasks running I have two running very close together, and the other two about 1/2 cycle behind if I'm lucky. Seems to be most efficient. Waiting to see if it crashes again. I moved that box to a cooler room in the house. ? ?
Happy New Year,
-Mike