einsteinbinary_: page allocation failure

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 915488605
RAC: 1457016
Topic 231539

I am getting this kernel fault when running the einstein binary on Linux with NVidia RTX 2060 gpu (2 gpu's), The OS is fully up to date

kernel is 6.11.0-1 nd the nvidia drivers are 550.107.02, the OS is openSuse Tumbleweed

It appears to be running "Binary Radio Pulsar Search "MeerKAT) 0.16 (BPR7-cuda102)

Tue Oct  1 04:59:12 2024] [T1662263] einsteinbinary_: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[Tue Oct  1 04:59:12 2024] [T1662263] CPU: 16 UID: 1000 PID: 1662263 Comm: einsteinbinary_ Tainted: P           OE      6.11.0-1-default #1 openSUSE Tumbleweed 461f7965cd54a3c599f269012cdb3d6ce81b3260
[Tue Oct  1 04:59:12 2024] [T1662263] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[Tue Oct  1 04:59:12 2024] [T1662263] Hardware name: ASUS System Product Name/ROG MAXIMUS Z790 HERO, BIOS 2301 05/22/2024
[Tue Oct  1 04:59:12 2024] [T1662263] Call Trace:
[Tue Oct  1 04:59:12 2024] [T1662263]  <TASK>
[Tue Oct  1 04:59:12 2024] [T1662263]  dump_stack_lvl+0x5a/0x80
[Tue Oct  1 04:59:12 2024] [T1662263]  warn_alloc+0x139/0x160
[Tue Oct  1 04:59:12 2024] [T1662263]  ? __alloc_pages_direct_compact+0x1c1/0x2d0
[Tue Oct  1 04:59:12 2024] [T1662263]  __alloc_pages_slowpath.constprop.0+0xc62/0xd60
[Tue Oct  1 04:59:12 2024] [T1662263]  __alloc_pages_noprof+0x321/0x350
[Tue Oct  1 04:59:12 2024] [T1662263]  ___kmalloc_large_node+0x69/0xf0
[Tue Oct  1 04:59:12 2024] [T1662263]  ? handle_mm_fault+0x1bb/0x2c0
[Tue Oct  1 04:59:12 2024] [T1662263]  __kmalloc_large_node_noprof+0x1d/0xa0
[Tue Oct  1 04:59:12 2024] [T1662263]  __kmalloc_noprof+0x32a/0x440
[Tue Oct  1 04:59:12 2024] [T1662263]  ? __gup_longterm_locked+0x5ad/0xa00
[Tue Oct  1 04:59:12 2024] [T1662263]  ? __gup_longterm_locked+0x5ad/0xa00
[Tue Oct  1 04:59:12 2024] [T1662263]  __gup_longterm_locked+0x5ad/0xa00
[Tue Oct  1 04:59:12 2024] [T1662263]  ? __kmalloc_noprof+0x280/0x440
[Tue Oct  1 04:59:12 2024] [T1662263]  pin_user_pages+0x6e/0xb0
[Tue Oct  1 04:59:12 2024] [T1662263]  os_lock_user_pages+0xb3/0x1a0 [nvidia e810b5fb1a2a882eafdb0ea19ed3a500028d28e3]
[Tue Oct  1 04:59:12 2024] [T1662263]  _nv000662rm+0x67/0x110 [nvidia e810b5fb1a2a882eafdb0ea19ed3a500028d28e3]
[Tue Oct  1 04:59:12 2024] [T1662263]  _nv000731rm+0xc96/0xeb0 [nvidia e810b5fb1a2a882eafdb0ea19ed3a500028d28e3]
[Tue Oct  1 04:59:12 2024] [T1662263]  rm_ioctl+0x58/0xb0 [nvidia e810b5fb1a2a882eafdb0ea19ed3a500028d28e3]
[Tue Oct  1 04:59:12 2024] [T1662263]  nvidia_unlocked_ioctl+0x529/0x8b0 [nvidia e810b5fb1a2a882eafdb0ea19ed3a500028d28e3]
[Tue Oct  1 04:59:12 2024] [T1662263]  __x64_sys_ioctl+0x94/0xd0
[Tue Oct  1 04:59:12 2024] [T1662263]  do_syscall_64+0x82/0x160
[Tue Oct  1 04:59:12 2024] [T1662263]  ? handle_mm_fault+0x1bb/0x2c0
[Tue Oct  1 04:59:12 2024] [T1662263]  ? do_user_addr_fault+0x36c/0x620
[Tue Oct  1 04:59:12 2024] [T1662263]  ? exc_page_fault+0x73/0x170
[Tue Oct  1 04:59:12 2024] [T1662263]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Tue Oct  1 04:59:12 2024] [T1662263] RIP: 0033:0x7fb6e7b0f70f
[Tue Oct  1 04:59:12 2024] [T1662263] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[Tue Oct  1 04:59:12 2024] [T1662263] RSP: 002b:00007ffeaf83c440 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Tue Oct  1 04:59:12 2024] [T1662263] RAX: ffffffffffffffda RBX: 00007ffeaf83c540 RCX: 00007fb6e7b0f70f
[Tue Oct  1 04:59:12 2024] [T1662263] RDX: 00007ffeaf83c540 RSI: 00000000c0384627 RDI: 0000000000000012
[Tue Oct  1 04:59:12 2024] [T1662263] RBP: 00007ffeaf83c4f0 R08: 00007ffeaf83c540 R09: 00007ffeaf83c568
[Tue Oct  1 04:59:12 2024] [T1662263] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c0384627
[Tue Oct  1 04:59:12 2024] [T1662263] R13: 0000000000000012 R14: 00007ffeaf83c568 R15: 00007ffeaf83c4b0
[Tue Oct  1 04:59:12 2024] [T1662263]  </TASK>
[Tue Oct  1 04:59:12 2024] [T1662263] Mem-Info:
[Tue Oct  1 04:59:12 2024] [T1662263] active_anon:117150 inactive_anon:805190 isolated_anon:0
                                      active_file:992908 inactive_file:5342046 isolated_file:0
                                      unevictable:20 dirty:37006 writeback:0
                                      slab_reclaimable:489448 slab_unreclaimable:81950
                                      mapped:260686 shmem:64374 pagetables:9150
                                      sec_pagetables:282 bounce:0
                                      kernel_misc_reclaimable:0
                                      free:152490 free_pcp:311 free_cma:0
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 active_anon:468600kB inactive_anon:3220760kB active_file:3971632kB inactive_file:21368184kB unevictable:80kB isolated(anon):0kB isolated(file):0kB mapped:1042744kB dirty:148024kB writeback:0kB s
hmem:257496kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:671744kB writeback_tmp:0kB kernel_stack:20160kB pagetables:36600kB sec_pagetables:1128kB all_unreclaimable? no
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 DMA free:1068kB boost:0kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:4kB inactive_anon:884kB active_file:3524kB inactive_file:5776kB unevictable:0kB writepending:0kB present
:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Tue Oct  1 04:59:12 2024] [T1662263] lowmem_reserve[]: 0 609 31728 0 0
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 DMA32 free:8736kB boost:0kB min:1296kB low:1920kB high:2544kB reserved_highatomic:0KB active_anon:24900kB inactive_anon:235136kB active_file:112932kB inactive_file:347496kB unevictable:0kB write
pending:0kB present:801824kB managed:735744kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Tue Oct  1 04:59:12 2024] [T1662263] lowmem_reserve[]: 0 0 31118 0 0
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 Normal free:600660kB boost:194968kB min:261220kB low:293084kB high:324948kB reserved_highatomic:2048KB active_anon:443696kB inactive_anon:2983636kB active_file:3855176kB inactive_file:21014656kB
unevictable:80kB writepending:148024kB present:32505856kB managed:31872992kB mlocked:80kB bounce:0kB free_pcp:1128kB local_pcp:0kB free_cma:0kB
[Tue Oct  1 04:59:12 2024] [T1662263] lowmem_reserve[]: 0 0 0 0 0
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 DMA: 3*4kB (M) 2*8kB (UM) 1*16kB (U) 2*32kB (UM) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 1068kB
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 DMA32: 108*4kB (ME) 143*8kB (M) 27*16kB (ME) 13*32kB (ME) 13*64kB (ME) 8*128kB (M) 3*256kB (M) 1*512kB (M) 1*1024kB (M) 1*2048kB (M) 0*4096kB = 8632kB
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 Normal: 59497*4kB (UME) 17048*8kB (UME) 11852*16kB (UME) 1107*32kB (UME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (H) 0*4096kB = 601476kB
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Tue Oct  1 04:59:12 2024] [T1662263] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Tue Oct  1 04:59:12 2024] [T1662263] 6399359 total pagecache pages
[Tue Oct  1 04:59:12 2024] [T1662263] 0 pages in swap cache
[Tue Oct  1 04:59:12 2024] [T1662263] Free swap  = 0kB
[Tue Oct  1 04:59:12 2024] [T1662263] Total swap = 0kB
[Tue Oct  1 04:59:12 2024] [T1662263] 8330918 pages RAM
[Tue Oct  1 04:59:12 2024] [T1662263] 0 pages HighMem/MovableOnly
[Tue Oct  1 04:59:12 2024] [T1662263] 174894 pages reserved
[Tue Oct  1 04:59:12 2024] [T1662263] 0 pages cma reserved
[Tue Oct  1 04:59:12 2024] [T1662263] 0 pages hwpoisoned
[Tue Oct  1 04:59:12 2024] [T1662263] Cannot map memory with base addr 0x7fb676000000 and size of 0x3000 pages

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18701132778
RAC: 6262381

I found that the latest 550

I found that the latest 550 driver update from the .107 point to the .120 point release is broken with the 6.11.0 kernel.

There is a missing file in the nvidia-dkms-550 module that causes dpkg errors and the kernel is not properly compiled with the nvidia driver.

Best solution is to remove the 550 level drivers and either back or uplevel the drivers.  I went with the 555 series branch and have had no issues installing them into the 6.11.0 kernel.

Might want to read this pertaining to your specific issue.

https://forums.opensuse.org/t/issues-with-kernel-6-11-0-1-default-and-nvidia-drivers/178932/41

 

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 915488605
RAC: 1457016

Thanks, I will look into it.

Thanks, I will look into it. I'm currently on 550.120

Is there a way for me to get an email when someone reply's to my topic post?

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18701132778
RAC: 6262381

Yes.  Toggle the

Yes.  Toggle the Subscribe button on the top of the page and you will be notified of any further replies on the thread.

 

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 915488605
RAC: 1457016

I tried an earlier driver

I tried an earlier driver (550.100) but it failed to start X properly so i had to return to the 550.120 driver. Looks like i'm stuck until they fix the NVidia driver.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18701132778
RAC: 6262381

I suggest moving up to the

I suggest moving up to the 555 drivers.  That is what I did.  I had no issues with them.  Also, I think that what was installed with the 555 drivers fixes reverting back to the 550.120 drivers.

I had no issues with missing files during the installation of the 550.120 drivers after the brief detour to the 555 drivers.

 

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 915488605
RAC: 1457016

I'm thinking of trying the

I'm thinking of trying the new 560 beta version

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18701132778
RAC: 6262381

Sure, why not.  You can

Sure, why not.  You can always backlevel to what was working prior.  Test one thing, analyze the result for desired effect, then change one thing different and test again.

 

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 915488605
RAC: 1457016

Well, the 560 driver didn't

Well, the 560 driver didn't work so now i will go to the 555.58.02 driver and see what cooks

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 915488605
RAC: 1457016

The 555 driver wont compile

The 555 driver wont compile against the 6.11.0 kernel so i guess I'm stuck until NVidia fixes the issue. but what I wonder is what is going on in the Einstein binary, as long as i don't run that i have no issues so I don't think its only a nvidia issue.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46621852642
RAC: 64201858

How are you installing the

How are you installing the driver? With the Nvidia runfile? Or are you using some package manager with opensuse? 
 

with the runfile install you do need some prereqs, and a newer version of GCC 

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.