Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
system freeze, capturing logs ?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
sdauth
Guru
Guru


Joined: 19 Sep 2018
Posts: 569
Location: Ásgarðr

PostPosted: Wed Jul 20, 2022 8:52 am    Post subject: system freeze, capturing logs ? Reply with quote

Hello,
One of my laptop totally freezed recently with kernel 5.15.52. Impossible to switch to an other console, the keyboard was also totally unresponsive. No remote ssh access possible either. I was forced to hard shutdown (by holding the power button)
My question; when this happens, is there a way to trigger an automatic reboot (after a certain wait like 2min) and also being able to capture some logs to know what was the root cause of the lock ?
My kernel is really trimmed down, so I don't have CONFIG_MAGIC_SYSRQ, CONFIG_DEBUG_FS enabled, are these mandatory ?
Generally speaking, what is the best way to handle this ?

Thanks
Back to top
View user's profile Send private message
jpsollie
Apprentice
Apprentice


Joined: 17 Aug 2013
Posts: 291

PostPosted: Wed Jul 20, 2022 12:20 pm    Post subject: Reply with quote

the best way to handle this is configuring your kernel to write it to a persistent store (can be UEFI variables, ram memory portion or block device), and trigger kexec to boot your new rescue kernel so you can retrieve the panic message from your persistent store.
I'm not sure about the details, but usually no sysrq or debugfs is needed
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Back to top
View user's profile Send private message
sdauth
Guru
Guru


Joined: 19 Sep 2018
Posts: 569
Location: Ásgarðr

PostPosted: Wed Jul 20, 2022 5:30 pm    Post subject: Reply with quote

Thanks jjpsollie, do you have more detailed instructions if possible ? For now I enabled sysrq (also enabled kexec & debug_fs just in case) to see if I can at least trigger a clean shutdown using key combination (assuming next time, the keyboard responds :o ) if it happens again. I use openrc if that's relevant. Also, my system uses two partitions, one for / and one for /home. (with LVM)
Back to top
View user's profile Send private message
sublogic
Apprentice
Apprentice


Joined: 21 Mar 2022
Posts: 222
Location: Pennsylvania, USA

PostPosted: Thu Jul 21, 2022 2:14 am    Post subject: Re: system freeze, capturing logs ? Reply with quote

sdauth wrote:
Hello,
One of my laptop totally freezed recently with kernel 5.15.52. Impossible to switch to an other console, the keyboard was also totally unresponsive. No remote ssh access possible either. I was forced to hard shutdown (by holding the power button)
That looks like a normal kernel panic. The panic message is probably on tty1 but, as you noted, it is too late to switch consoles. It's also too late for the MAGIC_SYSRQ by the way.

sdauth wrote:

My question; when this happens, is there a way to trigger an automatic reboot (after a certain wait like 2min) and also being able to capture some logs to know what was the root cause of the lock ?
Yes, by enabling kernel crash dumps and preloading a capture kernel. When your main kernel panics, the capture kernel does a warm boot and the crash dump is available in /proc/vmcore . You copy that somewhere and analyze it later with the crash utility.

For the kernel config, https://wiki.gentoo.org/wiki/Kernel_Crash_Dumps.
For userspace tools,
Code:
# emerge -av sys-apps/kexec-tools dev-util/crash

My own adventures with this process are in this monologue. It may be hard to read because it's really a crash dump of my brain while I could remember what happened. It's all a blur now.

The wiki article is a bit dated but still useful. The kernel tree has more info in Documentation/admin-guide/kdump/kdump.rst . If you get into deep debugging like I did, you need CONFIG_DEBUG_INFO --but that bloats up your kernel ! If I remember correctly you don't need DEBUG_INFO to extract the panic message.

The references I collected about the crash utility are in post 8700413.

Prepare yourself for a learning experience . . . Good luck.
Back to top
View user's profile Send private message
sdauth
Guru
Guru


Joined: 19 Sep 2018
Posts: 569
Location: Ásgarðr

PostPosted: Thu Jul 21, 2022 7:30 am    Post subject: Reply with quote

Wow, very informative sublogic :wink: ! Your thread is exactly what I was looking for. I should have used the search function :)
Ok, so that's what I feared, too late for the sysrq magic. It was indeed totally freezed.
I will install the utilities mentioned. And double check everything needed is enabled in my kernel. For now though, I keyworded 5.15.55 and I'm not seeing any crash. (Last one was really random though, just scrolling some webpage..) Anyway, all that bit of info is really nice to have so thanks !
Back to top
View user's profile Send private message
sublogic
Apprentice
Apprentice


Joined: 21 Mar 2022
Posts: 222
Location: Pennsylvania, USA

PostPosted: Thu Jul 21, 2022 2:59 pm    Post subject: Reply with quote

sdauth wrote:
Wow, very informative sublogic :wink: ! Your thread is exactly what I was looking for.

Uh, there's a better way now. I wish I had known that earlier.
https://blogs.oracle.com/linux/post/pstore-linux-kernel-persistent-storage-file-system
That must be what jpsollie was referring to ?
Back to top
View user's profile Send private message
sublogic
Apprentice
Apprentice


Joined: 21 Mar 2022
Posts: 222
Location: Pennsylvania, USA

PostPosted: Sat Jul 23, 2022 1:28 am    Post subject: Reply with quote

sdauth, why don't you look under /sys/fs/pstore/ --there might be a little something for you already.

I say this because, on my two computers, pstore is already configured and I didn't do anything !

Also, try running
Code:
$ sudo grep pstore /var/log/dmesg
[    0.576049] pstore: Registered erst as persistent store backend
[    1.896699] pstore: Using crash dump compression: deflate

The above is from my desktop. On my (old!) laptop, the one that crashed earlier, the pstore messages are absent. I'll try CONFIG_PSTORE_BLK on that computer.
Back to top
View user's profile Send private message
sdauth
Guru
Guru


Joined: 19 Sep 2018
Posts: 569
Location: Ásgarðr

PostPosted: Sat Jul 23, 2022 2:23 pm    Post subject: Reply with quote

sublogic wrote:
sdauth, why don't you look under /sys/fs/pstore/ --there might be a little something for you already.

I say this because, on my two computers, pstore is already configured and I didn't do anything !

Also, try running
Code:
$ sudo grep pstore /var/log/dmesg
[    0.576049] pstore: Registered erst as persistent store backend
[    1.896699] pstore: Using crash dump compression: deflate

The above is from my desktop. On my (old!) laptop, the one that crashed earlier, the pstore messages are absent. I'll try CONFIG_PSTORE_BLK on that computer.


Nice find!
Unfortunately, it was empty. (including dmesg, no "pstore: Registered erst [...]" , maybe I need to enable CONFIG_ACPI_APEI_ERST_DEBUG ?)
On my laptop :

Code:
zgrep PSTORE /proc/config.gz
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_DEFLATE_COMPRESS=y
# CONFIG_PSTORE_LZO_COMPRESS is not set
# CONFIG_PSTORE_LZ4_COMPRESS is not set
# CONFIG_PSTORE_LZ4HC_COMPRESS is not set
# CONFIG_PSTORE_842_COMPRESS is not set
# CONFIG_PSTORE_ZSTD_COMPRESS is not set
CONFIG_PSTORE_COMPRESS=y
CONFIG_PSTORE_DEFLATE_COMPRESS_DEFAULT=y
CONFIG_PSTORE_COMPRESS_DEFAULT="deflate"
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_PMSG is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_PSTORE_BLK is not set


This looks much simpler indeed. I'll enable that right now. This is much better than enabling kexec & debug_info :o
So I guess I have to enable :

CONFIG_PSTORE_BLK
I guess I need to add a very small partition in my LVM VG to hold the logs. Outside of the root LV obviously.

CONFIG_PSTORE_CONSOLE seems to log everything (including oops and panic) so it's a bit overkill. I just want some log when it crashes.
For now I have not been able to reproduce the crash I experienced.. but it will be handy next time it happens! Thank you sublogic !
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21630

PostPosted: Sat Jul 23, 2022 4:12 pm    Post subject: Reply with quote

A panic is a hard crash, and your original description is consistent with a panic. An oops is a kernel bug, and should not happen in normal operation. Thererfore, I would not find it excessive to have a mechanism that logged every oops and every panic.

The system could also be completely hung. I have seen a failing motherboard result in a system that just locks up for no reason, and the kernel stops so suddenly that there is nothing you can do in Linux to prepare or handle it.
Back to top
View user's profile Send private message
sdauth
Guru
Guru


Joined: 19 Sep 2018
Posts: 569
Location: Ásgarðr

PostPosted: Sat Jul 23, 2022 5:17 pm    Post subject: Reply with quote

@Hu : Ok, I enabled PSTORE_CONSOLE so it logs everything
So I added a 12M partition and formatted it as ext4 for pstore.
Added the needed options & recompiled in my kernel but ERST doesn't seem to register.
I should have :
Code:
[    0.576049] pstore: Registered erst as persistent store backend
[    1.896699] pstore: Using crash dump compression: deflate


but
grep pstore /var/log/dmesg
returns nothing.

and
grep ERST /var/log/dmesg
Code:
ERST DBG: ERST support is disabled


config :
zgrep APEI /proc/config.gz
Code:
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
CONFIG_ACPI_APEI_ERST_DEBUG=y


zgrep PSTORE /proc/config.gz
Code:
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_DEFLATE_COMPRESS=y
# CONFIG_PSTORE_LZO_COMPRESS is not set
# CONFIG_PSTORE_LZ4_COMPRESS is not set
# CONFIG_PSTORE_LZ4HC_COMPRESS is not set
# CONFIG_PSTORE_842_COMPRESS is not set
# CONFIG_PSTORE_ZSTD_COMPRESS is not set
CONFIG_PSTORE_COMPRESS=y
CONFIG_PSTORE_DEFLATE_COMPRESS_DEFAULT=y
CONFIG_PSTORE_COMPRESS_DEFAULT="deflate"
CONFIG_PSTORE_CONSOLE=y
# CONFIG_PSTORE_PMSG is not set
# CONFIG_PSTORE_RAM is not set
CONFIG_PSTORE_ZONE=y
CONFIG_PSTORE_BLK=y
CONFIG_PSTORE_BLK_BLKDEV="PARTUUID=c317b1e1-c66f-46f8-add9-5e4dbe68f722"
CONFIG_PSTORE_BLK_KMSG_SIZE=512
CONFIG_PSTORE_BLK_MAX_REASON=2
CONFIG_PSTORE_BLK_CONSOLE_SIZE=1024


Any idea of what's wrong here ? Thanks
Back to top
View user's profile Send private message
sdauth
Guru
Guru


Joined: 19 Sep 2018
Posts: 569
Location: Ásgarðr

PostPosted: Sat Jul 23, 2022 6:06 pm    Post subject: Reply with quote

If I compile CONFIG_ACPI_APEI_ERST_DEBUG as a module :

Code:
modprobe -v erst-dbg
insmod /lib/modules/5.15.56-gentoo-gnu-x200/kernel/drivers/acpi/apei/erst-dbg.ko
modprobe: ERROR: could not insert 'erst_dbg': No such device


:?:
Back to top
View user's profile Send private message
spica
Apprentice
Apprentice


Joined: 04 Jun 2021
Posts: 287

PostPosted: Sat Jul 23, 2022 6:40 pm    Post subject: Reply with quote

I think need to look at erst_disable kernel parameter https://www.kernel.org/doc/html/v5.15/admin-guide/kernel-parameters.html

I guess it is 1 by default, i.e. erst is disabled by default, at least this is what I found grepping the code,
/usr/src/linux/drivers/acpi/apei/erst-dbg.c - see two blocks with return -ENODEV
/usr/src/linux/drivers/acpi/apei/erst.c - see what value is assigned to erst_disable
Back to top
View user's profile Send private message
sdauth
Guru
Guru


Joined: 19 Sep 2018
Posts: 569
Location: Ásgarðr

PostPosted: Sat Jul 23, 2022 6:56 pm    Post subject: Reply with quote

Hi spica,
I tried to pass erst_disable=0 ; result :

grep ERST /var/log/dmesg
Code:
ERST: Error Record Serialization Table (ERST) support is disabled.


Also, I disabled APEI_ERST_DEBUG ; this is not needed, I was confused at first.
I only need ACPI_APEI (built-in) but it still fails to register. I wonder if that's related to my laptop (it uses coreboot, so maybe there is something buggy with ACPI, who knows)

EDIT : Asked to coreboot devs, indeed ERST is not implemented for this old laptop. :o
Back to top
View user's profile Send private message
spica
Apprentice
Apprentice


Joined: 04 Jun 2021
Posts: 287

PostPosted: Sat Jul 23, 2022 9:20 pm    Post subject: Reply with quote

Hi sdauth,

it looks like my assumption about erst_disable was wrong.
I did some experiments locally to check it.
I put a few debug prints into erst.c and I tried to pass erst_disable=0 and I see this completely disables it. The next routine is called if erst_disable is present in kernel line, and this routine disables it completely.
Code:
static int __init setup_erst_disable(char *str)
{
  pr_info("=== setup_erst_disable was called"); // added
  erst_disable = 1;
  return 1;
}

__setup("erst_disable", setup_erst_disable);

--------------------------------------

I had a thought, if we see different dmesg output with and without erst_disable, then, if erst_disable is not present, there must be another condition which turns it off
So I covered each goto err with printing info message which I can easily grep drivers/acpi/apei/erst.c
Code:
static int __init erst_init(void)
{
***
        if (acpi_disabled) {
          pr_info("=== acpi_disabled"); // added
                goto err;
        }

        if (erst_disable) {
                pr_info("=== erst_disable"); // added
                pr_info(
        "Error Record Serialization Table (ERST) support is disabled.\n");
                goto err;
        }
       status = acpi_get_table(ACPI_SIG_ERST, 0,
                                (struct acpi_table_header **)&erst_tab);
        if (status == AE_NOT_FOUND) {

          pr_info("=== AE_NOT_FOUND"); // added

          goto err;

        }
   else if (ACPI_FAILURE(status)) {
                const char *msg = acpi_format_exception(status);
                pr_err("Failed to get table, %s\n", msg);
                rc = -EINVAL;
                  pr_info("=== ACPI_FAILURE"); // added
                goto err;
        }
***
err:
       pr_info("=== the err: point reached"); // added
   erst_disable = 1;
   return rc;
}


And according to my dmesg output, ACPI reported that "ESRT" table is absent:
Code:
# dmesg | grep ===
[    0.970745] ERST: === AE_NOT_FOUND
[    0.970746] ERST: === the err: point reached


I think, my findings prove that my laptop does not have ERST table in ACPI too.
In dmesg, closer to start of output, we can see a list of ACPI tables, and if ERST is present then I think there should be an earlier line that the table exists. I do not see it in my dmesg.

--------------------------------------

Does the device boot via UEFI? You can tell PSTORE to use EFI backend
Not sure if these keys are enough, maybe, some additional needs to be set too. One key must be "Y", the second "N"
Code:
CONFIG_EFI_VARS_PSTORE=y
EFI_VARS_PSTORE_DEFAULT_DISABLE=n


EFI backend works on my laptop.

It was interesting for me to look at ERST because recently I had a problem, efibootmgr refused to set boot record because NVRAM was overfilled with logs from PSTORE.
Systemd users do not have this issue because SystemD has a service which does housekeeping – moves the data to disk, but OpenRC seems not.
Back to top
View user's profile Send private message
sdauth
Guru
Guru


Joined: 19 Sep 2018
Posts: 569
Location: Ásgarðr

PostPosted: Sat Jul 23, 2022 9:56 pm    Post subject: Reply with quote

spica wrote:
Does the device boot via UEFI? You can tell PSTORE to use EFI backend
Not sure if these keys are enough, maybe, some additional needs to be set too. One key must be "Y", the second "N"
Code:
CONFIG_EFI_VARS_PSTORE=y
EFI_VARS_PSTORE_DEFAULT_DISABLE=n

EFI backend works on my laptop.

This laptop (running coreboot) doesn't have EFI support at all. Unless you use a special payload (Tianocore / EDK2) which could enable the ESRT table or at least allows the use of the EFI backend instead, like you said. Currently I'm using coreboot with seabios payload so classic bios boot only & missing ESRT table :lol:
Also, to make sure it really wasn't a config error on my side, I tried with a debian install to confirm ESRT was missing, and this was the case too.
Anyway, good findings spica, thanks.
I'll update this thread if I decide to try the EFI stuff.. Not motivated for now :lol:
Back to top
View user's profile Send private message
sublogic
Apprentice
Apprentice


Joined: 21 Mar 2022
Posts: 222
Location: Pennsylvania, USA

PostPosted: Sun Jul 24, 2022 10:52 pm    Post subject: Reply with quote

sdauth wrote:
This [pstore] looks much simpler indeed [than kexec -p]. I'll enable that right now.
So I guess I have to enable :

CONFIG_PSTORE_BLK
I guess I need to add a very small partition in my LVM VG to hold the logs. Outside of the root LV obviously.

sdauth wrote:
So I added a 12M partition and formatted it as ext4 for pstore.
Added the needed options & recompiled in my kernel but ERST doesn't seem to register.


If I understand correctly, PSTORE_BLK would replace ACPI_APEI (the erst backend, which neither of us have). If it works the dmesg should say something about a block device being registered. Also I think there is no need to put a filesystem on the partition.

Reading between the lines in fs/pstore/Kconfig and Documentation/admin-guide/pstore-blk.rst I suspect that the PSTORE_BLK_BLKDEV has to be a real disk partition, not a logical volume under /dev/mapper. I tried a logical volume anyway, specified as major:minor. Nothing (as expected).

My next move would be run parted and shrink my /dev/sda1, BIOS boot partition, which is larger than necessary. Then make a new partition between sda1 and sda2. I might have to edit /etc/fstab and re-run grub-install but, hey, I have rescue media. What could possibly go wrong ?
Back to top
View user's profile Send private message
sublogic
Apprentice
Apprentice


Joined: 21 Mar 2022
Posts: 222
Location: Pennsylvania, USA

PostPosted: Sat Jul 30, 2022 1:50 am    Post subject: Reply with quote

Well, I had some free space at the end of /dev/sda ("print free" in parted). Made a partition out of it. Tried to use it with PSTORE_BLK. Nothing.

It's not so simple. PSTORE_BLK doesn't work by itself. Grepping the kernel source tree for calls to register_pstore_device(), the only backend for PSTORE_BLK is in drivers/mtd/ (Memory Technology Device (MTD) support). There is no pstore backend to save panic messages to disc. Based on the notes at the bottom of Documentation/admin-guide/pstore-blk.rst it would be extraordinarily difficult (and dangerous !) to write one.

sdauth, back to post post 8728766 for you !
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2006

PostPosted: Sat Jul 30, 2022 12:15 pm    Post subject: Reply with quote

FWIW, some time back I successfully found compressed dumps in pstore using the following settings:
Code:
CONFIG_EFI_VARS_PSTORE=m
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_ZSTD_COMPRESS=y
CONFIG_PSTORE_COMPRESS=y
CONFIG_PSTORE_ZSTD_COMPRESS_DEFAULT=y
CONFIG_PSTORE_COMPRESS_DEFAULT="zstd"

all other PSTORE variables unset. That said, I've not seen anything in there recently, even when I thought there should be, so maybe I'm missing another setting somewhere.

This obviously uses EFIVAR storage, which is strictly limited and not to be overworked.
_________________
Greybeard
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum