[Solved] Journal aborted after random amount of time

Stamper · n00b Joined: 20 Apr 2016 Posts: 5

Hi everyone!

I have been using Gentoo for a couple of months now so I know the basics but now I am treading advanced waters where I don't know yet how to swim

.
Hopefully some of you guys and girls might be willing and able to help shed some light on this issue.

The Issue / Symptoms
After a fresh install, we receive a Journal aborted error on an ext4 partitioned Samsung Pro 850 512Gb SSD. This error is received after login after an arbitrary amount of time. The system does not connect to newly-plugged-in USB keyboards or mouses. The system does not respond on an SSH connect request. The system does not respond to any commands on an already-plugged-in USB keyboard.

The literal error:
[<timestamp>] EXT4-fs error (Device sda3): ext4_journal_check_start:56: Detected aborted journal
[<timestamp>] EXT4-fs (sda3): Remounting filesystem read-only

Sda3 is our root partition so it seems this crashes the OS into a complete unrecoverable, non-responsive state. Tried looking to the different logs in /var/log, but they do not contain any new logs around the time of the symptoms.

My thoughts
First of all, I was completely stomped when I first experience this error! We are still checking to see if it could be a hardware-related issue or not. I do wanted to start this thread to see if anyone maybe knows what is going on here; if it might be a software- or driver-related issue.

We have two identical desktops/servers. The issue started on only one of them after about 1,5 hours after install. The other one never initially caused the error (yet), but after swapping the SSDs between the servers we experienced the error on the "supposedly okay server". However, if that is due to an already corrupt filesystem, than that supposedly okay server seems to be still an okay server.

At the moment we are using an image we pre-built to reinstall both servers when we think it is necessary during the debug process.

What have we tried
- We tried a complete new install.
- We did an e2fsck on our pre-built image which came back clean
- We are trying an ubuntu install to see if it forces the same errors
- We have tried the ext3 fs but to no luck. It shows the exact same symptoms except for the error in the console
- We have tried a series of hardware tests (partially under Windows): HDTune surface scan(slow), 10 passes of Memtest86, Furmark, 1 hour of Prime95, writing 12GB of data to the SSDs and checking the checksums before and after
- We are trying to deduce if it is a hardware-related issue by swapping hardware. This is still ongoing.

The hardware
HDD: Samsung Pro 850 512GB SSD
Mobo: Gigabyte X99-UD3 LGA 2011-3
Proc: Intel i7 5820K
Memory: 2x8GB DDR4 Corsair 2133Mhz Vengeance LPX

The software
- Gentoo amd x64
- Installed from the minimal installation + Ubuntu linux (needed Ubuntu for the UEFI grub2 install)
- Installed Stage3 from 31-03-2016
- Installed using genkernel
- 4 partitions: 1{200Mib /boot FAT32 ESP} 2{16Gib swap} 3{100Gib / Ext4} 4{400Gib /data Ext4}
- Used the Gentoo Handbook

Any help or pointers where to check to rule out if it is an software issue is appreciated!

Thanks in advance!

Regards,

Stamper

kazdva · n00b Joined: 14 Mar 2016 Posts: 26

Try to flush the journal of that partition with e2fsck -fy /dev/sda3 command.

Stamper · n00b Joined: 20 Apr 2016 Posts: 5

Thanks for your reply!

Sadly, that would just fix it once and the problem would reappear after a random amount of time.

After some deep googling, I found a post similar to mine: http://www.eightforums.com/drivers-hardware/66257-possible-ssd-failure-4.html
It seems it might be a hardware issue with the SATA port. We have seen that it can take anywhere from 2 hours to 7 days for the issue to appear.
We tried Ubuntu and Windows on one of the systems and couldn't reproduce the issue with 1,5 day uptime and we were able to reproduce the issue on a completely different older harddrive. One, we know, has functioned properly for years.

Also, the possible errors seem to be random as well. These are some of the others we have gotten in the mean time:

EXT4-fs (sda3): Delayed block allocatio failed for inode <points to /var/log/syslog> at logical offset 139 with max blocks 3 with error 30
EXT4-fs error (device sda3): ext4_journal_check_start:56: Detected aborted journal
EXT4-fs (sda3): Remounting filesystem read-only
EXT4-fs error (device sda3) in ext4_reserve_inode_write:4980: Journal has aborted
EXT4-fs error (device sda3) in ext4_reserve_inode_write:4980: Journal has aborted
EXT4-fs error (device sda4) in __ext4_new_inode:849: Journal has aborted
EXT4-fs error (device sda4) in ext4_create:2538: Journal has aborted
EXT4-fs error (device sda3) in ext4_orphan_add:2908: Journal has aborted
EXT4-fs error (device sda3) mpage_map_and_submit_extent:2248: comm kworker/u24:1: Failed to mark inode <points to /var/log/syslog> dirty
EXT4-fs error (device sda3) in ext4_writepages:2539: Journal has aborted
EXT4-fs error (device sda3) in ext4_dirty_inode:5105: Journal has aborted
EXT4-fs error (device sda3): ext4_journal_check_start:56:
EXT4-fs (sda3): Remounting filesystem read-only
EXT4-fs error (device sda3) in ext4_reserve_inode_write:4980: Journal has aborted

To me, these errors sound as if the connection to the disk is lost.

We also tried to reproduce the issue when onerror=continue is added in fstab to sda3. This results in a crashed system with no error at all even when ext4 is used.
Also, the issue has been reproduced on both systems. This means it is not limited to a single machine anymore.

Anyone who could give me a sanity check based on this information?

krinn · Watchman Joined: 02 May 2003 Posts: 7470

Remind me that, guess what, the 850PRO 512GB is in the list.
While i think it was fix, it exactly explain your issue, your sdd might need firmware update or something, hence why moving it to another computer doesn't fix the issue.
http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-buggy-trim-implementation

Stamper · n00b Joined: 20 Apr 2016 Posts: 5

Hi Krinn!

Thanks for your reply and the source! I hadn't found that one yet, but the bug seemed to be limited to just RAID 0 and 10(we aren't running any RAID configuration) and it is already fixed: http://techreport.com/news/28724/samsung-docs-detail-linux-trim-bug-and-fix

Also, as said in my earlier post, we have been able to reproduce the issue on a completed different older HDD (Western Digital 750GB Black) so we already excluded the SSDs as the cause.

We have also tried a different SATA port which links up to a different SATA controller but that didn't fix anything either. This was suggested in the forum post I found in my previous post.

We are performing a long-term test using Ubuntu now to see if we can reproduce the issue on another OS/kernel. Would be odd if we couldn't reproduce it on Ubuntu as the core is the same as Gentoo's except for a different kernel version.

Again, any tips or help is appreciated! So feel free to pitch in any ideas!

P.S. We received a completely new error on the different SATA port crash:

EXT4-fs error (device sda4) in add_dirent_to_bug:1951: Journal has aborted
EXT4-fs error (device sda3): ext4_journal_check_start:56: Detected aborted journal
EXT4-fs (sda3): Remounting filesystem read-only
EXT4-fs error (device sda4): ext4_journal_check_start:56: Detected aborted journal
EXT4-fs (sda4): Remounting filesystem read-only
EXT4-fs error (device sda4) in ext4_evict_inode:240: Journal has aborted

Stamper · n00b Joined: 20 Apr 2016 Posts: 5

Hi again everyone!

So far, the Ubuntu test is running clean for about 5 days now. So it seems the issue might be configuration/software related. My guess, is that it might be power management related so I had a look into that.

I checked the link power management policies and they are all set to max_performance(a.k.a. disabled):

Stamper · n00b Joined: 20 Apr 2016 Posts: 5

It has been quite a journey but we think we found the issue!

After installing Ubuntu for the long-term test, we had some stability issues. Watchdog logged soft cpu core lock-ups which were caused by the Nouveau driver. We changed this with the Nvidia proprietary driver which resulted in a stable system. It ran for 9 days without issues.

So we got the idea to replace the Nouveau driver with the Nvidia proprietary driver on the Gentoo system as well. This resulted in two stable systems for 1,5 days now! We have to wait to see if it stays stable with longterm usage, but we are optimistic! One system never reached the 7 hour mark and it went past it without issue with this 'fix'.

Hopefully this info helps others!

Thanks to everyone for their time and effort to help create a solution!