Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[Solved] Journal aborted after random amount of time
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
Stamper
n00b
n00b


Joined: 20 Apr 2016
Posts: 5

PostPosted: Wed Apr 20, 2016 9:24 am    Post subject: [Solved] Journal aborted after random amount of time Reply with quote

Hi everyone!

I have been using Gentoo for a couple of months now so I know the basics but now I am treading advanced waters where I don't know yet how to swim :P.
Hopefully some of you guys and girls might be willing and able to help shed some light on this issue.

The Issue / Symptoms
After a fresh install, we receive a Journal aborted error on an ext4 partitioned Samsung Pro 850 512Gb SSD. This error is received after login after an arbitrary amount of time. The system does not connect to newly-plugged-in USB keyboards or mouses. The system does not respond on an SSH connect request. The system does not respond to any commands on an already-plugged-in USB keyboard.

The literal error:
[<timestamp>] EXT4-fs error (Device sda3): ext4_journal_check_start:56: Detected aborted journal
[<timestamp>] EXT4-fs (sda3): Remounting filesystem read-only

Sda3 is our root partition so it seems this crashes the OS into a complete unrecoverable, non-responsive state. Tried looking to the different logs in /var/log, but they do not contain any new logs around the time of the symptoms.


My thoughts
First of all, I was completely stomped when I first experience this error! We are still checking to see if it could be a hardware-related issue or not. I do wanted to start this thread to see if anyone maybe knows what is going on here; if it might be a software- or driver-related issue.

We have two identical desktops/servers. The issue started on only one of them after about 1,5 hours after install. The other one never initially caused the error (yet), but after swapping the SSDs between the servers we experienced the error on the "supposedly okay server". However, if that is due to an already corrupt filesystem, than that supposedly okay server seems to be still an okay server.

At the moment we are using an image we pre-built to reinstall both servers when we think it is necessary during the debug process.

What have we tried
- We tried a complete new install.
- We did an e2fsck on our pre-built image which came back clean
- We are trying an ubuntu install to see if it forces the same errors
- We have tried the ext3 fs but to no luck. It shows the exact same symptoms except for the error in the console
- We have tried a series of hardware tests (partially under Windows): HDTune surface scan(slow), 10 passes of Memtest86, Furmark, 1 hour of Prime95, writing 12GB of data to the SSDs and checking the checksums before and after
- We are trying to deduce if it is a hardware-related issue by swapping hardware. This is still ongoing.

The hardware
HDD: Samsung Pro 850 512GB SSD
Mobo: Gigabyte X99-UD3 LGA 2011-3
Proc: Intel i7 5820K
Memory: 2x8GB DDR4 Corsair 2133Mhz Vengeance LPX

The software
- Gentoo amd x64
- Installed from the minimal installation + Ubuntu linux (needed Ubuntu for the UEFI grub2 install)
- Installed Stage3 from 31-03-2016
- Installed using genkernel
- 4 partitions: 1{200Mib /boot FAT32 ESP} 2{16Gib swap} 3{100Gib / Ext4} 4{400Gib /data Ext4}
- Used the Gentoo Handbook

Any help or pointers where to check to rule out if it is an software issue is appreciated!

Thanks in advance!

Regards,

Stamper


Last edited by Stamper on Wed May 11, 2016 6:08 pm; edited 1 time in total
Back to top
View user's profile Send private message
kazdva
n00b
n00b


Joined: 14 Mar 2016
Posts: 26

PostPosted: Sat Apr 23, 2016 8:46 pm    Post subject: Reply with quote

Try to flush the journal of that partition with e2fsck -fy /dev/sda3 command.
Back to top
View user's profile Send private message
Stamper
n00b
n00b


Joined: 20 Apr 2016
Posts: 5

PostPosted: Tue Apr 26, 2016 12:14 pm    Post subject: Reply with quote

Thanks for your reply!

Sadly, that would just fix it once and the problem would reappear after a random amount of time.

After some deep googling, I found a post similar to mine: http://www.eightforums.com/drivers-hardware/66257-possible-ssd-failure-4.html
It seems it might be a hardware issue with the SATA port. We have seen that it can take anywhere from 2 hours to 7 days for the issue to appear.
We tried Ubuntu and Windows on one of the systems and couldn't reproduce the issue with 1,5 day uptime and we were able to reproduce the issue on a completely different older harddrive. One, we know, has functioned properly for years.

Also, the possible errors seem to be random as well. These are some of the others we have gotten in the mean time:

  • EXT4-fs (sda3): Delayed block allocatio failed for inode <points to /var/log/syslog> at logical offset 139 with max blocks 3 with error 30
    EXT4-fs error (device sda3): ext4_journal_check_start:56: Detected aborted journal
    EXT4-fs (sda3): Remounting filesystem read-only

  • EXT4-fs error (device sda3) in ext4_reserve_inode_write:4980: Journal has aborted
    EXT4-fs error (device sda3) in ext4_reserve_inode_write:4980: Journal has aborted
    EXT4-fs error (device sda4) in __ext4_new_inode:849: Journal has aborted
    EXT4-fs error (device sda4) in ext4_create:2538: Journal has aborted
    EXT4-fs error (device sda3) in ext4_orphan_add:2908: Journal has aborted
    EXT4-fs error (device sda3) mpage_map_and_submit_extent:2248: comm kworker/u24:1: Failed to mark inode <points to /var/log/syslog> dirty
    EXT4-fs error (device sda3) in ext4_writepages:2539: Journal has aborted
    EXT4-fs error (device sda3) in ext4_dirty_inode:5105: Journal has aborted
    EXT4-fs error (device sda3): ext4_journal_check_start:56:
    EXT4-fs (sda3): Remounting filesystem read-only
    EXT4-fs error (device sda3) in ext4_reserve_inode_write:4980: Journal has aborted

To me, these errors sound as if the connection to the disk is lost.

We also tried to reproduce the issue when onerror=continue is added in fstab to sda3. This results in a crashed system with no error at all even when ext4 is used.
Also, the issue has been reproduced on both systems. This means it is not limited to a single machine anymore.

Anyone who could give me a sanity check based on this information?
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Tue Apr 26, 2016 2:34 pm    Post subject: Reply with quote

Remind me that, guess what, the 850PRO 512GB is in the list.
While i think it was fix, it exactly explain your issue, your sdd might need firmware update or something, hence why moving it to another computer doesn't fix the issue.
http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-buggy-trim-implementation
Back to top
View user's profile Send private message
Stamper
n00b
n00b


Joined: 20 Apr 2016
Posts: 5

PostPosted: Thu Apr 28, 2016 7:36 am    Post subject: Reply with quote

Hi Krinn!

Thanks for your reply and the source! I hadn't found that one yet, but the bug seemed to be limited to just RAID 0 and 10(we aren't running any RAID configuration) and it is already fixed: http://techreport.com/news/28724/samsung-docs-detail-linux-trim-bug-and-fix

Also, as said in my earlier post, we have been able to reproduce the issue on a completed different older HDD (Western Digital 750GB Black) so we already excluded the SSDs as the cause.

We have also tried a different SATA port which links up to a different SATA controller but that didn't fix anything either. This was suggested in the forum post I found in my previous post.

We are performing a long-term test using Ubuntu now to see if we can reproduce the issue on another OS/kernel. Would be odd if we couldn't reproduce it on Ubuntu as the core is the same as Gentoo's except for a different kernel version.

Again, any tips or help is appreciated! So feel free to pitch in any ideas!


P.S. We received a completely new error on the different SATA port crash:


  • EXT4-fs error (device sda4) in add_dirent_to_bug:1951: Journal has aborted
    EXT4-fs error (device sda3): ext4_journal_check_start:56: Detected aborted journal
    EXT4-fs (sda3): Remounting filesystem read-only
    EXT4-fs error (device sda4): ext4_journal_check_start:56: Detected aborted journal
    EXT4-fs (sda4): Remounting filesystem read-only
    EXT4-fs error (device sda4) in ext4_evict_inode:240: Journal has aborted
Back to top
View user's profile Send private message
Stamper
n00b
n00b


Joined: 20 Apr 2016
Posts: 5

PostPosted: Mon May 02, 2016 12:31 pm    Post subject: Reply with quote

Hi again everyone!

So far, the Ubuntu test is running clean for about 5 days now. So it seems the issue might be configuration/software related. My guess, is that it might be power management related so I had a look into that.

I checked the link power management policies and they are all set to max_performance(a.k.a. disabled):
Code:

cat /sys/class/scsi_host/host*/link_power_management_policy


However, running it on both a Gentoo system and an Ubuntu system, I noted that the Gentoo system has scsi hosts 0 through 5 while ubuntu only has hosts 0 through 3. They are, however, identical systems except for the OS. Might this cause any of my issues?

Also, trying the following results in an input/output error:
Code:

cat /sys/bus/scsi/devices/host*/power/autosuspend_delay_ms
cat /sys/class/scsi_host/host*/power/autosuspend_delay_ms

Should that happen?

I had a look at this page: https://wiki.archlinux.org/index.php/Power_management#SATA_Active_Link_Power_Management
It seems quite comprehensive and the only option related to the issues I am having seem to be the ALPM setting. As that is disabled, are there any other settings I might be able to check?

Thanks again!
Back to top
View user's profile Send private message
Stamper
n00b
n00b


Joined: 20 Apr 2016
Posts: 5

PostPosted: Wed May 11, 2016 6:08 pm    Post subject: Reply with quote

It has been quite a journey but we think we found the issue!

After installing Ubuntu for the long-term test, we had some stability issues. Watchdog logged soft cpu core lock-ups which were caused by the Nouveau driver. We changed this with the Nvidia proprietary driver which resulted in a stable system. It ran for 9 days without issues.

So we got the idea to replace the Nouveau driver with the Nvidia proprietary driver on the Gentoo system as well. This resulted in two stable systems for 1,5 days now! We have to wait to see if it stays stable with longterm usage, but we are optimistic! One system never reached the 7 hour mark and it went past it without issue with this 'fix'.

Hopefully this info helps others!

Thanks to everyone for their time and effort to help create a solution!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum