Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED!]Ext-4 Data Corruption Bug Hits Stable Linux Kernels
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
BitJam
Advocate
Advocate


Joined: 12 Aug 2003
Posts: 2508
Location: Silver City, NM

PostPosted: Wed Oct 24, 2012 5:00 pm    Post subject: [SOLVED!]Ext-4 Data Corruption Bug Hits Stable Linux Kernels Reply with quote

link
Quote:
As a warning for those who are normally quick to upgrade to the latest stable vanilla kernel releases, a serious EXT4 data corruption bug worked its way into the stable Linux 3.4, 3.5, and 3.6 kernel series.

Forum member szczerb posted this news in a thread but I think it deserves a thread of its own.

TL;DR: In recent kernels ext-4 journal playback can in some cases bork your file system.

Edit: fixed


Last edited by BitJam on Wed Oct 31, 2012 6:18 pm; edited 1 time in total
Back to top
View user's profile Send private message
khayyam
Watchman
Watchman


Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

PostPosted: Wed Oct 24, 2012 6:05 pm    Post subject: Reply with quote

BitJam ...

Note that you'll only get hit if the journal hasn't been wrapped, so give the journal something to work on or don't reboot so often ;) ... hehe.

I applied Ted's patch to 3.6.3 earlier today but have't rebooted as yet, anyhow as I've been running an effected kenel (2.6.2) for a week or so without issues I'm not inclined to panic.

best ... khay
Back to top
View user's profile Send private message
e3k
Guru
Guru


Joined: 01 Oct 2007
Posts: 513
Location: Inner Space

PostPosted: Wed Oct 24, 2012 8:55 pm    Post subject: linux-3.5.7-gentoo is my pc affected? Reply with quote

did not get the phoronix article, which 3.5 kernel is not affected?
_________________

Flux & Contemplation - Portrait of an Artist in Isolation

Back to top
View user's profile Send private message
szczerb
Veteran
Veteran


Joined: 24 Feb 2007
Posts: 1709
Location: Poland => Lodz

PostPosted: Wed Oct 24, 2012 9:17 pm    Post subject: Reply with quote

This comment https://bugs.gentoo.org/show_bug.cgi?id=439502#c0 seems to suggest that 3.5.x < 3.5.7 should be safe. I just booted the 3.5.7 at work yesterday so I'm waiting things out without rebooting or shutting down for now. Patches seem to be flowing around fast.

EDIT: BitJam, you're right - I should've made it a separate thread. I was rather swarmed at work, so didn't think of it.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21518

PostPosted: Wed Oct 24, 2012 10:23 pm    Post subject: Reply with quote

Linux Weekly News has a free to read article following this. The situation is evolving. Ted now believes that journal wrapping may not be involved. Additionally, nix has now stated that the affected system has some rather unusual shutdown behavior that may cause it to halt without all filesystems finishing their unmount. If that is what happened to him and if the corruption occurs only on multiple journal replays, then standard systems that gracefully unmount (or remount readonly) all their filesystems are at much lower risk than suggested by early reports. However, those are substantial qualifiers and there is insufficient evidence to determine whether they are met in the reported cases.
Back to top
View user's profile Send private message
jimmij
Tux's lil' helper
Tux's lil' helper


Joined: 02 Dec 2008
Posts: 139

PostPosted: Thu Oct 25, 2012 6:33 am    Post subject: Reply with quote

Can someone advice me what is the safest way to switch/downgrade to 3.3.8 if I'm running 3.5.7 on my system for 2 days now without rebooting? If I undestood correctly the problem mostly appear during reboot, so maybe it is better to leave the system running and not downgrade at all until next reboot?
_________________
Vanitas vanitatum et omnia vanitas.
Libera temet ex inferis.
Back to top
View user's profile Send private message
szczerb
Veteran
Veteran


Joined: 24 Feb 2007
Posts: 1709
Location: Poland => Lodz

PostPosted: Thu Oct 25, 2012 9:41 am    Post subject: Reply with quote

jimmij wrote:
Can someone advice me what is the safest way to switch/downgrade to 3.3.8 if I'm running 3.5.7 on my system for 2 days now without rebooting? If I undestood correctly the problem mostly appear during reboot, so maybe it is better to leave the system running and not downgrade at all until next reboot?
I'm doing just that - waiting with my system on.
Back to top
View user's profile Send private message
depontius
Advocate
Advocate


Joined: 05 May 2004
Posts: 3509

PostPosted: Thu Oct 25, 2012 12:48 pm    Post subject: Reply with quote

So the problem appears to be "failing to wrap the journal" before rebooting. How much filesystem activity does it take to "wrap the journal"? Simply waiting may not do the trick, it sounds as if you really need to generate some filesystem activity. Most likely, "emerge --sync" would do the trick, but obviously not more often than once daily. Since we're talking about the journal, likely reads wouldn't do spit - it takes writes or updates. Any idea how big/many writes?
_________________
.sigs waste space and bandwidth
Back to top
View user's profile Send private message
NoDataFound
n00b
n00b


Joined: 01 Aug 2011
Posts: 34

PostPosted: Thu Oct 25, 2012 2:14 pm    Post subject: Reply with quote

I'd like to know what kind of corruption it produce.
Having a bug is bad in itself, although not the end of the world, but it's better if it's recoverable...
Back to top
View user's profile Send private message
khayyam
Watchman
Watchman


Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

PostPosted: Thu Oct 25, 2012 3:20 pm    Post subject: Reply with quote

depontius wrote:
So the problem appears to be "failing to wrap the journal" before rebooting. How much filesystem activity does it take to "wrap the journal"? Simply waiting may not do the trick, it sounds as if you really need to generate some filesystem activity. Most likely, "emerge --sync" would do the trick, but obviously not more often than once daily. Since we're talking about the journal, likely reads wouldn't do spit - it takes writes or updates. Any idea how big/many writes?

depontius ... the situation seems to have moved on (as Hu noted above), its nolonger thought to be related to wrapping.

Note: "Update: It now looks like the reproduction involved something very esoteric indeed, involving using umount -l and shutdowns while the file system was still being unmounted --- and the user had nobarrier specified in the mount options as well." Ted Ts'o

So, I don't think there is much reason to panic, if this wasn't a corner case then there would be hundreds of reports of data loss, and the actual reported case so far are few.

For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion.

best ... khay
Back to top
View user's profile Send private message
leifbk
Guru
Guru


Joined: 05 Jan 2004
Posts: 415
Location: Bærum, Norway

PostPosted: Thu Oct 25, 2012 4:18 pm    Post subject: Reply with quote

khayyam wrote:

For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion.

best ... khay


I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance?

I for one am still running good ol' ext3, and will keep my newly compiled 3.5.7 kernel until a new one hits stable.
_________________
Grumpy old man
Back to top
View user's profile Send private message
bandreabis
Advocate
Advocate


Joined: 18 Feb 2005
Posts: 2489
Location: イタリアのロディで

PostPosted: Thu Oct 25, 2012 4:39 pm    Post subject: Reply with quote

I can't see any visible difference between 3.3.8 and 3.5.7 (freshly compiled) so I remain with the "not hard masked" one.
Back to top
View user's profile Send private message
khayyam
Watchman
Watchman


Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

PostPosted: Thu Oct 25, 2012 6:02 pm    Post subject: Reply with quote

leifbk wrote:
khayyam wrote:
For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion.

I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance?

leifbk ... certainly keeping the machine up would be enough to not encounter the problem, but as for the jollarity ... it wouldn't help at all *regardless* of what filesystems were modified (note that I pointed out that the current understanding is its not caused by 'wraping'), but there is no telling the "truely insane" that. So, was that attempt at humour missed?

leifbk wrote:
I for one am still running good ol' ext3, and will keep my newly compiled 3.5.7 kernel until a new one hits stable.

A serious bug in linux kernel has caused users to believe that there is serious bug in the linux kernel, in a post made to the LKML, Linus Torvalds stated "we're not really sure if this is a bug or not, but we can assure everyone we're reading all of the hullaballoo on slashdot and we'll know more as and when news hits critical mass". The bug, code named "worse than y2k, stuxnet, and Window98 combined (WTY2KSTUXNET&W98)" is thought to effect at least three users, and more than ten million blogs and news sites". Users, who until recently had thought that the designation "stable" was a ancronym for "no need for backups any mo", are lining up to throw themselves under the wheels of this runaway train, as one commentator noted "its worse than Fukushima Daiichi and that other thing ... didn't you read my blog post?" :)

best of the bwaaaaa ... khay
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10587
Location: Somewhere over Atlanta, Georgia

PostPosted: Thu Oct 25, 2012 6:33 pm    Post subject: Reply with quote

Goodness. That's more sarcastic than me!

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
khayyam
Watchman
Watchman


Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

PostPosted: Thu Oct 25, 2012 7:07 pm    Post subject: Reply with quote

John R. Graham wrote:
Goodness. That's more sarcastic than me!

John ... the intention was to deflate the rise in panic with some humor. It seems that this serious bug, though no doubt an annoyance to those hit, is most likely a corner case, and so all the "hallaballoo" needs to step down a gear or three. Its already been said that this is reflecting badly on ext4, and some of the reporting has been out of proportion to the actual severity, so I guess my sarcasm reflects this.

best ... khay
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10587
Location: Somewhere over Atlanta, Georgia

PostPosted: Thu Oct 25, 2012 7:24 pm    Post subject: Reply with quote

Never explain sarcasm; it just ruins it. ;)

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2048
Location: Germany

PostPosted: Thu Oct 25, 2012 9:21 pm    Post subject: Reply with quote

short: don't do anything stupid and you won't hit the bug.

It is really that simple. Phoronix in the mean time is working hard to earn that Moronix moniker.
_________________
Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

I identify as a dirty penismensch.
Back to top
View user's profile Send private message
Jaglover
Watchman
Watchman


Joined: 29 May 2005
Posts: 8291
Location: Saint Amant, Acadiana

PostPosted: Thu Oct 25, 2012 9:25 pm    Post subject: Reply with quote

John R. Graham wrote:
Never explain sarcasm; it just ruins it. ;)

- John


:lol: +1

BTW, I'm not the fourth user hit by this.
_________________
My Gentoo installation notes.
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
leifbk
Guru
Guru


Joined: 05 Jan 2004
Posts: 415
Location: Bærum, Norway

PostPosted: Thu Oct 25, 2012 9:34 pm    Post subject: Reply with quote

khayyam wrote:
leifbk wrote:
khayyam wrote:
For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion.

I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance?

leifbk ... certainly keeping the machine up would be enough to not encounter the problem, but as for the jollarity ... it wouldn't help at all *regardless* of what filesystems were modified (note that I pointed out that the current understanding is its not caused by 'wraping'), but there is no telling the "truely insane" that. So, was that attempt at humour missed?


Not quite, but I got carried away by the implications. BTW, you came up with an excellent method for converting a Gentoo box into a fan heater, but you forgot the --keep-going option. It's getting cold here in Norway now.
_________________
Grumpy old man
Back to top
View user's profile Send private message
Jaglover
Watchman
Watchman


Joined: 29 May 2005
Posts: 8291
Location: Saint Amant, Acadiana

PostPosted: Thu Oct 25, 2012 10:27 pm    Post subject: Reply with quote

... I have three boxes always on ... is that why my central AC unit just kicked in? Or maybe it's because it's 30C out there? Patting myself on the back for moving from Nordic to Tropic. :P
_________________
My Gentoo installation notes.
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
khayyam
Watchman
Watchman


Joined: 07 Jun 2012
Posts: 6227
Location: Room 101

PostPosted: Fri Oct 26, 2012 1:25 am    Post subject: Reply with quote

leifbk wrote:
BTW, you came up with an excellent method for converting a Gentoo box into a fan heater, but you forgot the --keep-going option. It's getting cold here in Norway now.

leifbk ... what? and miss the opportunity to discover another bug, no ... and why not do as we more southern europeans do and throw another IKEA poang rocking chair with korndal brown cushion on the fire ... or is that too Swedish for Norwegian sensibilities?

best ... khay
Back to top
View user's profile Send private message
anyNiXwilldo
Apprentice
Apprentice


Joined: 20 Feb 2004
Posts: 176
Location: US

PostPosted: Fri Oct 26, 2012 2:26 am    Post subject: Reply with quote

Well I hadn't rebooted in several days, but I noticed this morning 3.6.2 was masked. I knew why, from yesterday's articles. The info making the rounds today was saying it's a rather esoteric (hard to reproduce) bug, which probably meant I had nothing to worry about. However, given I run almost 100% stable, except for things like qpdfview, nomacs and the kernel, I felt it best to back the kernel back down to stable from ~amd64. I umounted my data partition after building 3.5.4-hardened-r1-gnu, prior to rebooting with that kernel. Everything seems to be fine. I just know I don't have the nerves to deal with these newer kernels and whatever very scary bugs they might have.
_________________
Of course you can have my root password. I'm on Hardened!
Back to top
View user's profile Send private message
platojones
Veteran
Veteran


Joined: 23 Oct 2002
Posts: 1602
Location: Just over the horizon

PostPosted: Fri Oct 26, 2012 2:30 am    Post subject: Reply with quote

Reading the latest updates at the thread below and considering the fact that this isn't showing up but on 2 machines that anybody knows of so far (nobody has been able to independently reproduce yet), I'd say it's looking very anti-climactic:

http://thread.gmane.org/gmane.linux.kernel/1379725/focus=1381772
Back to top
View user's profile Send private message
leifbk
Guru
Guru


Joined: 05 Jan 2004
Posts: 415
Location: Bærum, Norway

PostPosted: Fri Oct 26, 2012 5:22 am    Post subject: Reply with quote

khayyam wrote:
... and why not do as we more southern europeans do and throw another IKEA poang rocking chair with korndal brown cushion on the fire ... or is that too Swedish for Norwegian sensibilities?


We love to burn cheap Swedish furniture :D

We still haven't forgiven the Swedes for Karl XII, who was shot through the head during his Norwegian campaign in 1718. Nobody knows for certain if the bullet was Norwegian or Swedish, but we love to claim the credit. This tends to make the Swedes irate.
_________________
Grumpy old man
Back to top
View user's profile Send private message
ulenrich
Veteran
Veteran


Joined: 10 Oct 2010
Posts: 1480

PostPosted: Fri Oct 26, 2012 11:44 am    Post subject: Reply with quote

platojones wrote:
Reading the latest updates at the thread below and considering the fact that this isn't showing up but on 2 machines that anybody knows of so far (nobody has been able to independently reproduce yet),

Yes, but it was not hardware related but setup:
cascading mounts of mixed ext4 and network devices, were it was forcefully configured to be able to very fast reboot: lazy "umount -l" was used to not wait for net devices. And a local machine ext4 partition was mounted on top of a net mount??? And "nobarrier" mount option??

In this special case _and_ if additionally some crash induced reboots then:
there was data loss after the second reboot!
A clean bit was set, when there hasn't been a journal cleanup yet (writeback?). A workaround for this setup would have been forcefsck in the boot cmdline. This would have played the capabilities of a journaled filesystem: The missing data would have been written back. But the additional forcefsck wouldn't quickly boot up the system ...

:) not a very general used setup :)
This is why Greg Kroah-Hartman doesn't quickly thin release to fix the issue. At first all of us who observe the stable patchlevel releases felt a panic attack :( because we knew there had been an ext4 feature backport for linux-3.6.2 . But the jbd2 patch which obviously caused the data loss would have been attached any way: it was (thought) a fix. Greg Kroah-Hartman should serialize such feature backports to reduce our psycho panics.

[edit]Don't take the last sentence as a serious suggestion, but as a tool to self audit (for me at least).
[edit2]Because Greg does it already when possible.


Last edited by ulenrich on Fri Oct 26, 2012 4:19 pm; edited 2 times in total
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum