| View previous topic :: View next topic |
| Author |
Message |
BitJam Advocate

Joined: 12 Aug 2003 Posts: 2343 Location: Silver City, NM
|
Posted: Wed Oct 24, 2012 5:00 pm Post subject: [SOLVED!]Ext-4 Data Corruption Bug Hits Stable Linux Kernels |
|
|
link | Quote: | | As a warning for those who are normally quick to upgrade to the latest stable vanilla kernel releases, a serious EXT4 data corruption bug worked its way into the stable Linux 3.4, 3.5, and 3.6 kernel series. |
Forum member szczerb posted this news in a thread but I think it deserves a thread of its own.
TL;DR: In recent kernels ext-4 journal playback can in some cases bork your file system.
Edit: fixed
Last edited by BitJam on Wed Oct 31, 2012 6:18 pm; edited 1 time in total |
|
| Back to top |
|
 |
khayyam Veteran


Joined: 07 Jun 2012 Posts: 1253
|
Posted: Wed Oct 24, 2012 6:05 pm Post subject: |
|
|
BitJam ...
Note that you'll only get hit if the journal hasn't been wrapped, so give the journal something to work on or don't reboot so often ;) ... hehe.
I applied Ted's patch to 3.6.3 earlier today but have't rebooted as yet, anyhow as I've been running an effected kenel (2.6.2) for a week or so without issues I'm not inclined to panic.
best ... khay |
|
| Back to top |
|
 |
e3k Apprentice


Joined: 01 Oct 2007 Posts: 151 Location: Slovakia
|
Posted: Wed Oct 24, 2012 8:55 pm Post subject: linux-3.5.7-gentoo is my pc affected? |
|
|
did not get the phoronix article, which 3.5 kernel is not affected? _________________ all meetings should be optional. |
|
| Back to top |
|
 |
szczerb Veteran

Joined: 24 Feb 2007 Posts: 1626 Location: Poland => Lodz
|
Posted: Wed Oct 24, 2012 9:17 pm Post subject: |
|
|
This comment https://bugs.gentoo.org/show_bug.cgi?id=439502#c0 seems to suggest that 3.5.x < 3.5.7 should be safe. I just booted the 3.5.7 at work yesterday so I'm waiting things out without rebooting or shutting down for now. Patches seem to be flowing around fast.
EDIT: BitJam, you're right - I should've made it a separate thread. I was rather swarmed at work, so didn't think of it. |
|
| Back to top |
|
 |
Hu Watchman

Joined: 06 Mar 2007 Posts: 7610
|
Posted: Wed Oct 24, 2012 10:23 pm Post subject: |
|
|
| Linux Weekly News has a free to read article following this. The situation is evolving. Ted now believes that journal wrapping may not be involved. Additionally, nix has now stated that the affected system has some rather unusual shutdown behavior that may cause it to halt without all filesystems finishing their unmount. If that is what happened to him and if the corruption occurs only on multiple journal replays, then standard systems that gracefully unmount (or remount readonly) all their filesystems are at much lower risk than suggested by early reports. However, those are substantial qualifiers and there is insufficient evidence to determine whether they are met in the reported cases. |
|
| Back to top |
|
 |
jimmij n00b

Joined: 02 Dec 2008 Posts: 51
|
Posted: Thu Oct 25, 2012 6:33 am Post subject: |
|
|
Can someone advice me what is the safest way to switch/downgrade to 3.3.8 if I'm running 3.5.7 on my system for 2 days now without rebooting? If I undestood correctly the problem mostly appear during reboot, so maybe it is better to leave the system running and not downgrade at all until next reboot? _________________ jimmij |
|
| Back to top |
|
 |
szczerb Veteran

Joined: 24 Feb 2007 Posts: 1626 Location: Poland => Lodz
|
Posted: Thu Oct 25, 2012 9:41 am Post subject: |
|
|
| jimmij wrote: | | Can someone advice me what is the safest way to switch/downgrade to 3.3.8 if I'm running 3.5.7 on my system for 2 days now without rebooting? If I undestood correctly the problem mostly appear during reboot, so maybe it is better to leave the system running and not downgrade at all until next reboot? | I'm doing just that - waiting with my system on. |
|
| Back to top |
|
 |
depontius Advocate

Joined: 05 May 2004 Posts: 2156
|
Posted: Thu Oct 25, 2012 12:48 pm Post subject: |
|
|
So the problem appears to be "failing to wrap the journal" before rebooting. How much filesystem activity does it take to "wrap the journal"? Simply waiting may not do the trick, it sounds as if you really need to generate some filesystem activity. Most likely, "emerge --sync" would do the trick, but obviously not more often than once daily. Since we're talking about the journal, likely reads wouldn't do spit - it takes writes or updates. Any idea how big/many writes? _________________ .sigs waste space and bandwidth |
|
| Back to top |
|
 |
NoDataFound n00b

Joined: 01 Aug 2011 Posts: 17
|
Posted: Thu Oct 25, 2012 2:14 pm Post subject: |
|
|
I'd like to know what kind of corruption it produce.
Having a bug is bad in itself, although not the end of the world, but it's better if it's recoverable... |
|
| Back to top |
|
 |
khayyam Veteran


Joined: 07 Jun 2012 Posts: 1253
|
Posted: Thu Oct 25, 2012 3:20 pm Post subject: |
|
|
| depontius wrote: | | So the problem appears to be "failing to wrap the journal" before rebooting. How much filesystem activity does it take to "wrap the journal"? Simply waiting may not do the trick, it sounds as if you really need to generate some filesystem activity. Most likely, "emerge --sync" would do the trick, but obviously not more often than once daily. Since we're talking about the journal, likely reads wouldn't do spit - it takes writes or updates. Any idea how big/many writes? |
depontius ... the situation seems to have moved on (as Hu noted above), its nolonger thought to be related to wrapping.
Note: "Update: It now looks like the reproduction involved something very esoteric indeed, involving using umount -l and shutdowns while the file system was still being unmounted --- and the user had nobarrier specified in the mount options as well." Ted Ts'o
So, I don't think there is much reason to panic, if this wasn't a corner case then there would be hundreds of reports of data loss, and the actual reported case so far are few.
For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion.
best ... khay |
|
| Back to top |
|
 |
leifbk Apprentice


Joined: 05 Jan 2004 Posts: 234 Location: Bærum, Norway
|
Posted: Thu Oct 25, 2012 4:18 pm Post subject: |
|
|
| khayyam wrote: |
For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane ... but for the rest of us its best not to blow this out of proportion.
best ... khay |
I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance?
I for one am still running good ol' ext3, and will keep my newly compiled 3.5.7 kernel until a new one hits stable. _________________ The Yggdrasil Genealogy Project |
|
| Back to top |
|
 |
bandreabis Advocate


Joined: 18 Feb 2005 Posts: 2031 Location: Somewhere over the rainbow... bluebirds fly!
|
Posted: Thu Oct 25, 2012 4:39 pm Post subject: |
|
|
| I can't see any visible difference between 3.3.8 and 3.5.7 (freshly compiled) so I remain with the "not hard masked" one. |
|
| Back to top |
|
 |
khayyam Veteran


Joined: 07 Jun 2012 Posts: 1253
|
Posted: Thu Oct 25, 2012 6:02 pm Post subject: |
|
|
| leifbk wrote: | | khayyam wrote: | | For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion. |
I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance? |
leifbk ... certainly keeping the machine up would be enough to not encounter the problem, but as for the jollarity ... it wouldn't help at all *regardless* of what filesystems were modified (note that I pointed out that the current understanding is its not caused by 'wraping'), but there is no telling the "truely insane" that. So, was that attempt at humour missed?
| leifbk wrote: | | I for one am still running good ol' ext3, and will keep my newly compiled 3.5.7 kernel until a new one hits stable. |
A serious bug in linux kernel has caused users to believe that there is serious bug in the linux kernel, in a post made to the LKML, Linus Torvalds stated "we're not really sure if this is a bug or not, but we can assure everyone we're reading all of the hullaballoo on slashdot and we'll know more as and when news hits critical mass". The bug, code named "worse than y2k, stuxnet, and Window98 combined (WTY2KSTUXNET&W98)" is thought to effect at least three users, and more than ten million blogs and news sites". Users, who until recently had thought that the designation "stable" was a ancronym for "no need for backups any mo", are lining up to throw themselves under the wheels of this runaway train, as one commentator noted "its worse than Fukushima Daiichi and that other thing ... didn't you read my blog post?" :)
best of the bwaaaaa ... khay |
|
| Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 6431 Location: Somewhere over Atlanta, Georgia
|
Posted: Thu Oct 25, 2012 6:33 pm Post subject: |
|
|
Goodness. That's more sarcastic than me!
- John _________________ This space intentionally left blank. |
|
| Back to top |
|
 |
khayyam Veteran


Joined: 07 Jun 2012 Posts: 1253
|
Posted: Thu Oct 25, 2012 7:07 pm Post subject: |
|
|
| John R. Graham wrote: | | Goodness. That's more sarcastic than me! |
John ... the intention was to deflate the rise in panic with some humor. It seems that this serious bug, though no doubt an annoyance to those hit, is most likely a corner case, and so all the "hallaballoo" needs to step down a gear or three. Its already been said that this is reflecting badly on ext4, and some of the reporting has been out of proportion to the actual severity, so I guess my sarcasm reflects this.
best ... khay |
|
| Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 6431 Location: Somewhere over Atlanta, Georgia
|
Posted: Thu Oct 25, 2012 7:24 pm Post subject: |
|
|
Never explain sarcasm; it just ruins it.
- John _________________ This space intentionally left blank. |
|
| Back to top |
|
 |
energyman76b Advocate


Joined: 26 Mar 2003 Posts: 2022 Location: Germany
|
Posted: Thu Oct 25, 2012 9:21 pm Post subject: |
|
|
short: don't do anything stupid and you won't hit the bug.
It is really that simple. Phoronix in the mean time is working hard to earn that Moronix moniker. _________________
| AidanJT wrote: |
Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.
|
Satan's got perfectly toned abs and rocks a c-cup. |
|
| Back to top |
|
 |
Jaglover Advocate


Joined: 29 May 2005 Posts: 3974 Location: Saint Amant, Acadiana
|
Posted: Thu Oct 25, 2012 9:25 pm Post subject: |
|
|
| John R. Graham wrote: | Never explain sarcasm; it just ruins it.
- John |
+1
BTW, I'm not the fourth user hit by this. _________________ Please learn how to denote units correctly! |
|
| Back to top |
|
 |
leifbk Apprentice


Joined: 05 Jan 2004 Posts: 234 Location: Bærum, Norway
|
Posted: Thu Oct 25, 2012 9:34 pm Post subject: |
|
|
| khayyam wrote: | | leifbk wrote: | | khayyam wrote: | For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane ... but for the rest of us its best not to blow this out of proportion. |
I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance? |
leifbk ... certainly keeping the machine up would be enough to not encounter the problem, but as for the jollarity ... it wouldn't help at all *regardless* of what filesystems were modified (note that I pointed out that the current understanding is its not caused by 'wraping'), but there is no telling the "truely insane" that. So, was that attempt at humour missed?
|
Not quite, but I got carried away by the implications. BTW, you came up with an excellent method for converting a Gentoo box into a fan heater, but you forgot the --keep-going option. It's getting cold here in Norway now. _________________ The Yggdrasil Genealogy Project |
|
| Back to top |
|
 |
Jaglover Advocate


Joined: 29 May 2005 Posts: 3974 Location: Saint Amant, Acadiana
|
Posted: Thu Oct 25, 2012 10:27 pm Post subject: |
|
|
... I have three boxes always on ... is that why my central AC unit just kicked in? Or maybe it's because it's 30C out there? Patting myself on the back for moving from Nordic to Tropic.  _________________ Please learn how to denote units correctly! |
|
| Back to top |
|
 |
khayyam Veteran


Joined: 07 Jun 2012 Posts: 1253
|
Posted: Fri Oct 26, 2012 1:25 am Post subject: |
|
|
| leifbk wrote: | | BTW, you came up with an excellent method for converting a Gentoo box into a fan heater, but you forgot the --keep-going option. It's getting cold here in Norway now. |
leifbk ... what? and miss the opportunity to discover another bug, no ... and why not do as we more southern europeans do and throw another IKEA poang rocking chair with korndal brown cushion on the fire ... or is that too Swedish for Norwegian sensibilities?
best ... khay |
|
| Back to top |
|
 |
anyNiXwilldo Apprentice

Joined: 20 Feb 2004 Posts: 172 Location: US
|
Posted: Fri Oct 26, 2012 2:26 am Post subject: |
|
|
Well I hadn't rebooted in several days, but I noticed this morning 3.6.2 was masked. I knew why, from yesterday's articles. The info making the rounds today was saying it's a rather esoteric (hard to reproduce) bug, which probably meant I had nothing to worry about. However, given I run almost 100% stable, except for things like qpdfview, nomacs and the kernel, I felt it best to back the kernel back down to stable from ~amd64. I umounted my data partition after building 3.5.4-hardened-r1-gnu, prior to rebooting with that kernel. Everything seems to be fine. I just know I don't have the nerves to deal with these newer kernels and whatever very scary bugs they might have. _________________ Of course you can have my root password. I'm on Hardened! |
|
| Back to top |
|
 |
platojones Veteran


Joined: 23 Oct 2002 Posts: 1491 Location: Just over the horizon
|
Posted: Fri Oct 26, 2012 2:30 am Post subject: |
|
|
Reading the latest updates at the thread below and considering the fact that this isn't showing up but on 2 machines that anybody knows of so far (nobody has been able to independently reproduce yet), I'd say it's looking very anti-climactic:
http://thread.gmane.org/gmane.linux.kernel/1379725/focus=1381772 |
|
| Back to top |
|
 |
leifbk Apprentice


Joined: 05 Jan 2004 Posts: 234 Location: Bærum, Norway
|
Posted: Fri Oct 26, 2012 5:22 am Post subject: |
|
|
| khayyam wrote: | | ... and why not do as we more southern europeans do and throw another IKEA poang rocking chair with korndal brown cushion on the fire ... or is that too Swedish for Norwegian sensibilities? |
We love to burn cheap Swedish furniture
We still haven't forgiven the Swedes for Karl XII, who was shot through the head during his Norwegian campaign in 1718. Nobody knows for certain if the bullet was Norwegian or Swedish, but we love to claim the credit. This tends to make the Swedes irate. _________________ The Yggdrasil Genealogy Project |
|
| Back to top |
|
 |
ulenrich Guru

Joined: 10 Oct 2010 Posts: 463
|
Posted: Fri Oct 26, 2012 11:44 am Post subject: |
|
|
| platojones wrote: | | Reading the latest updates at the thread below and considering the fact that this isn't showing up but on 2 machines that anybody knows of so far (nobody has been able to independently reproduce yet), |
Yes, but it was not hardware related but setup:
cascading mounts of mixed ext4 and network devices, were it was forcefully configured to be able to very fast reboot: lazy "umount -l" was used to not wait for net devices. And a local machine ext4 partition was mounted on top of a net mount??? And "nobarrier" mount option??
In this special case _and_ if additionally some crash induced reboots then:
there was data loss after the second reboot!
A clean bit was set, when there hasn't been a journal cleanup yet (writeback?). A workaround for this setup would have been forcefsck in the boot cmdline. This would have played the capabilities of a journaled filesystem: The missing data would have been written back. But the additional forcefsck wouldn't quickly boot up the system ...
not a very general used setup
This is why Greg Kroah-Hartman doesn't quickly thin release to fix the issue. At first all of us who observe the stable patchlevel releases felt a panic attack because we knew there had been an ext4 feature backport for linux-3.6.2 . But the jbd2 patch which obviously caused the data loss would have been attached any way: it was (thought) a fix. Greg Kroah-Hartman should serialize such feature backports to reduce our psycho panics.
[edit]Don't take the last sentence as a serious suggestion, but as a tool to self audit (for me at least).
[edit2]Because Greg does it already when possible. _________________ fun2gen2
Last edited by ulenrich on Fri Oct 26, 2012 4:19 pm; edited 2 times in total |
|
| Back to top |
|
 |
|