Ext3 fs corruption on raid/lvm

Zefiro · Posted: Wed Dec 14, 2005 5:20 am Post subject: Ext3 fs corruption on raid/lvm

I have frightening problems with my filesystem and what happens is beyond my knowledge. I suspect bugs in layers I thought to be stable and reliable, perhaps due to high system load, race conditions or anything not visible to me. The visible thing is my ext3 suddenly becoming readonly, with this message in the syslog:

jkt

Zefiro

jkt

Gentoo Server · Apprentice Joined: 21 Jul 2003 Posts: 279

I can confirm ext3 failure

copy to a raid5 (with dm_crypt)

high load

ext3 went to readonly (filesystem needed recovery)

this happend while copying to a fresh ext3 formated raid5 disk after about 200g

i think its a kernel bug or ext3 cant survive high load

i formated it same config the raid with xfs and copy now 800g

so far 129g copied with no problems

Zefiro · Posted: Mon Jan 30, 2006 2:05 pm Post subject:

jkt · Posted: Mon Jan 30, 2006 3:09 pm Post subject:

Have you folks submitted a bug?
_________________
cd /local/pub && more beer > /dev/mouth

Česká dokumentace

Gentoo Server · Apprentice Joined: 21 Jul 2003 Posts: 279

jkt · Posted: Mon Jan 30, 2006 5:47 pm Post subject:

Gentoo Server · Apprentice Joined: 21 Jul 2003 Posts: 279

jkt · Posted: Mon Jan 30, 2006 7:09 pm Post subject:

Gentoo Server · Apprentice Joined: 21 Jul 2003 Posts: 279

Gentoo Server · Apprentice Joined: 21 Jul 2003 Posts: 279

I found this!

Can I use ReiserFS with software RAID.
Not with raid5, our journaling and their raid code step on each
other in the buffering code. Also, you must use the mirror syncing
tools with the FS unounted. Otherwise, yes, you may do striping and
concatenating and mirroring.

Software RAID users: Using any journaled FS on top of software raid
will result in data corruption right now. We are working with the
ext3 and software raid developers to fix some conflicts in buffer
cache usage.

OMG looks like reiser and ext3 are crashing with softraid5

Gentoo Server · Apprentice Joined: 21 Jul 2003 Posts: 279

I am finsihed now with my conversion

system is p4-ht (smp)

8*hdd in raid5 , dmcrypt no lvm

ext3: fs failure after short time
xfs: zero problems after 20h of 100% heavy load
reiser3: kernel crash

Zefiro · Posted: Tue Jan 31, 2006 7:17 pm Post subject:

jkt: I haven't submitted a bug since I didn't know if it was one or perhaps some fault on my side. So I first presented the problem here, in the hope someone could help me identify what was wrong or if it really was a bug. Then, I didn't know where exactly the bug was. I assumed all of softraid, lvm, ext3 to be quite stable and from what I found it could have to do with all of it. So I didn't even know what to write in the bug report, or which program to associate it too.
(your first post wasn't quite helpful, too)

Zefiro · Posted: Tue Jan 31, 2006 8:23 pm Post subject:

So, a bit of searching later it seems that this 'journaling has problems with softraid' had indeed been an issue, but with kernel 2.2.x - we are at 2.6.x. nowadays, so this information is outdated. Though it seems to be still a problem for us.

From the reiser faq on http://www.namesys.com/faq.html#raid:

jkt · Posted: Tue Jan 31, 2006 9:03 pm Post subject:

Gentoo Server · Apprentice Joined: 21 Jul 2003 Posts: 279

http://archives.free.net.ph/message/20060121.172151.1a49c5e6.en.html

here is another report
i can only suggest to everyone to drop ext3 until its stable again

miraage · n00b Joined: 14 Oct 2002 Posts: 15

I'm now experiencing exactly the same issue, except this is RAID 1. It's pretty painful

Aug 19 03:19:00 zayin kernel: [141017.386918] Aborting journal on device md1.
Aug 19 03:21:22 zayin kernel: [141159.355326] ext3_abort called.
Aug 19 03:21:22 zayin kernel: [141159.355350] EXT3-fs error (device md1): ext3_journal_start_sb: Detected aborted journal
Aug 19 03:21:22 zayin kernel: [141159.355389] Remounting filesystem read-only

I'll try to rebuild my /dev/md1 on reiserfs since that seems to work better. The devices

/dev/mapper/lvm--raid-home on /home type reiserfs (rw,noatime)
/dev/mapper/lvm--raid-media on /media type reiserfs (rw,noatime)
/dev/mapper/lvm--raid-opt on /opt type reiserfs (rw,noatime)
/dev/mapper/lvm--raid-usr on /usr type reiserfs (rw,noatime)
/dev/mapper/lvm--raid-var on /var type reiserfs (rw,noatime)

Are living on top of a living RAID 1 array.
_________________
Visit my shameless plug!

Gentoo Server · Apprentice Joined: 21 Jul 2003 Posts: 279

i had hard kernel crashed with reiser3 too
now I am on XFS which looks pretty good when you dont have lots of hdd failures

miraage · n00b Joined: 14 Oct 2002 Posts: 15

Luckily, I've experienced no problems on reiserfs (yet?). I'm using the latest gentoo-sources-2.6.17.
_________________
Visit my shameless plug!

chojin · n00b Joined: 26 Jun 2005 Posts: 42

I have exactly the same error..
I am running gentoo stable 2006.0 with kernel 2.6.17-gentoo-r8 on an nforce2 chipset with sillicon image SATA controller. I first had only one disk on which I have my system and one linux raid partition on which I created a degraded raid1 (so I could add another disk later wich isn't available at the time of creation). There I put 2 LVM2 volumes: for home and for data. Both formatted with ext3. This ran well and without any problems for a few months, until I finaly have been able to add the second disk to the degraded raid1, to make it clean.
The second disk has its size, equal to the raid-partition on the fist disk. The second disk had always performed well in my fileserver, so I did not fear any HW errors on that disk.
But after adding the seccond disk (configured with one linux raid partition), I suddenly noticed my home partition going read-only. With an fsck I did not have much luck eather: the filesystem was severly corrupted and a I lost a lot of files. After that I started working again, and suddenly the data partition became readonly, also with severe corruption and losing a lot of files.
This scenario repeated itself a few times but since it ran well with the degraded raid before, I removed the second disk from the raid again. After that I still get filesystem corruption, now only on my home partition, and not the data partition. And it can always be fixed by fsck automaticaly now.. but still every day the home partition get read-only, not even under heavy load.
Always only ext3 troubles in the logging. For the raid or lvm layer, there seems to be no problem at all..
Someone on linuxquestions suggested my memory could be bad, but a memtest86 for nearly 24h doing 61 passes showed no error...

I also have my fileserver configured with a raid1 + lvm2 + ext3 with 2 (identical) sata disks with a sillicon image sata controller on which i haven't expierienced any problems yet, even after resizing or heavy load. But now I'm scared to ever have to degrade it to replace a HD and then find out that data corruption comes up...

Januszzz · Posted: Fri Dec 01, 2006 1:42 pm Post subject: The same.

The same ->
kernel gentoo-sources-2.6.17-r8,
machine TI UltraSparc II (BlackBird)
config RAID 5 with three SCSI disks.

I went for xfs, without errors now.

mattdev121 · n00b Joined: 17 Jul 2006 Posts: 3

I'm not sure if it's a actual failure yet, but I noticed my Raid1 array (two sata drives, ext3) was marked as [faulty]. I'm running a data check on it now, and will run a fsck on it, but I hope I don't have to rebuild the system on a non-journaling fs.

My question is that if journaling FSes are what cause raid to corrupt the data, will compiling ext3 support OUT of the kernel (mount the partition as ext2 fallback mode) solve the problem?

It may just be paranoia but it seems to fit what's been outlined here.

JB318 · Posted: Tue Jan 02, 2007 5:58 am Post subject:

I'm curious if this issue was caused by the data corruption bug that was just fixed:

http://kerneltrap.org/node/7518
_________________
"The life of every man is a diary, in which he means to write one story, and writes another."
-- _Cheers For Miss Bishop_ (1941)