Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Ext3 fs corruption on raid/lvm
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
Zefiro
n00b
n00b


Joined: 19 Aug 2004
Posts: 8
Location: Karlsruhe / Germany

PostPosted: Wed Dec 14, 2005 5:20 am    Post subject: Ext3 fs corruption on raid/lvm Reply with quote

I have frightening problems with my filesystem and what happens is beyond my knowledge. I suspect bugs in layers I thought to be stable and reliable, perhaps due to high system load, race conditions or anything not visible to me. The visible thing is my ext3 suddenly becoming readonly, with this message in the syslog:

Code:
Dec 14 04:03:34 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #42024961: directory entry across blocks - offset=0, inode=4023054245, rec_len=62276, name_len=117
Dec 14 04:03:34 Alexandria Aborting journal on device dm-3.
Dec 14 04:03:34 Alexandria ext3_abort called.
Dec 14 04:03:34 Alexandria EXT3-fs error (device dm-3): ext3_journal_start_sb: Detected aborted journal
Dec 14 04:03:34 Alexandria Remounting filesystem read-only
Dec 14 04:03:35 Alexandria __journal_remove_journal_head: freeing b_committed_data


Some background first:
I intended to build a new fileserver for private use. It should be inside the local network, acting as place to put files like mp3s, movies, photos, programs and stuff to be accessed by the Windows boxes (using samba) and visitors. It would hold primarly data which can be get from the net, but would be much work to do so. The personal stuff I would still backup using DVDs, but I hoped for longer intervalls when I use stable systems and raid. Doing a complete backup is impractical due to the intended size (I start with 500gb but plan to extend into 1-2TB in the future)

The Hardware used is an Asus K8N-E Deluxe (nforce3) with AMD Athlon64 2800+ Newcastle CPU, 512 MB Ram (PC400 CL3 Infineon). Connected to the onboard nforce SATA is the current system disk (250GB Maxtor 7Y250M0, /dev/sdd) and another equal disk currently in use for backups of another PC. The onboard Silicon Image controller holds three identical 250GB Seagate ST3250823AS disks (/dev/sda-c). The board also has an onboard ethernet adapter which I use (100MB/s mode).

For the software I installed a Gentoo 2005.1 system. Since it's a home server I did a simple partition of a small /boot (100mb), a swap (1gb) and a bigger / partition (20GB), using ext3. I try to avoid unstable packages, as long as stable versions are available. I'm using kernel 2.6.13-gentoo-r5 with amd64 keyword. I don't want to bloat this entry with the kernel config, so you can find it on nomorepasting.com with ID #54432. The output of lsmod can be found as ID #54433.

For the storage area I use a layered approach to achieve my goals of a robust, reliable and secure storage. The three Seagate SATA disks are combined into one software raid 5 system using md. To allow for later growing I set this up using EVMS. The only other tool I am aware of being capable of growing a raid is raidreconf, which was stated as abandoned and unreliable somewhere.

On top of the raid I wanted an encryption layer, for which I chose dm-crypt with the new LUKS scheme. Since EVMS is not capable of this, the rest of the setup was done using the traditional commandline tools.

Above the dm-crypt comes LVM2, currently with only one big partition, holding the ext3 filesystem.

My setup used theses devices:
Code:

/dev/sda1               data raid5 partition 1
/dev/sdb1               data raid5 partition 2
/dev/sdc1               data raid5 partition 3
/dev/sdd1               /boot
/dev/sdd2               swap
/dev/sdd3               /
/dev/md1                raid5 partition (using /dev/sda1, sdb1, sdc1)
/dev/mapper/cdisk2      encrypted partition (using /dev/md1)
/dev/crypta-library     logical volume on physical volume using /dev/mapper/cdisk2


Exact details on how I did this can be found here in my diary, though currently you need an OpenID or Lifejournal account to see this.

I assumed every used component - kernel 2.6, software raid, dm-crypt, LVM, ext3 - is considered production-stable and should pose no problems.

This is the second time in the short timeframe since I installed this that the ext3 system went readonly. Both times I was copying large amounts of data (several douzen GB) onto the system, using multiple copy commands, primarly over samba.

The first time it happened I immediately halted the system. This is the logfile extract from the first occurence (I found nothing else which I think is related)
Code:
Dec  5 07:21:02 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:21:02 Alexandria Aborting journal on device dm-3.
Dec  5 07:21:02 Alexandria __journal_remove_journal_head: freeing b_committed_data
Dec  5 07:21:02 Alexandria ext3_abort called.
Dec  5 07:21:02 Alexandria EXT3-fs error (device dm-3): ext3_journal_start_sb: Detected aborted journal
Dec  5 07:21:02 Alexandria Remounting filesystem read-only
Dec  5 07:21:20 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:21:52 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:21:55 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:21:59 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:22:08 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:22:08 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:22:13 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:22:18 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:22:20 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:22:21 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:22:21 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:22:21 Alexandria EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #46497793: directory entry across blocks - offset=0, inode=3058234988, rec_len=42308,
 name_len=172
Dec  5 07:26:02 Alexandria shutdown[11437]: shutting down for system halt


The e2fsck didn't say anything sounding alarming, but then I don't know how an e2fsck output should or should not look like. I have not kept it, but this second time it did at least say that the filesystem contains errors (without specifying exactly what kind of errors and whether the could be fixed or have corrupted something)

Code:
Alexandria ~ # e2fsck -vt /dev/mapper/crypta-library
e2fsck 1.38 (30-Jun-2005)
/dev/mapper/crypta-library: recovering journal
/dev/mapper/crypta-library contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Peak memory: Memory used: 524k/78864k (315k/210k), time: 913.08/ 9.52/29.09
Pass 4: Checking reference counts
Pass 5: Checking group summary information

  135367 inodes used (0%)
   21308 non-contiguous inodes (15.7%)
         # of inodes with ind/dind/tind blocks: 61042/5974/1
77326116 blocks used (63%)
       0 bad blocks
       2 large files

  127210 regular files
    8148 directories
       0 character device files
       0 block device files
       0 fifos
       0 links
       0 symbolic links (0 fast symbolic links)
       0 sockets
--------
  135358 files
Memory used: 524k/0k (10k/515k), time: 953.28/14.44/29.78


The filesystem is mountable afterwards without apparent problems - though I can't check if none of the files are corrupt.

Code:
Dec 14 06:19:58 Alexandria kjournald starting.  Commit interval 5 seconds
Dec 14 06:19:58 Alexandria EXT3 FS on dm-3, internal journal
Dec 14 06:19:58 Alexandria EXT3-fs: mounted filesystem with ordered data mode.


The hardware is new (<6 months), I assume there is no hardware/cabeling error as this should trigger errors in the raid layer, not above (I can't use smartctl since it's SATA). I did some googling on "EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory" (and also searched this forum), but found nothing really useful. I got from what I found - e.g. this thread or this one - that this happens to quite some ppl, not only me. It does seem to affect especially the combination of md and lvm with ext3 (one person stated that in his tests reiserfs did not find any error, while ext3 turned readonly), especially under load, but it is unclear at which layer (md, lvm, ext3 or somewhere else) this occurs.

I really hope someone here has some good knowledge about this problem, as building a new storage server when having spontaneous data corruption in mind is not something I could get comfortable with. So when I do it I at least want to know it's stable. (one sysadmin I asked said the whole concept of a source distribution is dangerous since noone has exactly the same binaries, so there is no real 'stable' version at all)
Back to top
View user's profile Send private message
jkt
Retired Dev
Retired Dev


Joined: 06 Feb 2004
Posts: 1250
Location: Prague, Czech republic, EU

PostPosted: Wed Dec 14, 2005 4:51 pm    Post subject: Re: Ext3 fs corruption on raid/lvm Reply with quote

Zefiro wrote:
one sysadmin I asked said the whole concept of a source distribution is dangerous since noone has exactly the same binaries, so there is no real 'stable' version at all

Tell the sysadmin that userspace stuff shouldn't affect kernel things like filesystems, RAID and device mapper.
_________________
cd /local/pub && more beer > /dev/mouth

Česká dokumentace
Back to top
View user's profile Send private message
Zefiro
n00b
n00b


Joined: 19 Aug 2004
Posts: 8
Location: Karlsruhe / Germany

PostPosted: Tue Dec 20, 2005 7:51 pm    Post subject: Re: Ext3 fs corruption on raid/lvm Reply with quote

jkt wrote:
Zefiro wrote:
one sysadmin I asked said the whole concept of a source distribution is dangerous since noone has exactly the same binaries, so there is no real 'stable' version at all

Tell the sysadmin that userspace stuff shouldn't affect kernel things like filesystems, RAID and device mapper.

That was not his point.

You're not helping with the general problem, either.

Is there really nobody who knows anything about this problem? I know it's quite rare, but googling told me I'm not the only one - which leds me to the conclusion that there indeed are bugs, somewhere in this layers (kernelspace).

Since currently I can't trust this system anymore I have to think about alternatives. Which might even be to cancel using Gentoo at all. But I hoped I would find some answers or at least more clues here in the forums.
Back to top
View user's profile Send private message
jkt
Retired Dev
Retired Dev


Joined: 06 Feb 2004
Posts: 1250
Location: Prague, Czech republic, EU

PostPosted: Wed Dec 21, 2005 12:11 am    Post subject: Re: Ext3 fs corruption on raid/lvm Reply with quote

Zefiro wrote:
jkt wrote:
Zefiro wrote:
one sysadmin I asked said the whole concept of a source distribution is dangerous since noone has exactly the same binaries, so there is no real 'stable' version at all

Tell the sysadmin that userspace stuff shouldn't affect kernel things like filesystems, RAID and device mapper.

That was not his point.

You're not helping with the general problem, either.

Okay, but come one, if you have issues with kernel stuff like filesystems, device mapper, RAID and LVM, how is this Gentoo-specific? That's my point. You can run exactly same kernel on your Gentoo box as on a RHEL machine...

Quote:
Is there really nobody who knows anything about this problem? I know it's quite rare, but googling told me I'm not the only one - which leds me to the conclusion that there indeed are bugs, somewhere in this layers (kernelspace).

Which sounds quite dangerous :-(.

Quote:
Since currently I can't trust this system anymore I have to think about alternatives. Which might even be to cancel using Gentoo at all. But I hoped I would find some answers or at least more clues here in the forums.

As I stated above, I can't see how can you help by switching to another distribution...
_________________
cd /local/pub && more beer > /dev/mouth

Česká dokumentace
Back to top
View user's profile Send private message
Gentoo Server
Apprentice
Apprentice


Joined: 21 Jul 2003
Posts: 279

PostPosted: Mon Jan 30, 2006 9:21 am    Post subject: Reply with quote

I can confirm ext3 failure

copy to a raid5 (with dm_crypt)

high load

ext3 went to readonly (filesystem needed recovery)

this happend while copying to a fresh ext3 formated raid5 disk after about 200g

i think its a kernel bug or ext3 cant survive high load

i formated it same config the raid with xfs and copy now 800g

so far 129g copied with no problems
Back to top
View user's profile Send private message
Zefiro
n00b
n00b


Joined: 19 Aug 2004
Posts: 8
Location: Karlsruhe / Germany

PostPosted: Mon Jan 30, 2006 2:05 pm    Post subject: Reply with quote

Gentoo Server wrote:
I can confirm ext3 failure

Thanks :)

So did my googlesearches back then. Unfortunately it seems to be existant, but rare, so no solution was presented anywhere. Some ideological geeks even claimed it to be wrong ("I can't see an error, never have, so there is no error").

Gentoo Server wrote:
copy to a raid5 (with dm_crypt)
high load
ext3 went to readonly (filesystem needed recovery)

Yes, exactly my setup, plus LVM inbetween. Did you use LVM?

The other sources all seemed to have either LVM or softraid underneath (without dm_crypt), and one person spoke from having no problems with reiserfs. So I suspect it's the combination of softraid and perhaps lvm with ext3. Considering what I read I doubt it's something to do with dm_crypt. Oh, and yes - high load, always. In my case it were two concurrent disk write (copy from another hdd and from network), so low CPU, but high IO activity.

Gentoo Server wrote:
i think its a kernel bug or ext3 cant survive high load

Quite possible. Do you think ext2 works better? I chose ext3 for it's long ext2 experience and I assumed it was stable enough. But on second thought I considered using non-journaling, non-special-root-area filesystems (-m 0 on ext) to have maximum space for my files.
Back to top
View user's profile Send private message
jkt
Retired Dev
Retired Dev


Joined: 06 Feb 2004
Posts: 1250
Location: Prague, Czech republic, EU

PostPosted: Mon Jan 30, 2006 3:09 pm    Post subject: Reply with quote

Have you folks submitted a bug?
_________________
cd /local/pub && more beer > /dev/mouth

Česká dokumentace
Back to top
View user's profile Send private message
Gentoo Server
Apprentice
Apprentice


Joined: 21 Jul 2003
Posts: 279

PostPosted: Mon Jan 30, 2006 5:35 pm    Post subject: Reply with quote

Zefiro wrote:
Gentoo Server wrote:
I can confirm ext3 failure

Thanks :)

So did my googlesearches back then. Unfortunately it seems to be existant, but rare, so no solution was presented anywhere. Some ideological geeks even claimed it to be wrong ("I can't see an error, never have, so there is no error").

Gentoo Server wrote:
copy to a raid5 (with dm_crypt)
high load
ext3 went to readonly (filesystem needed recovery)

Yes, exactly my setup, plus LVM inbetween. Did you use LVM?

The other sources all seemed to have either LVM or softraid underneath (without dm_crypt), and one person spoke from having no problems with reiserfs. So I suspect it's the combination of softraid and perhaps lvm with ext3. Considering what I read I doubt it's something to do with dm_crypt. Oh, and yes - high load, always. In my case it were two concurrent disk write (copy from another hdd and from network), so low CPU, but high IO activity.

Gentoo Server wrote:
i think its a kernel bug or ext3 cant survive high load

Quite possible. Do you think ext2 works better? I chose ext3 for it's long ext2 experience and I assumed it was stable enough. But on second thought I considered using non-journaling, non-special-root-area filesystems (-m 0 on ext) to have maximum space for my files.



i dont think its a rare bug

just create a 8 drives raid5 over dmcrypt

fs ext3 with writeback

then copy on that ext3 from other sources at max speed (in my test from 2 drives)

i think its important a full load system with is pretty easy done with dmcrypt

i got that error pretty early on a new ext3 fs without any crash or other problem

fs just went broken


no i didnt submit that bug

i had a kernel bug to with reiser 3.6 so I can only suggest to all dont use reiser or ext3

now i did the exact same copy on a xfs filesystem with higher speed and no problems so far (518g copied so far)

ah yes one extra info i copied from 2 single ext3 drives to that encrypted ext3 raid5 drive
-> fs crash
Back to top
View user's profile Send private message
jkt
Retired Dev
Retired Dev


Joined: 06 Feb 2004
Posts: 1250
Location: Prague, Czech republic, EU

PostPosted: Mon Jan 30, 2006 5:47 pm    Post subject: Reply with quote

Gentoo Server wrote:
no i didnt submit that bug

Ah, I thought you guys wanted to have the issue fixed...
_________________
cd /local/pub && more beer > /dev/mouth

Česká dokumentace
Back to top
View user's profile Send private message
Gentoo Server
Apprentice
Apprentice


Joined: 21 Jul 2003
Posts: 279

PostPosted: Mon Jan 30, 2006 7:03 pm    Post subject: Reply with quote

jkt wrote:
Gentoo Server wrote:
no i didnt submit that bug

Ah, I thought you guys wanted to have the issue fixed...


I fixed it easy by replacing ext3 with xfs
Back to top
View user's profile Send private message
jkt
Retired Dev
Retired Dev


Joined: 06 Feb 2004
Posts: 1250
Location: Prague, Czech republic, EU

PostPosted: Mon Jan 30, 2006 7:09 pm    Post subject: Reply with quote

Gentoo Server wrote:
I fixed it easy by replacing ext3 with xfs

Nope, you haven't fixed it. It seems that something in the combination of device mapper/raid/whatever and ext3 filesystem couses trouble. It might be something in the ext3 code or something in the SW RAID that gets triggered by the ext3's way it accesses data. You've just made a workaround to the real problem, so you can't call it a fix. The bug is still here.
_________________
cd /local/pub && more beer > /dev/mouth

Česká dokumentace
Back to top
View user's profile Send private message
Gentoo Server
Apprentice
Apprentice


Joined: 21 Jul 2003
Posts: 279

PostPosted: Mon Jan 30, 2006 7:39 pm    Post subject: Reply with quote

jkt wrote:
Gentoo Server wrote:
I fixed it easy by replacing ext3 with xfs

Nope, you haven't fixed it. It seems that something in the combination of device mapper/raid/whatever and ext3 filesystem couses trouble. It might be something in the ext3 code or something in the SW RAID that gets triggered by the ext3's way it accesses data. You've just made a workaround to the real problem, so you can't call it a fix. The bug is still here.



sure its just a fix for me not a fix for ext3

on the other hans xfs works perfect and reiser3 did kernel crash me too on raid5/dm_crypt
i think ext3 and reiser cant real handle high load
Back to top
View user's profile Send private message
Gentoo Server
Apprentice
Apprentice


Joined: 21 Jul 2003
Posts: 279

PostPosted: Mon Jan 30, 2006 7:43 pm    Post subject: Reply with quote

I found this!

Can I use ReiserFS with software RAID.
Not with raid5, our journaling and their raid code step on each
other in the buffering code. Also, you must use the mirror syncing
tools with the FS unounted. Otherwise, yes, you may do striping and
concatenating and mirroring.

Software RAID users: Using any journaled FS on top of software raid
will result in data corruption right now. We are working with the
ext3 and software raid developers to fix some conflicts in buffer
cache usage.



OMG looks like reiser and ext3 are crashing with softraid5
Back to top
View user's profile Send private message
Gentoo Server
Apprentice
Apprentice


Joined: 21 Jul 2003
Posts: 279

PostPosted: Tue Jan 31, 2006 6:14 pm    Post subject: Reply with quote

I am finsihed now with my conversion

system is p4-ht (smp)

8*hdd in raid5 , dmcrypt no lvm

ext3: fs failure after short time
xfs: zero problems after 20h of 100% heavy load
reiser3: kernel crash
Back to top
View user's profile Send private message
Zefiro
n00b
n00b


Joined: 19 Aug 2004
Posts: 8
Location: Karlsruhe / Germany

PostPosted: Tue Jan 31, 2006 7:17 pm    Post subject: Reply with quote

jkt: I haven't submitted a bug since I didn't know if it was one or perhaps some fault on my side. So I first presented the problem here, in the hope someone could help me identify what was wrong or if it really was a bug. Then, I didn't know where exactly the bug was. I assumed all of softraid, lvm, ext3 to be quite stable and from what I found it could have to do with all of it. So I didn't even know what to write in the bug report, or which program to associate it too.
(your first post wasn't quite helpful, too)

Gentoo Server wrote:
i dont think its a rare bug

I didn't know. I found not very much about this, and no answers. So I just assumed it was rather rare. If it's reproducible that would be great, as it helps finding and fixing it.

jkt wrote:
Ah, I thought you guys wanted to have the issue fixed...

Yes, we do. To be honest, I've never before used the bug tracker system, so I wouldn't know what to include in a report. So I first posted here, with all information I had. What would you suggest?

Gentoo Server wrote:
I found this!

Wow! Now, that's interesting reading.
It does not only state that there is indeed a problem, but also that it is known and where it is (buffering code). Sounds like it won't get fixed too fast, the way it is written. Though I must admit I do not really understand what the problem is. I thought every layer is seperated from another, treating the output of softraid just like every other device. So where is this problem? And how could I help to fix it?

Please, can you give me a link to where you found this?

If it's really a problem with journaling fs I think this is acceptable for me - for my current project I can live without journaling. But still it's a major pitfall for the unaware, as I have never read about this kind of problem before and even while I searched for it. Could be made quite more explicit in the softraid or filesystem documentation (man page e.g.)
Back to top
View user's profile Send private message
Zefiro
n00b
n00b


Joined: 19 Aug 2004
Posts: 8
Location: Karlsruhe / Germany

PostPosted: Tue Jan 31, 2006 8:23 pm    Post subject: Reply with quote

So, a bit of searching later it seems that this 'journaling has problems with softraid' had indeed been an issue, but with kernel 2.2.x - we are at 2.6.x. nowadays, so this information is outdated. Though it seems to be still a problem for us.

From the reiser faq on http://www.namesys.com/faq.html#raid:
Quote:
Can I use ReiserFS with the software RAID.
Yes, for all Raid levels using any Linux >= 2.4.1, but DO NOT use Raid5 with Linux 2.2.x. Our journaling and their Raid code step on each other in the buffering code. Also, mirroring is not safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only Raid level that is safe with ReiserFS in the 2.2.x kernels is the striping/concatenation level.


this mail also quotes "Umm, don't do that. 2.2's soft raid is incompatible with journaling of any form, and ext3 is no exception: this _will_ break.", but is from 2001 and also speaks about 2.2 kernels.

this mail is more current and states for ext3 that the exact ordering of the writes is not an issue, as long as they get written correctly.

So as I see it we have these statements/facts:
- there had been a known issue with softraid and journaling fs in 2.2 kernels
- this shouldn'e be a problem anymore with 2.4 and 2.6 kernels
- the problem was in the buffering code and thus in the write ordering, which shouldn't pose a problem in ext3
- if all layers are working correctly it shouldn't matter at all if softraid is used or not, or if journaling is used or not (apart from performance and recovery possibilities)
- we still have - apparently somewhat reproducible - fs errors using softraid and ext3/reiser
- it is unclear if this is a softraid, an ext3, a reiser or a general design problem, and thus unclear which layer should get fixed (and the bugreport)

What would be best to do now?
Back to top
View user's profile Send private message
jkt
Retired Dev
Retired Dev


Joined: 06 Feb 2004
Posts: 1250
Location: Prague, Czech republic, EU

PostPosted: Tue Jan 31, 2006 9:03 pm    Post subject: Reply with quote

Zefiro wrote:
jkt: I haven't submitted a bug since I didn't know if it was one or perhaps some fault on my side. So I first presented the problem here, in the hope someone could help me identify what was wrong or if it really was a bug. Then, I didn't know where exactly the bug was. I assumed all of softraid, lvm, ext3 to be quite stable and from what I found it could have to do with all of it. So I didn't even know what to write in the bug report, or which program to associate it too.


I'd suggest either LKML or kernel bugzilla.

Quote:
(your first post wasn't quite helpful, too)

What would you expect if you mention that someone said "it would work on another distribution" while speaking about kernel issues?

Quote:
jkt wrote:
Ah, I thought you guys wanted to have the issue fixed...

Yes, we do. To be honest, I've never before used the bug tracker system, so I wouldn't know what to include in a report. So I first posted here, with all information I had. What would you suggest?

See above.

Quote:
Gentoo Server wrote:
I found this!

Wow! Now, that's interesting reading.
It does not only state that there is indeed a problem, but also that it is known and where it is (buffering code). Sounds like it won't get fixed too fast, the way it is written. Though I must admit I do not really understand what the problem is. I thought every layer is seperated from another, treating the output of softraid just like every other device. So where is this problem? And how could I help to fix it?

I wouldn't personally trust all the stuff written on the reiserfs'/namesys' homepage as there was some outdated information last time I checked it.

Quote:
If it's really a problem with journaling fs I think this is acceptable for me - for my current project I can live without journaling. But still it's a major pitfall for the unaware, as I have never read about this kind of problem before and even while I searched for it. Could be made quite more explicit in the softraid or filesystem documentation (man page e.g.)

If it is realy a bug, it's pretty serious, IMHO (I don't know anything about internal kernel workings, though)... I've talked to some folks from our kernel project and they said they'll look at the issue.
_________________
cd /local/pub && more beer > /dev/mouth

Česká dokumentace
Back to top
View user's profile Send private message
Gentoo Server
Apprentice
Apprentice


Joined: 21 Jul 2003
Posts: 279

PostPosted: Fri Feb 03, 2006 11:54 pm    Post subject: Reply with quote

http://archives.free.net.ph/message/20060121.172151.1a49c5e6.en.html

here is another report
i can only suggest to everyone to drop ext3 until its stable again
Back to top
View user's profile Send private message
miraage
n00b
n00b


Joined: 14 Oct 2002
Posts: 15

PostPosted: Sat Aug 19, 2006 3:18 pm    Post subject: Reply with quote

I'm now experiencing exactly the same issue, except this is RAID 1. It's pretty painful

Aug 19 03:19:00 zayin kernel: [141017.386918] Aborting journal on device md1.
Aug 19 03:21:22 zayin kernel: [141159.355326] ext3_abort called.
Aug 19 03:21:22 zayin kernel: [141159.355350] EXT3-fs error (device md1): ext3_journal_start_sb: Detected aborted journal
Aug 19 03:21:22 zayin kernel: [141159.355389] Remounting filesystem read-only

I'll try to rebuild my /dev/md1 on reiserfs since that seems to work better. The devices

/dev/mapper/lvm--raid-home on /home type reiserfs (rw,noatime)
/dev/mapper/lvm--raid-media on /media type reiserfs (rw,noatime)
/dev/mapper/lvm--raid-opt on /opt type reiserfs (rw,noatime)
/dev/mapper/lvm--raid-usr on /usr type reiserfs (rw,noatime)
/dev/mapper/lvm--raid-var on /var type reiserfs (rw,noatime)

Are living on top of a living RAID 1 array.
_________________
Visit my shameless plug!
Back to top
View user's profile Send private message
Gentoo Server
Apprentice
Apprentice


Joined: 21 Jul 2003
Posts: 279

PostPosted: Sat Aug 19, 2006 5:59 pm    Post subject: Reply with quote

i had hard kernel crashed with reiser3 too
now I am on XFS which looks pretty good when you dont have lots of hdd failures
Back to top
View user's profile Send private message
miraage
n00b
n00b


Joined: 14 Oct 2002
Posts: 15

PostPosted: Sat Aug 19, 2006 7:29 pm    Post subject: Reply with quote

Luckily, I've experienced no problems on reiserfs (yet?). I'm using the latest gentoo-sources-2.6.17.
_________________
Visit my shameless plug!
Back to top
View user's profile Send private message
chojin
n00b
n00b


Joined: 26 Jun 2005
Posts: 42

PostPosted: Sun Oct 22, 2006 12:56 pm    Post subject: Reply with quote

I have exactly the same error..
I am running gentoo stable 2006.0 with kernel 2.6.17-gentoo-r8 on an nforce2 chipset with sillicon image SATA controller. I first had only one disk on which I have my system and one linux raid partition on which I created a degraded raid1 (so I could add another disk later wich isn't available at the time of creation). There I put 2 LVM2 volumes: for home and for data. Both formatted with ext3. This ran well and without any problems for a few months, until I finaly have been able to add the second disk to the degraded raid1, to make it clean.
The second disk has its size, equal to the raid-partition on the fist disk. The second disk had always performed well in my fileserver, so I did not fear any HW errors on that disk.
But after adding the seccond disk (configured with one linux raid partition), I suddenly noticed my home partition going read-only. With an fsck I did not have much luck eather: the filesystem was severly corrupted and a I lost a lot of files. After that I started working again, and suddenly the data partition became readonly, also with severe corruption and losing a lot of files.
This scenario repeated itself a few times but since it ran well with the degraded raid before, I removed the second disk from the raid again. After that I still get filesystem corruption, now only on my home partition, and not the data partition. And it can always be fixed by fsck automaticaly now.. but still every day the home partition get read-only, not even under heavy load.
Always only ext3 troubles in the logging. For the raid or lvm layer, there seems to be no problem at all..
Someone on linuxquestions suggested my memory could be bad, but a memtest86 for nearly 24h doing 61 passes showed no error...

I also have my fileserver configured with a raid1 + lvm2 + ext3 with 2 (identical) sata disks with a sillicon image sata controller on which i haven't expierienced any problems yet, even after resizing or heavy load. But now I'm scared to ever have to degrade it to replace a HD and then find out that data corruption comes up...
Back to top
View user's profile Send private message
Januszzz
Guru
Guru


Joined: 04 Feb 2006
Posts: 367
Location: Opole, Poland

PostPosted: Fri Dec 01, 2006 1:42 pm    Post subject: The same. Reply with quote

The same ->
kernel gentoo-sources-2.6.17-r8,
machine TI UltraSparc II (BlackBird)
config RAID 5 with three SCSI disks.

I went for xfs, without errors now.
Back to top
View user's profile Send private message
mattdev121
n00b
n00b


Joined: 17 Jul 2006
Posts: 3

PostPosted: Wed Dec 06, 2006 3:35 am    Post subject: Reply with quote

I'm not sure if it's a actual failure yet, but I noticed my Raid1 array (two sata drives, ext3) was marked as [faulty]. I'm running a data check on it now, and will run a fsck on it, but I hope I don't have to rebuild the system on a non-journaling fs.

My question is that if journaling FSes are what cause raid to corrupt the data, will compiling ext3 support OUT of the kernel (mount the partition as ext2 fallback mode) solve the problem?

It may just be paranoia but it seems to fit what's been outlined here.
Back to top
View user's profile Send private message
JB318
n00b
n00b


Joined: 26 Apr 2005
Posts: 27
Location: Tulsa, Oklahoma

PostPosted: Tue Jan 02, 2007 5:58 am    Post subject: Reply with quote

I'm curious if this issue was caused by the data corruption bug that was just fixed:

http://kerneltrap.org/node/7518
_________________
"The life of every man is a diary, in which he means to write one story, and writes another."
-- _Cheers For Miss Bishop_ (1941)
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum