Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Wierd IO problem
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
therealjrd
Tux's lil' helper
Tux's lil' helper


Joined: 18 May 2006
Posts: 120

PostPosted: Mon Nov 12, 2012 6:49 pm    Post subject: Wierd IO problem Reply with quote

Apols up front if this is not the right area, I couldn't quite figure out what was apropriate.

I've got a file server which I use for all sorts of things. It's an ASUS eeebox running a hardened profile, with two 2TB USB disks in a raid-1 configuration. Yes, I was after low power and cheap, more than flat out performance.

I found the issue when using this server as a repo for ISO images when provisioning ovirt. I'd launch my VMs, they'd get partway through booting and installing from the ISO, and then croak with disk errors. Validating the checksums of the ISOs revealed corruption. Downloading fresh copies didn't help.

After some experiments, I determined that I couldn't successfully sha256sum these images either locally from the server, or from any machine on my NFS network. Better yet, on whatever machine I did the sha256sum, if I caused the file cache to turn over, I could sha256sum the same file again and get a different result. Including local on the server.

I've done tests on smaller files and had them work. I've done iozone tests both locally and over nfs, against this server, and those report no errors. I had a backup covering part of the contents of my homedir on this server, and all the files which were still valid checked out.

I'm a bit stumped. The symptom suggests that there's some kind of corruption happening in the read path from the raid device. But only on pretty big files.

I've looked at the md stats and stuff, nothing is reporting any errors that I can see.

What else should I try? Are there known issues with raid-1 across USB disks (I couldn't find any) ? Are there any known issues with hardened, ie should I switch back to a standard server profile?

Thanks in advance for ideas/hints.
Back to top
View user's profile Send private message
BradN
Advocate
Advocate


Joined: 19 Apr 2002
Posts: 2391
Location: Wisconsin (USA)

PostPosted: Fri Nov 16, 2012 11:39 pm    Post subject: Reply with quote

Ok, so the corruption appears to be happening on the server itself.

We've got some possibilities:

A wonky hard drive / USB adapter chip
Data corruption within the USB interface
Data corruption between USB <-> RAM
RAM corruption (though if it were this bad you probably wouldn't be running many programs successfully)
CPU failure (again, unlikely)

OR....

RAID volume corruption (for instance, different data in each drive)
Filesystem corruption
Filesystem driver not working properly

One piece of advice, is if you're overclocking the system, be careful that it doesn't result in an overclocked USB interface as well - some motherboards use shared clocks for these.

Let's narrow down if our problem is before or after the "OR...." - try md5summing each drive individually twice and compare if you're getting the same results each of the two trials.

If the drives seem to be returning correct info that way, try upping the ante a bit and md5 both drives at the same time. This could bring out potential DMA sharing problems or similar. If it's still good, chances are your hardware is working properly.

Then, you would want to look at possible raid volume corruption. Without going into the details I can't remember, one thing to try is bringing up (in READ ONLY MODE) the raid using only one drive (hint: use keyword 'missing' to specify the nonexistant drive), and then bring up another raid (again, READ ONLY MODE) using the other drive. Now you have two raids running which should have the same data contents. md5 the two md devices and see if they're the same. If they're not the same, md5 them again to see if something weird is happening with the raid layer added in.

If the contents are the same, everything seems good thus far and it must be a filesystem related problem - use whatever error checking is available for that filesystem and look for data corruption related posts with it.

If your data access to the usb devices seems okay, but it's wonking out when raid is involved, try a kernel upgrade (and probably your filesystem is hosed and you'll have all kinds of problems... recommend recreating the whole raid in that case)
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum