Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
ReiserFS and 2TB disk
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Fri Dec 07, 2012 12:02 pm    Post subject: ReiserFS and 2TB disk Reply with quote

Two weeks ago I upgraded two ageing disks to a single 2TB Seagate (ST2000DM001-9YN164). I use LVM and formatted the LVs with ReiserFS. In particular, the /home partition is 1TB. Having restored my system everything looked fine but returning after several hours, the KDE desktop would not wakeup properly. Switching to a console, neither the sync or umount command would complete, they just hung. This happened a couple of times, so I thought the backup might have been a bit corrupt and completely reinstalled @system and @world, and built the latest kernel-3.6.8. Returning last night the same thing had occurred; looking at htop from a console I could see lots of identical processes that had been started and just hung. In the syslog I could see kernel messages relating to hung tasks:

    Dec 7 00:39:49 opal kernel: INFO: task apache2:28115 blocked for more than 120 seconds.
    Dec 7 00:39:49 opal kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Dec 7 00:39:49 opal kernel: apache2 D 0000000000000000 0 28115 10408 0x00000000
    Dec 7 00:39:49 opal kernel: ffff88012f27d980 0000000000000086 0000000000000002 ffffffff815b7420
    Dec 7 00:39:49 opal kernel: 0000000000011280 ffff8800b6727fd8 0000000000011280 ffff8800b6726010
    Dec 7 00:39:49 opal kernel: ffff8800b6727fd8 0000000000011280 ffff88012f27d980 0000000000011280
    Dec 7 00:39:49 opal kernel: Call Trace:
    Dec 7 00:39:49 opal kernel: [<ffffffff8106ac92>] ? load_balance+0x102/0x790
    Dec 7 00:39:49 opal kernel: [<ffffffff8107a609>] ? debug_mutex_add_waiter+0x29/0x70
    Dec 7 00:39:49 opal kernel: [<ffffffff814312cf>] ? __mutex_lock_slowpath+0x22f/0x310
    Dec 7 00:39:49 opal kernel: [<ffffffff8102c455>] ? default_spin_lock_flags+0x5/0x10
    Dec 7 00:39:49 opal kernel: [<ffffffff8143401b>] ? _raw_spin_lock_irqsave+0x3b/0x60
    Dec 7 00:39:49 opal kernel: [<ffffffff8118cd81>] ? queue_log_writer+0x91/0xe0
    Dec 7 00:39:49 opal kernel: [<ffffffff81066a80>] ? try_to_wake_up+0x2b0/0x2b0
    Dec 7 00:39:49 opal kernel: [<ffffffff81192a18>] ? do_journal_begin_r+0x238/0x380
    Dec 7 00:39:49 opal kernel: [<ffffffff81192bef>] ? journal_begin+0x8f/0x170
    Dec 7 00:39:49 opal kernel: [<ffffffff81173e49>] ? reiserfs_create+0xf9/0x260
    Dec 7 00:39:49 opal kernel: [<ffffffff8110ab1f>] ? generic_permission+0xff/0x240
    Dec 7 00:39:49 opal kernel: [<ffffffff8110ce29>] ? vfs_create+0xb9/0x110
    Dec 7 00:39:49 opal kernel: [<ffffffff8110e1c2>] ? do_last+0x9b2/0xe70
    Dec 7 00:39:49 opal kernel: [<ffffffff810c57b0>] ? release_pages+0x180/0x1d0
    Dec 7 00:39:49 opal kernel: [<ffffffff8110e741>] ? path_openat+0xc1/0x500
    Dec 7 00:39:49 opal kernel: [<ffffffff8110ecad>] ? do_filp_open+0x4d/0xc0
    Dec 7 00:39:49 opal kernel: [<ffffffff81433cf5>] ? _raw_spin_unlock+0x15/0x40
    Dec 7 00:39:49 opal kernel: [<ffffffff8111b686>] ? alloc_fd+0x106/0x130
    Dec 7 00:39:49 opal kernel: [<ffffffff810fd2e8>] ? do_sys_open+0x108/0x1f0
    Dec 7 00:39:49 opal kernel: [<ffffffff81434a39>] ? system_call_fastpath+0x16/0x1b

Eventually the system just hangs completely. Since this started with the new disk, I am wondering if ReiserFS actually works with new, huge disks. If not, what else could be causing this? This is a bit desperate. :(

TIA
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
Merlin-TC
l33t
l33t


Joined: 16 May 2003
Posts: 603
Location: Germany

PostPosted: Fri Dec 07, 2012 2:58 pm    Post subject: Reply with quote

Sawadee Binro,

reiserfs doesn't have any problems with volumes up to 16tb so I doubt reiserfs itself is the problem.

1. Is there any additional output of dmesg?
2. Can you reproduce it or does it feel "random"?
3. Is the system under heavy load when this is happening?

You could try another io scheduler just to narrow down the problem.
Back to top
View user's profile Send private message
srs5694
Guru
Guru


Joined: 08 Mar 2004
Posts: 434
Location: Woonsocket, RI

PostPosted: Fri Dec 07, 2012 3:52 pm    Post subject: Reply with quote

You might also run a SMART utility like GSmartControl, the SMART functions of Palimpsest, or smartctl. These will tell you if you've got a new disk that's defective. (Sadly, it happens sometimes.) The output can be difficult to interpret sometimes, though, so post for help interpreting the output if you need it.
Back to top
View user's profile Send private message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Fri Dec 07, 2012 4:26 pm    Post subject: Reply with quote

Merlin-TC wrote:
Sawadee Binro,

reiserfs doesn't have any problems with volumes up to 16tb so I doubt reiserfs itself is the problem.

1. Is there any additional output of dmesg?
2. Can you reproduce it or does it feel "random"?
3. Is the system under heavy load when this is happening?

You could try another io scheduler just to narrow down the problem.

I examined the syslog and everything looks normal, there is no unusual load. It is not random, but inevitable. I am beginning to suspect it is caused by the graphics, the nvidia driver or KDE in some way, the system is stable if I don't logon. But this never happened before I changed the disk.

Khawp khun khrup!
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Fri Dec 07, 2012 4:27 pm    Post subject: Reply with quote

srs5694 wrote:
You might also run a SMART utility like GSmartControl, the SMART functions of Palimpsest, or smartctl. These will tell you if you've got a new disk that's defective. (Sadly, it happens sometimes.) The output can be difficult to interpret sometimes, though, so post for help interpreting the output if you need it.

The smartd daemon is running and reports the disk to be entirely healthy!
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Sat Dec 08, 2012 4:20 pm    Post subject: Reply with quote

This gets stranger and stranger. I disabled the screen-saver and now the system is stable again! A screen-saver wouldn't interfere with process execution, would it?
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
srs5694
Guru
Guru


Joined: 08 Mar 2004
Posts: 434
Location: Woonsocket, RI

PostPosted: Sat Dec 08, 2012 4:34 pm    Post subject: Reply with quote

binro wrote:
This gets stranger and stranger. I disabled the screen-saver and now the system is stable again! A screen-saver wouldn't interfere with process execution, would it?


It might, especially if it uses an advanced video feature and if that feature has a buggy implementation in a video driver. Video drivers are increasingly relying on kernel-level code, and then all bets are off; a buggy kernel driver could interfere with just about anything.

Thus, you might try upgrading your video driver, if possible, or switch drivers (from Nvidia's proprietary driver to nouveau or vice-versa, for instance). If that's too much hassle or otherwise impractical, try adjusting your screen saver to use just one module that does the simplest thing possible -- ideally just blank the screen. You'll do without the eye candy that way, but that's better than having a system that hangs randomly.
Back to top
View user's profile Send private message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Sat Dec 08, 2012 8:51 pm    Post subject: Reply with quote

I was thinking along the same lines, except that before the restore onto the new disk this all worked perfectly. I can't help thinking that something in my system has been subtly corrupted.
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
srs5694
Guru
Guru


Joined: 08 Mar 2004
Posts: 434
Location: Woonsocket, RI

PostPosted: Sun Dec 09, 2012 12:49 am    Post subject: Reply with quote

How did you transfer your system to the new disks? (dd, tar, etc.?) It could be there's a malfunction in the video drivers that's related to a subtle permission problem introduced in the transfer; or maybe a bit or two got flipped during the copying. If you've still got the original disk, you could plug it in and write a script to compare every file. between the two systems.
Back to top
View user's profile Send private message
salahx
Guru
Guru


Joined: 12 Mar 2005
Posts: 530

PostPosted: Sun Dec 09, 2012 1:19 am    Post subject: Reply with quote

Actually looking at the stack trace and explanation of symptoms, this could be a genuine bug. It sounds like there a race condition in reiserfs that's causing a deadlock. The screen saver being innocent in this matter - it just happens to widen the window the race can occur.

It may worth recompiling the kernel with CONFIG_PROVE_LOCKING=y
Back to top
View user's profile Send private message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Sat Jan 19, 2013 1:08 pm    Post subject: Reply with quote

srs5694 wrote:
How did you transfer your system to the new disks? (dd, tar, etc.?) It could be there's a malfunction in the video drivers that's related to a subtle permission problem introduced in the transfer; or maybe a bit or two got flipped during the copying. If you've still got the original disk, you could plug it in and write a script to compare every file. between the two systems.

The system is backed up using dar, which is a sound utility and checks the backup against the original disk every time.
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Sat Jan 19, 2013 1:10 pm    Post subject: Reply with quote

salahx wrote:
Actually looking at the stack trace and explanation of symptoms, this could be a genuine bug. It sounds like there a race condition in reiserfs that's causing a deadlock. The screen saver being innocent in this matter - it just happens to widen the window the race can occur.

It may worth recompiling the kernel with CONFIG_PROVE_LOCKING=y

Thanks, I will try that.
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Mon Feb 18, 2013 11:37 am    Post subject: Reply with quote

I am back looking at this again. The lock proving idea did not work because the kernel disabled it when the evil NVidia binary module tainted the kernel! I am now seeing this in the logging:

    Feb 18 06:04:00 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Feb 18 06:04:00 opal kernel: ata1.00: BMDMA stat 0x25
    Feb 18 06:04:00 opal kernel: ata1.00: failed command: READ DMA EXT
    Feb 18 06:04:00 opal kernel: ata1.00: cmd 25/00:18:f8:7b:57/00:00:93:00:00/e0 tag 0 dma 12288 in
    Feb 18 06:04:00 opal kernel: res 51/40:00:f8:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)
    Feb 18 06:04:00 opal kernel: ata1.00: status: { DRDY ERR }
    Feb 18 06:04:00 opal kernel: ata1.00: error: { UNC }
    Feb 18 06:04:03 opal kernel: ata1.00: configured for UDMA/133
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:03 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:03 opal kernel: Sense Key : Medium Error [current] [descriptor]
    Feb 18 06:04:03 opal kernel: Descriptor sense data with sense descriptors (in hex):
    Feb 18 06:04:03 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    Feb 18 06:04:03 opal kernel: 93 57 7b f8
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:03 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] CDB:
    Feb 18 06:04:03 opal kernel: Read(10): 28 00 93 57 7b f8 00 00 18 00
    Feb 18 06:04:03 opal kernel: end_request: I/O error, dev sda, sector 2471984120
    Feb 18 06:04:03 opal kernel: ata1: EH complete
    Feb 18 06:04:03 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Feb 18 06:04:03 opal kernel: ata1.00: BMDMA stat 0x25
    Feb 18 06:04:03 opal kernel: ata1.00: failed command: READ DMA EXT
    Feb 18 06:04:03 opal kernel: ata1.00: cmd 25/00:08:f8:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in
    Feb 18 06:04:03 opal kernel: res 51/40:00:f8:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)
    Feb 18 06:04:03 opal kernel: ata1.00: status: { DRDY ERR }
    Feb 18 06:04:03 opal kernel: ata1.00: error: { UNC }
    Feb 18 06:04:03 opal kernel: ata1.00: configured for UDMA/133
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:03 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:03 opal kernel: Sense Key : Medium Error [current] [descriptor]
    Feb 18 06:04:03 opal kernel: Descriptor sense data with sense descriptors (in hex):
    Feb 18 06:04:03 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    Feb 18 06:04:03 opal kernel: 93 57 7b f8
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:03 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
    Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] CDB:
    Feb 18 06:04:03 opal kernel: Read(10): 28 00 93 57 7b f8 00 00 08 00
    Feb 18 06:04:03 opal kernel: end_request: I/O error, dev sda, sector 2471984120
    Feb 18 06:04:03 opal kernel: Buffer I/O error on device dm-3, logical block 9603455
    Feb 18 06:04:03 opal kernel: ata1: EH complete
    Feb 18 06:04:07 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Feb 18 06:04:07 opal kernel: ata1.00: BMDMA stat 0x25
    Feb 18 06:04:07 opal kernel: ata1.00: failed command: READ DMA EXT
    Feb 18 06:04:07 opal kernel: ata1.00: cmd 25/00:10:20:7c:57/00:00:93:00:00/e0 tag 0 dma 8192 in
    Feb 18 06:04:07 opal kernel: res 51/40:00:20:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)
    Feb 18 06:04:07 opal kernel: ata1.00: status: { DRDY ERR }
    Feb 18 06:04:07 opal kernel: ata1.00: error: { UNC }
    Feb 18 06:04:07 opal kernel: ata1.00: configured for UDMA/133
    Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
    Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:07 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:07 opal kernel: Sense Key : Medium Error [current] [descriptor]
    Feb 18 06:04:07 opal kernel: Descriptor sense data with sense descriptors (in hex):
    Feb 18 06:04:07 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    Feb 18 06:04:07 opal kernel: 93 57 7c 20
    Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:07 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
    Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda] CDB:
    Feb 18 06:04:07 opal kernel: Read(10): 28 00 93 57 7c 20 00 00 10 00
    Feb 18 06:04:07 opal kernel: end_request: I/O error, dev sda, sector 2471984160
    Feb 18 06:04:07 opal kernel: ata1: EH complete
    Feb 18 06:04:10 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Feb 18 06:04:10 opal kernel: ata1.00: BMDMA stat 0x25
    Feb 18 06:04:10 opal kernel: ata1.00: failed command: READ DMA EXT
    Feb 18 06:04:10 opal kernel: ata1.00: cmd 25/00:08:20:7c:57/00:00:93:00:00/e0 tag 0 dma 4096 in
    Feb 18 06:04:10 opal kernel: res 51/40:00:20:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)
    Feb 18 06:04:10 opal kernel: ata1.00: status: { DRDY ERR }
    Feb 18 06:04:10 opal kernel: ata1.00: error: { UNC }
    Feb 18 06:04:10 opal kernel: ata1.00: configured for UDMA/133
    Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
    Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:10 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:10 opal kernel: Sense Key : Medium Error [current] [descriptor]
    Feb 18 06:04:10 opal kernel: Descriptor sense data with sense descriptors (in hex):
    Feb 18 06:04:10 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    Feb 18 06:04:10 opal kernel: 93 57 7c 20
    Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:10 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
    Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda] CDB:
    Feb 18 06:04:10 opal kernel: Read(10): 28 00 93 57 7c 20 00 00 08 00
    Feb 18 06:04:10 opal kernel: end_request: I/O error, dev sda, sector 2471984160
    Feb 18 06:04:10 opal kernel: Buffer I/O error on device dm-3, logical block 9603460
    Feb 18 06:04:10 opal kernel: ata1: EH complete
    Feb 18 06:04:13 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Feb 18 06:04:13 opal kernel: ata1.00: BMDMA stat 0x25
    Feb 18 06:04:13 opal kernel: ata1.00: failed command: READ DMA EXT
    Feb 18 06:04:13 opal kernel: ata1.00: cmd 25/00:20:48:7c:57/00:00:93:00:00/e0 tag 0 dma 16384 in
    Feb 18 06:04:13 opal kernel: res 51/40:00:48:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)
    Feb 18 06:04:13 opal kernel: ata1.00: status: { DRDY ERR }
    Feb 18 06:04:13 opal kernel: ata1.00: error: { UNC }
    Feb 18 06:04:13 opal kernel: ata1.00: configured for UDMA/133
    Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
    Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:13 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:13 opal kernel: Sense Key : Medium Error [current] [descriptor]
    Feb 18 06:04:13 opal kernel: Descriptor sense data with sense descriptors (in hex):
    Feb 18 06:04:13 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    Feb 18 06:04:13 opal kernel: 93 57 7c 48
    Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:13 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
    Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda] CDB:
    Feb 18 06:04:13 opal kernel: Read(10): 28 00 93 57 7c 48 00 00 20 00
    Feb 18 06:04:13 opal kernel: end_request: I/O error, dev sda, sector 2471984200
    Feb 18 06:04:13 opal kernel: ata1: EH complete
    Feb 18 06:04:16 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Feb 18 06:04:16 opal kernel: ata1.00: BMDMA stat 0x25
    Feb 18 06:04:16 opal kernel: ata1.00: failed command: READ DMA EXT
    Feb 18 06:04:16 opal kernel: ata1.00: cmd 25/00:08:48:7c:57/00:00:93:00:00/e0 tag 0 dma 4096 in
    Feb 18 06:04:16 opal kernel: res 51/40:00:48:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)
    Feb 18 06:04:16 opal kernel: ata1.00: status: { DRDY ERR }
    Feb 18 06:04:16 opal kernel: ata1.00: error: { UNC }
    Feb 18 06:04:16 opal kernel: ata1.00: configured for UDMA/133
    Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
    Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:16 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:16 opal kernel: Sense Key : Medium Error [current] [descriptor]
    Feb 18 06:04:16 opal kernel: Descriptor sense data with sense descriptors (in hex):
    Feb 18 06:04:16 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    Feb 18 06:04:16 opal kernel: 93 57 7c 48
    Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:16 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
    Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda] CDB:
    Feb 18 06:04:16 opal kernel: Read(10): 28 00 93 57 7c 48 00 00 08 00
    Feb 18 06:04:16 opal kernel: end_request: I/O error, dev sda, sector 2471984200
    Feb 18 06:04:16 opal kernel: Buffer I/O error on device dm-3, logical block 9603465
    Feb 18 06:04:16 opal kernel: ata1: EH complete
    Feb 18 06:04:26 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Feb 18 06:04:26 opal kernel: ata1.00: BMDMA stat 0x25
    Feb 18 06:04:26 opal kernel: ata1.00: failed command: READ DMA EXT
    Feb 18 06:04:26 opal kernel: ata1.00: cmd 25/00:08:60:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in
    Feb 18 06:04:26 opal kernel: res 51/40:00:60:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)
    Feb 18 06:04:26 opal kernel: ata1.00: status: { DRDY ERR }
    Feb 18 06:04:26 opal kernel: ata1.00: error: { UNC }
    Feb 18 06:04:26 opal kernel: ata1.00: configured for UDMA/133
    Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
    Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:26 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:26 opal kernel: Sense Key : Medium Error [current] [descriptor]
    Feb 18 06:04:26 opal kernel: Descriptor sense data with sense descriptors (in hex):
    Feb 18 06:04:26 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    Feb 18 06:04:26 opal kernel: 93 57 7b 60
    Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:26 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
    Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda] CDB:
    Feb 18 06:04:26 opal kernel: Read(10): 28 00 93 57 7b 60 00 00 08 00
    Feb 18 06:04:26 opal kernel: end_request: I/O error, dev sda, sector 2471983968
    Feb 18 06:04:26 opal kernel: ata1: EH complete
    Feb 18 06:04:29 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Feb 18 06:04:29 opal kernel: ata1.00: BMDMA stat 0x25
    Feb 18 06:04:29 opal kernel: ata1.00: failed command: READ DMA EXT
    Feb 18 06:04:29 opal kernel: ata1.00: cmd 25/00:08:60:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in
    Feb 18 06:04:29 opal kernel: res 51/40:00:60:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)
    Feb 18 06:04:29 opal kernel: ata1.00: status: { DRDY ERR }
    Feb 18 06:04:29 opal kernel: ata1.00: error: { UNC }
    Feb 18 06:04:29 opal kernel: ata1.00: configured for UDMA/133
    Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
    Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:29 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda]
    Feb 18 06:04:29 opal kernel: Sense Key : Medium Error [current] [descriptor]
    Feb 18 06:04:29 opal kernel: Descriptor sense data with sense descriptors (in hex):
    Feb 18 06:04:29 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
    Feb 18 06:04:29 opal kernel: 93 57 7b 60

This was during a nightly backup. Also...

    Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, 112 Currently unreadable (pending) sectors
    Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, 112 Offline uncorrectable sectors
    Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate
    changed from 117 to 108
    Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 190 Airflow_Temperature_Cel
    changed from 57 to 60
    Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius
    changed from 43 to 40
    Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, previous self-test completed with error (read
    test element)
    Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, Self-Test Log error count increased from 2
    to 3
    Feb 18 17:24:08 opal smartd[10040]: Sending warning via mail to root@localhost ...
    Feb 18 17:24:09 opal smartd[10040]: Warning via mail to root@localhost: successful
    Feb 18 17:24:09 opal smartd[10040]: Device: /dev/sda, ATA error count increased from 107 to 123

Signs of a failing disk?
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
Merlin-TC
l33t
l33t


Joined: 16 May 2003
Posts: 603
Location: Germany

PostPosted: Mon Feb 18, 2013 5:12 pm    Post subject: Reply with quote

I wouldn't say it's a sign of a failing disk but it is failing right now.
If there is anything important on it copy it off while you can.
It also seems as if your hard drive doesn't have any spare sectors as well so you really should replace it.

This is a hardware error for sure.
It could of course be a faulty cable/sata port but I doubt it.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Mon Feb 18, 2013 5:20 pm    Post subject: Reply with quote

binro,

the output of smartctl -a for that drive would be good.

Code:
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 190 Airflow_Temperature_Cel
changed from 57 to 60


Cooling air at 60C over a disk. I would be worried if mine went over 40C.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
binro
l33t
l33t


Joined: 06 May 2005
Posts: 724
Location: Bangkok, Thailand

PostPosted: Mon Feb 18, 2013 8:14 pm    Post subject: Reply with quote


    # smartctl -a /dev/sda
    smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.7.7-gentoo] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

    === START OF INFORMATION SECTION ===
    Device Model: ST2000DM001-9YN164
    Serial Number: S1E0MATD
    LU WWN Device Id: 5 000c50 0517daeab
    Firmware Version: CC4B
    User Capacity: 2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes: 512 bytes logical, 4096 bytes physical
    Device is: Not in smartctl database [for details use: -P showall]
    ATA Version is: 8
    ATA Standard is: ATA-8-ACS revision 4
    Local Time is: Tue Feb 19 03:04:35 2013 ICT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x82) Offline data collection activity
    was completed without error.
    Auto Offline Data Collection: Enabled.
    Self-test execution status: ( 121) The previous self-test completed having
    the read element of the test failed.
    Total time to complete Offline
    data collection: ( 575) seconds.
    Offline data collection
    capabilities: (0x7b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    Conveyance Self-test supported.
    Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported.
    Short self-test routine
    recommended polling time: ( 1) minutes.
    Extended self-test routine
    recommended polling time: ( 226) minutes.
    Conveyance self-test routine
    recommended polling time: ( 2) minutes.
    SCT capabilities: (0x3085) SCT Status supported.

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x000f 108 099 006 Pre-fail Always - 16533576
    3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
    4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 32
    5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
    7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 82984983
    9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2028
    10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
    12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 32
    183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
    184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
    187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 123
    188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
    189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
    190 Airflow_Temperature_Cel 0x0022 057 053 045 Old_age Always - 43 (Min/Max 35/44)
    191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
    192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 23
    193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 52
    194 Temperature_Celsius 0x0022 043 047 000 Old_age Always - 43 (0 27 0 0 0)
    197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 112
    198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 112
    199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
    240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 247445950826471
    241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 96929235318
    242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 606478393055

    SMART Error Log Version: 1
    ATA Error Count: 123 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 123 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    25 00 08 ff ff ff ef 00 1d+11:21:48.160 READ DMA EXT
    c8 00 18 78 97 ff e9 00 1d+11:21:48.159 READ DMA
    c8 00 18 50 97 ff e9 00 1d+11:21:48.142 READ DMA
    25 00 10 ff ff ff ef 00 1d+11:21:48.142 READ DMA EXT
    25 00 08 ff ff ff ef 00 1d+11:21:48.138 READ DMA EXT

    Error 122 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    25 00 10 ff ff ff ef 00 1d+11:21:45.114 READ DMA EXT
    35 00 80 ff ff ff ef 00 1d+11:21:45.113 WRITE DMA EXT
    35 00 10 ff ff ff ef 00 1d+11:21:45.113 WRITE DMA EXT
    35 00 08 ff ff ff ef 00 1d+11:21:45.113 WRITE DMA EXT
    35 00 08 ff ff ff ef 00 1d+11:21:45.113 WRITE DMA EXT

    Error 121 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    25 00 08 ff ff ff ef 00 1d+11:21:41.561 READ DMA EXT
    c8 00 08 38 31 4f ea 00 1d+11:21:41.551 READ DMA
    c8 00 30 90 99 ff e9 00 1d+11:21:41.550 READ DMA
    c8 00 70 18 99 ff e9 00 1d+11:21:41.537 READ DMA
    25 00 08 ff ff ff ef 00 1d+11:21:41.526 READ DMA EXT

    Error 120 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    25 00 10 ff ff ff ef 00 1d+11:21:38.611 READ DMA EXT
    ea 00 00 ff ff ff af 00 1d+11:21:38.581 FLUSH CACHE EXT
    35 00 08 ff ff ff ef 00 1d+11:21:38.581 WRITE DMA EXT
    25 00 08 ff ff ff ef 00 1d+11:21:38.566 READ DMA EXT
    ea 00 00 ff ff ff af 00 1d+11:21:38.533 FLUSH CACHE EXT

    Error 119 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    25 00 08 ff ff ff ef 00 1d+11:21:35.045 READ DMA EXT
    25 00 08 ff ff ff ef 00 1d+11:21:35.028 READ DMA EXT
    35 00 08 ff ff ff ef 00 1d+11:21:35.028 WRITE DMA EXT
    35 00 20 ff ff ff ef 00 1d+11:21:35.028 WRITE DMA EXT
    35 00 08 ff ff ff ef 00 1d+11:21:35.028 WRITE DMA EXT

    SMART Self-test log structure revision number 1
    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    # 1 Short offline Completed: read failure 90% 2007 2471984032
    # 2 Short offline Completed: read failure 10% 1985 2471984032
    # 3 Short offline Completed without error 00% 1949 -
    # 4 Short offline Completed without error 00% 1925 -
    # 5 Short offline Completed without error 00% 1901 -
    # 6 Short offline Completed without error 00% 1877 -
    # 7 Short offline Completed without error 00% 1853 -
    # 8 Short offline Completed without error 00% 1829 -
    # 9 Short offline Completed without error 00% 1802 -
    #10 Short offline Completed without error 00% 1778 -
    #11 Short offline Completed without error 00% 1754 -
    #12 Short offline Completed without error 00% 1734 -
    #13 Short offline Completed without error 00% 1710 -
    #14 Short offline Completed without error 00% 1686 -
    #15 Short offline Completed without error 00% 1662 -
    #16 Extended offline Completed: read failure 40% 1644 2471983952
    #17 Short offline Completed without error 00% 1614 -
    #18 Short offline Completed without error 00% 1590 -
    #19 Short offline Completed without error 00% 1566 -
    #20 Short offline Completed without error 00% 1542 -
    #21 Short offline Completed without error 00% 1518 -

    SMART Selective self-test log data structure revision number 1
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
    Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.


I live in Bangkok, so 60C is not so hot in the middle of the night when the aircon is off. Kit does tend to expire more quickly out here, but this unit has only been operating 83 days! Well Bangkok, as well as being hot, is also the hard disk capital of the world, so I should be able to get it replaced. :)
_________________
"Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Mon Feb 18, 2013 8:48 pm    Post subject: Reply with quote

binro,

Code:
  5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 112

The drive has 112 sectors that it would like to relocate and none have been relocated yet.
That you get hard errors shows that at least some sectors can no longer be read.

The Seagate Website says
Code:
    In Warranty 
Expiration 22-Sep-2013 


Don't mess about - save your data and return the drive.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum