View previous topic :: View next topic |
Author |
Message |
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Fri Dec 07, 2012 12:02 pm Post subject: ReiserFS and 2TB disk |
|
|
Two weeks ago I upgraded two ageing disks to a single 2TB Seagate (ST2000DM001-9YN164). I use LVM and formatted the LVs with ReiserFS. In particular, the /home partition is 1TB. Having restored my system everything looked fine but returning after several hours, the KDE desktop would not wakeup properly. Switching to a console, neither the sync or umount command would complete, they just hung. This happened a couple of times, so I thought the backup might have been a bit corrupt and completely reinstalled @system and @world, and built the latest kernel-3.6.8. Returning last night the same thing had occurred; looking at htop from a console I could see lots of identical processes that had been started and just hung. In the syslog I could see kernel messages relating to hung tasks:
Dec 7 00:39:49 opal kernel: INFO: task apache2:28115 blocked for more than 120 seconds.
Dec 7 00:39:49 opal kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 7 00:39:49 opal kernel: apache2 D 0000000000000000 0 28115 10408 0x00000000
Dec 7 00:39:49 opal kernel: ffff88012f27d980 0000000000000086 0000000000000002 ffffffff815b7420
Dec 7 00:39:49 opal kernel: 0000000000011280 ffff8800b6727fd8 0000000000011280 ffff8800b6726010
Dec 7 00:39:49 opal kernel: ffff8800b6727fd8 0000000000011280 ffff88012f27d980 0000000000011280
Dec 7 00:39:49 opal kernel: Call Trace:
Dec 7 00:39:49 opal kernel: [<ffffffff8106ac92>] ? load_balance+0x102/0x790
Dec 7 00:39:49 opal kernel: [<ffffffff8107a609>] ? debug_mutex_add_waiter+0x29/0x70
Dec 7 00:39:49 opal kernel: [<ffffffff814312cf>] ? __mutex_lock_slowpath+0x22f/0x310
Dec 7 00:39:49 opal kernel: [<ffffffff8102c455>] ? default_spin_lock_flags+0x5/0x10
Dec 7 00:39:49 opal kernel: [<ffffffff8143401b>] ? _raw_spin_lock_irqsave+0x3b/0x60
Dec 7 00:39:49 opal kernel: [<ffffffff8118cd81>] ? queue_log_writer+0x91/0xe0
Dec 7 00:39:49 opal kernel: [<ffffffff81066a80>] ? try_to_wake_up+0x2b0/0x2b0
Dec 7 00:39:49 opal kernel: [<ffffffff81192a18>] ? do_journal_begin_r+0x238/0x380
Dec 7 00:39:49 opal kernel: [<ffffffff81192bef>] ? journal_begin+0x8f/0x170
Dec 7 00:39:49 opal kernel: [<ffffffff81173e49>] ? reiserfs_create+0xf9/0x260
Dec 7 00:39:49 opal kernel: [<ffffffff8110ab1f>] ? generic_permission+0xff/0x240
Dec 7 00:39:49 opal kernel: [<ffffffff8110ce29>] ? vfs_create+0xb9/0x110
Dec 7 00:39:49 opal kernel: [<ffffffff8110e1c2>] ? do_last+0x9b2/0xe70
Dec 7 00:39:49 opal kernel: [<ffffffff810c57b0>] ? release_pages+0x180/0x1d0
Dec 7 00:39:49 opal kernel: [<ffffffff8110e741>] ? path_openat+0xc1/0x500
Dec 7 00:39:49 opal kernel: [<ffffffff8110ecad>] ? do_filp_open+0x4d/0xc0
Dec 7 00:39:49 opal kernel: [<ffffffff81433cf5>] ? _raw_spin_unlock+0x15/0x40
Dec 7 00:39:49 opal kernel: [<ffffffff8111b686>] ? alloc_fd+0x106/0x130
Dec 7 00:39:49 opal kernel: [<ffffffff810fd2e8>] ? do_sys_open+0x108/0x1f0
Dec 7 00:39:49 opal kernel: [<ffffffff81434a39>] ? system_call_fastpath+0x16/0x1b
Eventually the system just hangs completely. Since this started with the new disk, I am wondering if ReiserFS actually works with new, huge disks. If not, what else could be causing this? This is a bit desperate.
TIA _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
Merlin-TC l33t
Joined: 16 May 2003 Posts: 603 Location: Germany
|
Posted: Fri Dec 07, 2012 2:58 pm Post subject: |
|
|
Sawadee Binro,
reiserfs doesn't have any problems with volumes up to 16tb so I doubt reiserfs itself is the problem.
1. Is there any additional output of dmesg?
2. Can you reproduce it or does it feel "random"?
3. Is the system under heavy load when this is happening?
You could try another io scheduler just to narrow down the problem. |
|
Back to top |
|
|
srs5694 Guru
Joined: 08 Mar 2004 Posts: 434 Location: Woonsocket, RI
|
Posted: Fri Dec 07, 2012 3:52 pm Post subject: |
|
|
You might also run a SMART utility like GSmartControl, the SMART functions of Palimpsest, or smartctl. These will tell you if you've got a new disk that's defective. (Sadly, it happens sometimes.) The output can be difficult to interpret sometimes, though, so post for help interpreting the output if you need it. |
|
Back to top |
|
|
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Fri Dec 07, 2012 4:26 pm Post subject: |
|
|
Merlin-TC wrote: | Sawadee Binro,
reiserfs doesn't have any problems with volumes up to 16tb so I doubt reiserfs itself is the problem.
1. Is there any additional output of dmesg?
2. Can you reproduce it or does it feel "random"?
3. Is the system under heavy load when this is happening?
You could try another io scheduler just to narrow down the problem. |
I examined the syslog and everything looks normal, there is no unusual load. It is not random, but inevitable. I am beginning to suspect it is caused by the graphics, the nvidia driver or KDE in some way, the system is stable if I don't logon. But this never happened before I changed the disk.
Khawp khun khrup! _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Fri Dec 07, 2012 4:27 pm Post subject: |
|
|
srs5694 wrote: | You might also run a SMART utility like GSmartControl, the SMART functions of Palimpsest, or smartctl. These will tell you if you've got a new disk that's defective. (Sadly, it happens sometimes.) The output can be difficult to interpret sometimes, though, so post for help interpreting the output if you need it. |
The smartd daemon is running and reports the disk to be entirely healthy! _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Sat Dec 08, 2012 4:20 pm Post subject: |
|
|
This gets stranger and stranger. I disabled the screen-saver and now the system is stable again! A screen-saver wouldn't interfere with process execution, would it? _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
srs5694 Guru
Joined: 08 Mar 2004 Posts: 434 Location: Woonsocket, RI
|
Posted: Sat Dec 08, 2012 4:34 pm Post subject: |
|
|
binro wrote: | This gets stranger and stranger. I disabled the screen-saver and now the system is stable again! A screen-saver wouldn't interfere with process execution, would it? |
It might, especially if it uses an advanced video feature and if that feature has a buggy implementation in a video driver. Video drivers are increasingly relying on kernel-level code, and then all bets are off; a buggy kernel driver could interfere with just about anything.
Thus, you might try upgrading your video driver, if possible, or switch drivers (from Nvidia's proprietary driver to nouveau or vice-versa, for instance). If that's too much hassle or otherwise impractical, try adjusting your screen saver to use just one module that does the simplest thing possible -- ideally just blank the screen. You'll do without the eye candy that way, but that's better than having a system that hangs randomly. |
|
Back to top |
|
|
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Sat Dec 08, 2012 8:51 pm Post subject: |
|
|
I was thinking along the same lines, except that before the restore onto the new disk this all worked perfectly. I can't help thinking that something in my system has been subtly corrupted. _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
srs5694 Guru
Joined: 08 Mar 2004 Posts: 434 Location: Woonsocket, RI
|
Posted: Sun Dec 09, 2012 12:49 am Post subject: |
|
|
How did you transfer your system to the new disks? (dd, tar, etc.?) It could be there's a malfunction in the video drivers that's related to a subtle permission problem introduced in the transfer; or maybe a bit or two got flipped during the copying. If you've still got the original disk, you could plug it in and write a script to compare every file. between the two systems. |
|
Back to top |
|
|
salahx Guru
Joined: 12 Mar 2005 Posts: 530
|
Posted: Sun Dec 09, 2012 1:19 am Post subject: |
|
|
Actually looking at the stack trace and explanation of symptoms, this could be a genuine bug. It sounds like there a race condition in reiserfs that's causing a deadlock. The screen saver being innocent in this matter - it just happens to widen the window the race can occur.
It may worth recompiling the kernel with CONFIG_PROVE_LOCKING=y |
|
Back to top |
|
|
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Sat Jan 19, 2013 1:08 pm Post subject: |
|
|
srs5694 wrote: | How did you transfer your system to the new disks? (dd, tar, etc.?) It could be there's a malfunction in the video drivers that's related to a subtle permission problem introduced in the transfer; or maybe a bit or two got flipped during the copying. If you've still got the original disk, you could plug it in and write a script to compare every file. between the two systems. |
The system is backed up using dar, which is a sound utility and checks the backup against the original disk every time. _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Sat Jan 19, 2013 1:10 pm Post subject: |
|
|
salahx wrote: | Actually looking at the stack trace and explanation of symptoms, this could be a genuine bug. It sounds like there a race condition in reiserfs that's causing a deadlock. The screen saver being innocent in this matter - it just happens to widen the window the race can occur.
It may worth recompiling the kernel with CONFIG_PROVE_LOCKING=y |
Thanks, I will try that. _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Mon Feb 18, 2013 11:37 am Post subject: |
|
|
I am back looking at this again. The lock proving idea did not work because the kernel disabled it when the evil NVidia binary module tainted the kernel! I am now seeing this in the logging:
Feb 18 06:04:00 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 18 06:04:00 opal kernel: ata1.00: BMDMA stat 0x25
Feb 18 06:04:00 opal kernel: ata1.00: failed command: READ DMA EXT
Feb 18 06:04:00 opal kernel: ata1.00: cmd 25/00:18:f8:7b:57/00:00:93:00:00/e0 tag 0 dma 12288 in
Feb 18 06:04:00 opal kernel: res 51/40:00:f8:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)
Feb 18 06:04:00 opal kernel: ata1.00: status: { DRDY ERR }
Feb 18 06:04:00 opal kernel: ata1.00: error: { UNC }
Feb 18 06:04:03 opal kernel: ata1.00: configured for UDMA/133
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:03 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:03 opal kernel: Sense Key : Medium Error [current] [descriptor]
Feb 18 06:04:03 opal kernel: Descriptor sense data with sense descriptors (in hex):
Feb 18 06:04:03 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 18 06:04:03 opal kernel: 93 57 7b f8
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:03 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] CDB:
Feb 18 06:04:03 opal kernel: Read(10): 28 00 93 57 7b f8 00 00 18 00
Feb 18 06:04:03 opal kernel: end_request: I/O error, dev sda, sector 2471984120
Feb 18 06:04:03 opal kernel: ata1: EH complete
Feb 18 06:04:03 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 18 06:04:03 opal kernel: ata1.00: BMDMA stat 0x25
Feb 18 06:04:03 opal kernel: ata1.00: failed command: READ DMA EXT
Feb 18 06:04:03 opal kernel: ata1.00: cmd 25/00:08:f8:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in
Feb 18 06:04:03 opal kernel: res 51/40:00:f8:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)
Feb 18 06:04:03 opal kernel: ata1.00: status: { DRDY ERR }
Feb 18 06:04:03 opal kernel: ata1.00: error: { UNC }
Feb 18 06:04:03 opal kernel: ata1.00: configured for UDMA/133
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:03 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:03 opal kernel: Sense Key : Medium Error [current] [descriptor]
Feb 18 06:04:03 opal kernel: Descriptor sense data with sense descriptors (in hex):
Feb 18 06:04:03 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 18 06:04:03 opal kernel: 93 57 7b f8
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:03 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] CDB:
Feb 18 06:04:03 opal kernel: Read(10): 28 00 93 57 7b f8 00 00 08 00
Feb 18 06:04:03 opal kernel: end_request: I/O error, dev sda, sector 2471984120
Feb 18 06:04:03 opal kernel: Buffer I/O error on device dm-3, logical block 9603455
Feb 18 06:04:03 opal kernel: ata1: EH complete
Feb 18 06:04:07 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 18 06:04:07 opal kernel: ata1.00: BMDMA stat 0x25
Feb 18 06:04:07 opal kernel: ata1.00: failed command: READ DMA EXT
Feb 18 06:04:07 opal kernel: ata1.00: cmd 25/00:10:20:7c:57/00:00:93:00:00/e0 tag 0 dma 8192 in
Feb 18 06:04:07 opal kernel: res 51/40:00:20:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)
Feb 18 06:04:07 opal kernel: ata1.00: status: { DRDY ERR }
Feb 18 06:04:07 opal kernel: ata1.00: error: { UNC }
Feb 18 06:04:07 opal kernel: ata1.00: configured for UDMA/133
Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:07 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:07 opal kernel: Sense Key : Medium Error [current] [descriptor]
Feb 18 06:04:07 opal kernel: Descriptor sense data with sense descriptors (in hex):
Feb 18 06:04:07 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 18 06:04:07 opal kernel: 93 57 7c 20
Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:07 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda] CDB:
Feb 18 06:04:07 opal kernel: Read(10): 28 00 93 57 7c 20 00 00 10 00
Feb 18 06:04:07 opal kernel: end_request: I/O error, dev sda, sector 2471984160
Feb 18 06:04:07 opal kernel: ata1: EH complete
Feb 18 06:04:10 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 18 06:04:10 opal kernel: ata1.00: BMDMA stat 0x25
Feb 18 06:04:10 opal kernel: ata1.00: failed command: READ DMA EXT
Feb 18 06:04:10 opal kernel: ata1.00: cmd 25/00:08:20:7c:57/00:00:93:00:00/e0 tag 0 dma 4096 in
Feb 18 06:04:10 opal kernel: res 51/40:00:20:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)
Feb 18 06:04:10 opal kernel: ata1.00: status: { DRDY ERR }
Feb 18 06:04:10 opal kernel: ata1.00: error: { UNC }
Feb 18 06:04:10 opal kernel: ata1.00: configured for UDMA/133
Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:10 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:10 opal kernel: Sense Key : Medium Error [current] [descriptor]
Feb 18 06:04:10 opal kernel: Descriptor sense data with sense descriptors (in hex):
Feb 18 06:04:10 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 18 06:04:10 opal kernel: 93 57 7c 20
Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:10 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda] CDB:
Feb 18 06:04:10 opal kernel: Read(10): 28 00 93 57 7c 20 00 00 08 00
Feb 18 06:04:10 opal kernel: end_request: I/O error, dev sda, sector 2471984160
Feb 18 06:04:10 opal kernel: Buffer I/O error on device dm-3, logical block 9603460
Feb 18 06:04:10 opal kernel: ata1: EH complete
Feb 18 06:04:13 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 18 06:04:13 opal kernel: ata1.00: BMDMA stat 0x25
Feb 18 06:04:13 opal kernel: ata1.00: failed command: READ DMA EXT
Feb 18 06:04:13 opal kernel: ata1.00: cmd 25/00:20:48:7c:57/00:00:93:00:00/e0 tag 0 dma 16384 in
Feb 18 06:04:13 opal kernel: res 51/40:00:48:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)
Feb 18 06:04:13 opal kernel: ata1.00: status: { DRDY ERR }
Feb 18 06:04:13 opal kernel: ata1.00: error: { UNC }
Feb 18 06:04:13 opal kernel: ata1.00: configured for UDMA/133
Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:13 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:13 opal kernel: Sense Key : Medium Error [current] [descriptor]
Feb 18 06:04:13 opal kernel: Descriptor sense data with sense descriptors (in hex):
Feb 18 06:04:13 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 18 06:04:13 opal kernel: 93 57 7c 48
Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:13 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda] CDB:
Feb 18 06:04:13 opal kernel: Read(10): 28 00 93 57 7c 48 00 00 20 00
Feb 18 06:04:13 opal kernel: end_request: I/O error, dev sda, sector 2471984200
Feb 18 06:04:13 opal kernel: ata1: EH complete
Feb 18 06:04:16 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 18 06:04:16 opal kernel: ata1.00: BMDMA stat 0x25
Feb 18 06:04:16 opal kernel: ata1.00: failed command: READ DMA EXT
Feb 18 06:04:16 opal kernel: ata1.00: cmd 25/00:08:48:7c:57/00:00:93:00:00/e0 tag 0 dma 4096 in
Feb 18 06:04:16 opal kernel: res 51/40:00:48:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)
Feb 18 06:04:16 opal kernel: ata1.00: status: { DRDY ERR }
Feb 18 06:04:16 opal kernel: ata1.00: error: { UNC }
Feb 18 06:04:16 opal kernel: ata1.00: configured for UDMA/133
Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:16 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:16 opal kernel: Sense Key : Medium Error [current] [descriptor]
Feb 18 06:04:16 opal kernel: Descriptor sense data with sense descriptors (in hex):
Feb 18 06:04:16 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 18 06:04:16 opal kernel: 93 57 7c 48
Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:16 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda] CDB:
Feb 18 06:04:16 opal kernel: Read(10): 28 00 93 57 7c 48 00 00 08 00
Feb 18 06:04:16 opal kernel: end_request: I/O error, dev sda, sector 2471984200
Feb 18 06:04:16 opal kernel: Buffer I/O error on device dm-3, logical block 9603465
Feb 18 06:04:16 opal kernel: ata1: EH complete
Feb 18 06:04:26 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 18 06:04:26 opal kernel: ata1.00: BMDMA stat 0x25
Feb 18 06:04:26 opal kernel: ata1.00: failed command: READ DMA EXT
Feb 18 06:04:26 opal kernel: ata1.00: cmd 25/00:08:60:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in
Feb 18 06:04:26 opal kernel: res 51/40:00:60:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)
Feb 18 06:04:26 opal kernel: ata1.00: status: { DRDY ERR }
Feb 18 06:04:26 opal kernel: ata1.00: error: { UNC }
Feb 18 06:04:26 opal kernel: ata1.00: configured for UDMA/133
Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:26 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:26 opal kernel: Sense Key : Medium Error [current] [descriptor]
Feb 18 06:04:26 opal kernel: Descriptor sense data with sense descriptors (in hex):
Feb 18 06:04:26 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 18 06:04:26 opal kernel: 93 57 7b 60
Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:26 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda] CDB:
Feb 18 06:04:26 opal kernel: Read(10): 28 00 93 57 7b 60 00 00 08 00
Feb 18 06:04:26 opal kernel: end_request: I/O error, dev sda, sector 2471983968
Feb 18 06:04:26 opal kernel: ata1: EH complete
Feb 18 06:04:29 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 18 06:04:29 opal kernel: ata1.00: BMDMA stat 0x25
Feb 18 06:04:29 opal kernel: ata1.00: failed command: READ DMA EXT
Feb 18 06:04:29 opal kernel: ata1.00: cmd 25/00:08:60:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in
Feb 18 06:04:29 opal kernel: res 51/40:00:60:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)
Feb 18 06:04:29 opal kernel: ata1.00: status: { DRDY ERR }
Feb 18 06:04:29 opal kernel: ata1.00: error: { UNC }
Feb 18 06:04:29 opal kernel: ata1.00: configured for UDMA/133
Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code
Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:29 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda]
Feb 18 06:04:29 opal kernel: Sense Key : Medium Error [current] [descriptor]
Feb 18 06:04:29 opal kernel: Descriptor sense data with sense descriptors (in hex):
Feb 18 06:04:29 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 18 06:04:29 opal kernel: 93 57 7b 60
This was during a nightly backup. Also...
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, 112 Currently unreadable (pending) sectors
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, 112 Offline uncorrectable sectors
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate
changed from 117 to 108
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 190 Airflow_Temperature_Cel
changed from 57 to 60
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius
changed from 43 to 40
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, previous self-test completed with error (read
test element)
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, Self-Test Log error count increased from 2
to 3
Feb 18 17:24:08 opal smartd[10040]: Sending warning via mail to root@localhost ...
Feb 18 17:24:09 opal smartd[10040]: Warning via mail to root@localhost: successful
Feb 18 17:24:09 opal smartd[10040]: Device: /dev/sda, ATA error count increased from 107 to 123
Signs of a failing disk? _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
Merlin-TC l33t
Joined: 16 May 2003 Posts: 603 Location: Germany
|
Posted: Mon Feb 18, 2013 5:12 pm Post subject: |
|
|
I wouldn't say it's a sign of a failing disk but it is failing right now.
If there is anything important on it copy it off while you can.
It also seems as if your hard drive doesn't have any spare sectors as well so you really should replace it.
This is a hardware error for sure.
It could of course be a faulty cable/sata port but I doubt it. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54236 Location: 56N 3W
|
Posted: Mon Feb 18, 2013 5:20 pm Post subject: |
|
|
binro,
the output of smartctl -a for that drive would be good.
Code: | Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 190 Airflow_Temperature_Cel
changed from 57 to 60 |
Cooling air at 60C over a disk. I would be worried if mine went over 40C. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
binro l33t
Joined: 06 May 2005 Posts: 724 Location: Bangkok, Thailand
|
Posted: Mon Feb 18, 2013 8:14 pm Post subject: |
|
|
# smartctl -a /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.7.7-gentoo] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: ST2000DM001-9YN164
Serial Number: S1E0MATD
LU WWN Device Id: 5 000c50 0517daeab
Firmware Version: CC4B
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Tue Feb 19 03:04:35 2013 ICT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 226) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 108 099 006 Pre-fail Always - 16533576
3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 32
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 82984983
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2028
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 32
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 123
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 057 053 045 Old_age Always - 43 (Min/Max 35/44)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 23
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 52
194 Temperature_Celsius 0x0022 043 047 000 Old_age Always - 43 (0 27 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 112
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 112
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 247445950826471
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 96929235318
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 606478393055
SMART Error Log Version: 1
ATA Error Count: 123 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 123 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 1d+11:21:48.160 READ DMA EXT
c8 00 18 78 97 ff e9 00 1d+11:21:48.159 READ DMA
c8 00 18 50 97 ff e9 00 1d+11:21:48.142 READ DMA
25 00 10 ff ff ff ef 00 1d+11:21:48.142 READ DMA EXT
25 00 08 ff ff ff ef 00 1d+11:21:48.138 READ DMA EXT
Error 122 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 10 ff ff ff ef 00 1d+11:21:45.114 READ DMA EXT
35 00 80 ff ff ff ef 00 1d+11:21:45.113 WRITE DMA EXT
35 00 10 ff ff ff ef 00 1d+11:21:45.113 WRITE DMA EXT
35 00 08 ff ff ff ef 00 1d+11:21:45.113 WRITE DMA EXT
35 00 08 ff ff ff ef 00 1d+11:21:45.113 WRITE DMA EXT
Error 121 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 1d+11:21:41.561 READ DMA EXT
c8 00 08 38 31 4f ea 00 1d+11:21:41.551 READ DMA
c8 00 30 90 99 ff e9 00 1d+11:21:41.550 READ DMA
c8 00 70 18 99 ff e9 00 1d+11:21:41.537 READ DMA
25 00 08 ff ff ff ef 00 1d+11:21:41.526 READ DMA EXT
Error 120 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 10 ff ff ff ef 00 1d+11:21:38.611 READ DMA EXT
ea 00 00 ff ff ff af 00 1d+11:21:38.581 FLUSH CACHE EXT
35 00 08 ff ff ff ef 00 1d+11:21:38.581 WRITE DMA EXT
25 00 08 ff ff ff ef 00 1d+11:21:38.566 READ DMA EXT
ea 00 00 ff ff ff af 00 1d+11:21:38.533 FLUSH CACHE EXT
Error 119 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 1d+11:21:35.045 READ DMA EXT
25 00 08 ff ff ff ef 00 1d+11:21:35.028 READ DMA EXT
35 00 08 ff ff ff ef 00 1d+11:21:35.028 WRITE DMA EXT
35 00 20 ff ff ff ef 00 1d+11:21:35.028 WRITE DMA EXT
35 00 08 ff ff ff ef 00 1d+11:21:35.028 WRITE DMA EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 2007 2471984032
# 2 Short offline Completed: read failure 10% 1985 2471984032
# 3 Short offline Completed without error 00% 1949 -
# 4 Short offline Completed without error 00% 1925 -
# 5 Short offline Completed without error 00% 1901 -
# 6 Short offline Completed without error 00% 1877 -
# 7 Short offline Completed without error 00% 1853 -
# 8 Short offline Completed without error 00% 1829 -
# 9 Short offline Completed without error 00% 1802 -
#10 Short offline Completed without error 00% 1778 -
#11 Short offline Completed without error 00% 1754 -
#12 Short offline Completed without error 00% 1734 -
#13 Short offline Completed without error 00% 1710 -
#14 Short offline Completed without error 00% 1686 -
#15 Short offline Completed without error 00% 1662 -
#16 Extended offline Completed: read failure 40% 1644 2471983952
#17 Short offline Completed without error 00% 1614 -
#18 Short offline Completed without error 00% 1590 -
#19 Short offline Completed without error 00% 1566 -
#20 Short offline Completed without error 00% 1542 -
#21 Short offline Completed without error 00% 1518 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I live in Bangkok, so 60C is not so hot in the middle of the night when the aircon is off. Kit does tend to expire more quickly out here, but this unit has only been operating 83 days! Well Bangkok, as well as being hot, is also the hard disk capital of the world, so I should be able to get it replaced. _________________ "Ship me somewheres east of Suez, where the best is like the worst,
Where there ain't no Ten Commandments an' a man can raise a thirst"
from "Mandalay" by Rudyard Kipling |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54236 Location: 56N 3W
|
Posted: Mon Feb 18, 2013 8:48 pm Post subject: |
|
|
binro,
Code: | 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 112 |
The drive has 112 sectors that it would like to relocate and none have been relocated yet.
That you get hard errors shows that at least some sectors can no longer be read.
The Seagate Website says Code: | In Warranty
Expiration 22-Sep-2013 |
Don't mess about - save your data and return the drive. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|