Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
How do I find out my file move takes so long (on btrfs)?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Letharion
Veteran
Veteran


Joined: 13 Jun 2005
Posts: 1344
Location: Sweden

PostPosted: Mon Aug 18, 2014 7:02 pm    Post subject: How do I find out my file move takes so long (on btrfs)? Reply with quote

I started moving a file, about 2 GiB from my SSD to my BTRFS raid-1. This wouldn't normally be a big deal, so I didn't expect it to take very long.

Now, 5 hours later, it still hasn't finished, and I have no idea why.

top shows very high values for "wait", hovering around 70 for long time periods, but going up and down.

java, which is doing the move, takes up about 6 - 7% CPU according to top.

Connecting to the long running process with strace, shows

Code:
$ strace -p 15184
Process 15184 attached
futex(0x7fe9160649d0, FUTEX_WAIT, 15189, NULL


and nothing more happens, as far as I can tell.

Is there any way I can tell what's happening? I could of course kill the process and hope for the best, but I'd prefer to understand what's going on first.
Back to top
View user's profile Send private message
vaxbrat
l33t
l33t


Joined: 05 Oct 2005
Posts: 731
Location: DC Burbs

PostPosted: Mon Aug 18, 2014 7:14 pm    Post subject: look at btrfs threads in the system Reply with quote

When a copy stalls like that it can often mean that the i/o is being blocked due to problems with the hardware. Look at your messages file to see if any btrfs events happened:

Code:
grep btrfs /var/log/messages


Run top and look for btrfs-something threads taking a lot of cpu time. If a drive is about to relocate a bad block, the btrfs i/o may be getting blocked by the low level device. Do you see a lot of repetitive blinking of your hard drive led every second or so? That often means a drive is trying to write and verify to a sector that has gone bad, and everything in the system will pile up behind the block.

That could also be on the reading end of your copy where the ssd is having troubles reading a sector.

If you have smartctl installed, you can do a "smartctl -a /dev/sd<whatever>" to see whether your drives are reporting errrors in their logs.
Back to top
View user's profile Send private message
Letharion
Veteran
Veteran


Joined: 13 Jun 2005
Posts: 1344
Location: Sweden

PostPosted: Mon Aug 18, 2014 7:58 pm    Post subject: Re: look at btrfs threads in the system Reply with quote

Thanks vaxbrat!

Based on your help, I've tried to continue diagnosing, but unfortunately I'm not sure I'm much wiser, see below.

vaxbrat wrote:
When a copy stalls like that it can often mean that the i/o is being blocked due to problems with the hardware. Look at your messages file to see if any btrfs events happened:
Code:
grep btrfs /var/log/messages

The logs only mention btrfs at the times when discs were mounted, and I see nothing else that appears to indicate errors. (Other than Chromium segfaulting at one point)

vaxbrat wrote:
Run top and look for btrfs-something threads taking a lot of cpu time.
btrfs-transact occasionally shows up, but rarely, and mostly with usage below 1%. Most of the time it appears to be doing so little work that it doesn't even show up in tops list.

vaxbrat wrote:
Do you see a lot of repetitive blinking of your hard drive led every second or so?
There's an occasional blink, but nothing that appears repetitive or even frequent about it. The average time between blinks might be 5 or even 10 seconds.

vaxbrat wrote:
If you have smartctl installed, you can do a "smartctl -a /dev/sd<whatever>" to see whether your drives are reporting errrors in their logs.
I'm not completely sure what to look for, but the three drives are sd{c,d,e}.

Is the below checking for the correct logs?
Code:
# for i in c d e; do smartctl -a /dev/sd${i} | grep -A1 "SMART Error"; done
SMART Error Log Version: 1
No Errors Logged
SMART Error Log Version: 1
No Errors Logged
SMART Error Log Version: 1
No Errors Logged
Back to top
View user's profile Send private message
vaxbrat
l33t
l33t


Joined: 05 Oct 2005
Posts: 731
Location: DC Burbs

PostPosted: Mon Aug 18, 2014 8:39 pm    Post subject: Reply with quote

The low stats for btrfs-transact suggest to me that your btrfs raid is just fine but not getting much of anything thrown to it. That 5-10 second blip on the drive light is probably journal flush and commit.

So it sounds like either your SSD is having trouble reading, or something else (file lock?) may be blocking your copy. If you are doing this in a gui, maybe a pop-up really became a pop-under and it's blocking while asking you some stupid question about whether you want to replace something, etc.

For your smartctl you really want to look at the raw values for SMART attributes to spot things like ECC errors re-allocated and offline uncorrectable. For example here's a dump of a brand new WD Red 4tb drive which reports as clean as a whistle:

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1033
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       4
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       595
194 Temperature_Celsius     0x0022   116   114   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0


WD does tend to gloss over things a bit in their tracked SMART attributes. That Raw_Read_Error_Rate value of 0 is probably a lie. Seagate drives will always show non-zero raw read rates with older drives having errors in the millions even though those drives run fine without any reallocated sectors.
Back to top
View user's profile Send private message
Letharion
Veteran
Veteran


Joined: 13 Jun 2005
Posts: 1344
Location: Sweden

PostPosted: Tue Aug 19, 2014 6:32 am    Post subject: Reply with quote

I see, now I understand better what I'm looking at.

The oldest drive by far, which is one of the raids, looks like this:
Code:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       67013599
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       969
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   066   060   030    Pre-fail  Always       -       163353565268
  9 Power_On_Hours          0x0032   071   071   000    Old_age   Always       -       25970
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       6
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       970
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       34360262664
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   055   037   045    Old_age   Always   In_the_past 45 (0 4 46 22 0)
194 Temperature_Celsius     0x0022   045   063   000    Old_age   Always       -       45 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   023   015   000    Old_age   Always       -       67013599
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       463
High numbers for things like Raw read error rate looks nasty. I already knew the drive was old, so I'm not surprised if it's nearing EOL.

Other raid drive:

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   176   021    Pre-fail  Always       -       6050
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       38
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3508
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       38
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   192   192   000    Old_age   Always       -       25642
194 Temperature_Celsius     0x0022   116   109   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0


and SSD:

Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       4355
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       66
177 Wear_Leveling_Count     0x0013   094   094   000    Pre-fail  Always       -       190
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   067   058   000    Old_age   Always       -       33
195 ECC_Rate                0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       11
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       8957187966


Now, because I was gonna try to copy files around within the SSD to see if that was the issue, I screwed it up and deleted the files I'm waiting for a move one >_<, so now I probably won't find out what's happening anyway.

I did look for pop unders though and couldn't find anything. Though it would have been somewhat funny to find it being that simple. ;)

I'll just kill the java process and reboot just to be sure the btrfs kernel driver is not in a bad state. And then find a new copy of the files.
Back to top
View user's profile Send private message
vaxbrat
l33t
l33t


Joined: 05 Oct 2005
Posts: 731
Location: DC Burbs

PostPosted: Wed Aug 20, 2014 2:45 am    Post subject: Reply with quote

On your first drive:

Code:
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       463


Could be a sign of a bad cable or SATA port and might just be why things were taking so long. It's hard to tell not knowing what the count may have been from before. If your computer and hard drive have trouble communicating, there will be a lot of retries and thus the long delays. Next time you run your transfer and have problems see if this is going up significantly. Try re-seating or swapping the cable or go to another SATA port if you have one free. Btrfs should be able to find the pair if one of the device names changes.
Back to top
View user's profile Send private message
vaxbrat
l33t
l33t


Joined: 05 Oct 2005
Posts: 731
Location: DC Burbs

PostPosted: Wed Aug 20, 2014 2:51 am    Post subject: Reply with quote

Also notice:

Code:
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       67013599


matches

Code:
195 Hardware_ECC_Recovered  0x001a   023   015   000    Old_age   Always       -       67013599


This is telling us that the seek errors are no big deal since the low level ECC correction is dealing with it.
Back to top
View user's profile Send private message
davidm
Guru
Guru


Joined: 26 Apr 2009
Posts: 557
Location: US

PostPosted: Wed Aug 27, 2014 3:52 am    Post subject: Reply with quote

All versions 3.15.0 - 3.17 rc2 (at least) have a nasty bug causing deadlock type conditions when compression is enabled. 3.14.x also had some sort of variation of this from what I understand but is still more stable. Patches exist and will probably land in a future kernel release. See the mailing list for details. Just thought I'd mention it in case this is what you are running into and it's not hardware related at all.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum