View previous topic :: View next topic |
Author |
Message |
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Tue Jun 29, 2010 5:41 am Post subject: BTRFS: The SSD killer |
|
|
So, recently I have been noticing that the erase cycle count on my SSD was going up very rapidly. Its a measure of how much data was being written to it. What I noticed was the something was writing to sda about 1-2MB data every minute, sometimes even more, even when nothing was going on in the system. And this data was random IOs. During a 10 hour period it had written around 2GB of data just idling and erased about 2 erase cycles from a 120GB disk (i.e. the firmware on the SSD is bad at combining smaller random writes and is erasing about 240GB for 2GB of real data written by OS i.e. write amplification of about 120... Looks like a firmware bug).
Anyhow, to further debug this, I shutdown X, stopped all services, unmounted all other FSs: almost like single user mode. Still writing about 1-2MB every minute. Then, I had the bright idea. I moved the stuff to another partition on the same disk and formatted that one with ext4. Booted back in and noted the writes: 650KB written in 10 mins (**) i.e. around 1KB per second, compared to anywhere between 15 to 400KB per second (averaged over a long periods i.e. its not necessarily writing every second) with BTRFS.
What the heck is it writing MBs of data for on an idle system?
I have now officially abandoned BTRFS for my root as well as other data I had on it. No BTRFS for me!
(**) I do wanna know what the heck is ext4 writing this much data for on an idle system if someone knows? |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Tue Jun 29, 2010 6:06 am Post subject: |
|
|
BTW, this is not the only reason for quiting on BTRFS. I got bit by silent corruptions which were reported in the other BTRFS thread couple of times. |
|
Back to top |
|
|
max_power n00b
Joined: 01 Aug 2004 Posts: 48 Location: /dev/bed
|
Posted: Tue Jun 29, 2010 9:33 am Post subject: |
|
|
how do you read out the cycle count of the ssd? |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Tue Jun 29, 2010 11:34 am Post subject: |
|
|
Yeah, how do I check such statistics? I also have a SSD (Samsung with TRIM), BTRFS compressed on root and home and would like to find out if something wrong is going on...
Also this is not really encouraging: http://lkml.org/lkml/2010/6/3/313 Quote: | Unbound(?) internal fragmentation in Btrfs |
|
|
Back to top |
|
|
d2_racing Bodhisattva
Joined: 25 Apr 2005 Posts: 13047 Location: Ste-Foy,Canada
|
Posted: Tue Jun 29, 2010 11:58 am Post subject: |
|
|
It's still in heavy developpement, so maybe wait a couple of months and then retry. |
|
Back to top |
|
|
P.Kosunen Guru
Joined: 21 Nov 2005 Posts: 309 Location: Finland
|
Posted: Tue Jun 29, 2010 12:09 pm Post subject: |
|
|
max_power wrote: | how do you read out the cycle count of the ssd? |
Smartmontools/smartctl i think. |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Tue Jun 29, 2010 4:17 pm Post subject: |
|
|
P.Kosunen wrote: | max_power wrote: | how do you read out the cycle count of the ssd? |
Smartmontools/smartctl i think. | Yes, 'smartctl -a /dev/sda'. Use the latest smartmontools package and the attribute names are self-explanatory. |
|
Back to top |
|
|
max_power n00b
Joined: 01 Aug 2004 Posts: 48 Location: /dev/bed
|
Posted: Tue Jun 29, 2010 6:26 pm Post subject: |
|
|
and which scheduler do you use with your ssd? i set mine to noop, but i am not sure if this this is the optimum. but at least the system should not read or write on the drive if the fifo stack is empty. |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Tue Jun 29, 2010 6:37 pm Post subject: |
|
|
max_power wrote: | and which scheduler do you use with your ssd? i set mine to noop, but i am not sure if this this is the optimum. but at least the system should not read or write on the drive if the fifo stack is empty. | I use deadline. Plain and simple. No mickey mouse CFQ! |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Tue Jun 29, 2010 7:53 pm Post subject: |
|
|
I use SIO, so let's compare results Here are mine, I'm going to give you a full dump -- me going to sleep in a few minutes (I did Secure ATA erase before installing gentoo approx 2 months ago, maybe less):
Code: | gentoo-xps64 ~ # smartctl -a /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG SSD PM800 2.5" 256GB
Serial Number: YF11700953SY953B3844
Firmware Version: VBM24D1Q
User Capacity: 256,060,514,304 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Tue Jun 29 21:50:27 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 720) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 12) minutes.
Extended self-test routine
recommended polling time: ( 72) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 790
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 286
175 Program_Fail_Count_Chip 0x0032 099 099 011 Old_age Always - 1
176 Erase_Fail_Count_Chip 0x0032 100 100 011 Old_age Always - 0
177 Wear_Leveling_Count 0x0013 099 099 017 Pre-fail Always - 17
178 Used_Rsvd_Blk_Cnt_Chip 0x0013 077 077 011 Pre-fail Always - 28
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 093 093 010 Pre-fail Always - 538
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013 093 093 010 Pre-fail Always - 7398
181 Program_Fail_Cnt_Total 0x0032 099 099 010 Old_age Always - 2
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 2
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 253 253 000 Old_age Always - 0
232 Available_Reservd_Space 0x0013 077 077 011 Pre-fail Always - 96
233 Media_Wearout_Indicator 0x0032 099 099 000 Old_age Always - 2660
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 472 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
|
|
Back to top |
|
|
bollucks l33t
Joined: 27 Oct 2004 Posts: 606
|
Posted: Wed Jun 30, 2010 1:19 am Post subject: Re: BTRFS: The SSD killer |
|
|
devsk wrote: | (**) I do wanna know what the heck is ext4 writing this much data for on an idle system if someone knows? |
All journalled file systems will write out a certain amount of data to the journal every certain time period. By default that time period is 5 seconds. If you want a truly idle filesystem, try booting ext4 without the journal enabled (nolog), but of course you'll lose the filesystem safety of journalling if you do this. |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Wed Jun 30, 2010 2:20 am Post subject: Re: BTRFS: The SSD killer |
|
|
bollucks wrote: | devsk wrote: | (**) I do wanna know what the heck is ext4 writing this much data for on an idle system if someone knows? |
All journalled file systems will write out a certain amount of data to the journal every certain time period. By default that time period is 5 seconds. If you want a truly idle filesystem, try booting ext4 without the journal enabled (nolog), but of course you'll lose the filesystem safety of journalling if you do this. | I do know they need to write the journal every 5 (or "commit") seconds, but if the journal size is greater than the data written in an idle system (hence making overall write a multiple of useful data written), the FS is doing something wrong, which is the main problem described in this thread.
I think ext4 is also writing more metadata than the data that's being written. The question can be rephrased like this: How can the system write 1MB of log files in an idle system in an hour but write 5MB of journal? Those numbers are not real but they are meant to illustrate the question. |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Wed Jun 30, 2010 2:22 am Post subject: |
|
|
@mbar: your Samsung disk is different from my OCZ Vertex and SMART data is completely different. I don't know what's what. You will have to get help from Samsung forums/techs about what that data really means. |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Wed Jun 30, 2010 6:06 am Post subject: |
|
|
ok, could you please post yours? |
|
Back to top |
|
|
haarp Guru
Joined: 31 Oct 2007 Posts: 535
|
Posted: Wed Jun 30, 2010 6:12 am Post subject: |
|
|
What brand of SSD do you use? I have an Intel one and now I'm afraid I'll have to get rid of btrfs aswell if this is indeed the case :/ |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Wed Jun 30, 2010 6:28 am Post subject: |
|
|
mbar wrote: | ok, could you please post yours? |
Code: | # ssd-stats sdb
Drive sdb:
184 Initial_Bad_Block_Count 44
195 Program_Failure_Blk_Ct 0
196 Erase_Failure_Blk_Ct 0
197 Read_Failure_Blk_Ct 0
198 Read_Sectors_Tot_Ct 5448667150
199 Write_Sectors_Tot_Ct 3942772932
200 Read_Commands_Tot_Ct 128687675
201 Write_Commands_Tot_Ct 29873046
202 Error_Bits_Flash_Tot_Ct 1886205
203 Corr_Read_Errors_Tot_Ct 1788537
204 Bad_Block_Full_Flag 0
205 Max_PE_Count_Spec 5000
206 Min_Erase_Count 4
207 Max_Erase_Count 3847
208 Average_Erase_Count 134
209 Remaining_Lifetime_Perc 98 | It looks nice on a terminal than here.
ssd-stats is a script wrapper around smartctl.
Code: | $ cat ssd-stats
#!/bin/sh
if [ $# -eq 0 ]
then
echo "Usage: $0 <device>"
echo " $0 sda sdb"
echo ""
exit 1
fi
for i in "$@"
do
echo ""
echo "Drive $i:"
drv=`readlink /dev/$i`
[ -z "$drv" ] && drv="/dev/$i"
smartctl -a $drv | grep "^[12][089]" | awk '{print $1"\t"$2"\t"$10}'
echo ""
done |
|
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Wed Jun 30, 2010 6:33 am Post subject: |
|
|
haarp wrote: | What brand of SSD do you use? I have an Intel one and now I'm afraid I'll have to get rid of btrfs aswell if this is indeed the case :/ | Mine is OCZ Vertex. The controller is different, firmware is different. So, I don't know how much write-amplification matters with Intel's controller. But Indilinx firmware is pretty bad with write-amplification. I just posted on OCZ forums about writing 242MB data in 23 hours and using up 5 erase cycles i.e. 600GB of data erased by firmware for a 242MB of data written by OS. That's a write-amplification of 2500! Unheard of! Something really screwy is going on with 1.6 firmware on Indilinx drives.
Also, it seems like BTRFS writes are small in size and random in placement. |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Tue Jul 06, 2010 8:03 pm Post subject: |
|
|
Don't want to cause a panic here, but it seems btrfs may be FUBAR by design. |
|
Back to top |
|
|
devsk Advocate
Joined: 24 Oct 2003 Posts: 2995 Location: Bay Area, CA
|
Posted: Tue Jul 06, 2010 8:10 pm Post subject: |
|
|
Ant_P wrote: | Don't want to cause a panic here, but it seems btrfs may be FUBAR by design. | That was already discussed here on Gentoo forums. |
|
Back to top |
|
|
cach0rr0 Bodhisattva
Joined: 13 Nov 2008 Posts: 4123 Location: Houston, Republic of Texas
|
Posted: Wed Jul 07, 2010 12:16 am Post subject: |
|
|
devsk wrote: | Ant_P wrote: | Don't want to cause a panic here, but it seems btrfs may be FUBAR by design. | That was already discussed here on Gentoo forums. |
++
and it's already resulting in a patch _________________ Lost configuring your system?
dump lspci -n here | see Pappy's guide | Link Stash |
|
Back to top |
|
|
d2_racing Bodhisattva
Joined: 25 Apr 2005 Posts: 13047 Location: Ste-Foy,Canada
|
Posted: Wed Jul 07, 2010 11:39 am Post subject: |
|
|
I hope that they resolve that kind of problem, because it seems that BRTFS may become the next standard like EXT2/EXT3 was a couple years ago. |
|
Back to top |
|
|
DigitalCorpus Apprentice
Joined: 30 Jul 2007 Posts: 283
|
Posted: Thu Jul 08, 2010 10:53 pm Post subject: |
|
|
I have to ask since this was never mentioned, but did you use the ssd mount option for BTRFS? _________________ Atlas (HDTV PVR, HTTP & Media server)
http://mobrienphotography.com/ |
|
Back to top |
|
|
Shining Arcanine Veteran
Joined: 24 Sep 2009 Posts: 1110
|
Posted: Thu Jul 08, 2010 10:59 pm Post subject: |
|
|
max_power wrote: | and which scheduler do you use with your ssd? i set mine to noop, but i am not sure if this this is the optimum. but at least the system should not read or write on the drive if the fifo stack is empty. |
It is only optimal on a single core system. On a multicore system, CFQ can do optimizations on requests such that multiple requests to the same region of the virtual address space can be merged, increasing performance. |
|
Back to top |
|
|
DestroyFX n00b
Joined: 05 Dec 2005 Posts: 44
|
Posted: Sat Jul 10, 2010 2:58 pm Post subject: |
|
|
For SSD, you must:
- Use NOOP scheduler
- align partition with HDD blocks and use the same size of sectors if possible
- use noatime, compress, ssd_spread and nodiratime mount options
The *atime are usefull for not writhing access time of everything....
I have a cheapo Patriot Warp 2 128GB SSD and the only FS working without shuttering is btrfs+ssd options.
Also, I recommand to use TMPFS for
|
|
Back to top |
|
|
|