I've been experimenting a lot with setting up a Hybrid Raid Array using a couple of 200Gb Maxtor SATA disks via device mapper
and I've found a couple of things that I couldn't find on google / gentoo forums or the dm-devel list
this relates to dmsetup / dmraid and Hybrid Raid arrays (especially with regards to Raid 1 and dmsetup)
So I decided to write this little HowTo
(this is also for when I forget all about this and have to remember how I setup my Raid arrays in the first place
In my own case I've been booting off of a 3rd disk which I'll eventually copy the contents to the Raid Array
although it may also be possible to use the LiveCD as well, but I've not actually tried this myself yet
a large amount of the info here has been pieced together from other messages on the gentoo forum and from experimentation
Please feel free to copy or indicate if any part of this is incorrect
(i.e. I take no responsibility for any loss of data if your PC blows up etc)
also this is my first HowTo and use of BBCode, so apologies if I haven't got the formatting right
(I'm also starting to think that I've written far too much here for one document
Table of Contents
1.0 A bit about Raid in general
1.1 Hybrid Raid
1.2 dmsetup
1.3 dmraid
2.0 Determining the size of an individual disk
2.1 Setting up Raid 1
2.2 Setting up Raid 1 Disk synchronization
2.3 Setting up Raid 1 Without synchronization
2.4 Setting up Raid 1 Specific options for the mirror target
3.0 Setting up Raid 0
4.0 Hiding the Raid / Bios metadata
5.0 Mapping out the partitions
6.0 Automating via scripts
7.0 Performance
1.0 A bit about Raid in general
At the moment there are 3 different types of Raid implementation available under Linux
- Software Raid - This uses the user space tools within Linux to present a typical /dev/md0 access. this is Raid support native to Linux but not usually compatible with Windows
- True Hardware Raid - typically only seen on Servers, or machines with separate Hardware PCI Raid card add ons
- There is also another type of raid that is seen on some of the new motherboards. It's not true hardware raid as a lot of the work is still carried out by the CPU and the operating system,
For the rest of this document I'll refer to this type as Hybrid Raid. This is the type that this HowTo is concerned with
a VIA VT8237 and the Fast Track Promise 20378 RAID controller
My end Goal was to get a dual boot system up and running with Win XP and Gentoo Linux
something that could use the hybrid raid setup so that Win XP and Linux would both be raided and could co-exist
Also something that would use the VIA controller in preference, as initial indications using Windows benchmarking tools appear to show that this is a little faster than the Fastrack controller when using both disks in combination.
I've noticed that it is possible to use Linux software raid for Linux, and the Hybrid Raid for Win XP
and both will be compatible (at least with Raid 1). But this is only if the super block option is switched off for software raid
This makes it a pain to setup. Also since device mapper is closer to the kernel I was hoping that it may perform better
or at the very least be easier to implement
For Raid 1 purposes I also used the windows VIA utility after trying different setups with Linux to confirm that the disks were still considered in-sync by the Hybrid Array setup
1.1 Hybrid Raid
Typically with hybrid raid, the setup / initialization is controlled from the bios when you first boot up
With the controllers that I was using (at least in my case the VIA VT8237), typically 1 or 2 bytes is written
near to the partition table as an indicator that the hybrid raid array is there
Also the meta data which describes how the raid array is setup (size of array number of disks / type etc)
appears to be located somewhere right at the end of the disk on each of the raid members
I believe this is what the bios usually reads / writes to when initializing the array at bootup
when you view the disk via DOS / windows / bios etc, it sees the end result of the raid array
which is a disk slightly shorter than normal
(I'd guess this is to prevent the meta data from being overwritten / affected at the end of the disk)
Linux on the other hand doesn't see the disks via the bios
it sees the disks as non-raided entities including the meta data at the end
in my case as an example with the latest kernel 2.6.9-rc2-love4 my SATA drives were showing up as /dev/sda /dev/sdb
as the newer SATA drivers now appear to be a part of the SCSI driver set
1.2 dmsetup
The new 2.6 kernel doesn't currently have support for Hybrid Arrays like 2.4 used to have
For 2.4 there were some drivers that could be used, including a proprietary driver for the VIA system
but for 2.6 this is now moving towards device mapper
Device mapper is a kernel feature that is controlled by a user space program called dmsetup
the way to imagine device mapper / dmsetup
a block device is fed in, it is then manipulated with a resultant block device out
you specify the output block device name and a map file
the map file contains what input block devices there are, the type of target: linear / stripe / mirror etc
with a couple of other parameters
block devices created from dmsetup usually end up within /dev/mapper/
For one example of this
If we imagine /dev/hda as a single hard disk, that has 10000 sectors as an example (0 - 10000)
and /dev/hda1 which is the first partition, starts from sector 63 and ends at sector 5063
accessing sector 0 on hda1 actually accesses sector 63 on hda
also accessing sector 5000 on hda1 actually accesses sector 5063 on hda
in the case of hda and hda1 these are both setup automatically when the kernel first boots up
but hda1 responds in the same way that a linear device map would
using the above example if we'd used dmsetup with the above disk
Code: Select all
echo "0 5000 linear /dev/hda 63" | dmsetup create testpart1.3 dmraid
One tool I've been experimenting with is dmraid
this calls dmsetup with a map to setup Raid arrays automatically by reading the meta data off the disks in use
however it's still in beta testing at the moment
also dmraid doesn't currently recognize the VIA VT8237
for the Fastrack 20378 (dmraid pdc driver) it was able to identify the Raid 0 array I'd setup correctly
However the raid 1 array appeared to be only half the size that it should be (100Gb instead of 200Gb)
this is probably just a bug, but it's something to be aware of (the version used was dmraid-1.0.0-rc4)
homepage is here
http://people.redhat.com/~heinzm/sw/dmraid/
EDIT
I've just found that someone else has made a much better ebuild here
http://bugs.gentoo.org/show_bug.cgi?id=63041
to use with gentoo portage overlay, create the directory and place the ebuild within /usr/local/portage/sys-fs/dmraid/
make sure that PORTDIR_OVERLAY="/usr/local/portage is set within /etc/make.conf
Code: Select all
cd /usr/local/portage/sys-fs/dmraid/
ebuild dmraid-1.0.0_rc4.ebuild fetch
ebuild dmraid-1.0.0_rc4.ebuild digest
emerge dmraidOne of the values we may need, to create the map, is the size of one of the individual raid members (single disk)
this can be viewed by a couple of different ways
- by looking at /sys/block/<block device>/size, assuming that /dev/sda is one of the raid members
Code: Select all
cat /sys/block/sda/size - by using the blockdev command
Code: Select all
blockdev --getsize /dev/sda
for the length of the disk in bytes this value can be multiplied by 512, (typically 1 sector = 512 bytes)
2.1 Setting up Raid 1
Raid 1 is primarily designed for resilience, in that if one disk fails, there is still another disk with an identical copy of data
It is possible to setup Raid 1 just using Device mapper without Software Raid
what we need to do is to create a single block device to represent the Raid 1 Array of the 2 disks
we feed 2 disks in and get 1 block device out
whatever is written to the output block device it written to both disks at the same time
anything Read could be read from ether disk
the below sections indicate how to put the map together
the map can be stored in a file and then called by
Code: Select all
dmsetup create <map file>Code: Select all
echo "0 398283480 mirror core 2 128 nosync 2 /dev/sda 0 /dev/sdb 0" | dmsetup create testdeviceso far the only mention of this table I've seen documented for Raid 1 device mapper is the following
Code: Select all
0 398283480 mirror core 1 128 2 /dev/sda 0 /dev/sdb 0sometimes this is something you want to happen to ensure that both disks have identical data
However for normal boot up this is not suitable
on my own system while windows takes around 1Hr to synchronize the disks, I've worked out it would take around 5Hrs to wait for the disks to synchronize using dmsetup in this way, for a pair of 200Gb disks
Also from what I've observed it would appear that data is not currently written to both disks at the same time while the synchronization is taking place
This is noticeable if synchronization is stopped half way through with dmsetup remove_all
the only parameters you'll probably need to alter (assuming you want your disks to synchronize)
- The length of the Array. The figure used above 398283480 is the full length of one of the individual disks in my case, see the above section to get this value
- The device nodes /dev/sda and /dev/sdb may be different for your system but represent the block device for each individual disk
the full parameter list for a Raid 1 map is listed in a section further below
One other consideration is that this will present the full disk as a Raid array, which means that the meta data that the Bios uses at bootup will also be visible (if this was corrupted then potentially the system could become unbootable)
see the section relating to hiding the metadata to get around this.
2.3 Setting up Raid 1 Without synchronization
I spent a long time trying to figure this one out. A way for dmsetup to setup a Raid 1 array without the disks synchronizing
which ideally is what's required for normal use / bootup
I finally figured something out by looking at the map that dmraid had created on my FastTrack Array with Raid 1
we can up the number of options to 2 and specify nosync as the second option
Code: Select all
0 398283480 mirror core 2 128 nosync 2 /dev/sda 0 /dev/sdb 0this has the effect of setting up the output device node, but doesn't do the syncing of the whole disk as above, which is what we need for normal operation
2.4 Setting up Raid 1 Specific options for the mirror target
The mirror target uses the following syntax for the table
<output start> <output length> <target> <log type> <number of options> <... option values> <number of devices> ... <device name> <offset>
the numbers used here are a measurement of the number of sectors on the disk
the first 2 parameters affect the output block device
the rest of the parameters affect the devices going into the map
- the first is the offset for the output, this should always be 0
- the second parameter represents the length of the output device. typically for Raid 1 this should be the size of a single Raid member
- the target parameter in this case is mirror for Raid 1
- for log type, the only one supported at the moment is "core" (there is also another one called disk from looking at the kernel sources, but I wouldn't try to use this at the moment)
- next this is the number of options to feed into the mirror target. this can be 1 or 2 (unless someone is aware of a 3rd option)
- the next 1 or 2 parameters can be specified here. first is the region size, (see below for more info on this). next if the number of options is 2, you can specify nosync here as well
- next is the number of disks going into the map, in the case of Raid 1 this will always be 2
- finally we specify the device block's going in and the offset (typically the offset should always be 0 for raid 1)
I tested this by timing the amount of blocks synchronized using a synchronize map and reading off the number of blocks completed within a minute
using
Code: Select all
dmsetup status /dev/mapper/<block device>anything smaller than 128 appears to have no affect, while anything larger appears to slow things down
3.0 Setting up Raid 0
Raid 0 has the advantages of using both disks for maximum capacity (2 x 200Gb = 400Gb)
Also depending on the size of the file, (for large files) an increase in the read performance may be obtained.
The disadvantages are that if one disk fails then the whole array is lost
This means that resilience is half that of a single Disk (in other words make sure you have lots of backups)
The data is striped across both disks, e.g. with a 64K stripe size the first 64K is written to the first disk, the second 64K to the second, the third back to the first disk just after the first stripe and so on in an alternating fashion
we can use Device mapper in the same way to setup access to a Raid 0 Array
what we need to do is to create a single block device to represent the Raid 0 Array of the 2 disks
we feed 2 disks in and get 1 block device out
to keep things simple if we use the full size of a single disk e.g. 398283480 sectors (see the above sections on how to obtain this)
now we multiply it by 2 (398283480 * 2 = 796566960)
we'll use this figure as the full size of the Raid 0 array
Code: Select all
0 796566960 striped 2 128 /dev/sda 0 /dev/sdb 0- in this example the 1st parameter should always be 0 as this is the start offset for the resultant output block device
- the next parameter represents the full size of the raid array when it is created, this is one that you will need to set based on the size of your own disks in the array
- this parameter specifies a striped type of target, which is always required for Raid 0
- this parameter specifies the number of disks involved, in most cases it is always 2 disks
- this parameter is linked to the stripe size, if for example when creating the array in the Bios menu you've used a stripe size of 64K then this value will be (64 * 2 = 128) or for a 32K stripe (32 * 2 = 64)
- finally we specify each disk followed by the offset, the offset is the number of sectors to skip before reading / writing the first stripe on the disk
e.g. if the raid members are sda and sdb then sda usually comes first
or for hde and hdg, hde would usually have the first stripe
however this might not always be the case and will depend on the raid controller that you are using
in my own case I've used an offset of 0 for both sources disks
But depending on your controller sometimes it is necessary to set the the offset for the 2nd source disk to a value other than 0
e.g. on one thread within the gentoo forum I've seen one person mention that a sector offset of 10 would be required for the second disk for the HPT374 controller
In order to test that dmsetup has mapped the the Raid 0 Array the same way as the bios
- create a partition within DOS / windows with a filesystem on
- use dmsetup to access the array within Linux
- setup the linear maps for the partitions (see below sections for this)
- attempt to mount the filesystem
dmraid may be a better option if it works with your system
Also if you want to set the overall size of the array to hide the Raid Bios meta data then see the below sections
4.0 Hiding the Raid / Bios meta data
In the above examples for setting up Raid 1 or Raid 0 Arrays, the full span or size of the disk has been used to present the raid array as a block device
one thing to note with this however is that in some cases the bios will store it's information about the raid array at the end of the disk
If a partition has been created on the array within Linux that crosses over into this area, then you risk overwriting this data which could potentially make the system unbootable
i.e. Grub would probably read it's information from the bios which would in turn be unable to read the Array if the meta data has been corrupted
If you've used a windows or DOS utility such as PQMagic to setup your partitions then you probably don't need to worry about this
as these utilities would see the array via the bios which would in turn display the Raid Array as slightly shorter than the physical disk
this way the last partition on the disk won't cross over into this area
If you want to make sure that the resultant device nodes for the array under Linux cannot view the meta data
then we also need to make the Raid Array seem slightly shorter than the full span in order to hide it
this way any partitioning tools used under Linux (such as sfdisk or fdisk) won't be able to see or allocate the space used by the meta data, also any backup / restore utility that may affect the entire disk won't interfere as well
There are 2 ways to do this
- if you're raid controller is supported, then use dmraid as it is able to read off the specific values from the meta data and set the correct lengths
- do it manually
unfortunately this relies on using DOS or windows
- use a Windows or DOS utility (such as PQMagic) to create a partition located right at the end of the Raid Array
- boot into Linux and setup the Raid array using the full span of the disk to begin with
- run sfdisk -l -uS to find the end sector (last sector used) for the partition created at the end of the Array, since the partition was created under DOS / windows it won't go right to the end of the disk
in my case as an example the full size of the disk was 398297088 sectors but the last partition created under PQMagic ended at sector 398283479 on a Raid 1 Array
now add 1 to this value (as the partition needs to sit within the disk) 398283479+1 = 398283480
398283480 is now the value I use for the length of the Raid 1 Array
e.g.
Code: Select all
0 398283480 mirror core 2 128 nosync 2 /dev/sda 0 /dev/sdb 0you could just add 1 the same as above but to be sure when I tried this myself
I wanted to make sure that the length of the array was an even number of stripes
this may be over complicating things a bit, but as an example
size of individual disk - 398297088
Full size of Raid 0 Array - (398297088 x 2) = 796594176
end sector of the last partition on the disk 796583024
for 64K stripe size 65536 / 512 = 128 sectors for each stripe on the disk
796583024 / 128 = 6223304.875 stripes
rounding this up to an even whole number = 6223306 stripes
working backwards 6223306 * 128 = 796583168 sectors
which is the value I've used in the raid map for Raid 0
Code: Select all
0 796583168 striped 2 128 /dev/sda 0 /dev/sdb 0Realistically I've found that the Win / DOS tools won't go right to the end of the array when creating the partition, which means the array is probably slightly longer than this value, but since we're only talking about a couple of Mb or so and we want to make sure to hide the Bios Raid meta data, this appears to be a safe value to use. (dmraid is more accurate in this regard assuming it can recognize your setup)
Also something to note is that some hybrid raid array controllers have the option for a Gigabyte Boundary in the bios setup for Raid 1
All this means is that the Bios will shorten the length of the Array to the nearest Gb, that way if a replacement Disk is not exactly the same size as the old one, it will still function in the Array, as long as it is the same length in Gb
This can also have the affect of making Raid 1 Arrays appear shorter than it might otherwise be and will also affect the end sector for the last partition on the disk
5.0 Mapping out the Partitions
while the above maps for Raid 1 and Raid 0 will create device nodes for the entire array within /dev/mapper
we still need to create device nodes for the individual partitions as this is something which isn't done automatically
this is similar to sda1, sda2 for sda or hda1, hda2 for hda etc
the easy way to do this is to just use the partition-mapper script mentioned at the end of this How-to
if we assume that the raid array has been setup as /dev/mapper/raidarray
and that you've already used a partitioning tool to setup the partitions on the disk
we need to use a map with a linear target
first we run
Code: Select all
sfdisk -l -uS /dev/mapper/raidarrayCode: Select all
Device Boot Start End #sectors Id System
/dev/mapper/raidarray1 63 102414374 102414312 c W95 FAT32 (LBA)
/dev/mapper/raidarray2 102414375 204828749 102414375 c W95 FAT32 (LBA)
/dev/mapper/raidarray3 204828750 307243124 102414375 c W95 FAT32 (LBA)
/dev/mapper/raidarray4 307243125 398283479 91040355 c W95 FAT32 (LBA)
Code: Select all
echo "0 102414312 linear /dev/hda 63" | dmsetup create raidarray1the second value 63 is the offset from the beginning of the raidarray device node taken from the output of sfdisk
assuming the partition has a filesystem on it we can now mount /dev/mapper/raidarray1
If you want to be really clever you could feed the output of the linear map into cryptsetup to encrypt the partition as well
(but there are already other HowTo's for how to do that)
6.0 Automating via Scripts
There are a couple of scripts that I've spotted on another thread which can be useful for setting up raidmaps and partitionmaps
I'm not taking credit for ether of these but I did modify one slightly to be more compatible with sfdisk
(sometimes sfdisk will list partitions for a device node by placing p1 at the end of the name, but for device nodes with long names it will sometimes just add a number without the p)
first I started off by creating a directory called /etc/dmmaps
the first 2 scripts I placed within this directory
while the last one was located within /etc/init.d
dm-mapper.sh script
Code: Select all
#!/bin/sh
SELF=`basename $0`
BASEDIR=`dirname $0`
if [[ $# < 1 || $1 == "--help" ]]
then
echo usage: $SELF mapping-file
exit 1;
fi
# setup vars for mapping-file, device-name and devce-path
FNAME=$1
NAME=`basename $FNAME .devmap`
DEV=/dev/mapper/$NAME
# create device using device-mapper
dmsetup create $NAME $FNAME
if [[ ! -b $DEV ]]
then
echo $SELF: could not map device: $DEV
exit 1;
fi
# create a linear mapping for each partition
$BASEDIR/partition-mapper.sh $DEV
Code: Select all
#!/bin/sh
SELF=`basename $0`
if [[ $# < 1 || $1 == "--help" ]]
then
echo usage: $SELF map-device
exit 1;
fi
NAME=$1
if [[ ! -b $NAME ]]
then
echo $SELF: unable to access device: $NAME
exit 1;
fi
# create a linear mapping for each partition
sfdisk -l -uS $NAME | awk '/^\// {
if ( $2 == "*" ) {start = $3;size = $5;}
else {start = $2;size = $4;}
if ( size == 0 ) next;
part = substr($1,length($1)-1);
("basename " $1) | getline dev;
print 0, size, "linear", base, start | ("dmsetup create " dev); }' base=$NAME
dmraidmapper script
Code: Select all
#!/sbin/runscript
depend() {
need modules
}
start() {
ebegin "Initializing software mapped RAID devices"
/etc/dmmaps/dm-mapper.sh /etc/dmmaps/*.devmap
eend $? "Error initializing software mapped RAID devices"
}
stop() {
ebegin "Removing software mapped RAID devices"
dmsetup remove_all
eend $? "Failed to remove software mapped RAID devices."
}
containing a raidmap
e.g. I have one called via_rd1.devmap that contains
Code: Select all
0 398283480 mirror core 2 128 nosync 2 /dev/sda 0 /dev/sdb 0Code: Select all
cd /etc/dmmaps
./dm-mapper.sh via_rd1.devmap
dm-mapper.sh will by default automatically call partition-mapper.sh
partition-mapper.sh takes one parameter as input which is the block device of the raid array
e.g.
Code: Select all
partition-mapper.sh /dev/mapper/via_rd1
starting dmraidmapper as a service manually
Code: Select all
/etc/init.d/dmraidmapper startalthough please note that if your root filesystem is on the array you'll probably need to setup a manual initrd that contains these scripts / devmaps and sfdisk to make the root filesystem available for boot
7.0 Performance
One interesting thing I've also been looking into is the performance of the different methods of accessing the disks
to see which is the fastest
zcav is a part of the bonnie++ toolset and reads 100Mb at a time from the block device and outputs the K/s per 100Mb of data
Note - disclaimer these are not precise benchmarks, also for Raid0 I've always used a stripe size of 64K
better results in certain circumstances may be obtained with different stripe sizes
Also measuring the performance in this way, is at a disk level not at a particular software level
using the form
Code: Select all
zcav -c3 /dev/<input device> >output_result.datthe steps in the graph appear to represent the different zones on the disk
from the way that I interpret this (I could be wrong)
the steps on the graph appear to indicate that there shouldn't be a bottleneck between the disk and zcav
a flat line may be an indication of a bottle neck of the Raid implementation or SATA raid controller on the motherboard
This gave some very interesting results
- Accessing a single disk from the Promise or Via chip set gives the same result - full speed of the disk used as it zones down
- when accessing both disks in combination for Raid 0 via Device Mapper, the Promise controller appears to bottle neck at around 95Mb/s as a straight line
while the via controller appears to use the full capacity of the disks starting at 120Mb/s and slowly zoning down - Raid 1 via Device Mapper follows the performance of a single disk almost identically
I'm wondering if in the future this may improve, if there is an option for the data to be written to both disks at the same time but is read from different disks in a stripe fashion to improve read performance - Software Raid 1 this appears to follow the zone of the disk as a fuzzy line, just under the performance of a single disk
(in other words its probably around 3Kb/s slower than using Device mapper which isn't that much of a difference) - For Software Raid 0 compared to Device Mapper Raid 0 there appears to be a large difference (at least for 64K stripe)
Device Mapper appears to be around 30Mb/s better off at Raid 0 starting at the beginning of the disk
while software Raid 0 appears to flat line further down the graph
Also the X axis (disk position) appeared to be to a different scale for diskspeed32 so I had to write a small C program to multiply the X axis by a certain factor to get the graphs to match up, considering that the performance of a single disk under XP and Linux appears to match I believe I've got this right.
I've included a picture of the graph and the raw data if anyone's interested
Graph
Raw Data
for gnuplot I just edited the gp.txt file to include / exclude different results
and used
Code: Select all
load gp.txtNext I'm going to see if I can get grub to work properly, along with setting up an initrd
and to compare the bootup times for Raid 0 / Raid 1 as I'm setting the array up for final use


