Monday, March 14, 2011

Linux software raid: mdadm

Recently I had to start using linux software raid. I was more used to lvm, but software raid can come pretty handy in certain circumstances. A friend of mine had a dual boot installation linux-windows, and he wanted to be able to access the linux stuff from the windows environment, plus some kind of redundance in order to not lose important data. So I thought about a basic raid-1 in linux (2 partitions mirrored) to provide redundancy. Eventually he could access one of the partitions from the windows install. But than again, I didn't want to use NTFS file systems (I don't really know I they can be mirrored with linux mdadm, haven't tried that out yet). So I decided to use ext3, and use the ext2/ext3 driver available for windows here.
It's a nice tool for accessing linux ext3 file systems, and though it's perfectly able to work in read/write mode, I recommended to mount the partitions only in read mode. But there is an important limitation: in order to be able to use that driver in windows, the file system has to be created with an inode-size of 128, otherwise it won't work (it's a driver limitation) and the partition will not be accessible from windows (it will be reported to not have been formatted). This can (and, in fact has to be done, at file system creation time), with e.g:
mke2fs -i 128 /dev/sda1
Also check out /etc/mke2fs.conf for defaults for ext3/ext4 file system creation.

But my point here is not how to access ext3 file systems from windows. I'm more interested in mdadm in linux. I think it's very esasy to use and more straight forward than lvm. I think it may be easier to understand for the beginner. Of course it lacks the full power of lvm. But I decided to use it also because the windows driver mentioned above has one more limitation: it can't access lvm extents. And one of the things I like most of mdadm is that you can create an array using an already existing file system without harm: it's not necessary to create a new file system on both disks (but be careful with the order you add the disks).

Well, here's my cheat sheet, which btw is based roughly on this wiki. There are a lot of software raid tutorials and howto's out there, many of them probably much better and more detailed then this one, but I deliberately want to keep it short.

· Build the raid:
    You can define the raid array by just using one of the partitions (you can add the other one later):
    mdadm --create /dev/md1 --level=1 --raid-disks=2 missing /dev/sda1

    · Add a disk to the disk array

      Successive partitions can be added any time like this:
      mdadm --add /dev/md1 /dev/sdb1

      · Check out status:
        You can check out the status of your array any time by editing the file /proc/mdstat. Here's some an example of the file content:

        # cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
        md1 : active raid1 sdb1[0] sda1[1] 104320 blocks [2/2] [UU] 
        unused devices:
         
        where, as you can see, we got one raid1 array (md1) consisting of 2 partitions, /dev/sda1 and /dev/sdb1, and the array status is ok. Of course, in order to both partitions in a raid1 array must be equal in size (check out sfdisk to clone partitions).

        · Mount the array
        The raid array can be mounted just like any other formatted partition, but now the partition is something like /dev/md1 or /dev/md/d1 or so, instead of something like /dev/sdc3. So mount it like:

        mount -t ext3 /dev/md1 /mnt/myraid

        And, if you like, add it to your /etc/fstab. But relative to this, I had a problem with Ubuntu 10.10. Seems like the device wasn't available yet when the mounting script was executed by init, so I put it off the /etc/fstab file and mounted the device from the /etc/init.d/mdadm file that is shipped with the mdadm package (at least in Debian like distros).

        · Raid info:
        Run:
        mdadm --detail --scan
        or
        mdadm --examine --scan
        or, when you already know which array you want to get info about (e.g. size):
        mdadm --query /dev/md1

        · Start the array:
        With commands like the ones listed above you could have been informed that, surprisingly, your array is not active; you can activate it with:
        mdadm -R /dev/md1
        But usually it will be necessary to reassemble the array.


        · Reassemble / restart the array:

        mdadm -Ac partitions -m 1 /dev/md1

        If you check out /proc/mdstat now for array status, you may see something like this:

        md_d1 : active raid1 sdb1[2] sda1[1]
        104320 blocks [2/1] [_U]
        [=========>;...........] recovery = 45.0% (47872/104320) finish=0.0min speed=15957K/sec

        unused devices:



        This is useful if you've run into trouble with the array.

        · Disassemble, modify and reassemble:
        Let's assume you want for some reason be able to access one of the partitions from the array separately, introduce some changes, and you want these changes to be persistent after you reassemble the array again. Here's he only way I found to do that, if e.g. we want to introduce changes in partition /dev/sda1:
        1. detach the other disks from the array:
        mdadm /dev/md1 --fail /dev/sdb1
        2. stop the array:
        mdadm -S /dev/md1
        3. mount the other disk outside the array (disk /dev/sda1)
        4. add or remove files from disk /dev/sda1 mounted
        5. umount disk /dev/sda1
        6. re-arrange the raid array:
        mdadm -Ac partitions -m 1 /dev/md1
        7. re-add the previously marked as failed disk (dev/sdb1) to the array again:
        mdadm --re-add /dev/md/d1 /dev/sdb1
        8. watch /proc/mdadm resync files in the array
        9. mount the disk array again (and watch in /proc/mdstat how they sync)


        I put that into the /etc/init.d/mdadm script in order to be able to access one of the partitions from the windows installation I mentioned in the introduction, make modifications to the file system, and make them persistent after reassembling the array. I append it here, just in case somebody find it interesting (or for my retrieval of it). It's just the file delivered with the package with a few modifications highligthed in blue (sorry for the poor indentation).

        #!/bin/sh
        #
        # Start the MD monitor daemon for all active MD arrays if desired.
        #
        # Copyright © 2001-2005 Mario Jou/3en

        # Copyright © 2005-2008 Martin F. Krafft

        # Distributable under the terms of the GNU GPL version 2.
        #
        ### BEGIN INIT INFO
        # Provides: mdadm
        # Required-Start: checkroot
        # Required-Stop: umountroot
        # Should-Start: module-init-tools
        # Default-Start: S
        # Default-Stop: 0 6
        # Short-Description: MD monitoring daemon
        # Description: mdadm provides a monitor mode, in which it will scan for
        # problems with the MD devices. If a problem is found, the
        # administrator is alerted via email, or a custom script is
        # run.
        ### END INIT INFO
        #
        set -eu

        MDADM=/sbin/mdadm
        RUNDIR=/var/run/mdadm
        PIDFILE=$RUNDIR/monitor.pid
        DEBIANCONFIG=/etc/default/mdadm

        # my variables
        ARRAY=/dev/md1
        DISK=/dev/sdb1
        RAIDMOUNT=/mnt/raid
        MOUNT=/bin/mount
        UMOUNT=/bin/umount

        test -x "$MDADM" || exit 0

        test -f /proc/mdstat || exit 0

        START_DAEMON=true
        test -f $DEBIANCONFIG && . $DEBIANCONFIG

        . /lib/lsb/init-functions

        stop_my_array () {
            $MDADM $ARRAY --fail $DISK
            $UMOUNT $RAIDMOUNT
            $MDADM -S $ARRAY
        }

        start_my_array () {
            $MDADM -A -m1 $ARRAY
            $MDADM --re-add $ARRAY $DISK
            $MOUNT $RAIDMOUNT
        }


        is_true()
        {
           case "${1:-}" in
           [Yy]es|[Yy]|1|[Tt]|[Tt]rue) return 0;;
           *) return 1;
           esac
        }

        case "${1:-}" in
        start)
            if is_true $START_DAEMON; then
            log_daemon_msg "Starting MD monitoring service" "mdadm --monitor"
            mkdir -p $RUNDIR
            set +e
            start-stop-daemon -S -p $PIDFILE -x $MDADM -- \
            --monitor --pid-file $PIDFILE --daemonise --scan ${DAEMON_OPTIONS:-}
            log_end_msg $?
            set -e
            start_my_array
        fi
        ;;
        stop)
            if [ -f $PIDFILE ] ; then
            log_daemon_msg "Stopping MD monitoring service" "mdadm --monitor"
            stop_my_array
            set +e
            start-stop-daemon -K -p $PIDFILE -x $MDADM
            rm -f $PIDFILE
            log_end_msg $?
            set -e
        fi
        ;;
        restart|reload|force-reload)
        ${0:-} stop
        ${0:-} start
        ;;
        *)
        echo "Usage: ${0:-} {start|stop|restart|reload|force-reload}" >&2
        exit 1
        ;;
        esac

        exit 0

        Thursday, March 3, 2011

        translate atax:xx to actual disk (and partition?)

        What if we're getting a message like:

        ata4.01: error: { UNC }

        or

        ata4.01: failed command: READ DMA

        one day in our messages or via dmesg? Well, maybe it's a bug somewhere in the kernel or in a kernel module, but most probably something's wrong with one of your hard drives. Let's assume the latter, and anyway it's a good idea to check it out. If we've got several drives, which one is it?
        That's what happened to me recently and it took me quite a while to find out which disk and partition the messages where referring to. I'm sure there are some great scripts out there to help us out in such a situation, but I didn't find them.

        Well, the steps that could summarize how to find that out could be:
        1. find out in the bootup message what is said about it, something like:

        # dmesg |grep ata4.01|head
        [ 0.962250] ata4.01: ATA-7: ST3320620AS, 3.AAK, max UDMA/133
        (...)

        2. cross compare that with the output from lshw -businfo, e.g.
        # lshw -businfo|grep ST3320620AS
        scsi@3:0.1.0 /dev/sdd disk 320GB ST3320620AS

        (nice option btw this -businfo from lshw)
        There it is, we already got the disk! Now, which partition is it that we're having trouble with?
        Actually, the messages listed above don't refer to partitions at all, but they target the whole disk: ata4.01 is the same as scsi notation 3:0:1:0, which is again (in my system, not everywhere) the same as /dev/sdd.
        This can also be seen with the tool lsscsi:

        # lsscsi
        [0:0:0:0] disk ATA Maxtor 6Y120L0 YAR4 /dev/sda
        [0:0:1:0] cd/dvd TSSTcorp CD/DVDW SH-S162A TS01 /dev/sr0
        [2:0:0:0] disk ATA ST3250824AS 3.AA /dev/sdb
        [3:0:0:0] disk ATA ST3250310AS 3.AA /dev/sdc
        [3:0:1:0] disk ATA ST3320620AS 3.AA /dev/sdd

        as long as we agree (I'm not sure if this is correct) that we can identify:
        ata1 -> 0:0:0:0
        ata2 -> 1:0:0:0
        ata3 -> 2:0:0:0
        and e.g.
        ata2.01 -> 1:0:1:0

        So, if we want to know which partition is having trouble, we need to check out something else in the messages, and look for lines like:
        [24649.946461] Buffer I/O error on device sdd1, logical block 280748933
        or[743.150286] lost page write due to I/O error on sdd1
        or maybe[743.232014] JBD: Detected IO errors while flushing file data on sdd1
        which finally tells us what partition should be a concern to us.

        It still remains to determine if it's a hardware failure or if it's just a corrupt file system (at least to the point, I can't tell). What to do now?
        •  Try to access your data and back them up
        •  Run a file system check. If it's an ext3 or ext4 file system, maybe run a e2fsck -c /dev/sdd1 to check for bad blocks (or reiserfsck --rebuild-tree, but be careful with that, read the man page), but take your time, it's pretty slow.
        But again, first try to access and backup your data.