Thursday, March 3, 2011

translate atax:xx to actual disk (and partition?)

What if we're getting a message like:

ata4.01: error: { UNC }

or

ata4.01: failed command: READ DMA

one day in our messages or via dmesg? Well, maybe it's a bug somewhere in the kernel or in a kernel module, but most probably something's wrong with one of your hard drives. Let's assume the latter, and anyway it's a good idea to check it out. If we've got several drives, which one is it?
That's what happened to me recently and it took me quite a while to find out which disk and partition the messages where referring to. I'm sure there are some great scripts out there to help us out in such a situation, but I didn't find them.

Well, the steps that could summarize how to find that out could be:
1. find out in the bootup message what is said about it, something like:

# dmesg |grep ata4.01|head
[ 0.962250] ata4.01: ATA-7: ST3320620AS, 3.AAK, max UDMA/133
(...)

2. cross compare that with the output from lshw -businfo, e.g.
# lshw -businfo|grep ST3320620AS
scsi@3:0.1.0 /dev/sdd disk 320GB ST3320620AS

(nice option btw this -businfo from lshw)
There it is, we already got the disk! Now, which partition is it that we're having trouble with?
Actually, the messages listed above don't refer to partitions at all, but they target the whole disk: ata4.01 is the same as scsi notation 3:0:1:0, which is again (in my system, not everywhere) the same as /dev/sdd.
This can also be seen with the tool lsscsi:

# lsscsi
[0:0:0:0] disk ATA Maxtor 6Y120L0 YAR4 /dev/sda
[0:0:1:0] cd/dvd TSSTcorp CD/DVDW SH-S162A TS01 /dev/sr0
[2:0:0:0] disk ATA ST3250824AS 3.AA /dev/sdb
[3:0:0:0] disk ATA ST3250310AS 3.AA /dev/sdc
[3:0:1:0] disk ATA ST3320620AS 3.AA /dev/sdd

as long as we agree (I'm not sure if this is correct) that we can identify:
ata1 -> 0:0:0:0
ata2 -> 1:0:0:0
ata3 -> 2:0:0:0
and e.g.
ata2.01 -> 1:0:1:0

So, if we want to know which partition is having trouble, we need to check out something else in the messages, and look for lines like:
[24649.946461] Buffer I/O error on device sdd1, logical block 280748933
or[743.150286] lost page write due to I/O error on sdd1
or maybe[743.232014] JBD: Detected IO errors while flushing file data on sdd1
which finally tells us what partition should be a concern to us.

It still remains to determine if it's a hardware failure or if it's just a corrupt file system (at least to the point, I can't tell). What to do now?
  •  Try to access your data and back them up
  •  Run a file system check. If it's an ext3 or ext4 file system, maybe run a e2fsck -c /dev/sdd1 to check for bad blocks (or reiserfsck --rebuild-tree, but be careful with that, read the man page), but take your time, it's pretty slow.
But again, first try to access and backup your data.

No comments: