9fans archive / 2001 / 04 / 667 /    prev next

From: Dan Cross <cross@mat...>
Subject: Re: [9fans] Oh....Hell.  File server problems.
Date: Fri, 27 Apr 2001 10:14:53 -0400 (EDT)

In article <20010427070646.E8BC2199C1@mai...> you write:
>>>I seem to have done a bad thing; my file server thinks that it's dump
>>>disk (pseudo-worm) is full, even though it's really not (uhh, don't ask).
>>>Now, every time I try and boot the file server, it panics.  I don't care
>
>don't ask?  knowing what the configuration was and what went wrong might
>allow recovery.  depending on what you did it's possible the data is still
>there.

Well, it's embarassing.  :-)  The FS is using Eric Dorman's patches for
IDE disks, and the pseudo-worm lives on a 10GB IDE disk.  Cache lives on
a 9GB SCSI disk.  The config is as straight forward as can be; the entire
IDE disk is devoted to cache (no partitions, no nothing), and the entire
SCSI disk to cache.

The problem is that there was a very small bug in the IDE FS code wherein
size calculations for disks > ~4GB would overflow; leaving the file server
to believe that it had significantly less space available than it really
did.  A patch was sent out to 9fans for it a few months ago (sorry, I
don't remember who wrote the patch!), but I never applied it.  Hence, my
FS thought that the dump disk was somewhere on the order of ~2GB instead
of 10.  Whoops.  (See?  I said it was embarassing....  :-)

Anyway, I got Eric's patches again, and the patch to the patch, built
another file server kernel (from my stand-alone laptop) and tried
rebooting the file server with that.  This time, the file server
paniced on boot after not being able to find it's superblock.  When I
switched the kernels back and rebooted, it came up, but a few files
were giving me ``phase error--cannot happen'' diagnostics when I tried
to cat or otherwise read them.  I was going around trying to remove all
these so I could get a snapshot of the filesystem when the thing
crashed the last time, refusing to come up after that.  It occured to
me that I should have just tried to tar the latest dump, which seemed
to be unaffected.

I have no reason to believe that the data itself has been affected;
it seems to be more a metadata issue.  :-(

>have you tried the recover command in config mode, or doesn't it get even
>that far?

I have tried the recover command, and the machine indeed comes up into
config mode, but as soon as I try to ``end'' to make the recover happen,
the machine panics with a, ``panic: worm rbounds xxxx'' where xxxx is the
size of what the FS thinks the worm is, which is greater than it thinks
that it *can* be.

It's interesting, and perhaps a little scary, to notice how the file server
deals with the worm when it gets full.  I've noticed that it will return
a diagnostic to the user (``file system full'') and continue working okay
for a few seconds after that, but then freeze; even a ``halt'' on the
console is ineffective.  Yikes!

	- Dan C.