April 2005 – Jerry Jelinek's blog

SVM and Solaris Express 4/05

April 26, 2005

The latest release of
Solaris Express
came out yesterday. As usual a summary is on

Dan’s blog.

One new Solaris Volume Manager (SVM) capability in this release
is better integration with the Solaris

Reconfiguration Coordination Manager (RCM) framework.
This is the fix for bug:

4927518 SVM could be more intelligent when components are being unconfigured

SVM has had some RCM integration since Solaris 9 update 2. Starting
in that release, if you attempted to Dynamically Reconfigure (DR) out
a disk that was in use by SVM, you would get a nice message explaining
how SVM was using the disk. However, this was really only the
bare minimum of what we could do to integrate with the RCM
framework. The problem that is identified by bug 4927518 is that
if the disk has died you should just be able to DR it out so you can
replace it. Up to now what would happen is you would get the message
explaining that the disk was part of an SVM configuration. You had to
manually unconfigure the SVM metadevice before you could DR out the disk.
For example, if the disk was on one side of a mirror, you would have had
to detach the submirror, delete it, DR out the disk, DR in the
new one, recreate the submirror and reattach it to the mirror. Then
you would have incurred a complete resync of the newly attached
submirror. Obviously this is not very clean.

With the latest Solaris Express release SVM will clean up the systems
internal state for the dead disk so that you can just DR it out
without detaching and deleting the submirror. Once
you DR in a new disk you can just enable it into the mirror and
only that disk will be resynced, not the whole submirror. This is
a big improvement for the manageability of SVM. There are more
integration improvements like this coming for SVM. In another
blog entry I’ll also try to highlight some of the areas of integration
we already have with SVM and other Solaris features.

Solaris Volume Manager odds and ends

April 19, 2005

I have had a few questions about some of my previous blog
entries so I thought I would try to write up some general responses
to the various questions.

First, here are the bugids for some of the bugs I mentioned
in my previous posts:

6215065 Booting off single disk from mirrored root pair causes panic reset

This is the UFS logging bug that you can hit when you don’t have
metadb quorum.

6236382 boot should enter single-user when there is no mddb quorum

This one is pretty self explanatory. The system will boot all
the way to multi-user, but root remains read-only. To recover
from this, delete the metadb on the dead disk and reboot so
that you have metadb quorum.

6250010 Cannot boot root mirror

This is the bug where the V20z and V40z don’t use the altbootpath
to failover when the primary boot disk is dead.

There was a question about what to do if you get into the infinite
panic-reboot cycle. This would happen if you were hitting bug 6215065.
To recover from this, you need to boot off of some other media
so that you can cleanup. You could boot off of a Solaris netinstall image
or the the Solaris install CD-ROM for example. Once you do that
you can mount the disk that is still ok and change the /etc/vfstab
entry so that the root filesystem is mounted without logging. Since
logging was originally in use, you want to make sure the log is rolled
and that UFS stops using the log. Here are some example commands
to do this:

# mount -o nologging /dev/dsk/c1t0d0s0 /a

# [edit /a/etc/vfstab; change the last field for / to “nologging”]

# umount /a

# reboot

By mounting this way UFS should roll the existing log and then mark
the filesystem so that logging won’t be used for the current mount on /a.
This kind of recovery procedure where you boot off of the install image
should only be used when one side of the root mirror is already dead.
If you were to use this approach for other kinds of recovery you
would leave the mirror in an inconsistent state since only one side
was actually modified.

Here is a general procedure for accessing the root mirror when
you boot off of the install image. An example where you might use
this procedure is if you forgot the root password and wanted to
mount the mirror so you could clear the field in /etc/shadow.

First, you need to boot off the install image and get a copy of
the md.conf file from the root filesystem.
Mount one of the underlying root mirror disks read-only to get a copy.

# mount -o ro /dev/dsk/c0t0d0s0 /a

# cp /a/kernel/drv/md.conf /kernel/drv/md.conf

# umount /a

Now update the SVM driver to load the configuration.

# update_drv -f md

[ignore any warning messages printed by update_drv]

# metainit -r

If you have mirrors you should run metasync
to get them synced.

Your SVM metadevices should now be accessible and you should
be able to mount them and perform whatever recovery you need.

# mount /dev/md/dsk/d10 /a

We are also getting this simple procedure into our docs so that
it is easier to find.

Finally, there was another comment about using a USB memory disk
to hold a copy of the metadb so that quorum would be maintained even
if one of the disks in the root mirror died.

This is something that has come up internally in the past but nobody on our
team had actually tried this so I went out this weekend and bought a
USB memory disk to see if I could make this work.
It turns out this worked fine for me, but there are a lot of variables
so your mileage may vary.

Here is what worked for me. I got a 128MB memory disk since this
was the cheapest one they sold and you don’t need much space for a
copy of the metadb (only about 5MB is required for one copy).
First I used the “rmformat -l” command to figure out the name of the disk.
The disk came formatted with a pcfs filesystem already on it and
a single fdisk partition of type DOS-BIG. I used the fdisk(1M)
command to delete that partition and create a single Solaris fdisk
partition for the whole memory disk.
After that I just put a metadb on the disk.

# metadb -a /dev/dsk/c6t0d0s2

Once I did all of this, I could reboot with one of the disks removed
from my root mirror and I still had metadb quorum since I had a 3rd
copy of the metadb available.

There are a several caveats here. First, I was running this on
a current nightly build of Solaris. I haven’t tried it yet on the
shipping Solaris 10 bits but I think this will probably work. Going
back to the earlier S9 bits I would be less certain since a lot of
work went into the USB code for S10. The main thing here is that
the system has to see the USB disk at the time that SVM driver is reading
the metadbs. This happens fairly early in the boot sequence. If
we don’t see the USB disk, then that metadb replica will be marked
in error and it won’t help maintain quorum.

The second thing to watch out for is that SVM keeps track of mirror
resync regions in some of the copies of the metadbs. This is used
for the optimized resync feature that SVM supports. Currently there
is no good way to see which metadbs SVM is using for this and there
is no way to control which metadbs will be used. You wouldn’t want
these writes going to the USB memory disk since it will probably
wear out faster and might be slower too. We need to improve this
in order for the USB memory disk to really be a production quality
solution.

Another issue to watch out for is if your hardware supports the
memory disk. I tried this on an older x86 box and it couldn’t see
the USB disk. I am pretty sure the older system did not support USB 2.0
which is what the disk I bought supports. This worked fine when I tried it
on a V65 and on a Dell 650, both of which are newer systems.

I need to do more work in this area before we could really recommend
this approach, but it might be a useful solution today for people
who need to get around the metadb quorum problem on a two disk
configuration. You would really have to play around to make sure
the configuration worked ok in your environment.
We are also working internally on some other solutions to this same problem.
Hopefully I can write more about that soon.

SVM V20z root mirror panic

April 18, 2005

One of the problems I have been working on recently has to
do with running Solaris Volume Manager on
V20z
and
V40z
servers. In general, SVM does not care what kinds of disks
it is layered on top of. It justs passes the I/O requests through
to the drivers for the underlying storage. However, we were seeing
problems on the V20z when it was configured for root mirroring.

With root mirroring, a common test is to pull the primary boot disk
and reboot to verify that the system comes up properly on the other
side of the mirror. What was happening was that Solaris would start
to boot, then panic.

It turns out that we were hitting a limitation in the boot support for
disks that are connected to the system with an mpt(7D) based HBA. The
problematic code exists in bootconf.exe within the Device Configuration
Assistant (DCA). The DCA is responsible for loading and starting the
Solaris kernel.
The problem is that the bootconf code was not failing over to the
altbootpath device path, so Solaris would
start to boot but then panic because the DCA was passing it the old
bootpath device path. With the primary boot disk removed from the
system, this was no longer the correct boot device path.
You can see if this limitation might impact you by using the “prtconf -D”
command and looking for the mpt driver.

We have some solutions for this limitation in the pipeline, but in
the meantime, there is an easy workaround for this. You need to edit
the /boot/solaris/bootenv.rc file and remove the entries for bootpath
and altbootpath. At this point, the DCA should
automatically detect the correct boot device path and pass it into the kernel.

There are a couple of limitations to this workaround. First, it only
works for S10 and later. In S9, it won’t automatically boot. Instead
it will enter the DCA and you will have to manually choose the boot
device. Also, it only works for systems where both disks in the root
mirror are attached
via the mpt HBA. This is a typical configuration for the V20z and V40z. We are working
on better solutions to this limitation, but hopefully this workaround
is useful in the meantime.

Solaris Volume Manager disksets

April 5, 2005

Solaris Volume Manager has had support for a feature called
“disksets” for a long time, going back to when it was
an unbundled product named SDS. Disksets are a way to group
a collection of disks together for use by SVM. Originally
this was designed for sharing disks, and the metadevices
on top of them, between two or more hosts. For example,
disksets are used by SunCluster. However, having a way to
manage a set of disks for exclusive use by SVM simplifies administration
and with S10 we have made several enhancements to the diskset
feature which make it more useful, even on a single host.

If you have never used SVM disksets before, I’ll briefly summarize
a couple of the key differences from the more common use
of SVM metadevices outside of a diskset. First, with a diskset,
the whole disk is given over to SVM control. When you do this
SVM will repartition the disk and create a slice 7 for metadb
placement. SVM automatically manages metadbs in disksets so
you don’t have to worry about creating those. As disks are
added to the set, SVM will create new metadbs and automatically
rebalance them across the disks as needed. Another difference
with disksets is that hosts can also be assigned to the set.
The local host must be in the set when you
create it. Disksets implement the concept of a set owner
so you can release ownership of the set and a different host
can take ownership. Only the host that owns the set can access
the metadevices within the set. An exception to this is the
new multi-owner diskset which is a large topic on its own so
I’ll save that for another post.

With S10 SVM has added several new features based around the
diskset. One of these is the metaimport(1M) command which
can be used to load a complete diskset configuration onto
a separate system. You might use this if your host died and
you physically moved all of your storage to a different machine. SVM
uses the disk device IDs to figure out how to put the configuration
together on the new machine. This is required since the
disk names themselves (e.g. c2t1d0,c3t5d0,…) will probably
be different on the new system. In S10 disk images in a diskset
are self-identifying. What this means is that if you use
remote block replication software like HDS TrueCopy or the SNDR
feature of Sun’s StorEdge Availability Suite to do
remote replication you can still use metaimport(1M) to import
the remotely replicated disks, even though the device IDs
of the remote disks will different.

A second new feature is the metassist(1M) command. This command
automates the creation of metadevices so that you don’t have
to run all of the separate meta* commands individually. Metassist
has quite a lot of features which I won’t delve into
in this post, but you can read more

here.
The one idea I wanted to discuss is that metassist uses
disksets to implement the concept of a storage pool. Metassist
relies on the automatic management of metadbs that disksets
provide. Also, since the whole disk is now under control of
SVM, metassist can repartition the disk as necessary to automatically
create the appropriate metadevices within the pool. Metassist
will use the disks in the pool to create new metadevices and only
try to add additional disks to the pool if it can’t configure the new
metadevice using the available space on the existing disks in the pool.

When disksets were first implemented back in SDS they were intended
for use by multiple hosts. Since the metadevices in the set were
only accessible by the host the owned the set the assumption was
that some other piece of software, outside of SDS, would manage the
diskset ownership. Since SVM is now making greater use of disksets
we added a new capability called “auto-take” which allows the local
host to automatically take ownership of the diskset during boot and thus have
access to the metadevices in the set. This means that you can
use vfstab entries to mount filesystems built on metadevices within
the set and those will “just work” during the system boot.
The metassist command relies on this feature and the storage pools (i.e. disksets)
it uses will all have auto-take enabled. Auto-take
can only be enabled for disksets which have the single, local host
in the set. If you have multiple hosts in the set than you are really
using the set in the traditional manner and you’ll need something
outside of SVM to manage which host owns the set (again, this ignores
the new multi-owner sets). You use the new “-A” option on the metaset
command to enable or disable auto-take.

Finally, since use of disksets is expected to be more common with these
new features, we added the “-a” option to the metastat command so that
you can see the status of metadevices in all disksets without have to
run a separate command for each set. We also added a “-c” option to
metastat which gives a compact, one-line output for each metadevice.
This is particularly useful as the number of metadevices in the
configuration increases. For example, on one of our local servers we have 93
metadevices. With the original, verbose metastat output this resulted in
842 lines of status, which makes it hard to see exactly what is going
on. The “-c” option reduces this down to 93 lines.

This is just a brief overview of some of the new features we have
implemented around SVM disksets. There is lots more detail in
the docs

here.
The metaimport(1M) command was available starting in S9 9/04
and the metassist command was available starting in S9 6/04. However,
the remote replication feature for metaimport requires S10.

Solaris Volume Manager root mirror problems on S10

April 1, 2005

There are a couple of bugs that we found in S10
that make it look like Solaris Volume Manager root
mirroring does not work at all. Unfortunately
we found these bugs after the release went out.
These bugs will be patched but I wanted to describe
the problems a bit and describe some workarounds

On a lot of systems that use SVM to do root mirroring
there are only two disks. When you set up the configuration
you put one or more metadbs on each disk to hold the
SVM configuration information.

SVM implements a metadb quorum rule which means that
during the system boot, if more than 50% of the metadbs
are not available, the system should boot into single-user
mode so that you can fix things up. You can read more
about this

here.

On a two disk system there is no way to set things
up so that more than 50% of the metadbs will be available
if one of the disks dies.

When SVM does not have metadb quorum during the boot it
is supposed to leave all of the metadevices read-only and
boot into single-user. This gives you a chance to confirm that
you are using the
right SVM configuration and that you don’t corrupt any of
your data before having a chance to cleanup the dead metadbs.

What a lot of people do when when they set up a root
mirror is pull one of the disks to check if the system
will still boot and run ok. If you do this experiment
on a two disk configuration running S10 the system will
panic really early in the boot process and it will go
into a infinite panic/reboot cycle.

What is happening here is that we found a bug related
to UFS logging, which is on by default in S10. Since
the root mirror stays read-only because there is no
metadb quorum we hit a bug in the UFS log rolling code.
This in turn leaves UFS in a bad state which causes
the system to panic.

We’re testing the fix for this bug right now but in the
meantime, it is easy to workaround this bug by just
disabling logging on the root filesystem. You can do
that be specifying the “nologging” option in the last
field of the vfstab entry for root. You should reboot
once before doing any SVM experiments (like pulling a disk)
to ensure that UFS has rolled the log and is no longer
using logging on root.

Once a patch for this bug is out you will definitely want to remove
this workaround from the the vfstab entry since UFS logging
offers so many performance and availability benefits.

By they way, UFS logging is also on by default in the S9 9/04
release but that code does not suffer from this bug.

The second problem we found is not as serious as the UFS bug.
This has to do with an interaction with the

Service Management Facility (SMF)
which is new in S10 and again, this is related to not
have metadb quorum during the boot. What should happen
is that the system should enter single-user so you can clean
up the dead metadbs. Instead it boots all the way to multi-user
but since the root device is still read-only things don’t work
very well. This turned out to be a missing dependency which
we didn’t catch when we integrated SVM and SMF. We’ll have
a patch for this too but this problem is much less serious.
You can still login as root and clean up the dead metadbs so that
you can then reboot with a good metadb quorum.

Both of the problems result because there is no metadb quorum
so the root metadevice remains read-only after a boot with
a dead disk. If you have a third disk which you can use to
add a metadb onto, then you can reduce the likelihood of hitting
this problem since losing one disk won’t cause you to lose
quorum during boot.

Given these kinds of problems you might wonder why does SVM
bother to implement the metadb quorum? Why not just trust
the metadbs that are alive? SVM is conservative and always
chooses the path to ensure that you won’t lose
data or use stale data. There are various corner cases to
worry about when SVM cannot be sure it is using the most current
data. For example, in a two disk mirror configuration, you might run
for a while on the first disk with the second disk powered down. Later
you might reboot off the second disk (because the disk was
now powered up) and the first disk might now be powered down.
At this point you would be using the stale data on the mirror,
possibly without even realizing it. The metadb quorum rule
gives you a chance to intervene and fix up the configuration when
SVM cannot do it automatically.

Jerry Jelinek's blog

Month: April 2005

SVM and Solaris Express 4/05

Solaris Volume Manager odds and ends

SVM V20z root mirror panic

Solaris Volume Manager disksets

Solaris Volume Manager root mirror problems on S10

Recent Posts

Joyent. Wow!

An End and a Beginning

solaris10 branded zones on OpenSolaris

Community One Slides

Free Community One Deep Dive

Running Solaris 10 on OpenSolaris

Archives

Archives