Jerry Jelinek's blog

Search
Close this search box.

I have had a few questions about some of my previous blog
entries so I thought I would try to write up some general responses
to the various questions.

First, here are the bugids for some of the bugs I mentioned
in my previous posts:

6215065 Booting off single disk from mirrored root pair causes panic reset

This is the UFS logging bug that you can hit when you don’t have
metadb quorum.

6236382 boot should enter single-user when there is no mddb quorum

This one is pretty self explanatory. The system will boot all
the way to multi-user, but root remains read-only. To recover
from this, delete the metadb on the dead disk and reboot so
that you have metadb quorum.

6250010 Cannot boot root mirror

This is the bug where the V20z and V40z don’t use the altbootpath
to failover when the primary boot disk is dead.

There was a question about what to do if you get into the infinite
panic-reboot cycle. This would happen if you were hitting bug 6215065.
To recover from this, you need to boot off of some other media
so that you can cleanup. You could boot off of a Solaris netinstall image
or the the Solaris install CD-ROM for example. Once you do that
you can mount the disk that is still ok and change the /etc/vfstab
entry so that the root filesystem is mounted without logging. Since
logging was originally in use, you want to make sure the log is rolled
and that UFS stops using the log. Here are some example commands
to do this:

# mount -o nologging /dev/dsk/c1t0d0s0 /a

# [edit /a/etc/vfstab; change the last field for / to “nologging”]

# umount /a

# reboot

By mounting this way UFS should roll the existing log and then mark
the filesystem so that logging won’t be used for the current mount on /a.
This kind of recovery procedure where you boot off of the install image
should only be used when one side of the root mirror is already dead.
If you were to use this approach for other kinds of recovery you
would leave the mirror in an inconsistent state since only one side
was actually modified.

Here is a general procedure for accessing the root mirror when
you boot off of the install image. An example where you might use
this procedure is if you forgot the root password and wanted to
mount the mirror so you could clear the field in /etc/shadow.

First, you need to boot off the install image and get a copy of
the md.conf file from the root filesystem.
Mount one of the underlying root mirror disks read-only to get a copy.

# mount -o ro /dev/dsk/c0t0d0s0 /a

# cp /a/kernel/drv/md.conf /kernel/drv/md.conf

# umount /a

Now update the SVM driver to load the configuration.

# update_drv -f md

[ignore any warning messages printed by update_drv]

# metainit -r

If you have mirrors you should run metasync
to get them synced.

Your SVM metadevices should now be accessible and you should
be able to mount them and perform whatever recovery you need.

# mount /dev/md/dsk/d10 /a

We are also getting this simple procedure into our docs so that
it is easier to find.

Finally, there was another comment about using a USB memory disk
to hold a copy of the metadb so that quorum would be maintained even
if one of the disks in the root mirror died.

This is something that has come up internally in the past but nobody on our
team had actually tried this so I went out this weekend and bought a
USB memory disk to see if I could make this work.
It turns out this worked fine for me, but there are a lot of variables
so your mileage may vary.

Here is what worked for me. I got a 128MB memory disk since this
was the cheapest one they sold and you don’t need much space for a
copy of the metadb (only about 5MB is required for one copy).
First I used the “rmformat -l” command to figure out the name of the disk.
The disk came formatted with a pcfs filesystem already on it and
a single fdisk partition of type DOS-BIG. I used the fdisk(1M)
command to delete that partition and create a single Solaris fdisk
partition for the whole memory disk.
After that I just put a metadb on the disk.

# metadb -a /dev/dsk/c6t0d0s2

Once I did all of this, I could reboot with one of the disks removed
from my root mirror and I still had metadb quorum since I had a 3rd
copy of the metadb available.

There are a several caveats here. First, I was running this on
a current nightly build of Solaris. I haven’t tried it yet on the
shipping Solaris 10 bits but I think this will probably work. Going
back to the earlier S9 bits I would be less certain since a lot of
work went into the USB code for S10. The main thing here is that
the system has to see the USB disk at the time that SVM driver is reading
the metadbs. This happens fairly early in the boot sequence. If
we don’t see the USB disk, then that metadb replica will be marked
in error and it won’t help maintain quorum.

The second thing to watch out for is that SVM keeps track of mirror
resync regions in some of the copies of the metadbs. This is used
for the optimized resync feature that SVM supports. Currently there
is no good way to see which metadbs SVM is using for this and there
is no way to control which metadbs will be used. You wouldn’t want
these writes going to the USB memory disk since it will probably
wear out faster and might be slower too. We need to improve this
in order for the USB memory disk to really be a production quality
solution.

Another issue to watch out for is if your hardware supports the
memory disk. I tried this on an older x86 box and it couldn’t see
the USB disk. I am pretty sure the older system did not support USB 2.0
which is what the disk I bought supports. This worked fine when I tried it
on a V65 and on a Dell 650, both of which are newer systems.

I need to do more work in this area before we could really recommend
this approach, but it might be a useful solution today for people
who need to get around the metadb quorum problem on a two disk
configuration. You would really have to play around to make sure
the configuration worked ok in your environment.
We are also working internally on some other solutions to this same problem.
Hopefully I can write more about that soon.

One of the problems I have been working on recently has to
do with running Solaris Volume Manager on
V20z
and
V40z
servers. In general, SVM does not care what kinds of disks
it is layered on top of. It justs passes the I/O requests through
to the drivers for the underlying storage. However, we were seeing
problems on the V20z when it was configured for root mirroring.

With root mirroring, a common test is to pull the primary boot disk
and reboot to verify that the system comes up properly on the other
side of the mirror. What was happening was that Solaris would start
to boot, then panic.

It turns out that we were hitting a limitation in the boot support for
disks that are connected to the system with an mpt(7D) based HBA. The
problematic code exists in bootconf.exe within the Device Configuration
Assistant (DCA). The DCA is responsible for loading and starting the
Solaris kernel.
The problem is that the bootconf code was not failing over to the
altbootpath device path, so Solaris would
start to boot but then panic because the DCA was passing it the old
bootpath device path. With the primary boot disk removed from the
system, this was no longer the correct boot device path.
You can see if this limitation might impact you by using the “prtconf -D”
command and looking for the mpt driver.

We have some solutions for this limitation in the pipeline, but in
the meantime, there is an easy workaround for this. You need to edit
the /boot/solaris/bootenv.rc file and remove the entries for bootpath
and altbootpath. At this point, the DCA should
automatically detect the correct boot device path and pass it into the kernel.

There are a couple of limitations to this workaround. First, it only
works for S10 and later. In S9, it won’t automatically boot. Instead
it will enter the DCA and you will have to manually choose the boot
device. Also, it only works for systems where both disks in the root
mirror are attached
via the mpt HBA. This is a typical configuration for the V20z and V40z. We are working
on better solutions to this limitation, but hopefully this workaround
is useful in the meantime.

Solaris Volume Manager has had support for a feature called
“disksets” for a long time, going back to when it was
an unbundled product named SDS. Disksets are a way to group
a collection of disks together for use by SVM. Originally
this was designed for sharing disks, and the metadevices
on top of them, between two or more hosts. For example,
disksets are used by SunCluster. However, having a way to
manage a set of disks for exclusive use by SVM simplifies administration
and with S10 we have made several enhancements to the diskset
feature which make it more useful, even on a single host.

If you have never used SVM disksets before, I’ll briefly summarize
a couple of the key differences from the more common use
of SVM metadevices outside of a diskset. First, with a diskset,
the whole disk is given over to SVM control. When you do this
SVM will repartition the disk and create a slice 7 for metadb
placement. SVM automatically manages metadbs in disksets so
you don’t have to worry about creating those. As disks are
added to the set, SVM will create new metadbs and automatically
rebalance them across the disks as needed. Another difference
with disksets is that hosts can also be assigned to the set.
The local host must be in the set when you
create it. Disksets implement the concept of a set owner
so you can release ownership of the set and a different host
can take ownership. Only the host that owns the set can access
the metadevices within the set. An exception to this is the
new multi-owner diskset which is a large topic on its own so
I’ll save that for another post.

With S10 SVM has added several new features based around the
diskset. One of these is the metaimport(1M) command which
can be used to load a complete diskset configuration onto
a separate system. You might use this if your host died and
you physically moved all of your storage to a different machine. SVM
uses the disk device IDs to figure out how to put the configuration
together on the new machine. This is required since the
disk names themselves (e.g. c2t1d0,c3t5d0,…) will probably
be different on the new system. In S10 disk images in a diskset
are self-identifying. What this means is that if you use
remote block replication software like HDS TrueCopy or the SNDR
feature of Sun’s StorEdge Availability Suite to do
remote replication you can still use metaimport(1M) to import
the remotely replicated disks, even though the device IDs
of the remote disks will different.

A second new feature is the metassist(1M) command. This command
automates the creation of metadevices so that you don’t have
to run all of the separate meta* commands individually. Metassist
has quite a lot of features which I won’t delve into
in this post, but you can read more

here
.
The one idea I wanted to discuss is that metassist uses
disksets to implement the concept of a storage pool. Metassist
relies on the automatic management of metadbs that disksets
provide. Also, since the whole disk is now under control of
SVM, metassist can repartition the disk as necessary to automatically
create the appropriate metadevices within the pool. Metassist
will use the disks in the pool to create new metadevices and only
try to add additional disks to the pool if it can’t configure the new
metadevice using the available space on the existing disks in the pool.

When disksets were first implemented back in SDS they were intended
for use by multiple hosts. Since the metadevices in the set were
only accessible by the host the owned the set the assumption was
that some other piece of software, outside of SDS, would manage the
diskset ownership. Since SVM is now making greater use of disksets
we added a new capability called “auto-take” which allows the local
host to automatically take ownership of the diskset during boot and thus have
access to the metadevices in the set. This means that you can
use vfstab entries to mount filesystems built on metadevices within
the set and those will “just work” during the system boot.
The metassist command relies on this feature and the storage pools (i.e. disksets)
it uses will all have auto-take enabled. Auto-take
can only be enabled for disksets which have the single, local host
in the set. If you have multiple hosts in the set than you are really
using the set in the traditional manner and you’ll need something
outside of SVM to manage which host owns the set (again, this ignores
the new multi-owner sets). You use the new “-A” option on the metaset
command to enable or disable auto-take.

Finally, since use of disksets is expected to be more common with these
new features, we added the “-a” option to the metastat command so that
you can see the status of metadevices in all disksets without have to
run a separate command for each set. We also added a “-c” option to
metastat which gives a compact, one-line output for each metadevice.
This is particularly useful as the number of metadevices in the
configuration increases. For example, on one of our local servers we have 93
metadevices. With the original, verbose metastat output this resulted in
842 lines of status, which makes it hard to see exactly what is going
on. The “-c” option reduces this down to 93 lines.

This is just a brief overview of some of the new features we have
implemented around SVM disksets. There is lots more detail in
the docs

here
.
The metaimport(1M) command was available starting in S9 9/04
and the metassist command was available starting in S9 6/04. However,
the remote replication feature for metaimport requires S10.

There are a couple of bugs that we found in S10
that make it look like Solaris Volume Manager root
mirroring does not work at all. Unfortunately
we found these bugs after the release went out.
These bugs will be patched but I wanted to describe
the problems a bit and describe some workarounds

On a lot of systems that use SVM to do root mirroring
there are only two disks. When you set up the configuration
you put one or more metadbs on each disk to hold the
SVM configuration information.

SVM implements a metadb quorum rule which means that
during the system boot, if more than 50% of the metadbs
are not available, the system should boot into single-user
mode so that you can fix things up. You can read more
about this

here
.

On a two disk system there is no way to set things
up so that more than 50% of the metadbs will be available
if one of the disks dies.

When SVM does not have metadb quorum during the boot it
is supposed to leave all of the metadevices read-only and
boot into single-user. This gives you a chance to confirm that
you are using the
right SVM configuration and that you don’t corrupt any of
your data before having a chance to cleanup the dead metadbs.

What a lot of people do when when they set up a root
mirror is pull one of the disks to check if the system
will still boot and run ok. If you do this experiment
on a two disk configuration running S10 the system will
panic really early in the boot process and it will go
into a infinite panic/reboot cycle.

What is happening here is that we found a bug related
to UFS logging, which is on by default in S10. Since
the root mirror stays read-only because there is no
metadb quorum we hit a bug in the UFS log rolling code.
This in turn leaves UFS in a bad state which causes
the system to panic.

We’re testing the fix for this bug right now but in the
meantime, it is easy to workaround this bug by just
disabling logging on the root filesystem. You can do
that be specifying the “nologging” option in the last
field of the vfstab entry for root. You should reboot
once before doing any SVM experiments (like pulling a disk)
to ensure that UFS has rolled the log and is no longer
using logging on root.

Once a patch for this bug is out you will definitely want to remove
this workaround from the the vfstab entry since UFS logging
offers so many performance and availability benefits.

By they way, UFS logging is also on by default in the S9 9/04
release but that code does not suffer from this bug.

The second problem we found is not as serious as the UFS bug.
This has to do with an interaction with the

Service Management Facility (SMF)

which is new in S10 and again, this is related to not
have metadb quorum during the boot. What should happen
is that the system should enter single-user so you can clean
up the dead metadbs. Instead it boots all the way to multi-user
but since the root device is still read-only things don’t work
very well. This turned out to be a missing dependency which
we didn’t catch when we integrated SVM and SMF. We’ll have
a patch for this too but this problem is much less serious.
You can still login as root and clean up the dead metadbs so that
you can then reboot with a good metadb quorum.

Both of the problems result because there is no metadb quorum
so the root metadevice remains read-only after a boot with
a dead disk. If you have a third disk which you can use to
add a metadb onto, then you can reduce the likelihood of hitting
this problem since losing one disk won’t cause you to lose
quorum during boot.

Given these kinds of problems you might wonder why does SVM
bother to implement the metadb quorum? Why not just trust
the metadbs that are alive? SVM is conservative and always
chooses the path to ensure that you won’t lose
data or use stale data. There are various corner cases to
worry about when SVM cannot be sure it is using the most current
data. For example, in a two disk mirror configuration, you might run
for a while on the first disk with the second disk powered down. Later
you might reboot off the second disk (because the disk was
now powered up) and the first disk might now be powered down.
At this point you would be using the stale data on the mirror,
possibly without even realizing it. The metadb quorum rule
gives you a chance to intervene and fix up the configuration when
SVM cannot do it automatically.

I am one of the engineers working on Solaris Volume manager. There are some questions I get
asked a lot and I thought it would be useful to get this information out to a broader
audience so I am starting this blog. We have also done a lot of work on the volume manager
in the recent Solaris
releases and I’d like to talk about some of that too.

One of the questions I get asked most frequently is how to do root mirroring on x86
systems. Unfortunately our docs don’t explain this very well yet. This is a short
description I wrote about root mirroring on x86 which hopefully explains how to set
this up.

Root mirroring on x86 is more complex than is root mirroring
on SPARC. Specifically, there are issues with being able to boot
from the secondary side of the mirror when the primary side fails.
Compared to SPARC systems, with x86 machines you have be sure
to properly configure the system BIOS and set up your fdisk partitioning.

The x86 BIOS is analogous to the PROM interpreter on SPARC.
The BIOS is responsible for finding the right device to boot
from, then loading and executing the master boot record from that
device.

All modern x86 BIOSes are configurable to some degree but the
discussion of how to configure them is beyond the scope of this
post. In general you can usually select the order of devices that
you want the BIOS to probe (e.g. floppy, IDE disk, SCSI disk, network)
but you may be limited in configuring at a more granular level.
For example, it may not be possible to configure the BIOS to boot from
the first and second IDE disks. These limitations may be a factor
with some hardware configurations (e.g. a system with two IDE
disks that are root mirrored). You will need to understand
the capabilities of the BIOS that is on your hardware. If your
primary boot disk fails and your BIOS is set up properly, then it
will automatically boot from the second disk in the mirror.
Otherwise, you may need to break into the BIOS while the machine is
booting and reconfigure to boot from the second disk or you may even need
to boot from a floppy with the Solaris Device Configuration Assistant (DCA)
on it so that you can select the alternate disk to boot from.

On x86 machines fdisk partitions are used and it is common to have
multiple operating systems installed. Also, there are different flavors
of master boot programs (e.g. LILO or Grub), in addition to the standard
Solaris master boot program. The boot(1M) man page is a good resource
for a detailed discussion of the multiple components that are used during
booting on Solaris x86.

Since SVM can only mirror Solaris slices within the Solaris fdisk
partition this post focuses on a configuration that only
has Solaris installed. If you have multiple fdisk partitions
then you will need to use some other approach to protect the data
outside of the Solaris fdisk partition. SVM can’t mirror that data.

For an x86 system with Solaris installed there are two common
fdisk partitioning schemes. One approach uses two fdisk partitions.
There is a Solaris fdisk partition and another, small fdisk
partition of about 10MB called the x86 boot partition.
This partition has an Id value of 190. The Solaris system installation
software will create a configuration with these two fdisk partitions
as the default. The x86 boot partition is needed in some cases,
such as when you want to use live-upgrade on a single disk configuration,
but it is problematic when using root mirroring. The Solaris
system installation software only allows one x86 boot partition for
the entire system and it places important data on that fdisk
partition. That partition is mounted in the vfstab with this
entry:

/dev/dsk/c2t1d0p0:boot – /boot pcfs – no –

Because this fdisk partition is outside of the Solaris fdisk
partition it cannot be mirrored by SVM. Furthermore, because
there is only a single copy of this fdisk partition it represents
a single point of failure.

Since the x86 boot partition is not required in most cases
it is recommended that you do not use this as the default when setting
up a system that will have root mirroring. Instead, just use
a single Solaris fdisk partition and omit the x86 boot partition for
your installation. It is easiest to do this at the time you install
Solaris. If you already have Solaris installed and you created
the x86 boot partition as part of that process then the easiest
thing would be to delete that with the fdisk(1M) command and reinstall,
taking care not to create the x86 boot partition during the installation
process.

Once your system is installed you create your metadbs and root
mirror using the normal procedures.

When you use fdisk to partition your second disk you must make the
disk bootable with a master boot program. The -b option of fdisk(1M)
does this.

e.g.
fdisk -b /usr/lib/fs/ufs/mboot /dev/rdsk/c2t1d0p0

On x86 machines the Solaris VTOC (the slices within the
Solaris fdisk partition) is slightly different from what is seen
on SPARC. On SPARC there are 8 VTOC slices (0-7) but on x86
there are more. In particular slice 8 is used as a “boot”
slice. You will see that this slice is 1 cylinder in size
and starts at the beginning of the disk (offset 0). The other
slices will come after that, starting at cylinder 1.

Slice 8 is necessary for booting Solaris from this fdisk partition. It
holds the partition boot record (pboot), the Solaris VTOC for the disk
and the bootblk. This information is disk specific so it
is not mirrored with SVM. However, you must ensure that
both disks are bootable so that you can boot from the secondary
disk if the primary fails. You can use the installboot program
to setup the second disk as a Solaris bootable disk (see installboot(1M)).
An example command is:

installboot /usr/platform/i86pc/lib/fs/ufs/pboot \
/usr/platform/i86pc/lib/fs/ufs/bootblk /dev/rdsk/c2t1d0s2

There is one further consideration for booting an x86 disk. You must
ensure that the root slice has a slice tag of “root” and the root slice
must be slice 0. See format(1M) for checking and setting the slice tag
field.

Solaris x86 emulates some of the behavior of the SPARC eeprom. See
eeprom(1M). The boot device is stored in the “bootpath” property that
you can see with the eeprom command. The value is the device tree path
of the primary boot disk. For example:

bootpath=/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@0,0:a

You should set up the alternate bootpath via the eeprom command so that
the system will try to boot from the second side of the root mirror.
First you must get the device tree path for the other disk in the mirror.
You can use the ls command. For example:

# ls -l /dev/dsk/c2t1d0s0 lrwxrwxrwx 1 root root 78 Sep 28 23:41 /dev/dsk/c0t1d0s0 -> ../../devices/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a

The device tree path is the portion of the output following “../devices”.

Use the eeprom command to set up the alternate boot path. For example:

eeprom altbootpath=’/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a’

If your primary disk fails you will boot from the secondary disk.
This may either be automatic if your BIOS is configured properly,
you may need to manually enter the BIOS and boot from the secondary
disk or you may even need to boot from a floppy with the DCA on it.
Once the system starts to boot it will try to boot from the “bootpath”
device. Assuming the primary boot disk is the dead disk in the root
mirror the system will attempt to boot from the “altbootpath” device.

If the system fails to boot from the altbootpath device for some reason then
you need to finish booting manually. The boot should drop in to the Device
Configuration Assistant (DCA). You must choose the secondary disk as the
boot disk within the DCA. After the system has booted you should update the
“bootpath” value with the device path that you used for the for the secondary
disk (the “altbootpath”) so that the machine will boot automatically.

For example, run the following to set the boot device to the second
scsi disk (target 1):

eeprom bootpath=’/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a’

Note that all of the usual considerations of mddb quorum apply
to x86, just as they do for SPARC.

Recent Posts

September 23, 2010
September 13, 2010
May 26, 2009

Archives