Jerry Jelinek's blog

SVM and the B_FAILFAST flag

June 14, 2005

Now that
OpenSolaris
is here it is a lot easier to talk about some of the interesting
implementation details in the code. In this post I wanted to
discuss the first project I did after I started to work on the
Solaris Volume Manager (SVM). This is on my mind right now
because it also happens to be related to one of my most recent changes
to the code. This change is not even in
Solaris Express
yet, it is only available in
OpenSolaris.
Early access to these kind of changes is just one small reasons why OpenSolaris is so cool.

My first SVM project was to add support for the B_FAILFAST flag.
This flag is defined in /usr/include/sys/buf.h and it was
implemented in some of the disk drivers so that I/O requests
that were queued in the driver could be cleared out quickly when
the driver saw that a disk was malfunctioning. For SVM the
big requester for this feature was our clustering software. The
problem they were seeing was that in a production environment
there would be many concurrent I/O requests queued up down in
the sd driver. When the disk was failing the sd driver would
need to process each of these requests, wait for the timeouts
and retrys and slowly drain its queue. The cluster software
could not failover to another node until all of these pending
requests had been cleared out of the system. The B_FAILFAST flag
is the exact solution to this problem. It tells the driver
to do two things. First, it reduces the number of retries that
the driver does to a failing disk before it gives up and returns
an error. Second, when the first I/O buf that is queued up in
the driver gets an error, the driver will immediately error
out all of the other, pending bufs in its queue. Furthermore,
any new bufs sent down with the B_FAILFAST flag will immediately
return with an error.

This seemed fairly straightforward to implement in SVM. The
code had to be modified to detect if the underlying devices
supported the B_FAILFAST flag and if so, the flag should be
set in the buf that was being passed down from the md driver
to the underlying drivers that made up the metadevice. For
simplicity we decided we would only add this support to the
mirror metadevice in SVM. However, the more I looked at this,
the more complicated it seemed to be. We were worried about
creating new failure modes with B_FAILFAST and the big concern was the
possibility of a “spurious” error. That is, getting back an
error on the buf that we would not have seen if we had let the
underlying driver perform its full set of timeouts and retries.
This concern eventually drove the whole design of the initial B_FAILFAST
implementation within the mirror code. To handle this spurious
error case I implemented an algorithm within the driver so that when we
got back an errored B_FAILFAST buf we would resubmit that buf without the
B_FAILFAST flag set. During this retry, all of the other failed I/O bufs would
also immediately come back up into the md driver. I queued those
up so that I could either fail all of the them after the retried buf
finally failed or I could resubmit them back down to the underlying
driver if the retried I/O succeeded. Implementing this correctly
took a lot longer than I originally expected when I took this first
project and it was one of those things that worked but I was never
very happy with. The code was complex and I never felt
completely confident that there wasn’t some obscure error condition
lurking here that would come back to bite us later. In addition,
because of the retry, the failover of a component within a mirror
actually took *longer* now if there was only a single I/O
being processed.

This original algorithm was delivered in the S10 code and was also released
as a patch for S9 and SDS 4.2.1. It has been in use for a couple of years
which gave me some direct experience with how well the B_FAILFAST
option worked in real life. We actually have seen one or two
of these so called spurious errors but in all cases there were real,
underlying problems with the disks. The storage was marginal
and SVM would have been better off just erroring out those components
within the mirror and immediately failing over to the good side
of the mirror. By this time I was comfortable with this idea so
I rewrote the B_FAILFAST code within the mirror driver. This new
algorithm is what you’ll see today in the
OpenSolaris
code base. I basically decided to just trust the error we get
back when B_FAILFAST is set. The code will follow the normal error
path so that it puts the submirror component into the maintenance state
and just uses the other, good side of the mirror from that point onward.
I was able to remove the queue and simplify the logic almost back to
what it was before we added support for B_FAILFAST.

However, there is still one special case we have to worry about when
using B_FAILFAST. As I mentioned above, when B_FAILFAST is set, all
of the pending I/O bufs that are queued down in the underlying driver
will fail once the first buf gets an error. When we are down to the
last side of a mirror the SVM code will continue to try to do I/O to
the those last submirror components, even though they are taking errors.
This is called the LAST_ERRED state within SVM and is an attempt to try to
provide access to as much of your data as possible. When using B_FAILFAST
it is probable that not all of the failed I/O bufs will have been
seen by the disk and given a chance to succeed. With the new algorithm
the code detects this state and reissues all of the I/O bufs without
B_FAILFAST set. There is no longer any queueing, we just resubmit the I/O
bufs without the flag and all future I/O to the submirror is done
without the flag. Once the LAST_ERRED state is cleared the code will
return to using the B_FAILFAST flag.

All of this is really an implementation detail of mirroring in SVM.
There is no user-visible component of this except for a change in
the behavior of how quickly the mirror will fail the errored drives
in the submirror. All of the code is contained within the mirror
portion of the SVM driver and you can see it in
mirror.c.
The function
mirror_check_failfast
is used to determine if all of the components
in a submirror support using the B_FAILFAST flag. The
mirror_done
function is called when the I/O to the underlying submirror is complete.
In this function we check if the I/O failed and if B_FAILFAST was set.
If so we call the
submirror_is_lasterred
function to check for that
condition and the
last_err_retry
function is called only when we
need to resubmit the I/O. This function is actually executed in
a helper thread since the I/O completes in a thread separately from
the thread that initiated the I/O down into the md driver.

To wrap up, the SVM md driver code lives in the source tree at
usr/src/uts/common/io/lvm.
The main md driver is in the
md
subdirectory
and each specific kind of metadevice also has its own subdirectory (
mirror,
stripe,
etc.). The SVM command line utilities live in
usr/src/cmd/lvm
and the shared library code that SVM uses lives in
usr/src/lib/lvm.
Libmeta
is the primary library. In another post
I’ll talk in more detail about some of these other components of SVM.

Technorati Tag:
OpenSolaris

Technorati Tag:
Solaris

SVM and SMF

May 20, 2005

In a previous
blog
I talked about integration of Solaris Volume Manager (SVM) and
RCM. Proper integration of SVM with the other subsystems in Solaris
is one of the things I am particularly interested in.

Today I’d like to talk about some of the work I did to integrate
SVM with the new Service Management Facility (

SMF)
that was introduced in S10.
Previously SVM had a couple of RC scripts that would run when the
system booted, even if you had not configured SVM and were not using
any metadevices. There were also several SVM specific RPC daemons
that were enabled. One of the ideas behind SMF is that only the services
that are actually needed should be enabled. This speeds up
boot and makes for a cleaner system. Another thing is that not
all of the RPC daemons need to be enabled when using SVM. Different
daemons will be used based upon the way SVM is configured.
SMF allows us to clean this up and manage these services within the code
so that the proper services are enabled and disabled as you
reconfigure SVM.

The following is a list of the services used by SVM:

svc:/network/rpc/mdcomm

svc:/network/rpc/metamed

svc:/network/rpc/metamh

svc:/network/rpc/meta

svc:/system/metainit

svc:/system/mdmonitor

The system/mdmonitor, system/metainit and network/rpc/meta services are the
core services. These will be enabled when you create the first metadb.
Once you create your first diskset the network/rpc/metamed and
network/rpc/metamh services will be enabled. When you create your first
multi-node diskset the network/rpc/mdcomm service will also be enabled.
As you delete these portions of your configuration the corresponding
services will be disabled.

Integrating this coordination of SVM and SMF is easy since SMF offers
a full API which allows programs to monitor and reconfigure the services
they use. The primary functions used are smf_get_state, smf_enable_instance
and smf_disable_instance, all of which are documented on the

smf_enable_instance(3SCF))
man page. This could have all be done previously using various hacks to
rename scripts and edit configuration files in various ways but it is trivially
simple with SMF. Furthermore, the code can always tell when there is something
wrong with the services it depends on. Recently I integrated some new code
that will notify you whenever you check the status of SVM with one of the CLI commands (metastat,
metaset or metadb) and there is a problem with the SVM services. We have barely scratched the service
here but SMF lays a good foundation for enabling us to deliver a true
self-healing system.

Another SVMer starts blogging.

May 17, 2005

Sanjay Nadkarni,
another member of the Solaris Volume Manager engineering team
has just started a
blog.
Sanjay was the technical lead for the project
that added new clustering capabilities to SVM so that it can now support
concurrent readers and writers. If you read my blog because
you are interested in SVM you will want to take a look at his too.
Welcome Sanjay.

SVM metadbs, USB disks and S2.7

May 6, 2005

In an earlier
blog
I talked about using a USB memory disk to store a
Solaris Volume Manager (SVM) metadb
on a two-disk configuration. This would reduce the likelihood
of hitting the mddb quorum problem I have talked about.
The biggest problem with this approach is that there was
no way to control where SVM would place its optimized resync regions.
I just putback a fix for this limitation this week. It should
be showing up in an upcoming
Solaris Express
release in the near future. With this fix the code will no longer
place the optimized resync regions on a USB disk or any other removable
disk for that matter. The only I/O to these devices should be
the SVM configuration change writes or the initialization reads of
the metadbs, which is a lot less frequent than the optimized resync writes.

I had another interesting experience this week. I was working on
a bug fix for the latest release of Solaris and I had to run
a test of an x86 upgrade for S2.7 with a Solstice DiskSuite (SDS) 4.2
root mirror to the latest release of Solaris. This was interesting
to me for a number of reasons. First this code is over 6 years
old but because of the long support lifetimes for Solaris releases we
still have to be sure things like this will work. Second, it was
truly painful to see this ancient release of Solaris and SDS running
on x86. It was quite a reminder of how far Solaris 10 has come
on the x86 platform. It will be exciting to see where the Solaris community
takes
Open Solaris
on the x86 platform, as well as other platforms, over the next few years.

SVM and Solaris Express 4/05

April 26, 2005

The latest release of
Solaris Express
came out yesterday. As usual a summary is on

Dan’s blog.

One new Solaris Volume Manager (SVM) capability in this release
is better integration with the Solaris

Reconfiguration Coordination Manager (RCM) framework.
This is the fix for bug:

4927518 SVM could be more intelligent when components are being unconfigured

SVM has had some RCM integration since Solaris 9 update 2. Starting
in that release, if you attempted to Dynamically Reconfigure (DR) out
a disk that was in use by SVM, you would get a nice message explaining
how SVM was using the disk. However, this was really only the
bare minimum of what we could do to integrate with the RCM
framework. The problem that is identified by bug 4927518 is that
if the disk has died you should just be able to DR it out so you can
replace it. Up to now what would happen is you would get the message
explaining that the disk was part of an SVM configuration. You had to
manually unconfigure the SVM metadevice before you could DR out the disk.
For example, if the disk was on one side of a mirror, you would have had
to detach the submirror, delete it, DR out the disk, DR in the
new one, recreate the submirror and reattach it to the mirror. Then
you would have incurred a complete resync of the newly attached
submirror. Obviously this is not very clean.

With the latest Solaris Express release SVM will clean up the systems
internal state for the dead disk so that you can just DR it out
without detaching and deleting the submirror. Once
you DR in a new disk you can just enable it into the mirror and
only that disk will be resynced, not the whole submirror. This is
a big improvement for the manageability of SVM. There are more
integration improvements like this coming for SVM. In another
blog entry I’ll also try to highlight some of the areas of integration
we already have with SVM and other Solaris features.

SVM resync cancel/resume

SVM and the B_FAILFAST flag

SVM and SMF

Another SVMer starts blogging.

SVM metadbs, USB disks and S2.7

SVM and Solaris Express 4/05

Recent Posts

Joyent. Wow!

An End and a Beginning

solaris10 branded zones on OpenSolaris

Community One Slides

Free Community One Deep Dive

Running Solaris 10 on OpenSolaris

Archives

Archives