What to do if you have problems with Vinum

What to do if you have problems with Vinum What to do if you have problems with Vinum
Last updated: 23 September 2000

If you have problems with Vinum, it could be due to misunderstanding the setup, or it could be a bug. This page describes some of the more common pitfalls of setting up Vinum and then describes the state of the known bugs.

Setup problems

The following list of GOTCHAs comes from the man page vinum(8), which may be more up to date:

vinum drives are UNIX disk partitions and should have the partition type vinum. This is different from ccd, which expects partitions of type 4.2BSD. This behaviour of ccd is an invitation to shoot yourself in the foot: with ccd you can easily overwrite a file system. vinum will not permit this. See the man page for details of how to set the partition type.
For similar reasons, the vinum start command will not accept a drive on partition c. Partition c is used by the system to represent the whole disk, and must be of type unused. Clearly there is a conflict here, which vinum resolves by not using the c partition.

When you create a volume with multiple plexes, vinum does not automatically initialize the plexes. This means that the contents are not known, but they are certainly not consistent. As a result, by default vinum sets the state of all newly-created plexes except the first to stale. In order to synchronize them with the first plex, you must start their subdisks, which causes vinum to copy the data from a plex which is in the up state. Depending on the size of the subdisks involved, this can take a long time.
In practice, people aren't too interested in what was in the plex when it was created, and other volume managers cheat by setting them up anyway. vinum provides two ways to ensure that newly created plexes are up:
- Create the plexes and then synchronize them with vinum start.
- Create the volume (not the plex) with the keyword setupstate, which tells vinum to ignore any possible inconsistency and set the plexes to be up.

Some of the commands currently supported by vinum are not really needed. For reasons which I don't understand, however, I find that users frequently try the label and resetconfig commands, though especially resetconfig outputs all sort of dire warnings. Don't use these commands unless you have a good reason to do so.

Some state transitions are not very intuitive. In fact, it's not clear whether this is a bug or a feature. If you find that you can't start an object in some strange state, such as a reborn subdisk, try first to get it into stopped state, with the stop or stop -f commands. If that works, you should then be able to start it. If you find that this is the only way to get out of a position where easier methods fail, please report the situation.

If you build the kernel module with the -DVINUMDEBUG option, you must also build vinum(8) with the -DVINUMDEBUG option, since the size of some data objects used by both components depends on this option. If you don't do so, commands will fail with the message Invalid argument, and a console message will be logged such as
```
vinumioctl: invalid ioctl from process 247 (vinum): c0e44642
```
This error may also occur if you use old versions of kld or userland program.

The vinum read command has a particularly emetic syntax. Once it was the only way to start vinum, but now the preferred method is with the start command. vinum read should be used for maintenance purposes only. Note that its syntax has changed, and the arguments must be disk slices, such as /dev/da0, not partitions such as /dev/da0e.

Unresolved bugs

27 February 2000: When replacing a dead drive, a number of bugs have surfaced.
1. It's not documented. You can find a temporary description here.
2. After replacing the drive, the vinum start command will fail with the message
```
vinum -> start test.p1.s0
Can't start test.p1.s0: Device busy (16)
```
  To get past this problem, you could first set the state to obsolete:
```
vinum -> setstate obsolete test.p1.s0
vinum -> start test.p1.s0
Reviving test.p1.s0 in the background
```
3. After starting the drive, the space allocated on the drive will not register in the list. It will appear that all the space on the drive is available. This is not as bad as it seems, and you can get rid of the problem by restarting Vinum. In the meantime, make very sure not to create any new subdisks on the drive, or they may overlap existing subdisks.
Status: I'm not sure whether the fix got committed. Watch this space.

23 September 2000: When reviving a striped plex, writes will be lost to the area which has already been written. This does not apply to concatenated plexes, nor to RAID-4 and RAID-5 plexes.
Technical explanation: This is a bug. I had forgotten to write the code.
Workaround: When reviving striped plexes, ensure that no other write I/O takes place. The simplest way to achieve this is to unmount the volume.
Status: Fix in planning.

Bugs missing, presumed dead

These bugs haven't formally been fixed; they were reported at some time, but can no longer be reproduced.

29 September 1999: If a drive is removed from a Vinum array and then replaced in a running system, it is not possible to reintegrate it. Access to the device fails in a non-Vinum context.
Technical explanation: It's possible that Vinum isn't releasing drive resources.
Status: This doesn't seem to happen any more.

29 September 1999: Under certain circumstances, the following sequence can cause data corruption:
```
# newfs /dev/vinum/rvol
# fsck /dev/vinum/rvol			OK
# vinum stop vol
# vinum start vol
# fsck /dev/vinum/rvol			Errors detected
```
Technical explanation: I haven't investigated this one yet. It appears likely that there is some issue with flushing data to the disk, since a global vinum stop does not have this problem.
Workaround: Don't do that then.
Status: This was reported once nearly a year ago. It's likely that either it wasn't a bug in the first place, or that other changes in this area have fixed it.

Bugs considered fixed

These bugs have been identified, a fix has been committed, and the problem hasn't shown up again yet.

28 September 1999: In 3.3-RELEASE, 'vinum start' may panic the machine. This appears to be related to the presence of all disk device nodes, but cases have been reported where it happens even if all nodes exist. Specifically, if you have a disk /dev/da0c, you should have the following devices:

brw-r-----  1 root  operator    4, 0x00010002 Sep 28 12:01 /dev/da0
brw-r-----  1 root  operator    4,   0 Sep 28 12:01 /dev/da0a
brw-r-----  1 root  operator    4,   1 Sep 28 12:01 /dev/da0b
brw-r-----  1 root  operator    4,   2 Sep 28 12:01 /dev/da0c
brw-r-----  1 root  operator    4,   3 Sep 28 12:01 /dev/da0d
brw-r-----  1 root  operator    4,   4 Sep 28 12:01 /dev/da0e
brw-r-----  1 root  operator    4,   5 Sep 28 12:01 /dev/da0f
brw-r-----  1 root  operator    4,   6 Sep 28 12:01 /dev/da0g
brw-r-----  1 root  operator    4,   7 Sep 28 12:01 /dev/da0h
brw-r-----  1 root  operator    4, 0x00020002 Sep 28 12:01 /dev/da0s1
brw-r-----  1 root  operator    4, 0x00020000 Sep 28 12:01 /dev/da0s1a
brw-r-----  1 root  operator    4, 0x00020001 Sep 28 12:01 /dev/da0s1b
brw-r-----  1 root  operator    4, 0x00020002 Sep 28 12:01 /dev/da0s1c
brw-r-----  1 root  operator    4, 0x00020003 Sep 28 12:01 /dev/da0s1d
brw-r-----  1 root  operator    4, 0x00020004 Sep 28 12:01 /dev/da0s1e
brw-r-----  1 root  operator    4, 0x00020005 Sep 28 12:01 /dev/da0s1f
brw-r-----  1 root  operator    4, 0x00020006 Sep 28 12:01 /dev/da0s1g
brw-r-----  1 root  operator    4, 0x00020007 Sep 28 12:01 /dev/da0s1h
brw-r-----  1 root  operator    4, 0x00030002 Sep 28 12:01 /dev/da0s2
brw-r-----  1 root  operator    4, 0x00040002 Sep 28 12:01 /dev/da0s3
brw-r-----  1 root  operator    4, 0x00050002 Sep 28 12:01 /dev/da0s4

In particular, if the /dev/da0s1a devices are missing, we have seen panics.

Technical explanation: The function vinum_scandisk was trying to read from a drive which had been invalidated. This caused a null pointer dereference.

Status (13 October 1999): Fixed in 4.0-CURRENT (file vinumio.c, revision 1.45) and 3.3-STABLE (file vinumio.c, revision 1.7.2.11). If you're running 3.3-RELEASE, you should upgrade to 3.3-STABLE or 3.4-RELEASE.

Will not be fixed in older releases, if indeed it exists there.

28 September 1999: We've seen some strange panics when using Vinum RAID-5. The symptoms are:
```
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x0
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0x0
```
Note particularly the instruction pointer value. This happens during normal file access. In all cases I have investigated, the system was using soft updates. I have one report of a similar panic without soft updates, but it hasn't been substantiated.
This problem doesn't happen everywhere, but where it does, it's quite reliable. If you're planning to use this constellation, be sure to test carefully.
Technical explanation: A buffer header gets corrupted between the time the top half of the driver issues the request to the disk driver, and when the I/O completes. Currently, the evidence is pointing towards the disk driver, but the corruption is of such an unusual nature that it's difficult to guess what's going on.
Status: Fix committed to FreeBSD-CURRENT on 5 January 2000 and to -STABLE on 11 May 2000.

28 September 1999: We have seen hangs when perform heavy I/O to RAID-5 plexes. The symptoms are that processes hang waiting on vrlock and flswai. Use ps lax to display this information.
Technical explanation: A deadlock arose between code locking stripes on a RAID-5 plex (vrlock) and code waiting for buffers to be freed (flswai).
Status: Fix committed to FreeBSD-CURRENT on 5 January 2000 and to -STABLE on 11 May 2000.

6 October 1999: We've seen cases where a plex is not clean after performing an init command. The init command writes binary zeros to the entire plex, but some patterns have shown up where specific patterns of non-zero blocks show up.
Technical explanation: This was due to sloppy coding (incorrect use of the buffered I/O routines).
Status (29 February 2000): Fixed.