Greg's diary--June 2000

This page is effectively the contents of a report that I wrote on 23 June 2000, relating to the beginnings of the FreeBSD SMP rewrite project. It seems that the main addressees were on a mailing list of my employer of the time, Linuxcare. The last paragraph (Timing) includes information that relates to the week after the meeting.

I also collected some other documents: two sets of slides here and here, and a anonymously modified version of this page.

Thursday, 15 June 2000	Sunnyvale	Images for 15 June 2000
		Top of page

On the 15th and 16th of June we had a seminar at Yahoo! in Sunnyvale about the recent changes to the BSD/OS kernel designed to improve SMP performance. This report has been delayed by a week, so I'll include details which have become known since the meeting.

Participants were:

Don Brady	Apple Computer	File systems
Ramesh	Apple Computer
Ted Walker	Apple Computer	network drivers
Jeffrey Hsu	FreeBSD project
Chuck Paterson	BSDi	Chief developer
Jonathan Lemon	Cisco, FreeBSD project
Matt Dillon	FreeBSD project	VM, NFS
Paul Saab	Yahoo!
Kirk McKusick
Peter Wemm	Yahoo!
Jayanth	Yahoo!
Doug Rabson	FreeBSD project	Alpha port
Jason Evans	FreeBSD project	kernel threads
David Greenman	FreeBSD project	chief architect
Justin Gibbs	Adaptec, FreeBSD project	SCSI, 0 copy TCP
Greg Lehey	Linuxcare, FreeBSD project	storage management
Mike Smith	BSDi, FreeBSD project	hardware, iA64 port
Alfred Perlstein	Wintel, FreeBSD project
David O'Brien	BSDi, FreeBSD project	compilers, binutils
Ceren Ercen	Linuxcare	Daemon babe

The purpose of the meeting was to describe the changes that BSDi have made to BSD/OS between the currently released version 4.1 and development versions 4.2 and 5.0 in order to improve SMP performance. The intention is to use the experience gained with this development, and probably a large proportion of the code, to gain similar advantages in FreeBSD. Although not discussed in this meeting, BSDi is also talking with the NetBSD and OpenBSD people about releasing the code to these projects as well.

The question of source code access.

BSD/OS is not an open source operating system. Source code licenses are available for a fee which must now be in the $3000 range. Nevertheless, it's based on the same free sources (4.4BSD) as the free BSD operating systems. It is very similar to FreeBSD, though the differences are enough to be painful.

A few weeks back, BSDi made the source code of BSD/OS 5.0 (development) available to all FreeBSD developers (i.e. those with a "commit bit", who are allowed to directly modify the FreeBSD source tree). No "need to know" criterion was applied, but it was made clear that this is still not free software. During the meeting we discussed what this really means, and Kirk McKusick (amongst other things chairman of the board of BSDi) said "well, we're quite happy for you to take generous chunks, but if you ended up taking it all, we might get a little uneasy". I don't see any reason why Linuxcare researchers shouldn't take a look at it. If you're interested, let me know and I'll give you access. I take it for granted that you'd respect BSDi's intention.

This reference to Linuxcare reflects the purpose of the original message: I was working for Linuxcare at the time, and this was part of my job. I didn't expect anybody to ask for access, and my expectation proved correct.
The current problems.
- UNIX was written for single processor machines, and many of the design choices are not just suboptimal for SMP, they're just plain ugly. In particular the synchronization mechanisms don't work well with more than one processor. Briefly,
- the process context, including the upper half of device drivers, doesn't need to protect itself. The kernel is non-preemptive, in other words as long as a process is executing in the kernel, no other process can execute in the kernel. If another process, even with higher priority, becomes executable while a process is executing kernel code, it will have to wait until the active process leaves the kernel or sleeps.
- processes protect themselves against the interrupt context, primarily the bottom half of device drivers, by masking interrupts. The original PDP-11 UNIX used the hardware priority levels (numbered 4 to 7), and even today you'll find function calls like spl4() and spl7() in System V code. BSD changed the names to more descriptive terms like splbio(), splnet() and splhigh(), and also replaced the fixed priorities by interrupt masks, but the principle remains the same. It's not always easy to solve the question of which interrupts need to be masked in which context, and one of the interesting observations at this meeting was that the interrupt masks were getting blacker, in other words each spl() was masking off more bits in the interrupt mask register. This is not good for performance.
- Processes synchronize with each other using the sleep() or tsleep() calls. Traditional UNIX, including System V, uses sleep(), and BSD prefers tsleep(), which provides nice strings which ps(1) displays to show what the process is waiting for. FreeBSD no longer has a sleep() call, while BSD/OS has both, but sleep() is deprecated. tsleep() is used both for voluntary process synchronization (e.g. send a request to another process and wait until it is finished), and for involuntary synchronization (e.g. wait for a shared resource to become available).
  
  Processes sleep on a specific address. In many cases, the address in itself has no meaning, and it's probably easier to think of it as a number. When a process sleeps, it is put on a sleep queue. The wakeup() function takes the sleep address, walks through the sleep queue, and wakes *every* process which is sleeping on this address. This can cause massive problems even on single processor machines: UNIX was never really intended to have hundreds of processes waiting on the same resource, and a number of Apache performance problems centre around this behaviour. As a partial solution, FreeBSD also has an additional function, wakeup_one(), which only wakes one process.
There are a number of reasons why this concept is not a good solution for SMP. Firstly, the simplistic assumption "nothing else can be executing in the kernel while I am" falls flat. We currently haven't implemented a solution for this. Instead, we found a way of enforcing this illogical state, the Giant Kernel Lock. Any process entering the kernel must first obtain the GKL; if a process executing on another processor has the lock, then the current processor spins; it can't even schedule another process to run, because that requires entering the kernel.
In March this year Mike Smith did some testing of Oracle on FreeBSD. The results were impressive. On an SMP machine--I believe a 4 processor Xeon--which had been running a tuned Oracle/Solaris-86 combination, and which achieved 200 tps on that platform, he initially got 14 tps. After some effort he got the performance up to 40 tps, still only 20% of what Solaris did. The investigations he made showed that Oracle was continually calling sigprocmask (2) to set and reset SIGALRM. Each time this happened, the other CPUs ground to a halt: this would by itself explain a 75% performance degradation.

Aside: I think this is a bug in an Oracle library. Some System V libraries simulate sockets with STREAMS, and in the process they play fast and loose with SIGALRM. A version of Oracle for FreeBSD would use native sockets, of course.

The other issue is with masking interrupts. This is also quite a problem for SMP machines, since it requires masking the interrupts on all processors, again requiring an expensive synchronization.
The BSD/OS solution.

BSD/OS started working on this problem by first solving the GKL spin lock problem: instead of a single lock, it added a scheduler lock which allowed a process blocking on the GKL to at least schedule another process. This is implemented in BSD/OS 4.2, which I believe will soon be released.

More interesting is the solution chosen for BSD/OS 5.0:
- The GKL remains, but becomes increasingly meaningless. In particular, it is not always necessary to obtain it in order to enter the kernel.
- Instead the system protects shared data structures with mutexes (or mutexs, as Chuck Paterson spells it). I would call these semaphores. These mutexes replace calls to tsleep() when waiting on shared resources (the involuntary process synchronization I mentioned above). In contrast to traditional UNIX, I understand that mutexes will be used much more frequently (in particular, to protect data structures which were previously implicitly protected by the non-preemptive nature of the kernel). This mechanism will replace calls to tsleep.
  
  Compared with the use of tsleep, mutexes have a number of advantages:
  - Each mutex has its own wait (sleep) queue. When a process releases a mutex, it automatically schedules the next process waiting on the queue. This is more efficient than searching a possibly very long, linear sleep queue. It also avoids the flooding when multiple processes get scheduled, and most of them have to go back to sleep again.
  - Mutexes can be a combination of spin and sleep mutexes: for a resource which may be held only for a very short period of time, even the overhead of sleeping and rescheduling may be higher than waiting in a tight loop. A spin/sleep lock might first wait in a tight loop for 2 µs and then sleep if the lock is still not available at that time. This is an issue which Sun has investigated in great detail with Solaris, but BSDi has not looked at in as much detail. It's possibly an area for us to investigate once the system is up and limping again.
- We eliminate interrupt lockout (spl()) altogether. Instead, interrupt functions use mutexes for synchronization. This means that an interrupt function must be capable of blocking, which is currently impossible: in order to block, the function must have a "process" context, where I use the term "process" loosely. In particular, this could include kernel threads.
  
  BSD/OS 5.0 on Intel currently uses light-weight interrupt threads to process interrupts, while 5.0-SPARC uses normal ("heavyweight") processes. Chuck indicated that the decision to implement light-weight threads initially was probably the wrong one, since it gave rise to a large number of problems, and although the heavyweight process model would give lousy performance, it would probably make it easier to develop the kernel while the light-weight processes were being debugged. There is also the possibility of building a kernel with one or the other support, so that in case of problems during development it would be possible to revert to the heavy-weight processes while searching for the bug.
Other details we discussed included:
- BSD/OS has not implemented condition variables. We didn't go into details. The opinion was expressed that they would be very useful for synchronization, but that they require coding discipline. Condition variables can live along with tsleep(). However, converting all use of tsleep() is a lot of work.
- Netgraph poses locking performance problems, since locks have to be released at multiple potential transfer points, regardless of whether Netgraph is in use. This is the same problem as SYSV STREAMS.
- Interrupts can have priority inversion problems on MP machines. However, it's a temporary inversion that just causes latency. The reason that deadlock never occurs is that as soon as a lock is missed, the interrupt stack stealing is unwound, so there is never a situation where a lock is held that can cause deadlock.
- NFS leasing causes big problems. Samba will have similar problems, potentially.
- Message queues are probably worthwhile, but they're currently not a high priority.
- There are a number of global variable updates that are not locked, and can thus result in partially updated variables (i.e. reads can get corrupt values). This requires either using a locked instruction, or using a mutex. A mutex isn't much more expensive, and is probably easier.
- We should split part of struct proc out into a fixed-size kproc for ps use. This isn't really related to the SMP work, but it would be nice to get rid of the dreaded "proc size mismatch" error message that people get when their kernel is out of sync with userland.
- We spoke about naming conventions. Some people weren't too happy with BSD/OS's macro names. Chuck agreed and said that he would adopt our naming convention if we chose a better one.
- Per-CPU variables need GET_*() and SET_*() routines to lock.
Things we need to do

There are a number of things we need to do. During the meeting we didn't get beyond deciding the first couple of things:
- First remove the Giant Kernel Spinlock and replace it with two, maybe three mutexes, one for the scheduler (schedlock), and a blocking mutex for the kernel in place of the spinlock. BSDi also have an ipending lock for posting interrupts, which we should probably implement in the short term, though it's possible that it might go again.
- In addition, implement the heavy-weight interrupt processes. These would remain in place while the light-weight threads were being debugged.
Who does what?

A surprising number of the participants were in read-only mode. Taking the list above,
- The blokes from Apple sat in the corner on the first day and said nothing. They didn't come the second day. I heard that Darwin (MacOS X) apparently uses a GKL, and they were concerned about the issues. On the other hand, the low-level part of MacOS X is Mach, so it's not clear how much use the BSD/OS code would be.
- Jeffrey Hsu was relatively active in the discussions, but didn't commit to anything. I don't know him very well; he's been in the project for a long time, but I haven't seen him produce anything significant.
- Kirk McKusick wanted to get away and think about it, a sentiment I shared--others were raring to go, apparently before they understood the issues. Typically Kirk gets involved in return for (much) money, so I don't see him doing much beyond giving direction here.
- Peter Wemm acknowledged that he wouldn't have time.
- Jayanth said very little. I'm not sure that he's in a position to do this kind of work.
- Alfred Perlstein didn't commit to anything, probably because his employer wouldn't buy in.
- David Greenman, Justin Gibbs and Mike Smith gave no reason for their non-involvement.
- David O'Brien and Ceren Ercen are not in a position to contribute. They were there mainly to meet the others.
This leaves us with the following people who have committed to anything:
- Matt Dillon will put in locking primitives and schedlock. This includes resurrecting our long-dead idle process to scan the run queue for interrupt threads. He won't have time for NFS.
- Greg Lehey will implement the heavyweight interrupt processes and lightweight interrupt threads.
- Doug Rabson will do Alpha machine-dependent work and keep in touch with i386 work. May be a good resource for NFS, but he doesn't want to commit to it.
- Paul Saab will add kdebug functionality to our DDB.
- Jonathan Lemon wants to convert network drivers.
- Chuck Paterson will be active in a passive role to support the project. He expects to spend about 50% of his time doing so.
- Jason Evans will be the project manager, the first time we've had anybody doing this job.
Of these, only Matt, Greg and Jason currently have specific tasks assigned.
Timing

We have a general agreement that it's better to do it right than do it quickly. Nevertheless, Matt Dillon was very keen to get started as quickly as possible. He expected to be able to implement the dual lock system in the first weekend, though most of us doubted both the time frame and the likelihood that it would even be possible to get it to work without the interrupt processes.

Matt did in fact spend about 5 days on this job, during which he re-implemented the mutex code, a thing that wasn't agreed upon. His current patches, including detailed documentation, are at http://apollo.backplane.com/FreeBSDSmp/

He also discovered that, as we had expected, his code wouldn't work without mine. There's a certain dissatisfaction in the community that he's gone off in his own direction, and there's a possibility of conflict there. People are already calling for rejection of the code, but I'm concerned that, if Jason is too heavy on him (not a danger I have seen so far), he might leave the project. These are, of course, project management issues, and I'd be interested to see how we manage them.

From my point of view, Matt's actions put a certain amount of pressure on me. I have Jason's full backing, however, so this is not too serious.

On the light side, we had a rather amusing experience on Friday. We wanted to order some sandwiches, but something went wrong with the order, so Paul ordered pizza instead. A bit later, the pizza boy came in and deposited the pizzas on the conference table and was about to leave when Paul introduced him. His name is David Filo.