SCO's evidence of copying between Linux and UnixWare
Greg's diary
Photo index
Greg's home page
Network link stats
Greg's other links
Copyright information
Groogle

by Greg Lehey

Note: The opinions expressed here are my own and have no relationship with the opinions or official viewpoints of any organization with which I am associated

Executive summary

On 18 August 2003, SCO presented “evidence” of transfer of SCO code into Linux at a conference in Las Vegas. The slides have become available on the web, and this page analyzes the allegations. In summary, SCO's allegations are incorrect. They do, however, indicate that SCO's own code base is in a much worse state that we had previously imagined, and it is very possible that SCO has misappropriated code from BSD. Specifically:

The evidence

On 20 August 2003, Heise Verlag in Germany published an update to a report with the (translated) unemotional and objective title “SCO threatens to kill Open Source”. It was the first thing that I had seen that refers to a presentation of textual similarities between UnixWare and Linux. I don't have time to translate the whole thing, but here's the important bit:

Assisted by his vice president, Chris Sontag, McBride showed examples from the code of Linux 2.5 and 2.6 which should prove that source code has been taken out of Unix without change–an example shown by SCO shows code commentaries ... Identical typing mistakes in the commentaries and unusual formulations had left traitorous traces, claimed Sontag. To prove this, McBride had hired a pattern recognition team to hunt through tens of thousands lines of code. The few code sequences near the comments were made illegible to protect SCO's copyrights.

There's not much information here, and on my first attempt I came to incorrect conclusions. Firstly, the claims are stupid. There's no code in common, just a comment which, admittedly, looks to be the same. I also don't see any “typing mistakes”. But where does it come from? On the SCO side, it includes another line (“The swap map unit is 512 bytes”). Maybe this is not correct for Linux. But people who copy comments so literally don't remove things just because they're wrong; they haven't fixed the broken indentation, for example, assuming that this is really broken indentation in the code, and not a badly prepared slide; there's every possibility that it's the latter.

But if two comments are the same except for an addition, which is the original? The one without the addition, obviously. I initially saw this as an indication that the code was copied from Linux to UnixWare.

In addition, the alleged code sequences near the comments which were made illegible to protect SCO's copyrights are really additional commentary. You don't have to be a C programmer to recognize that comments start with /* and end with */, and that people frequently put a single * in multiline comments for stylistic reasons, something that the person who put together this slide obviously didn't consider important.

The comment is in English written in approximate Greek letters and reads:

As part of the kernel evolution toward modular naming, the functions malloc and mfree are being renamed to rmalloc and rmfree. Compatibility will be maintained by the following assembler code: (also see mfree/rmfree below)

This comment is completely irrelevant to the Linux code to which the first half of the comment (the part written in Roman letters) has been applied. The Linux version of both of these examples comes from the file arch/ia64/sn/io/ate_utils.c. There are a number of interesting things to note about this file:

Further down in the Heise report, you can read:

In total, SCO's testers claim to have found more than 800,000 lines of duplicate code–an example from SCO

OK, let's look at this example. In fact, it's a continuation of the previous example, the function atealloc in arch/ia64/sn/io/ate_utils.c. There are a number of things to note about it:

After reading other opinions, notably those of Bruce Perens and friends (also since updated), I realized that I was wrong: the algorithm for the function atealloc is effectively the old UNIX algorithm for malloc(). SCO is incorrect in claiming that the code in question has been lifted from System V.4 without changes, but that doesn't change the fact that it obviously comes from System V.4. Here's the corresponding code in the Seventh Edition of UNIX (1978), which SCO (then called Caldera) released in early 2002:

/*
 * Allocate 'size' units from the given
 * map.  Return the base of the allocated
 * space.
 * In a map, the addresses are increasing and the
 * list is terminated by a 0 size.
 * The core map unit is 64 bytes; the swap map unit
 * is 512 bytes.
 * Algorithm is first-fit.
 */
malloc(mp, size)
struct map *mp;
{
        register unsigned int a;
        register struct map *bp;

        for(bp=mp;bp->m_size && ((bp-mp) < MAPSIZ);bp++) {
                if (bp->m_size >= size) {
                        a = bp->m_addr;
                        bp->m_addr += size;
                        if ((bp->m_size -= size) == 0) {
                                do {
                                        bp++;
                                        (bp-1)->m_addr = bp->m_addr;
                                } while ((bp-1)->m_size = bp->m_size);
                        }
                        return(a);
                }
        }
        return(0);
}

Both comments and codes are obviously related. But some things are missing, and the comments are formatted differently. In fact, it is almost identical with the oldest version of this code, which was introduced in the Third Edition of Research UNIX in January 1973, the first version of UNIX to be written in C. I've confirmed with a “reliable source” that System V code includes the following changes:

But maybe this code has come from BSD? No. Even in 1986, in 4.3BSD, malloc() had deviated significantly from the original:

/*
 * Allocate 'size' units from the given
 * map.  Return the base of the allocated space.
 * In a map, the addresses are increasing and the
 * list is terminated by a 0 size.
 *
 * Algorithm is first-fit.
 *
 * This routine knows about the interleaving of the swapmap
 * and handles that.
 */
long
rmalloc(mp, size)
    register struct map *mp;
    long size;
{
    register struct mapent *ep = (struct mapent *)(mp+1);
    register int addr;
    register struct mapent *bp;
    swblk_t first, rest;

    if (size <= 0 || mp == swapmap && size > dmmax)
        panic("rmalloc");
    /*
     * Search for a piece of the resource map which has enough
     * free space to accomodate the request.
     */
    for (bp = ep; bp->m_size; bp++) {
        if (bp->m_size >= size) {
            /*
             * If allocating from swapmap,
             * then have to respect interleaving
             * boundaries.
             */
            if (mp == swapmap && nswdev > 1 &&
                (first = dmmax - bp->m_addr%dmmax) < bp->m_size) {
                if (bp->m_size - first < size)
                    continue;
                addr = bp->m_addr + first;
                rest = bp->m_size - first - size;
                bp->m_size = first;
                if (rest)
                    rmfree(swapmap, rest, addr+size);
                return (addr);
            }
            /*
             * Allocate from the map.
             * If there is no space left of the piece
             * we allocated from, move the rest of
             * the pieces to the left.
             */
            addr = bp->m_addr;
            bp->m_addr += size;
            if ((bp->m_size -= size) == 0) {
                do {
                    bp++;
                    (bp-1)->m_addr = bp->m_addr;
                } while ((bp-1)->m_size = bp->m_size);
            }
            if (mp == swapmap && addr % CLSIZE)
                panic("rmalloc swapmap");
            return (addr);
        }
    }
    return (0);
}

The origin of this code is still clearly recognizable, but the code has evolved. If we are to believe SCO, even today, 17 years later, System V malloc(), a critical function, has not evolved to this extent. In those 17 years, BSD malloc() has been completely rewritten, while System V malloc() is essentially the same function as in the very first C language implementation of 1973.

There are a number of things to note about this code:

esr's analysis

Since I wrote the above, Eric Raymond (esr) has also done an analysis. He covers the same ground as I do, but comes to a different conclusion. I stand by my conclusion.

The main differences in esr's approach are:

All of this closely mirrors what I've shown above. In view of the litigious nature of SCO, I haven't included a diff with System V, but esr's version looks the same. None of this explains why he comes to the conclusion he does, that the Linux version was derived from 32V. The similarities between the System V and Linux versions are far too obvious. He writes:

The System V and Linux versions really differ from the common ancestor 32V only in that they both contain mutual-exclusion locking, but it is implemented in significantly different ways, using different data structures.

Well, of course they'd use different functions and different data structures: they fit into different kernels. He also doesn't mention the almost identical ASSERT statements in System V and Linux, something missing in all the other versions. Mutual exclusion locking is an understandable thing to add, and as I commented, there's almost only one way to do it. But the ASSERTs are debugging tools which tend to get added after some problem shows up too often, and which then don't go away again (one reason for not removing them is that they're usually not enabled, so they don't take up any space in the executable).

I see nothing to question my statements above.

esr also writes:

In retrospect, there was a clue in the Linux code all along that it had been copied from rather old sources: the register declarations. Those do by hand an optimization that modern C compilers do automatically, and most programmers lost the habit of inserting them in new code a good ten years ago. So the honest question is: where was Linux's atemalloc copied from?

This is baffling. In his diff, he shows that System V uses the register keyword as well, so I'm not sure what he's aiming at here. It's true that nobody uses this keyword in new code any more, but we're not talking about new code here. It's 30 years old.

esr makes other points:

Given this, there are two pieces of internal evidence that suggest the ancient code. One is that the function is split in two in SVr4 but single in ancient Unix and Linux.

This is true, but the split seems unimportant: In System V, the malloc() function is simply a wrapper for rmalloc(). We've already seen the comparison between rmalloc() and ate_alloc(). The function we're talking about here was called rmalloc() in System V, but since it's been renamed anyway, that's not of any significance.

A subtler indication that one change between SVr4 and Linux would remove a cast (in the second ASSERT call). It is quite unlikely that a programmer casually copying code would go to the effort to remove a cast, and a guilty copier wouldn't do it when there are ways to obscure similarities that are both easier and less likely to spawn subtle bugs. This is especially true since a more effective obscuring method would have been to remove the ASSERTs entirely; they are used for debugging rather than being neccessary to operation and could be readily dispensed with.

Agreed, the difference can't be accounted for as an attempt to obfuscate the source of the code; that's obvious enough. But removing the cast in the System V case would cause the code to fail: it's asserting that the value is less than 0x80000000. This is the smallest possible 32 bit signed number, so if we're doing a signed comparison, it will always fail. The point that esr has missed here is that this is 64 bit code, where this value has no particular meaning. It's possible that it was necessary to remove the unsigned to avoid a compiler warning, though I can't see why it should cause one, or for some similar reason, possibly including internal code auditing.

The second example: BPF

Bruce Perens and friends have published the complete “Powerpoint” presentation from SCO. Here's a link to a PDF version. It includes a second example, code from the Berkeley Packet Filter.

Berkeley? Isn't that BSD? Well, sort of, it seems. It grew up around the BSD distributions, but it's not part of them. The license, however, is pure BSD. The code in question is indeed the same. It's quoted like this:

                        pc += (A == pc->k) ? pc->jt : pc->jf;
                        continue;

                case BPF_JMP|BPF_JSET|BPF_K:
                        pc += (A & pc->k) ? pc->jt : pc->jf;
                        continue;

                case BPF_JMP|BPF_JGT|BPF_X:
                        pc += (A > X) ? pc->jt : pc->jf;
                        continue;

                case BPF_JMP|BPF_JGE|BPF_X:
                        pc += (A >= X) ? pc->jt : pc->jf;
                        continue;

                case BPF_JMP|BPF_JEQ|BPF_X:
                        pc += (A == X) ? pc->jt : pc->jf;
                        continue;

                case BPF_JMP|BPF_JSET|BPF_X:
                        pc += (A & X) ? pc->jt : pc->jf;

Any programmer must cringe at the way this has been quoted. It should be pretty clear even to a non-programmer that this code consists of groups of three lines. The first line describes a condition, the second specified an action to take, and the third (continue) tells the program that that's all (and not to continue to the following line). But the people who prepared the slides chopped off the first line of the first group, and the last line of the last group.

As Perens points out, this is not System V code. It's freely available for download on the Internet. The example above comes from the file bpf-1.2a1/net/bpf_filter.c. The license at the beginning of this file reads:

/*-
 * Copyright (c) 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997
 *      The Regents of the University of California.  All rights reserved.
 *
 * This code is derived from the Stanford/CMU enet packet filter,
 * (net/enet.c) distributed as part of 4.3BSD, and code contributed
 * to Berkeley by Steven McCanne and Van Jacobson both of Lawrence
 * Berkeley Laboratory.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1.  Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2.  Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 3.  All advertising materials mentioning features or use of this software
 *    must display the following acknowledgement:
 *      This product includes software developed by the University of
 *      California, Berkeley and its contributors.
 * 4.  Neither the name of the University nor the names of its contributors
 *    may be used to endorse or promote products derived from this software
 *    without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 *
 *      @(#)bpf.c       7.5 (Berkeley) 7/15/91
 */

How could SCO miss such a glaring indication that it wasn't their code? I think the answer is simple: it's no longer there. I suggest that SCO has removed the license conditions, in direct contravention of paragraph 1 of the license. This suggests that, far from proving any fault in Linux, it has pointed to SCO abusing the BSD license.

On 3 September 2003, during the AUUG 2003 conference, I participated in a panel discussion with Kieran O'Shaughnessy, the General Manager of SCO Australia, and Con Zymaris, an Australian open source activist. I asked Kieran this question (“How could you miss the BSD license?”), and he replied that this was not supposed to be evidence of real System V code in Linux, just a demonstration of the techniques involved. At first I thought he was just trying to worm his way out of the issue, but it seems to be the party line; I'll chase down other references when I have time. In the meantime, look at slide 15 of the briefing and decide whether you think that this was their intention. I have difficulty getting past the conclusion:

  1. This was intended to show theft of SCO intellectual property, which it is not.
  2. SCO made a mistake.
  3. The only way they could do that would be if somebody had removed the BSD license from this file.

Summary

This presentation was supposed to prove that Linux is abusing SCO intellectual property. It seems not only to completely and utterly fail in this purpose, but also to show a number of problems within SCO:


Greg's home page Greg's diary Greg's photos Copyright

Valid XHTML 1.0!

$Id: code-comparison.php,v 1.27 2011/09/26 22:49:44 grog Exp $