Introduction

This document has been written to describe the compression format used in Sonic the Hedgehog 2 for compressing various level data, graphics, and other stuff. This format has baffled many people, making it very difficult to create Sonic 2 mods and hacks, since the data structure is hard to modify without disrupting the file format. So, here, I describe the format used in the game.

The Basics

It’s pretty clear that, in level maps, graphics, and so on, there is a great deal of data redundancy. And, in the days when Sonic 2 was being produced, ROM space was at somewhat of a premium. So, being able to compress this data down and reduce some of the redundancy can be very beneficial and mean the difference between a 1 or 2 megabyte ROM. However, this format must be easily decompressed, requiring little in the way of computation, while still providing reasonable compression ratios.

In Sonic 2, the compression format chosen relies on the fact that segments in a piece of data may be repeated in other places. So, it is designed similar to a run-length encoding scheme. Essentially, during decompression, the format allows you to specify data copies from previously decompressed data. So, given an offset into the currently decompressed buffer and a length, you can exploit the data repetition by copying previously decompressed segments. This provides a great deal of compression (for example, in EHZ1, the compressed level is 1/10th the size of the original) while being easy to decompress.

Overview of the Format

The compression scheme used in Sonic 2 provides four different ways for storing data. The first method simply specifies uncompressed data. The other three are different methods of indicating the copy offset and length for decompression. These methods each have tradeoffs in terms of amount of bytes needed to represent them, the offset range, and the maximum count expressible. Thus, during compression, one could choose the best method given a particular set of data.

Now, the disadvantage of such a format is thatit does create added complexity, and requires some way of indicating the compression method for each segment. In the next section I’ll go over the details of the compression format and we’ll see how this indication is achieved.

The Details

The compressed data is composed of 16-bit bitfields stored in little-endian format, which are used to indicate the compression methods used for the next chunk of data, and the data itself. The bitfield is read starting at the right, and each bit is used to determine how to treat the next byte (or two or three) of data. There are three bit combinations:

  1. 1 - Straight byte copy (no compression)
  2. 10 - Embedded/Separate compression
  3. 00 - Inline compression

In the straight byte copy format, the data is simply not compressed. So, read it as is.

The Embedded/Separate compression formats are slightly more complex. The next two or three bytes contain data about the compression format. The format for these bytes is as follows:

OOOOOCCC OOOOOOOO [CCCCCCCC] <- optional

O - Offset bit, C - Count bit

To construct the offset from this data, take the five offset bits from the first byte and prepend them to the offset bits from the second byte. The result is a 13-bit offset.

The count is slightly more complicated. By default, take the count from the bottom three bits of the first byte and add one. However, if these bits are all zero then you must read the next byte to get the count. If, at this point, the count is zero, we’re at the end of the data buffer. If the count is a one, we do nothing.

Now, the first case (where the 3 count bits are non-zero) is what I refer to as the Embedded format. The second format is the Separate compression format.

The Inline compression format is the easiest compression format. The next byte contains data about the compressed format. This byte is used as an offset. The count is acquired by taking the next two bits from the bitfield, reversing them, and adding one. So, for example, if the next two bits are a 1 and a 0, then the count is 01 + 1 (binary), or 2.

Now, in ALL cases, the number of bytes actually copied during decompression is equal to the count PLUS 1. So, in the above example, three bytes would actually be copied from the specified offset. As well, the offsets are all stored in signed two’s complement.

Also, if, at any time during reading bits from the bitfield, you run out of bits, immediately read the next two bytes from the data buffer and use them as your bitfield.

Decompression Pseudo-Code

Read bitfield from data buffer
MainLoop:
  Shift bit from right side of bitfield
  If no bits left, read next bitfield
  If bit = 1
    Append next byte from data buffer to decompressed buffer
    Position = Position + 1
    Goto MainLoop
  Else
    Shift bit from right side of bitfield
    If no bits left, read next bitfield
    If bit = 1
      Low = Next byte from buffer
      Hi = Next byte from buffer
      Offset = (Hi << 5) | Low
      Count = Hi & 0x07
      If Count != 0
        Count = Count + 1
      Else
        Count = Next byte from buffer
        If Count = 0 terminate
        If Count = 1 goto MainLoop
    Else
      Shift bit from right side of bitfield
      If no bits left, read next bitfield
      Count |= Bit << 1
      Shift bit from right side of bitfield
      If no bits left, read next bitfield
      Count |= Bit
      Offset = Next byte from buffer
  Count = Count + 1
  while (Count > 0)
    decompressed_buffer[position] = decompressed_buffer[offset + position]
    position = position + 1

Note, this is also available as C code: sonicdec.c

Acknowledgements

Well, I definitely couldn’t have done this without the work of Dave, Joe Groff, Phil K. Hornung, and everyone else who wrote/worked on DGen. Damn, this program is sweet… :)

Damian Grove and the Sonic 2 Hacking Guide website. The information on level data locations is just what I needed to get going. :)

My wife Lenore, for being fairly understanding about me staying up ‘til 3:00am working on this damn stuff.