This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Compression benchmarks

Compression benchmarks
by on (#161355)
Since nobody apparently had benchmarked the usual algos on the NES, I ran some. Feel free to copy to the wiki if you want.

Code:
Algo         | Ratio | Speed | License, comments
-------------------------------------------
zlib         | 0.527 | 6.4   | zlib, requires 0.7kb RAM -> not usable without extra RAM
exomizer mem | 0.581 | 9.4   | Free to use/modify; needs about 162 bytes of temp RAM, 4 of which need to be ZP
lzo          | 0.666 | 2.8   | GPL, no RAM required
lz5hc        | 0.688 | 2.5   | BSD, no RAM required
fastlz       | 0.725 | 3.9   | MIT, no RAM required
lz4hc        | 0.731 | 5.8   | BSD, no RAM required - optimized asm is 39.5kb/s


Zlib is the asm-optimized version included in cc65, the others were compiled C, and it shows in the speed. Speed is decompression in kb/s. The data used was the first 768 bytes of the GPL, ie English text, comparable to any level data you'd usually use in compressibility.

I started with the assumption LZO would be best of the no-RAM algos, and indeed it was. Its speed was a surprise however, on desktops it is extremely fast, here it's over twice as slow as zlib. Being GPL it is not usable for many closed projects, though commercial licenses are available.

Might try others if I find some, only LZO advertises being low RAM.

edit: Added lz5hc.
Re: Compression benchmarks
by on (#161362)
Forgive my ignorance, but are these different methods of compressing in-game graphics?

I'm only familiar with RLE, not included here.

Do you have any links to NES code which actually use the compression methods discussed here?
Re: Compression benchmarks
by on (#161370)
dougeff wrote:
Forgive my ignorance, but are these different methods of compressing in-game graphics?

Due to the small amount of RAM on the NES, graphics are usually the only thing compressed with traditional techniques. Some do use an extra 8KB of WRAM to store level maps, and these could also benefit from these traditional compression algorithms.

Quote:
I'm only familiar with RLE, not included here.

RLE is, like, the most basic compression scheme ever, but even it can be implemented in a lot of different ways that'll slightly affect the final compression ratios. Still, all of the other compression schemes mentioned here should outperform RLE by a reasonable margin, specially when compressing text.

Quote:
Do you have any links to NES code which actually use the compression methods discussed here?

I believe there are 6502 implementations of decompression routines for some of these schemes, but if you want to use them for graphics you must modify all the code that deals with the output stream to use $2006/$2007 for reads and writes.
Re: Compression benchmarks
by on (#161377)
lz4hc seems to be the best of the four tested (considering ram first then time), but still would be painfully slow. 5.8 kb/s is only about 12 bytes per ntsc field, assuming 100% cpu usage.
Re: Compression benchmarks
by on (#161378)
I'm not sure what is the data that you are showing, but it almost certainly does not make any sense in a "general purpose" way. A good compression algorithm exploit property of the data it compresses, so it really depends on the data. This is why I wrote my own tool that does statistical analysis using many different algorithm that uses low ressources.

The NES has 2kb RAM so an algorithm that requires 0.7kb of RAM is certainly usable without extra RAM.

The ROM usage is certainly more important than RAM usage. If the decompression algorithm is large, it kills of the savings you have by compressing data.
Re: Compression benchmarks
by on (#161381)
dougeff wrote:
Forgive my ignorance, but are these different methods of compressing in-game graphics?

Do you have any links to NES code which actually use the compression methods discussed here?

They can be used for graphics with CHR RAM, but I intend to use them on level data and such. No public examples that I know of, and for the decompression I used the latest release of each.

43110 wrote:
lz4hc seems to be the best of the four tested (considering ram first then time), but still would be painfully slow. 5.8 kb/s is only about 12 bytes per ntsc field, assuming 100% cpu usage.


When using it to compress the level data, which must be under 2kb, that means sub-0.5s level loading time, which is quite good. Loading can be on a black screen, so all cpu can be used.

Bregalad wrote:
The NES has 2kb RAM so an algorithm that requires 0.7kb of RAM is certainly usable without extra RAM.


There usually isn't that much free. In a game I did previously, there was less than 200 bytes free BSS, so zlib couldn't be used. To keep that much free, you pretty much have to design for it from the start.

Quote:
The ROM usage is certainly more important than RAM usage. If the decompression algorithm is large, it kills of the savings you have by compressing data.


Yes. However, level data may be 10kb or more total, compression that cuts half can be a kb or two and still a big win.
Re: Compression benchmarks
by on (#161395)
Did anyone ever make a 6502 version of APLIB? I've made ARM and Z80 versions.
Re: Compression benchmarks
by on (#161404)
Bregalad wrote:
The NES has 2kb RAM so an algorithm that requires 0.7kb of RAM is certainly usable without extra RAM.

Maybe between levels, or other times when you're not keeping track of an entire virtual world. I seriously doubt that most NES games will have 0.7KB of free RAM during gameplay.
Re: Compression benchmarks
by on (#161406)
Dwedit wrote:
Did anyone ever make a 6502 version of APLIB? I've made ARM and Z80 versions.
the version mic_ wrote in 2010 is still available, and was successfully used for the NES version of Streemerz.
Re: Compression benchmarks
by on (#161415)
Too bad aplib's compressor is closed source.
Re: Compression benchmarks
by on (#161418)
You have the lz4 one :
http://www.brutaldeluxe.fr/products/crossdevtools/lz4/

It's for 65816 cpu but a 6502 version exist somewhere .

It's a fast decompressor (real time oriented), but it have not the best ratio of course .
Re: Compression benchmarks
by on (#161421)
I don't know if there's a good random-access compression scheme (LZ78-style) out there. Zelda Oracles used something like that for the dungeon maps.
Re: Compression benchmarks
by on (#161423)
The Legend of Zelda: Oracle of * also had the Game Boy Color's 32K of RAM (4K fixed, 4K switched) and 16K of VRAM, which is easily randomly accessible while rendering is off, to work with. For random access with little RAM, such as on an NES without extra RAM on the cartridge, you need to use a static dictionary in ROM.

For text, try Huffword or byte pair, with pointers to each document (e.g. each page of text or each line of dialogue). I used byte pair for my robotfindskitten port, and I was going to use Huffword for an e-book reader until the project ran into serious emulator bugs.

For map data, there's either multi-layer metatiles (Mega Man; Blaster Master; Sonic the Hedgehog) or object coding (Super Mario Bros. series). The Legend of Zelda for NES is known for using metatiles that are a column of the screen.

Another trick I like to use that generalizes RLE in a static-dictionary-ish way is based on a Markov chain. For each symbol, store the most common symbol that follows it, and then treat a "run" as a set of symbols each followed by its most common next symbol. (RLE is the special case of this where each symbol's most common next symbol is itself.) Background maps in Haunted: Halloween '85 for NES are stored this way with vertical runs. Level maps in Wrecking Ball Boy for Pygame are horizontal rows of tiles that are then Markov-expanded vertically.

It's also useful to consider how you are encoding run lengths. The best method rate-wise depends on the distribution of run lengths and literal lengths; the distribution implied by popular RLE schemes such as PackBits may not be optimal. And on the NES, you often have to break VRAM uploads into fixed-size packets. I like to use a unary code, in which each bit of a control word can be 0 for a single literal symbol or 1 for part of previous run, to be fairly efficient on real data and easy to decode to a multiple of 8 bytes. LZSS in the Allegro 4 library and Game Boy Advance and Nintendo DS BIOS uses a variant of this unary code, where each bit of the control word is one way for a literal and the other way for a reference to previous data.
Re: Compression benchmarks
by on (#161458)
It is true that RLE is another format, and then there is also huffed RLE (I have used for map data in one game). As tepples mentions, there are many other good ones too. Another thing you can do is to decompress data from CHR ROM into RAM; if everything is read sequentially and may even have parts of variable length, then this may speed it up even more, but you would have to disable rendering in order to use it properly, so it might be used during loading one level or something like that (if you have small enough level sizes).

The kind of data you may compress might include:
  • Graphics
  • Map
  • Text
  • Music
Re: Compression benchmarks
by on (#161735)
tepples wrote:
I like to use a unary code, in which each bit of a control word can be 0 for a single literal symbol or 1 for part of previous run, to be fairly efficient on real data and easy to decode to a multiple of 8 bytes. LZSS in the Allegro 4 library and Game Boy Advance and Nintendo DS BIOS uses a variant of this unary code, where each bit of the control word is one way for a literal and the other way for a reference to previous data.

I don't think that qualifies under the definition of unary code you linked, though... (the tokens are always 0 or 1, not a string of bits) Also what you've described applies to pretty much nearly every LZ77-based scheme out there (LZSS's particularity is the use of a ring buffer instead of the already decoded data, allowing for more room to reuse strings as repeated strings don't go into the buffer).
Re: Compression benchmarks
by on (#161741)
Sik wrote:
tepples wrote:
I like to use a unary code, in which each bit of a control word can be 0 for a single literal symbol or 1 for part of previous run, to be fairly efficient on real data and easy to decode to a multiple of 8 bytes. LZSS in the Allegro 4 library and Game Boy Advance and Nintendo DS BIOS uses a variant of this unary code, where each bit of the control word is one way for a literal and the other way for a reference to previous data.

I don't think that qualifies under the definition of unary code you linked, though... (the tokens are always 0 or 1, not a string of bits)

That's because unary coding and run-length encoding are inverses of each other. From an RLE point of view, PB8's repeat coding is a unary code.

0: Run of 1 byte, then change byte
10: Run of 2 bytes, then change byte
110: Run of 3 bytes, then change byte
1110: Run of 4 bytes, then change byte
etc.

Quote:
Also what you've described applies to pretty much nearly every LZ77-based scheme out there

It models the probability of a back-reference as 50%, with the other 50% for a single literal byte, with no dependence on the previous runs. This is equivalent to runs of back-references and runs of literals each having a geometric distribution to their lengths. Unary codes and other Golomb codes are ideal for geometric distributions, with the Golomb code's M parameter depending on the distribution's p parameter. Optimally, M should be close to -log(2)/log(1 - p), with the unary code corresponding to M = 1 or p = .5. Most schemes branded as "RLE", on the other hand, treat all run lengths up to 32 or 128 or whatever as equally likely.

Another option for coding run lengths is the exp-Golomb code. It replaces the unary code for the quotient with the Elias gamma code, resulting in longer codes for values 1 and 3 (runs of length 2 or 4) for shorter codes for longer runs. Exp-Golomb is ideal for power law (Zipf) distributions.

Quote:
(LZSS's particularity is the use of a ring buffer instead of the already decoded data, allowing for more room to reuse strings as repeated strings don't go into the buffer).

Repeated strings do go into the ring buffer, at least in Allegro 4 LZSS (which is equivalent to GBA LZSS with different bit ordering). The use of a ring buffer is to allow encoding or decoding without keeping the entire uncompressed stream in RAM at once. The difference between LZSS and the format described in Ziv and Lempel's 1977 paper is that original LZ77 assumed that all back references would be followed by exactly one literal and potentially have zero length, while LZSS allows back-references to follow one another directly.
Re: Compression benchmarks
by on (#161786)
Guys, please keep compression theory in a separate topic. This topic's intention was to have data on existing open-source compression algorithms' performance on the NES, not to argue what is the theoretical best.
Re: Compression benchmarks
by on (#161801)
if you are interested by LZ4 decompressor, you can find the 6502 implementation here :
http://pferrie.host22.com/misc/appleii.htm
Re: Compression benchmarks
by on (#162985)
Added lz5hc results. It's a new variant, performs worse than LZO in both metrics, but it's liberally licensed.
Re: Compression benchmarks
by on (#197611)
I've made an asm version of lz4 that will probably soon be a part of cc65. Its speed has been posted on the first page, ~7x faster than the C version.

I also made a version that unpacks to VRAM. Rough speeds for a full 1024 nametable:
- rle takes 0.5 frames
- uncompressed takes 1.3 frames
- lz4 takes 2.8 frames

It's slower, but far from being an issue, as unpacking a nametable happens during a black screen anyway. The compression tends to be better than the Shiru version of RLE.
Re: Compression benchmarks
by on (#197661)
Can you post your "samples" the things you compress to make the tests I have some other methods I would like to compare.
Re: Compression benchmarks
by on (#197666)
As mentioned, the base tests were done with the first 768 bytes of the GPL (https://github.com/clbr/nes/blob/master ... /srcdata.h). I can't post the nametable I tested, as it's a part of an upcoming MegaCat game.
Re: Compression benchmarks
by on (#197669)
Code:
Algo         | Ratio  | Speed | License, comments
------------------------------------------------
exomizer mem | 0.5807 | 9.4   | Free to use/modify; needs about 162 bytes of temp RAM, 4 of which need to be ZP

Speed is based upon 143492 cycles for 768 bytes extrapolated to estimated NES speeds. Assumes Not ZP based selfmodifying Next Byte routine, with all variables stored in ZP. Using indirect vector would save RAM but cost cycles, ZP based get next byte would get more speed. Data was packed in reverse with literals.
exomizer 'mem' mode is 446 bytes, the 'sfx' Self extracting program is 696, the RAW decoder used for mem is 189 bytes, with literal support.
Re: Compression benchmarks
by on (#197690)
Added, thanks.