This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

I made a DTE text compressor!

I made a DTE text compressor!
by on (#51751)
EDIT: Version 1.5 released! Click the download link in this post to get it.

Here's the download link. Instructions on how to use it are in the download.

I've done some tests with it (15 books in US-ASCII format from www.gutenberg.org), and the average compressed file / original file ratio is about 54.3%. So that means you should be able to store about 1.84 times the amount of text that you would be able to store uncompressed for a given file size. For instance, if you have 8k available for text, this compressor would make it so that you could compress about 14-15k of text into that 8k.

An NES-based decompressor has now been made! It's DTE only though for the moment.

Also, if you have uncompressed files of game text in the range of 5k to 750k per game (like for Final Fantasy, Chrono Trigger, probably most other RPGs), let me know, I'd like to test them and see how well this works on them. I'd prefer files that use the game's native text format.

by on (#51753)
If you downloaded the file before I made this post, redownload it from the above link. The file I first uploaded had a bug that wouldn't allow you to compress or decompress to a file unless a file already existed with its name. It's fixed now.

by on (#51754)
Looks pretty good! Here soon I may take a shot at using it Garage Cart #2, if I have enough text.

by on (#51755)
I'm always glad when someone share its compressor tool source. I hope that it will help me one day coding my own one :)

Thanks for sharing. I didn't have the time to analyse it yet, but I surely will tonight

by on (#51765)
In addition to DTE for compression, games also sometimes use Dictionary Compression. A programmer of Secret of Evermore mentioned using it.

by on (#51778)
Ninja Gaiden 3 also uses a dictionary for compression if I remember right. Pretty smart thing to do if you have alot of text and seems well worth any effort put into making such a feature. Even better if someone shares their own and you don't have to do anything other than put it to use.

by on (#51786)
In other words: "tepples, hurry up and make huffword already!"

by on (#51787)
Great work Cart Collector. Can I use it in my project ? I was thinking text has to be compressed even if there isn't much text, because using 8-bits per letter when only about 30 combinations exists (counting ' ', '!', '?' and '.') is a waste.

by on (#51791)
I was wondering why the file was so big! now I see it's becaue of all the books.

Anyway, nice work. The NES community is in need of more compression schemes, for text and for graphics.

by on (#51801)
Thanks for all the compliments! If you want to, go ahead and use the compressor to compress text files for your programs. Be sure to give me credit for the text compressor though. But I have to ask, how many of you have actually used it? It'll work with any file that only has bytes in the range of $00-$7F (that is, bit 7 for each byte = 0). So Bregalad you could test it out on the text that you have.

Here's how it works, for those of you who don't know Java (or are too lazy to read it :P)

The compressor loads the entire text file into RAM, and then searches through each pair of bytes in the file (the 1st and the 2nd, the 2nd and the 3rd, the 3rd and the 4th, so on and so on) until it finds the most common pair of bytes. It then adds the most common pair of bytes to a dictionary that says $80 = most common pair, and then replaces the most common pair of bytes in the file stored in RAM with $80. Then it goes through with this process, using the partially compressed text file in memory, 127 more times, so that $80 through $FF all have pairs of bytes associated with them. If you run the compressor, you can see this happening. Then the compressor outputs a .dte file by adding the 256-byte dictionary of dual tiles to the front of the file and then adding the compressed text after the dictionary.

The decompressor is much simpler. First it loads the dictionary. Then it goes through the file, byte by byte. If it finds a plaintext byte ($00-$7F) it just adds that to the output text file. If it finds a compressed byte ($80-$FF) it calls a routine which returns a variable length of bytes. And yes, the length of the bytes returned has to be variable. Why? Let's say the text file in question has a lot of instances of "the ". On the first round of compression, the most common pair of bytes is "e" and " " (space). The next round's most common pair is "t" and "h". Third round's most common pair is $81 (second round) and $80 (first). So let's say the file contains a byte equal to $82, the third round. It can't add $81 and $80 to the decompressed file! It has to call itself (recursion) until it returns only plaintext bytes.

So yeah, dual tile encoding is actually a dictionary compression scheme. So Secret of Evermore and Ninja Gaiden 3 might use a form of DTE.

I will continue working on this by the way. The next thing I'm going to add to the program is the ability to have a variable dictionary size depending on the file and not just a fixed 128-compressed-byte dictionary. It will allow you to set the maximum size of your dictionary, so if you have, say, 192 characters from $00-$BF, you can set it so that it only has 64 compressed bytes from $C0-$FF, or if you have 30 characters, you can have 226 compressed bytes. It will also stop adding compressed bytes if the most common pair of bytes in the file occurs <=2 times. Why <=2?

2 pairs of uncompressed bytes: $21 $20 $21 $20, 4 bytes total
Compressed: $21 $20 (in dictionary), $AB $AB (in compressed file), 4 bytes total

So there's no benefit to adding that to the dictionary.

After that, I might add the ability to Huffman encode the file after it's been dual tile encoded. Doing so should compress the average text file down to 35-40%, around ZIP compression, but it'll make decompression slower and more complicated.

by on (#51849)
Sounds like BPE (Byte Pair Encoding)
What does DTE stand for?

by on (#51851)
DTE = Dual Tile Encoding

by on (#55750)
New version released (v1.5)!

New features:
-The minimum and maximum dictionary bytes can be set by the user instead of being fixed at $80-$FF
-The compressor can skip compressing certain bytes that you want to omit
-You can set the maximum recursive depth you want your decompressor to use (read on for more info)
-The compressor can generate a table of pointers to the beginning of each message you have based on an end of string byte and an offset value
-Contains ASM code for a 6502 DTE decompressor

Download here.

How the maxDepth setting works:
In Byte Pair Encoding, dictionary bytes might have to recursively call previous entries. Doing so takes up space on the stack, not to mention time if your chosen processor is slow at going to subroutines and returning from them.
For this reason you might want to put a limit on how deep each dictionary byte can call subroutines.
IMPORTANT NOTE: Setting this value to 0 makes the encoder encode only plaintext bytes into the dictionary. A file compressed this way can be decompressed with a DTE (dual tile encoder) decoder. The advantages of DTE are that it's faster and requires less memory and stack space to decompress, however, it doesn't compress as well as BPE does.

How do different maxDepth settings affect compression? I did tests with the included readme file using different maxDepth settings and recorded how well it compressed. Here's what I got (maxDepth setting - compression):

0 - 64.5%
1 - 55.3%
2 - 53.1%
3 - 52.8%
4+- 52.7%

Also, if anyone here has a project that you could use this with, let me know.