In our project, we have successfully compressed an uncompressible file with about a 1:2 ratio using our new idea in data compressionincorporating the most frequently occurring patterns into our compression utility. By doing this we not only make downloading our compression utility an "investment", but also it homes in on patterns that occur frequently in multiple files, but only once. Our program intended as a recompressor of zip files and will have the file extension of .zip.cit or just .cit The Limpel-Ziv standard that is used in programs such as PKZIP was omitted in our program for a simple reason. Compressed files have little or no redundancy. This problem is the same reason that a zip file cannot be compressed over and over again to achieve more compressionthere are simply no more repeated patterns for Limpel-Ziv to remove. On the other hand, our program addresses the problem while incorporating the Huffman encoding technique into file format. What Huffman does for us is provide a simple way of giving the extra data, how many more bytes to go in compressed or decompressed mode, in a tolerable amount of bits. It takes 8 bits or less to represent 1-14. The lower the number, the fewer bits required. Mode changes are expected to be more frequent than not. For this reason, Huffman becomes invaluable in representing this information and became one of the most original solutions in the project.
To get our patterns, a program was developed to get real world patterns into a scoring file. This scoring file was then ranked and the top 256 patterns were kept. Then it was just a matter of incorporating those patterns into the compression and decompression utility and the project could then be tested to measure its successhow much better it compressed the file over standard techniques.
Now that we have run the scoring program on the super computer, although because of the length of time it took (about 3 days since Pi was very busy and mode was about the same rate), we only got to process our large file. Still, we proceeded to see if our compression theory would work. Our project is based on finding the most frequently occurring patterns in all real-world zip files, but knowing the concept works on one file would work. As more files are processed, the accuracy of the patterns will increase for all files that this program may encounter. This is also one of the reasons for version implementation. We ran the ranking program on the score.dat and used the ranked results in the compression program. True to our concept of the file not being easily compressed, much of the scores were just 1. Still, once those patterns were used in compression, we were able to achieve about a 1:2 ratio on the file. Curious how Limpel-Ziv would work on the file (although it was going to be an obvious answer), we ran PKZIP on the file. Surprisingly the file actually got bigger. It went from 2,630,037 to 2,630,151.
From these results, we can conclude that our new compression concept does work. However, we cannot conclude that the same level of compression will be achieved on all files. The file tested was a .ZIP file containing 2 WAV files. Our program is intended for usage as a "re-compressor" because its nature of compressing "uncompressible" (totally random) byte patterns. Had we decided to incorporate Limpel-Ziv, the program would have been suitable for compressing more files, but then other file types would have to be analyzed as well as the added complication to our compression utilitys file format.