Qualitative-Data Compression
For the qualitative-data test, we chose natural-language text for the following reasons:
- It's probably the most common example of qualitative data. In practice, texts occur in many data types requiring compression: databases, executable files, XML files, HTML pages, text messages and so on.
- Historically, natural language was a focus of R&D with regard to universal data compression. Many practical algorithms and implementations were tailored to process texts. Therefore, this area is relatively well researched, and creation of an adequate test is less problematic than for other types of qualitative data.
We use English-only text to minimize the chance of one participant outrunning the others not because that participant's compression engine is superior, but because it employs a certain kind of data preprocessing and/or dictionaries for a specific language. Preprocessing and restructuring of English texts for better lossless compression is well known, so the competition will be more equitable.
Quantitative-Data Compression
Our decision to use images for the quantitative-data test follows reasoning similar to that for the qualitative-data test:
- Images are a common quantitative-data type, constituting a large portion of all stored and transferred data. In many applications, images need lossless or near-lossless compression.
- Lossless image compression is a well-covered field of science and software engineering, comprising numerous algorithms and solutions.
Mixed-Data Compression For many tasks, the input data is a mixture of unknown types that may necessitate different processing approaches. Efficient compression requires quick and reliable identification of common data types and use of an appropriate algorithm or processing mode for compressing them.
In addition, executable files are an important and widespread data class.
Since we focus on universal approaches, we exclude compressed data because processing it efficiently may require specialized decompression and recompression. Also, we filter data to ensure that the test contains comparable estimated percentages of quantitative and qualitative data.
Test Size and Small-Block-Data Compression
The 1 GB data size allows for sufficient statistics or a dictionary to demonstrate the strength of a compression algorithm, and it permits a speed estimate by ensuring that the compression or decompression process far outweighs initialization. Furthermore, it permits all the input data, all the output data and the compression program fit in the RAM of a typical computer.
For many compression tasks, the amount of data requiring compression is limited. A quick random access to compressed blocks is necessary. A compression algorithm's efficiency and implementations are much different for small blocks (files) than for large blocks of many megabytes or more. Ranking for a 1 GB test set may fail to predict compression characteristics for blocks of tens of kilobytes or less.
Speed Categories Different applications favor different characteristics in a lossless data compressor. The significance and cost of the data size and processing speed may vary tremendously, but a heavyweight boxer shouldn't be set against a lightweight one. We introduced an arbitrary division of "rapid compression," "balanced compression" and "high compression ratio" to account for these variations.
Not Carved in Stone No test set or testing approach is ideal when considering a broad spectrum of use cases. The test data and method are subject to change and improvement for subsequent competitions. We are open to suggestions!