- Participation is free.
- The submitted compressor may use any data-compression method, provided the decompressor can losslessly (bit-accurately) restore the original information using the compressed output data.
- Each participant may submit several essentially different compressors. At our sole discretion we may decline some of them if they lack any essential differences.
- Once we begin accepting submissions we run preliminary tests and publish interim leaderboards about every two weeks. Participants should use the open part of the test set (40% for each test) and our reported statistics to improve their submissions.
- After the deadline for new and updated submissions, we calculate the final leaderboards and announce the winners eligible for awards.
- For each submission, a participant may decide to enter only a subset of the categories—for example, only rapid compression for all four tests or only the three categories for one particular test.
- For each submission category, the participant must provide exactly one set of options under which the compressor meets that category’s speed requirement. So, participants may submit a compressor in each category, but with just one set of options apiece. If it fails to meet the speed requirement for a given category, we can help the participant select the appropriate options, but that participant is still responsible for providing acceptable options.
- Participants may update each submission up to six times (for a total of up to seven test runs, including the initial submission), as long as the submissions meet the respective deadlines. Although we’ll attempt to test submissions beyond the sixth update, especially for the top competitors of interim leaderboards, we make no guarantees to that effect.
- If a participant provides versions of the same compressor for different operating systems (Linux and Windows), we use the characteristics of the fastest one.
- After the deadline for new and updated submissions, each participant will receive the following information for cross-check before we announce results:
- Binary code of all the compressors in the competition
- All raw statistics (compressed-data size as well as compression and decompression times) for each compressor
- The entire test set
Results and Ranking
- We calculate statistics for the full test set, including both the open and closed portions of the data. Final ranks will be in accordance with the rules and formulas defined in “Ranking”.
- For each submission we use the best-ranked variant (i.e., the variant with the highest average rank for all categories) to calculate the leaderboards. Our ranking calculations employ the statistics for this variant only; for example, even if another variant has a decompressor that can bit-accurately decode compressed data faster, we decline to consider it because we don’t mix and match compressors and decompressors from different variants.
- To compute the compressed-data full size, we compress the decompressor using bzip2 v.1.0.8 with the “-9” setting. If the submitted compressor is just a single program, we use the total compressor size.
- We may, at our sole discretion, fully or partially delete from the results information about certain compressors, but only in exceptional cases (e.g., critical compressor errors or lossy compression).
- We may limit the number of entries in the leaderboards to show only Top-N compressors.
- Huawei, our sponsor, will remit prizes directly to the first- and second-place competitors in each category, as the “Scope” section specifies. Depending on the winner’s jurisdiction and other circumstances, the company may deduct applicable taxes and fees from the prize sum.
- Participants who place among the top three in any category will receive a formal certificate from Moscow State University and Huawei.
- Members of the competition committee are ineligible for awards.
A submission is subject to disqualification for a given category if it exhibits any of the following:
- Failure to meet the speed requirements of that category.
- Use of GPU(s) or multithreading.
- Failure to decompress all the compressed data without loss, including decompression on a computer different from the one used for compression.
Disqualification from the entire competition will occur for any of the following reasons:
- Failure to pass antivirus tests.
- Attempts to remotely access other computers, or other files on the storage device.
- Failure to meet participant or compressor requirements—in particular, submission of a compressor that’s owned by another company or individual or that otherwise violates intellectual-property rights.
For reference we may, at our sole discretion, include in the final results any statistics for disqualified compressors.
- Winners in each category agree to grant us permission to publish their binary code.
- All participants shall allow us to share their binary code with other participants.
- Each participant guarantees that the submitted compressor or compressors are that participant’s own work or, if they’re based on third-party source code, they differ essentially from that code and the participant has received all necessary permissions to modify and submit the code to this competition. Each participant must also guarantee that the submitted compressor or compressors violate no third-party intellectual-property rights.
- By entering this competition, the participant agrees with all applicable rules and conditions.
- To participate in the non-block-compression categories (Tests 1, 2 and 3), the compressor should be a command-line standalone application with batch-processing support: all the necessary options and file names must be assignable from the command line. The software can be two separate programs (a set of files with one executable): the compressor for compression (encoding) only and the decompressor for decompression (decoding) only. We recommend this format because we add the decompressor size to the compressed-data size.
- The total size of the package (the compressor and decompressor and all the files they use) must be less than 20 MiB (20,971,520 bytes).
- To participate in the block-compression test (Test 4), the compressor must be a library (
.so) with the following interface. The compressor may not take any parameters for Test 4; they must be hard coded in the library.
- Allowed operating systems are Linux and Windows.
- Multithreading is prohibited: we allow one thread only.
- GPU use is prohibited.
- We offer no guarantee that we’ll use exact versions of interpreted-language run-time environments, such as Java, Perl, or Python.
- We don’t guarantee the exact amount of available physical memory at run time. Please check the test-hardware description before estimating your software’s maximum memory consumption.
- Every compressor must be able to encode and bit-accurately decode (decompress) all four test sets.
Send your submission and any updates to firstname.lastname@example.org.
If your submission is bigger than 10 MB, please use a file-sharing service and provide a link to download it.
Your email should contain the following information in free form:
- Author name
- Preferred contact email address and, optionally, additional contact information
- Compressor name
- List of categories you wish to enter
- URL for the compressor package, or an attachment containing the package
The package shall be an archive in a common format (e.g.,
.bz2) and shall contain executable files for the compressor (or, separately, for the compressor and decompressor)—for participation in Tests 1-3, or library—for participation in Test 4. If the submission is for participation in Tests 1-3, it must also contain a readme.txt file that describes the chosen options for each target category. This file should include up to 18 command lines (9 for compression and 9 for decompression), because we may evaluate the package in an attempt to fit it into 9 leaderboards for Tests 1-3.
For subsequent submissions, you may provide only new or modified information in your email.
|CPU||Intel Socket 1151 Core i7-8700K (Coffee Lake, 3.7GHz, 6C12T, 95W TDP)|
|RAM||2x16 GB (32 GB total) DIMM DDR4-2400 CL15|
|OS||Linux (Ubuntu 18.04), Windows 10|
All Available Test Data
Click on the link below to download the open part of the test set:
Test 1: Qualitative Data (text)
Test 1 uses a single file containing various appended English texts from Project Gutenberg in UTF-8 characters, so it’s essentially ASCII. The total size is 1 GB, or 1,000,000,000 bytes.
Test 2: Quantitative Data (images)
Test 2 contains 100 images in 8-bit PNM format. The total size is 1,000,452,004 bytes, or about 1 GB. We compress data file-to-file.
Types of images:
- Photographic. These images come from RAW files and constitute about 75% of the test set by volume. We acquired the source images, with permission, from https://www.dpreview.com/.
- Astronomical. We generated these images from FITS files provided by The Sloan Digital Sky Survey; a few images are from NASA’s Photojournal.
Test 3: Mixed Data From Executable Files
Test 3 comes from an Ubuntu x64 distribution and x64 shared library files from Python packages for Linux. The test contains filtered content of executable files. Also, we filter the data to ensure the test contains a reasonably balanced mix of quantitative and qualitative data. The total size is 1 GB, or 1,000,000,000 bytes.
Test 4: Block Compression of Mixed Data
Test 4 employs a roughly 30-70 mixture of Test 1 and Test 3 data. The total size is 1,000,013,824 bytes, or approximately 1 GB.
For the qualitative-data test, we chose natural-language text for the following reasons:
- It’s probably the most common example of qualitative data. In practice, texts occur in many data types requiring compression: databases, executable files, XML files, HTML pages, text messages and so on.
- Historically, natural language was a focus of R&D with regard to universal data compression. Many practical algorithms and implementations were tailored to process texts. Therefore, this area is relatively well researched, and creation of an adequate test is less problematic than for other types of qualitative data.
We use English-only text to minimize the chance of one participant outrunning the others not because that participant’s compression engine is superior, but because it employs a certain kind of data preprocessing and/or dictionaries for a specific language. Preprocessing and restructuring of English texts for better lossless compression is well known, so the competition will be more equitable.
Our decision to use images for the quantitative-data test follows reasoning similar to that for the qualitative-data test:
- Images are a common quantitative-data type, constituting a large portion of all stored and transferred data. In many applications, images need lossless or near-lossless compression.
- Lossless image compression is a well-covered field of science and software engineering, comprising numerous algorithms and solutions.
For many tasks, the input data is a mixture of unknown types that may necessitate different processing approaches. Efficient compression requires quick and reliable identification of common data types and use of an appropriate algorithm or processing mode for compressing them.
In addition, executable files are an important and widespread data class.
Since we focus on universal approaches, we exclude compressed data because processing it efficiently may require specialized decompression and recompression. Also, we filter data to ensure that the test contains comparable estimated percentages of quantitative and qualitative data.
Test Size and Small-Block-Data Compression
The 1 GB data size allows for sufficient statistics or a dictionary to demonstrate the strength of a compression algorithm, and it permits a speed estimate by ensuring that the compression or decompression process far outweighs initialization. Furthermore, it permits all the input data, all the output data and the compression program fit in the RAM of a typical computer.
For many compression tasks, the amount of data requiring compression is limited. A quick random access to compressed blocks is necessary. A compression algorithm’s efficiency and implementations are much different for small blocks (files) than for large blocks of many megabytes or more. Ranking for a 1 GB test set may fail to predict compression characteristics for blocks of tens of kilobytes or less.
Different applications favor different characteristics in a lossless data compressor. The significance and cost of the data size and processing speed may vary tremendously, but a heavyweight boxer shouldn’t be set against a lightweight one. We introduced an arbitrary division of “rapid compression,” “balanced compression” and “high compression ratio” to account for these variations.
We believe a good rank should consider at least compression ratio, compression speed, and decompression speed. The dependencies in the ranking formula and the exact coefficients are based on the use case, however. Finding a universal formula is therefore problematic, so we use a weight function for the rapid-compression categories only when speed is especially important.
We believe that formulas based on the monetary cost associated with compression and decompression should be linear. In this comparison we assume the decompression speed is twice as important as compression speed because of I/O-operation statistics for typical databases. We arbitrarily selected the weight for the compressed-data size: 1,000,000 bytes of compressed data (0.1% of the input data size) costs as much as 1 second of encoding time or 0.5 seconds of decoding time.
Not Carved in Stone
No test set or testing approach is ideal when considering a broad spectrum of use cases. The test data and method are subject to change and improvement for subsequent competitions. We are open to suggestions!
- Q1. May I use explicit or implicit data-preprocessing methods, particularly for texts?
- A1. Yes. You may use WRT, starNT and other transforms.
- Q2. Is there any restriction on the compiler?
- A2. No.
- Q3. May I store dictionaries and similar prior-knowledge data in the compressor?
- A3. Yes, as long as your compressor still meets the size limit.
- Q4. May I send two builds of my program, one for Windows and one for Linux, to see which one is faster?
- A4. Yes: this year we will allow one build per operating system, but we count them as either two update submissions or as one initial and one update submission.