Competition Rules

General

  • Participation is free.
  • The submitted compressor may use any data-compression method, provided the decompressor can losslessly (bit-accurately) restore the original information using the compressed output data.
  • Each participant may submit several essentially different compressors. At our sole discretion we may decline some of them if they lack any essential differences.
  • Once we begin accepting submissions we run preliminary tests and publish interim leaderboards about every two weeks. Participants should use the open part of the test set (40% for each test) and our reported statistics to improve their submissions.
  • After the deadline for new and updated submissions, we calculate the final leaderboards and announce the winners eligible for awards.

Submissions

  • For each submission, a participant may decide to enter only a subset of the categories—for example, only rapid compression for all four tests or only the three categories for one particular test.
  • For each submission category, the participant must provide exactly one set of options under which the compressor meets that category’s speed requirement. So, participants may submit a compressor in each category, but with just one set of options apiece. If it fails to meet the speed requirement for a given category, we can help the participant select the appropriate options, but that participant is still responsible for providing acceptable options.
  • Participants may update each submission up to six times (for a total of up to seven test runs, including the initial submission), as long as the submissions meet the respective deadlines. Although we’ll attempt to test submissions beyond the sixth update, especially for the top competitors of interim leaderboards, we make no guarantees to that effect.
  • If a participant provides versions of the same compressor for different operating systems (Linux and Windows), we use the characteristics of the fastest one.
  • After the deadline for new and updated submissions, each participant will receive the following information for cross-check before we announce results:
    • Binary code of all the compressors in the competition
    • All raw statistics (compressed-data size as well as compression and decompression times) for each compressor
    • The entire test set

Results and Ranking

  • We calculate statistics for the full test set, including both the open and closed portions of the data. Final ranks will be in accordance with the rules and formulas defined in “Ranking”.
  • For each submission we use the best-ranked variant (i.e., the variant with the highest average rank for all categories) to calculate the leaderboards. Our ranking calculations employ the statistics for this variant only; for example, even if another variant has a decompressor that can bit-accurately decode compressed data faster, we decline to consider it because we don’t mix and match compressors and decompressors from different variants.
  • To compute the compressed-data full size, we compress the decompressor using bzip2 v.1.0.8 with the “-9” setting. If the submitted compressor is just a single program, we use the total compressor size.
  • We may, at our sole discretion, fully or partially delete from the results information about certain compressors, but only in exceptional cases (e.g., critical compressor errors or lossy compression).
  • We may limit the number of entries in the leaderboards to show only Top-N compressors.

Awards

  • Huawei, our sponsor, will remit prizes directly to the first- and second-place competitors in each category, as the “Scope” section specifies. Depending on the winner’s jurisdiction and other circumstances, the company may deduct applicable taxes and fees from the prize sum.
  • Participants who place among the top three in any category will receive a formal certificate from Moscow State University and Huawei.
  • Members of the competition committee are ineligible for awards.

Disqualification

A submission is subject to disqualification for a given category if it exhibits any of the following:

  • Failure to meet the speed requirements of that category.
  • Use of GPU(s) or multithreading.
  • Failure to decompress all the compressed data without loss, including decompression on a computer different from the one used for compression.

Disqualification from the entire competition will occur for any of the following reasons:

  • Failure to pass antivirus tests.
  • Attempts to remotely access other computers, or other files on the storage device.
  • Failure to meet participant or compressor requirements—in particular, submission of a compressor that’s owned by another company or individual or that otherwise violates intellectual-property rights.

For reference we may, at our sole discretion, include in the final results any statistics for disqualified compressors.

Participant Requirements

  • Winners in each category agree to grant us permission to publish their binary code.
  • All participants shall allow us to share their binary code with other participants.
  • Each participant guarantees that the submitted compressor or compressors are that participant’s own work or, if they’re based on third-party source code, they differ essentially from that code and the participant has received all necessary permissions to modify and submit the code to this competition. Each participant must also guarantee that the submitted compressor or compressors violate no third-party intellectual-property rights.
  • By entering this competition, the participant agrees with all applicable rules and conditions.

Compressor Requirements

  • To participate in the non-block-compression categories (Tests 1, 2 and 3), the compressor should be a command-line standalone application with batch-processing support: all the necessary options and file names must be assignable from the command line. The software can be two separate programs (a set of files with one executable): the compressor for compression (encoding) only and the decompressor for decompression (decoding) only. We recommend this format because we add the decompressor size to the compressed-data size.
  • The total size of the package (the compressor and decompressor and all the files they use) must be less than 20 MiB (20,971,520 bytes).
  • To participate in the block-compression test (Test 4), the compressor must be a library (.dll/.so) with the following interface: {to be defined}
    The compressor may not take any parameters for Test 4; they must be hard coded in the library.
  • Allowed operating systems are Linux and Windows.
  • Multithreading is prohibited: we allow one thread only.
  • GPU use is prohibited.
  • We offer no guarantee that we’ll use exact versions of interpreted-language run-time environments, such as Java, Perl, or Python.
  • We don’t guarantee the exact amount of available physical memory at run time. Please check the test-hardware description before estimating your software’s maximum memory consumption.

Submission Requirements

Send your submission and any updates to globalcompetition@compression.ru.

If your submission is bigger than 10 MB, please use a file-sharing service and provide a link to download it.

Your email should contain the following information in free form:

  • Author name
  • Preferred contact email address and, optionally, additional contact information
  • Compressor name
  • List of categories you wish to enter
  • URL for the compressor package, or an attachment containing the package

The package shall be an archive in a common format (e.g., .zip, .rar, .gz or .bz2) and shall contain executable files for the compressor (or, separately, for the compressor and decompressor). It must also contain a readme.txt file that describes the chosen options for each target category. This file should ideally include 18 command lines (9 for compression and 9 for decompression), because we may evaluate the package on all four test sets in an attempt to fit it into all 12 leaderboards, and because command lines are necessary for Tests 1-3.

Every submission must be able to encode and bit-accurately decode (decompress) all four test sets.

For subsequent submissions, you may provide only new or modified information in your email.

Test Hardware

CPU

Intel Socket 1151 Core i7-8700K (Coffee Lake, 3.7GHz, 6C12T, 95W TDP)

RAM

2x16 GB (32 GB total) DIMM DDR4-2400 CL15

OS

Linux (Ubuntu 18.04), Windows 10

Storage

32 GB (Optane)

Reference speed statistics for the test hardware running Windows (in seconds):

Compressor, Options

enwik9

Test 1

Bzip2 v.1.0.8,
-9 encoding
decoding

 
62.980
25.289


63.819
26.633

ppmd.exe variant J, May 10 2006
-m256 -o8 -r0 encoding
decoding

 
118.898
122.008


130.028
132.782

Compressor, Options

Test 2

BMF v.2.01
encoding
decoding


63.877
44.442

BMF v.2.01
-s encoding
decoding


842.271
729.310

QLIC Ver. 2.demo
encoding
decoding


7.063
6.464

Test-Set Description

All Available Test Data

Click on the link below to download the open part of the test set:

Test 1: Qualitative Data (text)

Test 1 uses a single file containing various appended English texts from Project Gutenberg in UTF-8 characters, so it’s essentially ASCII. The total size is 1 GB, or 1,000,000,000 bytes.

Test 2: Quantitative Data (images)

Test 2 contains 100 images in 8-bit PNM format. The total size is 1,000,452,004 bytes, or about 1 GB. We compress data file-to-file.

Types of images:

Test 3: Mixed Data From Executable Files

Test 3 comes from an Ubuntu x64 distribution and x64 shared library files from Python packages for Linux. The test contains filtered content of executable files. Also, we filter the data to ensure the test contains a reasonably balanced mix of quantitative and qualitative data. The total size is 1 GB, or 1,000,000,000 bytes.

Test 4: Block Compression of Mixed Data

Test 4 employs a roughly 30-70 mixture of Test 1 and Test 3 data. The total size is 1,000,013,824 bytes, or approximately 1 GB.

Rationale

Qualitative-Data Compression

For the qualitative-data test, we chose natural-language text for the following reasons:

  • It’s probably the most common example of qualitative data. In practice, texts occur in many data types requiring compression: databases, executable files, XML files, HTML pages, text messages and so on.
  • Historically, natural language was a focus of R&D with regard to universal data compression. Many practical algorithms and implementations were tailored to process texts. Therefore, this area is relatively well researched, and creation of an adequate test is less problematic than for other types of qualitative data.

We use English-only text to minimize the chance of one participant outrunning the others not because that participant’s compression engine is superior, but because it employs a certain kind of data preprocessing and/or dictionaries for a specific language. Preprocessing and restructuring of English texts for better lossless compression is well known, so the competition will be more equitable.

Quantitative-Data Compression

Our decision to use images for the quantitative-data test follows reasoning similar to that for the qualitative-data test:

  • Images are a common quantitative-data type, constituting a large portion of all stored and transferred data. In many applications, images need lossless or near-lossless compression.
  • Lossless image compression is a well-covered field of science and software engineering, comprising numerous algorithms and solutions.

Mixed-Data Compression

For many tasks, the input data is a mixture of unknown types that may necessitate different processing approaches. Efficient compression requires quick and reliable identification of common data types and use of an appropriate algorithm or processing mode for compressing them.

In addition, executable files are an important and widespread data class.

Since we focus on universal approaches, we exclude compressed data because processing it efficiently may require specialized decompression and recompression. Also, we filter data to ensure that the test contains comparable estimated percentages of quantitative and qualitative data.

Test Size and Small-Block-Data Compression

The 1 GB data size allows for sufficient statistics or a dictionary to demonstrate the strength of a compression algorithm, and it permits a speed estimate by ensuring that the compression or decompression process far outweighs initialization. Furthermore, it permits all the input data, all the output data and the compression program fit in the RAM of a typical computer.

For many compression tasks, the amount of data requiring compression is limited. A quick random access to compressed blocks is necessary. A compression algorithm’s efficiency and implementations are much different for small blocks (files) than for large blocks of many megabytes or more. Ranking for a 1 GB test set may fail to predict compression characteristics for blocks of tens of kilobytes or less.

Speed Categories

Different applications favor different characteristics in a lossless data compressor. The significance and cost of the data size and processing speed may vary tremendously, but a heavyweight boxer shouldn’t be set against a lightweight one. We introduced an arbitrary division of “rapid compression,” “balanced compression” and “high compression ratio” to account for these variations.

Ranking

We believe a good rank should consider at least compression ratio, compression speed, and decompression speed. The dependencies in the ranking formula and the exact coefficients are based on the use case, however. Finding a universal formula is therefore problematic, so we use a weight function for the rapid-compression categories only when speed is especially important.

We believe that formulas based on the monetary cost associated with compression and decompression should be linear. In this comparison we assume the decompression speed is twice as important as compression speed because of I/O-operation statistics for typical databases. We arbitrarily selected the weight for the compressed-data size: 1,000,000 bytes of compressed data (0.1% of the input data size) costs as much as 1 second of encoding time or 0.5 seconds of decoding time.

Not Carved in Stone

No test set or testing approach is ideal when considering a broad spectrum of use cases. The test data and method are subject to change and improvement for subsequent competitions. We are open to suggestions!

FAQ

Q1. May I use explicit or implicit data-preprocessing methods, particularly for texts?
A1. Yes. You may use WRT, starNT and other transforms.
Q2. Is there any restriction on the compiler?
A2. No.
Q3. May I store dictionaries and similar prior-knowledge data in the compressor?
A3. Yes, as long as your compressor still meets the size limit.
Q4. May I send two builds of my program, one for Windows and one for Linux, to see which one is faster?
A4. Yes: this year we will allow one build per operating system, but we count them as either two update submissions or as one initial and one update submission.