Making efficient neural machine translation available to everyone with OpenNMT

OpenNMT is an open-source ecosystem for neural machine translation started in 2016 by SYSTRAN and the Harvard NLP group. The project has been used in numerous research and industry applications, including SYSTRAN Translate and SYSTRAN Model Studio.

OpenNMT’s main goal is to make neural machine translation accessible to everyone. However, neural machine translation is notoriously expensive to run as MT models often require a lot of memory and compute power. Early in this project, SYSTRAN engineers focused on improving the efficiency of OpenNMT inference to reduce cost and improve productivity.

The computational challenge of neural machine translation

Neural machine translation models are usually based on the Transformer architecture which powers many recent advances in natural language processing. A common variant known as “big Transformer” contains about 300 million parameters that are tuned during a training phase. Since the parameters are stored using 32-bit floating-point numbers, the model alone takes at least 1.2 GB on disk and in memory.

Architecture Transformers - OpenNMT
The Transformer architecture consists of several neural network building blocks that are run sequentially. Adding more or bigger blocks increases the learning capacity of the model, but makes its execution slower.

Most of these model parameters are then used in matrix multiplications which are expensive to run due to their non-linear complexity. This operation can be accelerated by adding more compute power such as increasing the number of CPU cores, using a faster CPU, or using a GPU instead. However, there are several other ways to improve performance:

  • Doing less: avoid computations by understanding and controlling the execution in detail.
  • Doing better: optimize computations with improved parallelization and hardware-specific instructions.
  • Doing differently: replace computations with faster approximations.

SYSTRAN engineers explored all these methods as part of the OpenNMT project and released CTranslate2, one of the most complete and efficient inference engine for neural machine translation models.

CTranslate2: the OpenNMT inference engine with state-of-the-art efficiency

Released in 2019, CTranslate2 is a fast and full-featured inference engine for neural machine translation models, specifically Transformer models. Here are the key features of the project:

Optimized C++ implementation for CPU and GPU
CTranslate2 is implemented from the ground up in C++ and includes many optimizations to accelerate Transformer models and reduce their memory usage: layer fusions, in-place transformations, batch layout optimizations, etc.

Parallel translations with a single model instance
The engine is designed to allow multiple translations to run in parallel while keeping a single version of the model in memory.

Dynamic memory usage
The memory usage changes dynamically depending on the request size while still meeting performance requirements by reusing previous memory buffers.

Model quantization and reduced precision
Converting the model parameters from 32-bit numbers to 16 bits or 8 bits is an effective strategy to reduce the model size and accelerate its execution. CTranslate2 can convert parameters to 16-bit floating points (FP16), 16-bit integers (INT16), and 8-bit integers (INT8).

Build once, run everywhere
One binary can contain optimized code paths for multiple CPU instruction set architectures (AVX, AVX2, AVX512) and CPU families (Intel, AMD) that are resolved at runtime.

Interactive decoding methods
The library offers methods to complete a partial translation (autocompletion) and return alternative words at a specific position in the translation.

Support for multiple training frameworks

In addition to Transformer models trained with OpenNMT, the engine supports models trained with Fairseq and Marian which are other popular open source NMT toolkits. The project includes conversion scripts that transform these different models into a unified model representation. Developers can implement their own conversion script and benefit from the CTranslate2 runtime if their models follow a set of specifications.

State-of-the-art performance

Other open-source projects have a similar interest in efficiency. For example Marian, a NMT toolkit developed by Microsoft and the University of Edinburgh, is also implemented in pure C++ and include optimizations such as model quantization and reduced precision.

To benchmark the efficiency of each system, we used the OPUS-MT English-German model that was trained with Marian and converted it to CTranslate2. The performance is thus compared for a fixed translation quality (same model and decoding parameters). The results were aggregated over multiple runs to discard outliers.

GPUWords per secondMax. GPU memoryMax. CPU memoryBLEU on WMT14
Marian 1.1128332986 MB1713 MB27.9
CTranslate2 2.134709 (+66%)816 MB (-73%)560 MB (-67%)27.9
Comparison of FP16 GPU translations. Executed with CUDA 11.2 on a g4dn.xlarge Amazon EC2 instance equipped with a NVIDIA® T4 GPU (driver version: 470.82.01).
CPUWords per secondMax. memoryBLEU on WMT14
Marian 1.118578169 MB27.3
CTranslate2 2.131486 (+73%)746 MB (-91%)27.7 (+0.4)
Comparison of INT8 CPU translations. Executed with 8 threads on a c5.metal Amazon EC2 instance equipped with an Intel® Xeon® Platinum 8275CL CPU.

On both GPU and CPU, CTranslate2 is faster and uses significantly less memory for the same quality. The 8-bit model quantization applied by CTranslate2 on CPU is also able to retain more quality according to the BLEU metric.

More results and the scripts to reproduce this benchmark are available on GitHub.

Making efficient neural machine translation available to everyone

All these features and optimizations are freely available in the open source repository. The library includes a Python wrapper which makes it easy for developers to get started. It can be installed like any other Python packages with a single command:

pip install ctranslate2

New Python binaries are regularly released for several platforms: Linux, Linux ARM, Windows, and macOS.

At SYSTRAN, we used these latest optimizations in our products to improve scalability, response time, and resource usage. All translations running on SYSTRAN Translate are powered by the CTranslate2 inference engine.