Knowledge distillation has recently been successfully applied to neural machine translation. It basically allows for building shrunk networks while the resulting systems retain most of the quality of the original model. Despite that many authors report on the benefits of knowledge distillation, few works discuss the actual reasons why it works, especially in the context of neural MT. In this paper, we conduct several experiments aiming at understanding why and how distillation impacts accuracy on an English-German translation task. We show that translation complexity is actually reduced when building a distilled/synthesized bi-text when compared to the reference bi-text. We further remove noisy data from synthesized translations and merge filtered synthesized data together with original reference, thus achieving additional accuracy gains.
15th International Workshop on Spoken Language Translation, October 29-30 2018, Bruges, Belgium
Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. % may contain sentence pairs that are not as parallel as one would expect. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered/corrected corpora actually improves MT performance.
2018 Conference on Empirical Methods in Natural Language Processing, October 31 – November 4 2018, Brussels, Belgium
This paper describes the participation of SYSTRAN to the shared task on parallel corpus filtering at the Third Conference on Machine Translation (WMT 2018). We participate for the first time using a neural sentence similarity classifier which aims at predicting the relatedness of sentence pairs in a multilingual context. The paper describes the main characteristics of our approach and discusses the results obtained on the data sets published for the shared task.
Third Conference on Machine Translation (WMT18), October 31 - November 1 2018, Brussels, Belgium
Neural Machine Translation: let's go back to the origins
Each of us have experienced or heard of deep learning in day-to-day business applications. What are the fundamentals of this new technology and what new opportunities does it offer?
Jan 31, 2017
Abstract: This paper describes the joint submission by RWTH Aachen University and SYSTRAN in the Chinese-English Patent Machine Translation Task at the 10th NTCIR Workshop. We specify the statistical systems developed by RWTH Aachen University and the hybrid machine translation systems developed by SYSTRAN. We apply RWTH Aachen’s combination techniques to create consensus hypotheses from very different systems: phrase-based and hierarchical SMT, rule-based MT (RBMT) and MT with statistical post-editing (SPE). The system combination was ranked second in BLEU and second in the human adequacy evaluation in this competition.
June 18-21, 2013, Tokyo, Japan
Abstract: This report describes SYSTRAN’s Chinese-English and English-Chinese machine translation systems that participated in the CWMT 2011 machine translation evaluation tasks. The base systems are SYSTRAN rule-based machine translation systems, augmented with various statistical techniques. Based on the translations of the rule-based systems, we performed statistical post-editing with the provided bilingual and monolingual training corpora. In this report, we describe the technology behind the systems, the training data, and finally the evaluation results in the CWMT 2011 evaluation. Our primary Chinese-English system was ranked first in BLEU in the translation tasks.
Proceedings of the 7th China Workshop on Machine Translation (CWMT), September 2011.
Abstract: We present two methods that merge ideas from statistical machine translation (SMT) and translation memories (TM). We use a TM to retrieve matches for source segments, and replace the mismatched parts with instructions to an SMT system to fill in the gap. We show that for fuzzy matches of over 70%, one method outperforms both SMT and TM base- lines.
JEC, November 2010.
Abstract: We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3–247ms for a segment in a realistic scenario.
AMTA, October 2010.
Abstract: This report describes both of SYSTRAN's Chinese-English and English-Chinese machine translation systems that participated in the CWMT2009 machine translation evaluation tasks. The base systems are SYSTRAN rule-based machine translation systems, augmented with various statistical techniques. Based on the translations of the rule-based systems, we perform statistical post-editing with the provided bilingual and monolingual training corpora. In this report, we describe the technology behind the systems, the training data, and finally the evaluation results in the CWMT2009 evaluation. Our primary systems were top-ranked in the evaluation tasks.
November 2009, CWMT
Selective addition of corpus-extracted phrasal lexical rules to a rule-based machine translation system
Abstract: In this work, we show how an existing rule-based, general-purpose machine translation system may be improved and adapted automatically to a given domain, whenever parallel corpora are available. We perform this adaptation by extracting dictionary entries from the parallel data. From this initial set, the application of these rules is tested against the baseline performance. Rules are then pruned depending on sentence-level improvements and deteriorations, as evaluated by an automatic string-based metric. Experiments using the Europarl dataset show a 3% absolute improvement in BLEU over the original rule-based system.
MT Summit, August 2009.
Abstract: We describe here the two Systran/University of Edinburgh submissions for WMT2009. They involve a statistical post-editing model with a particular handling of named entities (English to French and German to English) and the extraction of phrasal rules (English to French).
Abstract: This paper describes the development of several machine translation systems for the 2009 WMT shared task evaluation. We only consider the translation between French and English. We describe a statistical system based on the Moses decoder and a statistical post-editing system using SYSTRAN’s rule-based system. We also investigated techniques to automatically extract additional bilingual texts from comparable corpora.
Abstract: This paper describes an initial version of a general purpose French/English statistical machine translation system. The main features of this system are the open-source Moses decoder, the integration of a bilingual dictionary and a continuous space target language model. We analyze the performance of this system on the test data of the WMT'08 evaluation.
Abstract: This paper describes SYSTRAN submissions for the shared task of the third Workshop on Statistical Machine Translation at ACL. Our main contribution consists in a French-English statistical model trained without the use of any human-translated parallel corpus. In substitution, we translated a monolingual corpus with SYSTRAN rule-based translation engine to produce the parallel corpus. The results are provided herein, along with a measure of error analysis.
Abstract: XSL Transformation stylesheets are usually used to transform a document described in an XML formalism into another XML formalism, to modify an XML document, or to publish content stored into an XML document to a publishing format (XSL-FO, (X)HTML...). SYSTRAN Translation Stylesheets (STS) use XSLT to drive and control the machine translation of XML documents (native XML document formats or XML representations — such as XLIFF — of other kinds of document formats).
Abstract: SYSTRAN started the design and the development of Arabic, Farsi and Urdu to English machine translation systems in July 2002. This paper describes the methodology and implementation adopted for dictionary building and morphological analysis. SYSTRAN's IntuitiveCoding® technology (ICT) facilitates the creation, update, and maintenance of Arabic, Farsi and Urdu lexical entries, is more modular and less costly. ICT for Arabic, Farsi, and Urdu requires the implementation of stem-based lexical entries, the authentic scripts for each language, a statistical Arabic stem-guesser, and separate declarative modules for internal and external morphology.
MT Summit IX; September 22-26, 2003.
Abstract: Customization of Machine Translation (MT) is a prerequisite for corporations to adopt the technology. It is therefore important but nonetheless challenging. Ongoing implementation proves that XML is an excellent exchange device between MT modules that efficiently enables interaction between the user and the processes to reach highly granulated structure-based customization. Accomplished through an innovative approach called the SYSTRAN Translation Stylesheet, this method is coherent with the current evolution of the “authoring process”. As a natural progression, the next stage in the customization process is the integration of MT in a multilingual tool kit designed for the "authoring process".
MT Summit IX, September 22-26, 2003.
Abstract: The SYSTRAN Review Manager (SRM) is one of the components that comprise the SYSTRAN Linguistics Platform (SLP), a comprehensive enterprise solution for managing MT customization and localization projects. The SRM is a productivity tool used for the review, quality assessment and maintenance of linguistic resources combined with a SYSTRAN solution. The SRM is used in-house by SYSTRAN’s development team and is also licensed to corporate customers as it addresses leading linguistic challenges, such as terminology and homographs, which makes it a key component of the QA process. Extremely flexible, the SRM adapts to localization and MT customization projects from small to large-scale. Its Web-based interface and multi-user architecture enable a centralized and efficient work environment for local and geographically disbursed individual users and teams. Users segment a given corpus to fluidly review and evaluate translations, as well as identify the typology of errors. Corpus metrics, terminology extraction and detailed reporting capabilities facilitate prioritizing tasks, resulting in immediate focus on those issues that significantly impact MT quality. Data and statistics are tracked throughout the customization process and are always available for regression tests and overall project management. This environment is highly conducive to increased productivity and efficient QA in the MT customization effort.
MT Summit IX; September 22-26, 2003.
Abstract: Customizing a general-purpose MT system is an effective way to improve machine translation quality for specific usages. Building a user-specific dictionary is the first and most important step in the customization process. An intuitive dictionary-coding tool was developed and is now utilized to allow the user to build user dictionaries easily and intelligently. SYSTRAN's innovative and proprietary IntuitiveCoding® technology is the engine powering this tool. It is comprised of various components: massive linguistic resources, a morphological analyzer, a statistical guesser, finite-state automaton, and a context-free grammar. Methodologically, IntuitiveCoding® is also a cross-application approach for high quality dictionary building in terminology import and exchange. This paper describes the various components and the issues involved in its implementation. An evaluation frame and utilization of the technology are also presented. Future plans for further advancing this technology forward are projected.
MT Summit IX; September 22-26, 2003.
Abstract: SYSTRAN's SLP (SYSTRAN Linguistics Platform) is a comprehensive enterprise solution for managing a full range of translation and localization project tasks. The SLP consists of the SYSTRAN machine translation (MT) technology, linguistic resources and tools for project management, corpus analysis and quality evaluation. The underlying platform that supports the SLP is the SYSTRAN WebServer, a client/server application that can be accessed transparently through most common software applications. It supports document formats including HTML, RTF, XML, and SGML. The SYSTRAN WebServer is hosted at the customer’s site and can be integrated with internal translation workflow systems. The SYSTRAN WebServer is a robust and high-volume platform that can support an unlimited number of users, and millions of translation jobs per day.
Abstract: In this article we present the concept of "implicit transfer" rules. We will show that they represent a valid compromise between huge direct transfer terminology lists and large sets of transfer rules, which are very complex to maintain. We present a concrete, real-life application of this concept in a customization project (TOLEDO project) concerning the automatic translation of Autodesk (ADSK) support pages. In this application, the alignment is moreover combined with a graph representation substituting linear dictionaries. We show how the concept could be extended to increase coverage of traditional translation dictionaries as well as to extract terminology from large existing multilingual corpora. We also introduce the concept of "alignment dictionary" which seems promising in its ability to extend the pragmatic limits of multilingual dictionary management.
MT Summit 8, September 18-22, 2001.
Abstract: In this paper, we present the design of the new generation Systran translation systems, currently utilized in the development of English-Hungarian, English-Polish, English-Arabic, French-Arabic, Hungarian-French and Polish-French language pairs. The new design, based on the traditional Systran machine translation expertise and the existing linguistic resources, addresses the following aspects: efficiency, modularity, declarativity, reusability, and maintainability. Technically, the new systems rely on intensive use of state-of-the-art finite automaton and formal grammar implementation. The finite automata provide the essential lookup facilities and the natural capacity of factorizing intuitive linguistic sets. Linguistically, we have introduced a full monolingual description of linguistic information and the concept of implicit transfer. Finally, we present some by-products that are directly derived from the new architecture: intuitive coding tools, spell checker and syntactic tagger.
MT Summit 8, September 18-22, 2001.