Documentos y Publicaciones

SYSTRAN @ WMT 2021: Terminology Task [PDF]

ACL: https://aclanthology.org/2021.wmt-1.0/

This paper describes SYSTRAN submissions to the WMT 2021 terminology shared task. We participate in the English-to-French translation direction with a standard Transformer neural machine translation network that we enhance with the ability to dynamically include terminology constraints, a very common industrial practice. Two state-of-the-art terminology insertion methods are evaluated based (i) on the use of placeholders complemented with morphosyntactic annotation and (ii) on the use of target constraints injected in the source stream. Results show the suitability of the presented approaches in the evaluated scenario where terminology is used in a system trained on generic data only

MinhQuang Pham, Antoine Senellart, Dan Berrebbi, Josep Crego, Jean Senellart

Proceedings of the Sixth Conference on Machine Translation (WMT), Online, November 10-11, 2021

Revisiting Multi-Domain Machine Translation [PDF]

ACL: https://www.aclweb.org/anthology/2021.tacl-1.2/, https://aclanthology.org/volumes/2021.tacl-1/

When building machine translation systems, one often needs to make the best out of heterogeneous sets of parallel data in training, and to robustly handle inputs from unexpected domains in testing. This multi-domain scenario has attracted a lot of recent work that fall under the general umbrella of transfer learning. In this study, we revisit multi-domain machine translation, with the aim to formulate the motivations for developing such systems and the associated expectations with respect to performance. Our experiments with a large sample of multi-domain systems show that most of these expectations are hardly met and suggest that further work is needed to better analyze the current behaviour of multi-domain systems and to make them fully hold their promises.

MinhQuang Pham, Josep Maria Crego, François Yvon

Book: Transactions of the Association for Computational Linguistics 9: 17–35, February 1th, 2021

Integrating Domain Terminology into Neural Machine Translation [PDF]

ACL: https://aclanthology.org/2020.coling-main.348/

This paper extends existing work on terminology integration into Neural Machine Translation, a common industrial practice to dynamically adapt translation to a specific domain. Our method, based on the use of placeholders complemented with morphosyntactic annotation, efficiently taps into the ability of the neural network to deal with symbolic knowledge to surpass the surface generalization shown by alternative techniques. We compare our approach to state-of-the-art systems and benchmark them through a well-defined evaluation framework, focusing on actual application of terminology and not just on the overall performance. Results indicate the suitability of our method in the use-case where terminology is used in a system trained on generic data only.

Elise Michon, Josep Maria Crego, Jean Senellart

Proceedings of the 28th International Conference on Computational Linguistics, December 2020

Priming Neural Machine Translation [PDF]

ACL: https://aclanthology.org/2020.wmt-1.63/

Priming is a well known and studied psychology phenomenon based on the prior presentation of one stimulus (cue) to influence the processing of a response. In this paper, we propose a framework to mimic the process of priming in the context of neural machine translation (NMT). We evaluate the effect of using similar translations as priming cues on the NMT network. We propose a method to inject priming cues into the NMT network and compare our framework to other mechanisms that perform micro-adaptation during inference. Overall, experiments conducted in a multi-domain setting confirm that adding priming cues in the NMT decoder can go a long way towards improving the translation accuracy. Besides, we show the suitability of our framework to gather valuable information for an NMT network from monolingual resources.

MinhQuang Pham, Jitao Xu, Josep Maria Crego, François Yvon, Jean Senellart

Proceedings of the Fifth Conference on Machine Translation, November 2020

A Study of Residual Adapters for Multi-Domain Neural Machine Translation [PDF]

ACL:

Domain adaptation is an old and vexing problem for machine translation systems. The most common approach and successful to supervised adaptation is to fine-tune a baseline system with in-domain parallel data. Standard fine-tuning however modifies all the network parameters, which makes this approach computationally costly and prone to overfitting. A recent, lightweight approach, instead augments a baseline model with supplementary (small) adapter layers, keeping the rest of the mode unchanged. This has the additional merit to leave the baseline model intact, and adaptable to multiple domains. In this paper, we conduct a thorough analysis of the adapter model in the context of a multidomain machine translation task. We contrast multiple implementations of this idea on two language pairs. Our main conclusions are that residual adapters provide a fast and cheap method for supervised multi-domain adaptation; our two variants prove as effective as the original adapter model, and open perspective to also make adapted models more robust to label domain errors.

MinhQuang Pham, Josep Maria Crego, François Yvon, Jean Senellart

Proceedings of the Fifth Conference on Machine Translation, November 2020

Efficient and High-Quality Neural Machine Translation with OpenNMT [PDF]

ACL: https://www.aclweb.org/anthology/2020.ngt-1.25

This paper describes the OpenNMT submissions to the WNGT 2020 efficiency shared task. We explore training and acceleration of Transformer models with various sizes that are trained in a teacher-student setup. We also present a custom and optimized C++ inference engine that enables fast CPU and GPU decoding with few dependencies. By combining additional optimizations and parallelization techniques, we create small, efficient, and high-quality neural machine translation models.

Guillaume Klein, Dakun Zhang, Clément Chouteau, Josep Crego, Jean Senellart

Book: "Proceedings of the Fourth Workshop on Neural Generation and Translation", pages 211--217, Association for Computational Linguistics, July 2020

Boosting Neural Machine Translation with Similar Translations [PDF]

ACL: https://www.aclweb.org/anthology/2020.acl-main.144

This paper explores data augmentation methods for training Neural Machine Translation to make use of similar translations, in a comparable way a human translator employs fuzzy matches. In particular, we show how we can simply present the neural model with information of both source and target sides of the fuzzy matches, we also extend the similarity to include semantically related translations retrieved using sentence distributed representations. We show that translations based on fuzzy matching provide the model with "copy" information while translations based on embedding similarities tend to extend the translation "context". Results indicate that the effect from both similar sentences are adding up to further boost accuracy, combine naturally with model fine-tuning and are providing dynamic adaptation for unseen translation pairs. Tests on multiple data sets and domains show consistent accuracy improvements. To foster research around these techniques, we also release an Open-Source toolkit with efficient and flexible fuzzy-match implementation.

Jitao XU, Josep Crego and Jean Senellart

Book: "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", pages 1580--1590, Association for Computational Linguistics, July 2020

SYSTRAN @ WNGT 2019: DGT Task [PDF]

ACL: https://www.aclweb.org/anthology/D19-5629

This paper describes SYSTRAN participation to the Document-level Generation and Translation (DGT) Shared Task of the 3rd Workshop on Neural Generation and Translation (WNGT 2019). We participate for the first time using a Transformer network enhanced with modified input embeddings and optimising an additional objective function that considers content selection. The network takes in structured data of basketball games and outputs a summary of the game in natural language.

Li Gong, Josep Crego, Jean Senellart

Book: "Proceedings of the 3rd Workshop on Neural Generation and Translation", pages 262--267, Association for Computational Linguistics, November 2019, Hong-Kong, China

SYSTRAN @ WAT 2019: Russian-Japanese News Commentary task [PDF]

ACL: https://www.aclweb.org/anthology/D19-5225

This paper describes Systran{'}s submissions to WAT 2019 Russian-Japanese News Commentary task. A challenging translation task due to the extremely low resources available and the distance of the language pair. We have used the neural Transformer architecture learned over the provided resources and we carried out synthetic data generation experiments which aim at alleviating the data scarcity problem. Results indicate the suitability of the data augmentation experiments, enabling our systems to rank first according to automatic evaluations.

Jitao Xu, TuAnh Nguyen, MinhQuang Pham, Josep Crego, Jean Senellart

Book: "Proceedings of the 6th Workshop on Asian Translation", pages 189--194, Association for Computational Linguistics, November 2019, Hong-Kong, China

Enhanced Transformer Model for Data-to-Text Generation [PDF]

ACL: https://www.aclweb.org/anthology/D19-5615

Neural models have recently shown significant progress on data-to-text generation tasks in which descriptive texts are generated conditioned on database records. In this work, we present a new Transformer-based data-to-text generation model which learns content selection and summary generation in an end-to-end fashion. We introduce two extensions to the baseline transformer model: First, we modify the latent representation of the input, which helps to significantly improve the content correctness of the output summary; Second, we include an additional learning objective that accounts for content selection modelling. In addition, we propose two data augmentation methods that succeed to further improve performance of the resulting generation models. Evaluation experiments show that our final model outperforms current state-of-the-art systems as measured by different metrics: BLEU, content selection precision and content ordering. We made publicly available the transformer extension presented in this paper.

Li Gong, Josep Crego, Jean Senellart

Book: "Proceedings of the 3rd Workshop on Neural Generation and Translation", pages 148--156, Association for Computational Linguistics, November 2019, Hong-Kong, China

Lexical Micro-adaptation for Neural Machine Translation [PDF]

AHL: https://hal.archives-ouvertes.fr/hal-02635039

This work is inspired by a typical machine translation industry scenario in which translators make use of in-domain data for facilitating translation of similar or repeating sentences. We introduce a generic framework applied at inference in which a subset of segment pairs are first extracted from training data according to their similarity to the input sentences. These segments are then used to dynamically update the parameters of a generic NMT network, thus performing a lexical micro-adaptation. Our approach demonstrates strong adaptation performance to new and existing datasets including pseudo in-domain data. We evaluate our approach on a heterogeneous English-French training dataset showing accuracy gains on all evaluated domains when compared to strong adaptation baselines.

Jitao Xu, Josep Crego, Jean Senellart

Book: "International Workshop on Spoken Language Translation", "Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT)", November 2019, Hong-Kong, China

Generic and Specialized Word Embeddings for Multi-Domain Machine Translation [PDF]

AHL: https://hal.archives-ouvertes.fr/hal-02343215

Supervised machine translation works well when the train and test data are sampled from the same distribution. When this is not the case, adaptation techniques help ensure that the knowledge learned from out-of-domain texts generalises to in-domain sentences. We study here a related setting, multi-domain adaptation, where the number of domains is potentially large and adapting separately to each domain would waste training resources. Our proposal transposes to neural machine translation the feature expansion technique of (Daum’{e} III, 2007): it isolates domain-agnostic from domain-specific lexical representations, while sharing the most of the network across domains. Our experiments use two architectures and two language pairs: they show that our approach, while simple and computationally inexpensive, outperforms several strong baselines and delivers a multi-domain system that successfully translates texts from diverse sources.

Minh Quang Pham, Josep Crego, François Yvon, Jean Senellart

Book: "International Workshop on Spoken Language Translation", "Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT)", November 2019, Hong-Kong, China

Analyzing Knowledge Distillation in Neural Machine Translation [PDF]

IWSLT, PDF: https://workshop2018.iwslt.org/downloads/Proceedings_IWSLT_2018.pdf

Knowledge distillation has recently been successfully applied to neural machine translation. It basically allows for building shrunk networks while the resulting systems retain most of the quality of the original model. Despite that many authors report on the benefits of knowledge distillation, few works discuss the actual reasons why it works, especially in the context of neural MT. In this paper, we conduct several experiments aiming at understanding why and how distillation impacts accuracy on an English-German translation task. We show that translation complexity is actually reduced when building a distilled/synthesized bi-text when compared to the reference bi-text. We further remove noisy data from synthesized translations and merge filtered synthesized data together with original reference, thus achieving additional accuracy gains.

Dakun Zhang, Josep Crego and Jean Senellart

15th International Workshop on Spoken Language Translation, October 29-30 2018, Bruges, Belgium

Fixing Translation Divergences in Parallel Corpora for Neural MT [PDF]

ACL: http://aclweb.org/anthology/D18-1328

Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. % may contain sentence pairs that are not as parallel as one would expect. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered/corrected corpora actually improves MT performance.

Minh Quang Pham, Josep Crego, Jean Senellart, François Yvon

2018 Conference on Empirical Methods in Natural Language Processing, October 31 – November 4 2018, Brussels, Belgium

SYSTRAN Participation to the WMT2018 Shared Task on Parallel Corpus Filtering [PDF]

ACL: http://aclweb.org/anthology/W18-6485

This paper describes the participation of SYSTRAN to the shared task on parallel corpus filtering at the Third Conference on Machine Translation (WMT 2018). We participate for the first time using a neural sentence similarity classifier which aims at predicting the relatedness of sentence pairs in a multilingual context. The paper describes the main characteristics of our approach and discusses the results obtained on the data sets published for the shared task.

Minh Quang Pham, Josep Crego, Jean Senellart

Third Conference on Machine Translation (WMT18), October 31 - November 1 2018, Brussels, Belgium

Neural Network Architectures for Arabic Dialect Identification [PDF]

ACL: http://aclweb.org/anthology/W18-3914

SYSTRAN competes this year for the first time to the DSL shared task, in the Arabic Dialect Identification subtask. We participate by training several Neural Network models showing that we can obtain competitive results despite the limited amount of training data available for learning. We report our experiments and detail the network architecture and parameters of our 3 runs: our best performing system consists in a Multi-Input CNN that learns separate embeddings for lexical, phonetic and acoustic input features (F1: 0.5289); we also built a CNN-biLSTM network aimed at capturing both spatial and sequential features directly from speech spectrograms (F1: 0.3894 at submission time, F1: 0.4235 with later found parameters); and finally a system relying on binary CNN-biLSTMs (F1: 0.4339).

Elise Michon, Minh Quang Pham, Josep Crego, Jean Senellart

Published in "Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects", Association for Computational Linguistics, pages 128-–136, August 20 2018, New Mexico, USA

OpenNMT System Description for WNMT 2018: 800 words/sec on a single-core CPU [PDF]

ACL: http://aclweb.org/anthology/W18-2715

We present a system description of the OpenNMT Neural Machine Translation entry for the WNMT 2018 evaluation. In this work, we developed a heavily optimized NMT inference model targeting a high-performance CPU system. The final system uses a combination of four techniques, all of them leading to significant speed-ups in combination: (a) sequence distillation, (b) architecture modifications, (c) pre-computation, particularly of vocabulary, and (d) CPU targeted quantization. This work achieves the fastest performance of the shared task, and led to the development of new features that have been integrated to OpenNMT and made available to the community.

Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, J.P. Ramatchandirin, Josep Crego, Alexander M. Rush

Published in "Proceedings of the 2nd Workshop on Neural Machine Translation and Generation", pages 122-–128, Association for Computational Linguistics, July 20 2018, Melbourne, Australia

Conception d'une solution de détection d'événements basée sur Twitter [PDF]

CNRS, PDF : http://taln2017.cnrs.fr/wp-content/uploads/2017/06/actes_TALN_2017-vol3-1.pdf#page=29

Cet article présente un système d'alertes fondé sur la masse de données issues de Tweeter. L'objectif de l'outil est de surveiller l'actualité, autour de différents domaines témoin incluant les événements sportifs ou les catastrophes naturelles. Cette surveillance est transmise à l'utilisateur sous forme d'une interface web contenant la liste d'événements localisés sur une carte.

Christophe Servan, Catherine Kobus, Yongchao Deng, Cyril Touffet, Jungi Kim, Inès Kapp, Djamel Mostefa, Josep Crego, Jean Senellart

24e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) - Actes de TALN 2017, volume 3 : démonstrations, pages 19--20, 26-30 juin 2017, Orléans, France

SYSTRAN Purely Neural MT Engines for WMT2017 [PDF]

ACL: http://aclweb.org/anthology/W17-4722

This paper describes SYSTRAN's systems submitted to the WMT 2017 shared news translation task for English-German, in both translation directions. Our systems are built using OpenNMT1, an opensource neural machine translation system, implementing sequence-to-sequence models with LSTM encoder/decoders and attention. We experimented using monolingual data automatically back-translated. Our resulting models are further hyperspecialised with an adaptation technique that finely tunes models according to the evaluation test sentences.

Yongchao Deng, Jungi Kim, Guillaume Klein, Catherine Kobus, Natalia Segal, Christophe Servan, Bo Wang, Dakun Zhang, Josep Crego, Jean Senellart

Published in "Proceedings of the Second Conference on Machine Translation", pages 265--270, Association for Computational Linguistics, 2017, Copenhagen, Denmark

OpenNMT: Open-Source Toolkit for Neural Machine Translation [PDF]

ACL: http://www.aclweb.org/anthology/P17-4012

We describe an open-source toolkit for neural machine translation (NMT). The toolkit prioritizes efficiency, modularity, and extensibility with the goal of supporting NMT research into model architectures, feature representations, and source modalities, while maintaining competitive performance and reasonable training requirements. The toolkit consists of modeling and translation support, as well as detailed pedagogical documentation about the underlying techniques.

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, Alexander Rush

Published in "Proceedings of ACL 2017, System Demonstrations", pages 67--72, Association for Computational Linguistics, 2017, Vancouver, Canada

Boosting Neural Machine Translation [PDF]

ACL: http://aclweb.org/anthology/I17-2046

Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networks need for very large data as well as many training iterations to achieve state-of-the-art performance. This results in very high computation cost, slowing down research and industrialisation. In this paper, we propose to alleviate this problem with several training methods based on data boosting and bootstrap with no modifications to the neural network. It imitates the learning process of humans, which typically spend more time when learning "difficult" concepts than easier ones. We experiment on an English-French translation task showing accuracy improvements of up to 1.63 BLEU while saving 20% of training time.

Dakun Zhang, Jungi Kim, Josep Crego, Jean Senellart

Published in "Proceedings of the Eighth International Joint Conference on Natural Language Processing" (Volume 2: Short Papers), Asian Federation of Natural Language Processing, 2017, Taipei, Taiwan

Domain Control for Neural Machine Translation [PDF]

ACL: https://www.aclweb.org/anthology/R17-1049/

Machine translation systems are very sensitive to the domains they were trained on. Several domain adaptation techniques have been deeply studied. We propose a new technique for neural machine translation (NMT) that we call domain control which is performed at runtime using a unique neural network covering multiple domains. The presented approach shows quality improvements when compared to dedicated domains translating on any of the covered domains and even on out-of-domain data. In addition, model parameters do not need to be re-estimated for each domain, making this effective to real use cases. Evaluation is carried out on English-to-French translation for two different testing scenarios. We first consider the case where an end-user performs translations on a known domain. Secondly, we consider the scenario where the domain is not known and predicted at the sentence level before translating. Results show consistent accuracy improvements for both conditions.

Catherine Kobus,
Josep Crego,
Jean Senellart

Published in "Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017", INCOMA Ltd., Varna, Bulgaria, Sep 4–6 2017 - [v2] 12 Sep 2017

Adaptation incrémentale de modèles de traduction neuronaux [PDF]

CNRS, PDF : http://taln2017.cnrs.fr/wp-content/uploads/2017/06/actes_TALN_2017-vol2-1.pdf#page=230

L'adaptation au domaine est un verrou scientifique en traduction automatique. Il englobe généralement l'adaptation de la terminologie et du style, en particulier pour la post-édition humaine dans le cadre d'une traduction assistée par ordinateur. Avec la traduction automatique neuronale, nous étudions une nouvelle approche d'adaptation au domaine que nous appelons “spécialisation” et qui présente des résultats prometteurs tant dans la vitesse d'apprentissage que dans les scores de traduction. Dans cet article, nous proposons d'explorer cette approche.

Christophe Servan, Josep Crego, Jean Senellart

24e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) - Actes de TALN 2017, volume 2 : articles courts, pages 218--225, 26-30 juin 2017, Orléans, France

SYSTRAN Pure Neural Machine Translation [PDF]

Neural Machine Translation: let's go back to the origins

Each of us have experienced or heard of deep learning in day-to-day business applications. What are the fundamentals of this new technology and what new opportunities does it offer?

Jan 31, 2017

Neural Machine Translation from Simplified Translations [PDF]

Subjects: Computation and Language (cs.CL)

arXiv: https://arxiv.org/abs/1612.06139

Abstract: Text simplification aims at reducing the lexical, grammatical and structural complexity of a text while keeping the same meaning. In the context of machine translation, we introduce the idea of simplified translations in order to boost the learning ability of deep neural translation models. We conduct preliminary experiments showing that translation complexity is actually reduced in a translation of a source bi-text compared to the target reference of the bi-text while using a neural machine translation (NMT) system learned on the exact same bi-text. Based on knowledge distillation idea, we then train an NMT system using the simplified bi-text, and show that it outperforms the initial system that was built over the reference data set. Performance is further boosted when both reference and automatic translations are used to learn the network. We perform an elementary analysis of the translated corpus and report accuracy results of the proposed approach on English-to-French and English-to-German translation tasks.

Josep Crego,
Jean Senellart

[v1] Mon, 19 Dec 2016 11:50:58 GMT

Domain specialization: a post-training domain adaptation for Neural Machine Translation [PDF]

Subjects: Computation and Language (cs.CL)

arXiv: https://arxiv.org/abs/1612.06141

Domain adaptation is a key feature in Machine Translation. It generally encompasses terminology, domain and style adaptation, especially for human post-editing workflows in Computer Assisted Translation (CAT). With Neural Machine Translation (NMT), we introduce a new notion of domain adaptation that we call "specialization" and which is showing promising results both in the learning speed and in adaptation accuracy. In this paper, we propose to explore this approach under several perspectives.

Christophe Servan,
Josep Crego,
Jean Senellart

[v1] Mon, 19 Dec 2016

SYSTRAN's Pure Neural Machine Translation Systems [PDF]

Subjects: Computation and Language (cs.CL)

arXiv: https://arxiv.org/abs/1610.05540

Abstract: Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development has recently moved from laboratory to production systems as demonstrated by several entities announcing roll-out of NMT engines to replace their existing technologies. NMT systems have a large number of training configurations and the training process of such systems is usually very long, often a few weeks, so role of experimentation is critical and important to share. In this work, we present our approach to production-ready systems simultaneously with release of online demonstrators covering a large variety of languages (12 languages, for 32 language pairs). We explore different practical choices: an efficient and evolutive open-source framework; data preparation; network architecture; additional implemented features; tuning for production; etc. We discuss about evaluation methodology, present our first findings and we finally outline further work.
Our ultimate goal is to share our expertise to build competitive production systems for "generic" translation. We aim at contributing to set up a collaborative framework to speed-up adoption of the technology, foster further research efforts and enable the delivery and adoption to/by industry of use-case specific engines integrated in real production workflows. Mastering of the technology would allow us to build translation engines suited for particular needs, outperforming current simplest/uniform systems.

Josep Crego,
Jungi Kim,
Guillaume Klein,
Anabel Rebollo,
Kathy Yang,
Jean Senellart,
Egor Akhanov,
Patrice Brunelle,
Aurelien Coquard,
Yongchao Deng,
Satoshi Enoue,
Chiyo Geiss,
Joshua Johanson,
Ardas Khalsa,
Raoum Khiari,
Byeongil Ko,
Catherine Kobus,
Jean Lorieux,
Leidiana Martins,
Dang-Chuan Nguyen,
Alexandra Priori,
Thomas Riccardi,
Natalia Segal,
Christophe Servan,
Cyril Tiquet,
Bo Wang,
Jin Yang,
Dakun Zhang,
Jing Zhou,
Peter Zoldan

[v1] Tue, 18 Oct 2016

System Combination RWTH Aachen - SYSTRAN for the NTCIR-10 PatentMT Evaluation 2013 [PDF]

Abstract: This paper describes the joint submission by RWTH Aachen University and SYSTRAN in the Chinese-English Patent Machine Translation Task at the 10th NTCIR Workshop. We specify the statistical systems developed by RWTH Aachen University and the hybrid machine translation systems developed by SYSTRAN. We apply RWTH Aachen’s combination techniques to create consensus hypotheses from very different systems: phrase-based and hierarchical SMT, rule-based MT (RBMT) and MT with statistical post-editing (SPE). The system combination was ranked second in BLEU and second in the human adequacy evaluation in this competition.

Minwei Feng, Markus Freitag, Hermann Ney, Bianka Buschbeck, Jean Senellart, Jin Yang

June 18-21, 2013, Tokyo, Japan

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT 2011 [PDF]

Abstract: This report describes SYSTRAN’s Chinese-English and English-Chinese machine translation systems that participated in the CWMT 2011 machine translation evaluation tasks. The base systems are SYSTRAN rule-based machine translation systems, augmented with various statistical techniques. Based on the translations of the rule-based systems, we performed statistical post-editing with the provided bilingual and monolingual training corpora. In this report, we describe the technology behind the systems, the training data, and finally the evaluation results in the CWMT 2011 evaluation. Our primary Chinese-English system was ranked first in BLEU in the translation tasks.

Jin Yang, Satoshi Enoue, Jean Senellart

Proceedings of the 7th China Workshop on Machine Translation (CWMT), September 2011.

Convergence of Translation Memory and Statistical Machine Translation [PDF]

Abstract: We present two methods that merge ideas from statistical machine translation (SMT) and translation memories (TM). We use a TM to retrieve matches for source segments, and replace the mismatched parts with instructions to an SMT system to fill in the gap. We show that for fuzzy matches of over 70%, one method outperforms both SMT and TM base- lines.

Philipp Koehn, Jean Senellart

JEC, November 2010.

Fast Approximate String Matching with Suffix Arrays and A* Parsing [PDF]

Abstract: We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3–247ms for a segment in a realistic scenario.

Philipp Koehn, Jean Senellart

AMTA, October 2010.

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems [PDF]

Abstract: This report describes both of SYSTRAN's Chinese-English and English-Chinese machine translation systems that participated in the CWMT2009 machine translation evaluation tasks. The base systems are SYSTRAN rule-based machine translation systems, augmented with various statistical techniques. Based on the translations of the rule-based systems, we perform statistical post-editing with the provided bilingual and monolingual training corpora. In this report, we describe the technology behind the systems, the training data, and finally the evaluation results in the CWMT2009 evaluation. Our primary systems were top-ranked in the evaluation tasks.

Jin Yang, Satoshi Enoue, Jean Senellart, Tristan Croiset

November 2009, CWMT

Selective addition of corpus-extracted phrasal lexical rules to a rule-based machine translation system [PDF]

Abstract: In this work, we show how an existing rule-based, general-purpose machine translation system may be improved and adapted automatically to a given domain, whenever parallel corpora are available. We perform this adaptation by extracting dictionary entries from the parallel data. From this initial set, the application of these rules is tested against the baseline performance. Rules are then pruned depending on sentence-level improvements and deteriorations, as evaluated by an automatic string-based metric. Experiments using the Europarl dataset show a 3% absolute improvement in BLEU over the original rule-based system.

Loic Dugast, Jean Senellart, Philipp Koehn

MT Summit, August 2009.

Statistical Post Editing and Dictionary Extraction: SYSTRAN/Edinburgh submissions for ACL-WMT2009 [PDF]

Abstract: We describe here the two Systran/University of Edinburgh submissions for WMT2009. They involve a statistical post-editing model with a particular handling of named entities (English to French and German to English) and the extraction of phrasal rules (English to French).

Loïc Dugast, Jean Senellart, Philipp Koehn

March 2009.

SMT and SPE Machine Translation Systems for WMT'09 [PDF]

Abstract: This paper describes the development of several machine translation systems for the 2009 WMT shared task evaluation. We only consider the translation between French and English. We describe a statistical system based on the Moses decoder and a statistical post-editing system using SYSTRAN’s rule-based system. We also investigated techniques to automatically extract additional bilingual texts from comparable corpora.

Holger Schwenk, Sadaf Abdul Rauf, Loic Barrault, Jean Senellart

March 2009.

First Steps towards a General Purpose French/English Statistical Machine Translation System [PDF]

Abstract: This paper describes an initial version of a general purpose French/English statistical machine translation system. The main features of this system are the open-source Moses decoder, the integration of a bilingual dictionary and a continuous space target language model. We analyze the performance of this system on the test data of the WMT'08 evaluation.

Holger Schwenk, Jean-Baptiste Fouet, Jean Senellart

June 2008.

Can we Relearn an RBMT System? [PDF]

Abstract: This paper describes SYSTRAN submissions for the shared task of the third Workshop on Statistical Machine Translation at ACL. Our main contribution consists in a French-English statistical model trained without the use of any human-translated parallel corpus. In substitution, we translated a monolingual corpus with SYSTRAN rule-based translation engine to produce the parallel corpus. The results are provided herein, along with a measure of error analysis.

Loïc Dugast, Jean Senellart, Philipp Koehn

June 2008.

SYSTRAN Translation Stylesheets: Machine Translation driven by XSLT [PDF]

Abstract: XSL Transformation stylesheets are usually used to transform a document described in an XML formalism into another XML formalism, to modify an XML document, or to publish content stored into an XML document to a publishing format (XSL-FO, (X)HTML...). SYSTRAN Translation Stylesheets (STS) use XSLT to drive and control the machine translation of XML documents (native XML document formats or XML representations — such as XLIFF — of other kinds of document formats).

Pierre Senellart, Jean Senellart

September 2005

Intuitive Coding of the Arabic Lexicon [PDF]

Abstract: SYSTRAN started the design and the development of Arabic, Farsi and Urdu to English machine translation systems in July 2002. This paper describes the methodology and implementation adopted for dictionary building and morphological analysis. SYSTRAN's IntuitiveCoding® technology (ICT) facilitates the creation, update, and maintenance of Arabic, Farsi and Urdu lexical entries, is more modular and less costly. ICT for Arabic, Farsi, and Urdu requires the implementation of stem-based lexical entries, the authentic scripts for each language, a statistical Arabic stem-guesser, and separate declarative modules for internal and external morphology.

Ali Farghaly, Jean Senellart

MT Summit IX; September 22-26, 2003.

SYSTRAN New Generation: The XML Translation Workflow [PDF]

Abstract: Customization of Machine Translation (MT) is a prerequisite for corporations to adopt the technology. It is therefore important but nonetheless challenging. Ongoing implementation proves that XML is an excellent exchange device between MT modules that efficiently enables interaction between the user and the processes to reach highly granulated structure-based customization. Accomplished through an innovative approach called the SYSTRAN Translation Stylesheet, this method is coherent with the current evolution of the “authoring process”. As a natural progression, the next stage in the customization process is the integration of MT in a multilingual tool kit designed for the "authoring process".

Jean Senellart, Christian Boitet, Laurent Romary

MT Summit IX, September 22-26, 2003.

SYSTRAN Review Manager [PDF]

Abstract: The SYSTRAN Review Manager (SRM) is one of the components that comprise the SYSTRAN Linguistics Platform (SLP), a comprehensive enterprise solution for managing MT customization and localization projects. The SRM is a productivity tool used for the review, quality assessment and maintenance of linguistic resources combined with a SYSTRAN solution. The SRM is used in-house by SYSTRAN’s development team and is also licensed to corporate customers as it addresses leading linguistic challenges, such as terminology and homographs, which makes it a key component of the QA process. Extremely flexible, the SRM adapts to localization and MT customization projects from small to large-scale. Its Web-based interface and multi-user architecture enable a centralized and efficient work environment for local and geographically disbursed individual users and teams. Users segment a given corpus to fluidly review and evaluate translations, as well as identify the typology of errors. Corpus metrics, terminology extraction and detailed reporting capabilities facilitate prioritizing tasks, resulting in immediate focus on those issues that significantly impact MT quality. Data and statistics are tracked throughout the customization process and are always available for regression tests and overall project management. This environment is highly conducive to increased productivity and efficient QA in the MT customization effort.

Jean-Cédric Costa, Christiane Panissod

MT Summit IX; September 22-26, 2003.

SYSTRAN Intuitive Coding Technology [PDF]

Abstract: Customizing a general-purpose MT system is an effective way to improve machine translation quality for specific usages. Building a user-specific dictionary is the first and most important step in the customization process. An intuitive dictionary-coding tool was developed and is now utilized to allow the user to build user dictionaries easily and intelligently. SYSTRAN's innovative and proprietary IntuitiveCoding® technology is the engine powering this tool. It is comprised of various components: massive linguistic resources, a morphological analyzer, a statistical guesser, finite-state automaton, and a context-free grammar. Methodologically, IntuitiveCoding® is also a cross-application approach for high quality dictionary building in terminology import and exchange. This paper describes the various components and the issues involved in its implementation. An evaluation frame and utilization of the technology are also presented. Future plans for further advancing this technology forward are projected.

Jean Senellart, Jin Yang, Anabel Rebollo

MT Summit IX; September 22-26, 2003.

The SYSTRAN Linguistics Platform [PDF]

Abstract: SYSTRAN's SLP (SYSTRAN Linguistics Platform) is a comprehensive enterprise solution for managing a full range of translation and localization project tasks. The SLP consists of the SYSTRAN machine translation (MT) technology, linguistic resources and tools for project management, corpus analysis and quality evaluation. The underlying platform that supports the SLP is the SYSTRAN WebServer, a client/server application that can be accessed transparently through most common software applications. It supports document formats including HTML, RTF, XML, and SGML. The SYSTRAN WebServer is hosted at the customer’s site and can be integrated with internal translation workflow systems. The SYSTRAN WebServer is a robust and high-volume platform that can support an unlimited number of users, and millions of translation jobs per day.

A software solution to manage multilingual corporate knowledge

October 2002.

SYSTRAN-Autodesk: Resource Alignment and Implicit Transfer [PDF]

Abstract: In this article we present the concept of "implicit transfer" rules. We will show that they represent a valid compromise between huge direct transfer terminology lists and large sets of transfer rules, which are very complex to maintain. We present a concrete, real-life application of this concept in a customization project (TOLEDO project) concerning the automatic translation of Autodesk (ADSK) support pages. In this application, the alignment is moreover combined with a graph representation substituting linear dictionaries. We show how the concept could be extended to increase coverage of traditional translation dictionaries as well as to extract terminology from large existing multilingual corpora. We also introduce the concept of "alignment dictionary" which seems promising in its ability to extend the pragmatic limits of multilingual dictionary management.

Jean Senellart, Mirko Plitt, Christophe Bailly, Françoise Cardoso

MT Summit 8, September 18-22, 2001.

New Generation SYSTRAN Translation System [PDF]

Abstract: In this paper, we present the design of the new generation Systran translation systems, currently utilized in the development of English-Hungarian, English-Polish, English-Arabic, French-Arabic, Hungarian-French and Polish-French language pairs. The new design, based on the traditional Systran machine translation expertise and the existing linguistic resources, addresses the following aspects: efficiency, modularity, declarativity, reusability, and maintainability. Technically, the new systems rely on intensive use of state-of-the-art finite automaton and formal grammar implementation. The finite automata provide the essential lookup facilities and the natural capacity of factorizing intuitive linguistic sets. Linguistically, we have introduced a full monolingual description of linguistic information and the concept of implicit transfer. Finally, we present some by-products that are directly derived from the new architecture: intuitive coding tools, spell checker and syntactic tagger.

Jean Senellart, Péter Dienes, Tamás Váradi

MT Summit 8, September 18-22, 2001.

Algunos campos son obligatorios

Respetamos la confidencialidad de su información y sólo la utilizaremos en el contexto de nuestros intercambios.