Over decades, the translation industry has been proposing the use of “similar” translations in CAT tools, allowing human translators to visualize one or several matches retrieved from a translation memory (TM) when translating new documents. A translation memory (TM) is a database that stores segments of text and their corresponding translations. Segments can be sentences, paragraphs or sentence-like units (headings, titles, elements in a list, etc.). While the ideal situation is to find perfect matches, these are not always available. In such a case, translators resort to matches showing sufficient content in common with the document to be translated. These partial matches are then slightly “repaired” to achieve correct translations.
The use of TM matches relies on the idea that repairing a given TM match requires less effort than producing a translation from scratch, thus leading to higher productivity and consistency rates. The following figure illustrates human translation via repairing a TM match. The English sentence How long does the flight last? is translated into French considering the TM match How long does a flu last? —Quelle est la durée d’une grippe?
Glossaries usually prove helpful to welcome a new colleague in your team, what if they were one of the best entry point to your domain for our models?
In various workplaces, a lot of knowledge is accumulated in lexicons, which uncover a wide variety of usages, from specifying specialized terms to introducing brand names and business concepts.
Based on more than 50 years of dedicated experience, our research team have presented at COLING 2020 the technique behind the User Dictionary feature, designed to polish machine translation and give it an appropriate flavor through words. This presentation has been recorded and is available here.
For many businesses, translation is a time-consuming, labor-intensive process. Human translators can take days to fully process, translate, and proof a few thousand words. Between sending jobs, communicating time frames and service prices, and receiving the actual translated document, businesses can spend several days waiting for a complete translation.
SYSTRAN’s Neural Machine Translation (NMT) solution cuts that process down to seconds. Our OpenNMT-powered neural engine and hyper-scalable architecture can almost instantly process translation requests. For example, it can translate a double-spaced, one-page Word document in around one second. NMT frees up human translators from grunt work and allows them to tackle more impactful, growth-oriented business problems.
Today, let’s discuss some of the features that allow us to provide those one-second, industry-leading turnaround times that facilitate nearly instant translations.
As part of our webinar series, one of our latest broadcast discussed and demonstrated the unique and innovative Language I/O + SYSTRAN solution, created in collaboration with our partner company Language I/O.
Hosted by J. Obakhan from SYSTRAN and Heather Shoemaker, CEO of Language I/O, the webinar discussed the power of integrating machine language translation technology into the customer care workflow.
Our webinar “Get More From SPNS9” on May 15th, 2020 was a huge success. The webinar demonstrated 6 new exciting upgrades to the SYSTRAN Pure Neural Server 9.6’s, further scaling its technological capabilities. Thank you to those who joined us.
In this post, we have compiled the highlights from the presentation and answers to the questions we receive after.
The minds behind SYSTRAN sit down for an interview regarding the complexities and the capacities of specialized neural machine translation engines.
Participants: Peter Zoldan, Senior Data Engineer -Software Engineer Linguistic Program, Svetlana Zyrianova, Linguistic Program, Petra Bayrami, Jr. Software Engineer – Linguistic Program, Natalia Segal, R&D Engineer.
How much data is required to create a specialized engine?
The more bilingual data, the better the quality. For broad domains such as news, millions of bilingual sentences will be required. However, if the domain is narrow, such as technical support documents for certain products, then even a small set of sentences of 50,000, noticeably improves the quality.
The amount of data required will depend on how broad or narrow the domain is you are specializing the engine into.
Language is messy. Ask any person who has ever had to learn a second language and they will tell you that the most difficult aspect isn’t learning all the rules, but understanding the exceptions to the rules — the real-world application of the language.
When it comes to protecting classified data, blackout redaction has been in use for at least a century. While it is not the only acceptable form of data sanitization, it is historically the oldest and most commonly utilized by eDiscovery firms. This is despite the fact there are more modern and easy-to-use alternatives that save time and reduce errors. The two main data sanitization alternatives that meet legal requirements include anonymization and pseudonymization.
Machine Translation users care about quality and performance. Based on our own observations and the feedback we’ve received; the quality of our Neural MT is impressive. Evaluating performance is a stickier subject, but we’d like to dig our hands in and present our innovations and achievements and how it benefits NMT users.
By performance we mostly mean the manner in which a system performs in terms of speed and efficiency in varying production environments. It is important to note that performance and quality in Neural MT are tightly connected: it is easy to accelerate a given model compromising on the quality. Therefore, when evaluating performance improvement, we always check that quality remains very close to optimal quality.
Since switching to NMT at the end of 2016, we’ve invested our R&D efforts into optimizing our engines to be more efficient, while maintaining and even improving translation accuracy. Our latest, 2nd generation NMT engines, available in our latest release of SYSTRAN Pure Neural® Server, implements several technical optimizations that make the translation faster and more efficient.
New model architecture
The first generation of neural translation engines was based on recurrent neural networks (RNN). This architecture requires the source text to be encoded sequentially, word by word, before generating the translation.
Since 2016, there has been a sharp increase in open source machine translation projects based on neural networks or Neural Machine Translation (NMT) led by companies such as Google, Facebook and SYSTRAN. Why have machine translation and NMT-related innovations become the new Holy Grail for tech companies? And does the future of these companies rely on machine translation?
Never before has a technological field undergone so much disruption in such a short time. Invented in the 1960s, machine translation was first based on grammatical and syntactical rules until 2007. Statistical modelling (known as statistical translation or SMT), which matured particularly due to the abundance of data, then took over. Although statistical translation was introduced by IBM in the 1990s, it took 15 years for the technology to reach mass adoption. Neural Machine Translation on the other hand, only took two years to be widely adopted by the industry after being introduced by academia in 2014, showing the acceleration of innovation in this field. Machine translation is currently experiencing a golden age of technology.
From Big Data to Good Data
Not only have these successive waves of technology differed in their pace of development and adoption, but their key strengths or “core values” have also changed. In rule-based translation, value was brought by code and accumulated linguistic resources. For statistical models, the amount of data was paramount. The more data you had, the better the quality of your translation and your evaluation via the BLEU score (Bilingual Evaluation Understudy, the most widely used algorithm measuring machine translation quality). Now, the move to Machine translation based on neural networks and Deep Learning is well underway and has brought about major changes. The engines are trained to learn language as a child does, progressing step by step. The challenge is not only to process exponential data (Big Data) but more importantly to feed the engines the most qualitative data possible. Hence the interest in “Good data.”