The Benefits of Maturity: SYSTRAN prioritizes source content engineering and knowing your customers (The LISA Newsletter: Globalization Insider. Number 3.1)2002/07/08
LISA asked SYSTRAN’s Chief Technology Officer Pierre-Yves Foucou about the French MT supplier’s approach to customer needs, two years after he took on the job of revamping the world’s longest-lasting MT supplier.
LISA: The arrival of the web has caused a rush of interest in automatic translation solutions. John Hutchins’ “Compendium of MT systems” now contains over a hundred systems for various language pairs, and there appears to be renewed interest in MT research projects. Does SYSTRAN feel challenged or encouraged by the competition?
Pierre-Yves Foucou: SYSTRAN as a company has been in existence for 30 years now, and to our knowledge we are the only independent fully-fledged supplier of the full range of machine translation solutions around.
In MT, as in many other areas of technology services, it is a question of knowing what you can do, and doing what you know. There are plenty of systems that can be tagged onto websites or desktops to offer some kind of translation functionality, but they are nearly all afterthoughts, often trying to market their wares through their own channels, as IBM does with its WebSphere Translation Server.
In our opinion, there are almost no serious attempts to understand what the market really needs, and then setting about engineering an MT system that delivers appropriate responses to these needs. Many of the commercial projects turn out to be vaporware in the final analysis and require massive investments in re-engineering to deliver any punch. So we feel that in the marketplace for multi-service MT solutions, we have no real competitor. And that is largely because core business focus is on delivering customized solutions in a long term perspective.
LISA: So what does it take to be an industrial-strength MT supplier rather than a translation tool vendor?
Pierre-Yves Foucou: You cannot develop an MT system without a clear idea of the user. Which means that we have to spend a lot of time actually finding out what our customers do or want to do with SYSTRAN solutions. Much of my work involves trying to make our services fully accountable. This involves, for example, developing sophisticated metrics on quality and usage, so that we can build a quality development model that addresses the real issues of delivering the service the customer wants.
We therefore have to encourage our customers to take the time and develop jointly with us the skills to evaluate user behavior, for example. We were able to do this with AutoDesk, where SYSTRAN is used to translate CRM content, because the company knew exactly what it needed SYSTRAN for. The same goes for PriceWaterhouseCoopers, who uses SYSTRAN as a multilingual solution for in-company knowledge management across a highly distributed user base.
This industrial approach that we have been developing provides a clear view of costs, and allows our customers to set business agendas to evaluate ROI and upgrade the system accordingly to meet performance objectives. It also means developing an understanding of how to deliver appropriate after-sales service and other forms of support, so that the development of an MT solution is integrated into an overall vision with transparency about needs and clarity about targets.
LISA: What are the key issues that potential customers must be aware of when seeking to MT-enable a given document cluster or web function?
Pierre-Yves Foucou: First of all, there is no such thing as a “general purpose” MT system. If you are going to use MT to deliver translation into multiple target languages, then our approach will be different from a solution used to translate multiple sources into a single target, just as a text gisting solution will be different from one delivering quality output. And this all depends largely on the nature of the original content.
So the key to successful MT lies in the source material. Companies seeking to automate some of their translation functions will have to be extremely aware of the provenance of their content: Is it their own? Do they have any quality metrics on it? The MT equation changes if you find that translatable content may come from outside the company.
It is essential to develop and use quality metrics for source content, but if you look around at corporate content management policy with respect to multilingual deployment, you find that there is very little relevant work carried out on industrial strength content quality assessment; indeed enterprises rarely have a stable in-house view of their own content. It is organizational and command issues like these which have to be addressed for a successful MT solution. In any case, this content quality mindset will have to go mainstream in companies as they begin to deploy a new generation of CMS.
It is rare, for example, for large industrial companies to have a developed sense of “terminology consistency” in their own document creation, even though there have been many attempts to move in this direction. One method of achieving some form of consistency is to use natural language processing techniques to assess the range and distribution of terminological usage, and we have developed tools to aid this process.
But, once again, it takes time and skill to apply such techniques effectively. Due to the rapid rise of MT as an apparent quick fix solution, potential customers might tend to believe there is a silver bullet that allows MT to be integrated into their process. But this is an illusion. The benefits of an MT solution come through advancing slowly, step by step against a checklist of vital operations. 70 to 80% of the work in building a successful MT solution goes on cleaning up and structuring the source material, the results of which can then be leveraged as a long term investment. Any other way will lead to confusion about purpose and disappointment in results.
LISA: What sort of rules of thumb do you use to evaluate content engineering for your customers ?
Pierre-Yves Foucou: An MT system such as SYSTRAN will only work successfully when very large data sets have been evaluated. This in turn will mean that a knowledge base of concepts and the terms that express them for the document base will need to target a minimum of 100,000 entries. And the figure can rise ten fold for any highly technical subject area.
One of the problems that I’ve met in recent years in this respect is the lack of good generic tools for adding new knowledge to a corporate term system. Older approaches to MT spent a lot of time worrying about ambiguities in the source text and remained technologically highly conservative, since they kept trying to fiddle with the grammar rules to cover all the potential ambiguities in natural language as a whole.
But with the right tools to extract knowledge from the source content, much of this “ambiguity” can be sidelined. Once you have the tools to identify unknown terms or extract plausible term candidates, you can then evaluate the investment needed for the following stages and so on.
At the same time, the new tools now emerging to provide semantic analyses of text tend to come from small-scale providers without the resources to scale or ensure durable development, and they often prove to be useless when confronted by industrial strength volumes of content. There is a serious lack of really robust technology for the MT engine room, so we end up having to develop it where possible ourselves.
LISA: Do you recommend your customers to explore the possibility of using controlled language input as a method to ensure high quality source content?
Pierre-Yves Foucou: Although there is growing interest in editing source text to specific linguistic standards in certain industries, our experience is that it is relatively hard to implement among writers, and therefore tends to put the brakes on long term productivity in MT solutions.
We feel it is better to build a software grammar for a given language, and then scale that grammar up to gradually fit the textual facts, rather than to try and reduce linguistic variation at root in the source production system. In other words, language grammars are put to better use as post-production checkers, and as techniques for deriving internal representations of the meaning of sentences, than as a technique for standardizing an authoring “language” as such.
That said, it is clear that an authoring dimension such as terminology consistency is an immense asset when you plan to multilingual-enable a document base later on.
LISA: What kind of specific solutions have you been delivering recently to your customers?
Pierre-Yves Foucou: DaimlerChrysler is an example of a company that, following its 1998 merger, needed a cost-effective, no-maintenance source of two-way English-German translation for constant intranet communications among its employees – a language engine embedded inside the company if you like. In fact the initiative came from the employees themselves which may account for the easier acceptance of the system, especially when most people’s experience of MT quality in action comes from a service such as Alta Vista on the web. One of the main constraints in this type of installation that customers should be aware of is that the system must integrate seamlessly with the existing IT environment. This is a non-trivial engineering problem requiring considerable time and skills. There is no off the shelf solution.
A totally different recent implementation in the cultural sphere has been the integration of SYSTRAN MT functionality into a web-based catalogue for Gaumont, one of the historic French film asset suppliers. Gaumont wanted to market its huge stock of now digitized newsreels spanning 90 years of current events to the growing demand for historical footage by providing global English access to the 120,000 French-language descriptive articles associated with each film sequence in its database – a source corpus of some 14 million words.
Luckily Gaumont had already digitized the French source material over the past nine years. And we were able to develop a system for them in about 9 months, spending most of the time on cleaning up and structuring the source files and building a terminology base, until we had proof of workability. The act of Englishing this mass of data to acceptable levels of quality for international film researchers means that Gaumont now has a business asset that forms a powerful basis for further localization solutions in the future.
In a quite different register, but only too understandable in the present geopolitical climate, the French Ministry of Defense has been looking for MT solutions for various Middle Eastern languages and we shall be building a team to address their specific needs over the next six months.
And of course, SYSTRAN continues to be deployed at the European Commission, more in a helpdesk capacity. This is a good example of an implementation where we do not as yet have full visibility of usage patterns and needs in order to optimize our solution and deliver real value for money.
These examples show that if you look at MT as a sort of fancy GUI, a view which has largely been encouraged by the first generation of web translation engines, then you just end up confused. When it comes to customer needs, there is simply no one-size-fits-all MT integration process, since every type of usage requires specific engineering, and all source content requires careful evaluation. MT delivers complex, not out-of-a-box services. But we believe that when they are developed properly, such services deliver long term returns on investment that are unequalled by any apparently quicker fixes.
Pierre-Yves Foucou is Chief Technology Officer at SYSTRAN in France. He has an academic background in computational linguistics, and is a member of the Open Lexicon Interchange Format (OLIF) Consortium, and the International Association of Machine Translation (IAMT).
The LISA Newsletter: Globalization Insider. Number 3.1 – Copyright © 2002, SMP Marketing Sarl. All Rights Reserved.