Laboratoire d'Informatique de l'Université du Maine

présentation | annuaire | accés | publications
lium | iup mime
bibliothèque universitaire | annuaire (accés réservé)
subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link | subglobal4 link
subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link | subglobal5 link
subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link
subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link
subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link | subglobal8 link

Research activites in SMT at LIUM small logo

Summary of our research activities in statistical machine translation

Two permanent researches, three postdocs and five PhD students are currently working on various aspects of statistical machine translation (SMT). Please contact me if you are interested to work with us during a short visit, a postdoc or if want to get involved in an PhD. Several financing possibilities are available.

Our main interests are in

The group is currently involved in several projects. Statistical methods are very compute intensive. Therefore, we have acquired a large cluster to support our research. We have successfully participated in several international evaluations, namely NIST'08 (Ar/En) NIST'09( Ar/En and Zh/En), WMT (FR <-> En since 2008), and IWSLT (English to French speech translation).

Statistical machine translation is today considered as a serious alternative to rule-based machine translation (RBMT). While RBMT systems rely on rules and linguistic resources built for that purpose, SMT systems can be developed without the need of any language-specific expertise and are only based on bilingual sentence-aligned data (bitext) and large monolingual texts. However, while monolingual data is usually available in large amounts, bilingual texts are a sparse resource for most of the language pairs.

The possibility to develop an MT system using only aligned bilingual texts is generally mentioned as an advantage of SMT systems. On the other hand, this can also be a handicap for this approach. For some language pairs bilingual corpora just do not exist, e.g. Japanese/Spanish, or the existing corpora are too small or out-of-domain to build a good SMT system. The performance of an SMT heavily depends on the parallel corpus used for training. Generally, more bitexts lead to better performance. Current resources of parallel corpora cover few language pairs and mostly come from one domain (proceedings of the Canadian or European Parliament, or of the United Nations). This becomes specifically problematic when SMT systems trained on such corpora are used for general translations, as the language jargon heavily used in these corpora is not appropriate for everyday life translations or translations of some other domain. We are working on three complementary topics that try to tackle these problems.

First, we are interested in new representations of the language and translations model, that are expected to take better advantage of the available training material and that generalize better to unseen words and phrases. The basic idea of the continuous space language model is to project the word indices onto a continuous space and to use a probability estimator operating on this space. Since the resulting probability functions are smooth functions of the word representation, better generalization to unknown events can be expected. A neural network can be used to simultaneously learn the projection of the words onto the continuous space and to estimate the n-gram probabilities. This is still a n-gram approach, but the LM probabilities are interpolated for any possible context of length n-1 instead of backing-off to shorter contexts. This approach was successfully used in large vocabulary continuous speech recognition (Schwenk and Gauvain ICASSP'02; Schwenk CSL'07), and in a phrase-based SMT systems (Schwenk et al ACL'06; Schwenk PBML'10). We are currently working on an application of this idea to the translation model. First research is reported in (Schwenk et al EMNLP'07). An open source implementation of this model is available.

A comparable corpus is usually defined as a collection of texts in several language that convey the same topic, but who are not necessarily translations of each other. Typical examples of comparable corpora are Wikipedia or articles from international news agencies like AFP, Xinhua or BBC. We have developed an easy to implement and effective algorithm to automatically extract parallel sentences from comparable corpora (Rauf et al EACL'09; Rauf et al MT'11).

Our group was the first one to apply large-scale unsupervised training to machine translation (Schwenk IWSLT'08). The idea is use an SMT system to translate large amounts of monolingual data (up to 300M words). The resulting translations can be used to retrain a new SMT system (after some filtering). When a comparable corpus is available, we call this method lightly-supervised training since the target language texts can be used to provide useful information in the language model. This technique is now applied in many of our systems (Schwenk et al MT Summit'09; Schwenk TALN'10; our NIST'09 system, etc) Recent research has shown that it is better to translate from the target to the source language (Lambert el al, WMT'11)

We are also working on several methods to adapt a generic SMT system to a particular topic. Often we have monolingual and parallel data that come from different sources which are more or less appropriate to the translation task. This is usually referred to as in-domain and out-of domain data. More generally, parallel data is quite inhomogeneous in many practical applications with respect to several factors like data source, alignment quality, appropriateness to the task, etc. We have developed a general framework to take these factors into account.

We are also interested in the use of linguistic knowledge to improve the a statistical machine translation system, in particular to deal with languages like Chinese.

Finally, we are working on other modalities than text as input to our translation systems, in particular speech. In cooperation we colleagues from the speech group, we have a developed a complete speech translation system for lectures. This system was ranked first in the 2011 IWSLT evaluation

Compute cluster


Our research is supported by a large Linux cluster owned by the group:
    36 machines with 2 CPUs each (4 or 6 cores)
    total of 360 cores, 3.8 Terabytes of main memory
    100 Terabytes of Raid5 storage
©2007 Lium - Université du Maine - France