Coordonnées / Contact
LIUM
Institut Claude Chappe
Avenue Laënnec
72085 Le Mans Cedex 9
tél. : +33 2.43.83.38.63
fax : +33 2.43.83.38.68
Summary of our research activities in statistical machine translation
Two permanent researches, three postdocs and five PhD students are currently working on
various aspects of statistical machine translation (SMT). Please contact me if you are interested to work
with us during a short visit, a postdoc or if want
to get involved in an PhD. Several financing
possibilities are available.
Our main interests are in
The group is currently involved in several
projects.
Statistical methods are very compute intensive. Therefore, we have acquired a large
cluster to support our research.
We have successfully participated in several international evaluations, namely
NIST'08 (Ar/En)
NIST'09(
Ar/En
and Zh/En),
WMT (FR <-> En since 2008),
and
IWSLT (English to French speech translation).
Statistical machine translation is today considered as a serious alternative to
rule-based machine translation (RBMT). While RBMT systems rely on rules and
linguistic resources built for that purpose, SMT systems can be developed
without the need of any language-specific expertise and are only based on
bilingual sentence-aligned data (bitext) and large monolingual texts.
However, while monolingual data is usually available in large amounts,
bilingual texts are a sparse resource for most of the language pairs.
The possibility to develop an MT system using only aligned bilingual texts is
generally mentioned as an advantage of SMT systems. On the other hand, this
can also be a handicap for this approach. For some language pairs bilingual
corpora just do not exist, e.g. Japanese/Spanish, or the existing corpora are
too small or out-of-domain to build a good SMT system. The performance of an
SMT heavily depends on the parallel corpus used for training. Generally, more
bitexts lead to better performance. Current resources of parallel corpora
cover few language pairs and mostly come from one domain (proceedings of the
Canadian or European Parliament, or of the United Nations). This becomes
specifically problematic when SMT systems trained on such corpora are used for
general translations, as the language jargon heavily used in these corpora is
not appropriate for everyday life translations or translations of some other
domain. We are working on three complementary topics that try to tackle these
problems.
First, we are interested in new representations of the
language and translations model, that are expected to take better advantage of
the available training material and that generalize better to unseen words and
phrases. The basic idea of the
continuous space language model is to
project the word indices onto a continuous space and to use a probability
estimator operating on this space. Since the resulting probability functions
are smooth functions of the word representation, better generalization to
unknown events can be expected. A neural network can be used to simultaneously
learn the projection of the words onto the continuous space and to estimate the
n-gram probabilities. This is still a n-gram approach, but the LM
probabilities are
interpolated for any possible context of length n-1
instead of backing-off to shorter contexts. This approach was successfully
used in large vocabulary continuous speech recognition (Schwenk and Gauvain
ICASSP'02; Schwenk CSL'07), and in a phrase-based SMT systems (Schwenk et al
ACL'06; Schwenk PBML'10). We are currently working on an application of this
idea to the translation model. First research is reported in (Schwenk et al
EMNLP'07). An
open
source implementation of this model is available.
A
comparable corpus is usually defined as a
collection of texts in several language that convey the same topic, but who are
not necessarily translations of each other. Typical examples of comparable
corpora are Wikipedia or articles from international news agencies like AFP,
Xinhua or BBC. We have developed an easy to implement and effective algorithm
to automatically extract parallel sentences from comparable corpora (Rauf et al
EACL'09; Rauf et al MT'11).
Our group was the first one to apply large-scale
unsupervised training to machine translation (Schwenk IWSLT'08). The
idea is use an SMT system to translate large amounts of monolingual data (up to
300M words). The resulting translations can be used to retrain a new SMT
system (after some filtering). When a comparable corpus is available, we call
this method
lightly-supervised training since the target language texts
can be used to provide useful information in the language model. This
technique is now applied in many of our systems (Schwenk et al MT
Summit'09; Schwenk TALN'10; our NIST'09 system, etc) Recent research
has shown that it is better to translate from the target to the source language
(Lambert el al, WMT'11)
We are also working on several methods to
adapt a
generic SMT system to a particular topic. Often we have monolingual and
parallel data that come from different sources which are more or less
appropriate to the translation task. This is usually referred to as
in-domain and
out-of domain data. More generally, parallel
data is quite inhomogeneous in many practical applications with respect to
several factors like data source, alignment quality, appropriateness to the
task, etc. We have developed a general framework to take these factors into
account.
We are also interested in the use of
linguistic knowledge to improve the a
statistical machine translation system, in particular to deal with languages like Chinese.
Finally, we are working on other modalities than text as
input to our translation systems, in particular speech. In cooperation we
colleagues from the
speech
group, we have a developed a
complete speech translation system for
lectures. This system was ranked first in the 2011
IWSLT evaluation
Compute cluster
Our research is supported by a large Linux cluster owned by the group:
36 machines with 2 CPUs each (4 or 6 cores)
total of 360 cores, 3.8 Terabytes of main memory
100 Terabytes of Raid5 storage
©2007 Lium - Université du Maine - France