| By Year | Referred Publications: | 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 before 2000 |
| By Type | Journal Papers | Book Chapters | International Conferences | National Conferences | Workshops |
| By Topic | Language Modelling | Speech Recognition | Machine Translation | Character Recognition |
| AdaBoost | System Combination | Neural Networks | Other |
Afli Haithem, Barrault Loïc, and Schwenk Holger.
Traduction automatique à partir de corpus comparables:
extraction de phrases parallèles à partir de données comparables
multimodales.
In Traitement du Langage Naturel, pages 447-454, 2012.
.
[ .pdf |
.pdf ]
P. Lambert, J. Senellart, L. Romary, H. Schwenk, F. Zipser, P. Lopez, and
F. Blain.
Collaborative machine translation service for scientific
texts.
In EACL demonstration session, pages 11-15, Avignon
(France), 2012.
.
[ .pdf ]
Patrik Lambert, Holger Schwenk, and Frédéric Blain.
Automatic translation of scientific documents in the hal
archive.
In LREC, pages 3933-3926, 2012.
.
Christophe Servan, Patrik Lambert, Anthony Rousseau, Holger Schwenk, and Loíc
Barrault.
LIUM's smt machine translation systems for WMT 2012.
In Proceedings of the Seventh Workshop on Statistical
Machine Translation, pages 369-373, 2012.
.
[ http ]
Holger Schwenk, Anthony Rousseau, and Mohammed Attik.
Large, pruned or continuous space language models on a GPU
for statistical machine translation.
In NAACL-HLT workshop on the Future of Language Modeling
for HLT, pages 11-19, 2012.
.
[ .pdf |
http ]
Afli Haithem, Barrault Loïc, and Schwenk Holger.
Parallel text extraction from multimodal comparable
corpora.
In 8th International Conference on Natural Language
Processing, pages 40-51. Springer, Heidelberg, 2012.
.
[ http ]
Kashif Shah, Loïc Barrault, and Holger Schwenk.
A general framework to weight heterogeneous parallel data
for model adaptation in statistical machine translation.
2012.
.
[ .pdf ]
Frédéric Blain, Holger Schwenk, and Jean Sénellart.
Incremental adaptation using translation information and
post-editing analysis.
In International Workshop on Spoken Language Translation,
2012.
.
[ .pdf ]
Walid Aransa, Holger Schwenk, and Loïc Barrault.
Semi-supervised transliteration mining from parallel and
comparable corpora.
In International Workshop on Spoken Language Translation,
2012.
.
[ .pdf ]
Holger Schwenk.
Continuous space translation models for phrase-based
statistical machine translation.
In Coling, 2012.
.
Holger Schwenk, Patrik Lambert, Loïc Barrault, Christophe Servan, Sadaf
Abdul-Rauf, Haithem Afli, and Kashif Shah.
Lium's smt machine translation systems for WMT 2011.
In Proceedings of the Sixth Workshop on Statistical Machine
Translation, pages 464-469, Edinburgh, Scotland, July 2011. Association for
Computational Linguistics.
.
[ .pdf |
http ]
Sadaf Abdul Rauf and Holger Schwenk.
Parallel sentence generation from comparable corpora for
improved SMT.
Machine Translation, 25(4):341-375, 2011.
.
Patrik Lambert, Holger Schwenk, Christophe Servan, and Sadaf Abdul-Rauf.
Investigations on translation model adaptation using
monolingual data.
pages 284-293, 2011.
.
[ .pdf ]
Frédéric Blain, Jean Senellart, Holger Schwenk, Mirko Plitt, and Johann
Roturier.
Qualitative analysis of post-editing for high quality
machine translation.
In Asia-Pacific Association for Machine Translation (AAMT),
editor, Machine Translation Summit XIII, Xiamen (China), 19-23 sept.
2011.
.
[ .pdf ]
Kashif Shah, Loïc Barrault, and Holger Schwenk.
Parametric weighting of parallel data for statistical
machine translation.
In The 5th International Joint Conference on Natural
Language Processing, pages 1323-1331, Chiang Mai (Thialand), 2011.
.
Christophe Servan and Schwenk Holger.
Optimising multiple metrics with MERT.
The Prague Bulletin of Mathematical Linguistics (PBML),
(96):109-117, 2011.
.
[ .pdf ]
A. Rousseau, F. Bougares, P. Deléglise, H. Schwenk, and Y. Estève.
Liums systems for the IWSLT 2011 speech translation
tasks.
In International Workshop on Spoken Language Translation,
2011.
.
Patrik Lambert, Sadaf Abdul-Rauf, and Holger Schwenk.
LIUM SMT machine translation system for WMT 2010.
In Proceedings of the Joint Fifth Workshop on Statistical
Machine Translation and MetricsMATR, pages 121-126, Uppsala, Sweden, July
2010.
.
Holger Schwenk.
Continuous space language models for statistical machine
translation.
The Prague Bulletin of Mathematical Linguistics,
(93):137-146, 2010.
.
[ .pdf | abstract]
This paper describes an open-source implementation of the so-called continuous space language model and its application to statistical machine translation. The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability estimation in a continuous space. The projection of the words and the probability estimation are both performed by a multi-layer neural network. This paper describes the theoretical background of the approach, efficient algorithms to handle the computational complexity, and gives implementation details and reports experimental results on a variety of tasks.
Patrik Lambert, Sadaf Abdul-Rauf, and Holger Schwenk.
LIUM SMT machine translation system for WMT 2010.
page in press, 2010.
.
[ .pdf | abstract]
This paper describes the development of French-English and English-French machine translation systems for the 2010 WMT shared task evaluation. These systems were standard phrase-based statistical systems based on the Moses decoder, trained on the provided data only. Most of our efforts were devoted to the choice and extraction of bilingual data used for training. We filtered out some bilingual corpora and pruned the phrase table. We also investigated the impact of adding two types of additional bilingual texts, extracted automatically from the available monolingual data. We first collected bilingual data by performing automatic translations of monolingual texts. The second type of bilingual text was harvested from comparable corpora with Information Retrieval techniques.
Patrik Lambert, Sadaf Abdul-Rauf, and Holger Schwenk.
LIUM SMT machine translation system for WMT 2010.
pages 127-132, 2010.
.
[ .pdf | abstract]
This paper describes the development of French-English and English-French machine translation systems for the 2010 WMT shared task evaluation. These systems were standard phrase-based statistical systems based on the Moses decoder, trained on the provided data only. Most of our efforts were devoted to the choice and extraction of bilingual data used for training. We filtered out some bilingual corpora and pruned the phrase table. We also investigated the impact of adding two types of additional bilingual texts, extracted automatically from the available monolingual data. We first collected bilingual data by performing automatic translations of monolingual texts. The second type of bilingual text was harvested from comparable corpora with Information Retrieval techniques.
Holger Schwenk.
Adaptation d'un système de traduction automatique
statistique avec des ressources monolingues.
In Traitement du Langage Naturel, page in press, 2010.
.
[ .pdf | abstract]
The performance of a statistical machine translation system depends a lot on the quality and quantity of the available training data. Most of the existing, easily available parallel texts come from international organizations and the jargon observed in those texts is not very appropriate to build a machine translation system for other domains. In this paper, we present a technique to automatically adapt the translation model to a new domain using monolingual data in the source language only. We observe significant improvements in the BLEU score in statistical machine translation systems from Arabic to French and English respectively.
Fancisco Zamora-Martínez, María José Castro-Bleda, and Holger
Schwenk.
N-gram-based machine translation enhanced with neural
networks for the French-English BTEC-IWSLT'10 task.
In International Workshop on Spoken Language Translation,
pages 45-52, 2010.
.
Y. Estève, P. Deléglise, S. Meignier, S. Petitrenaud, H. Schwenk, L. Barrault,
F. Bougares, R. Dufour, V. Jousse, A. Laurent, and A. Rousseau.
Some recent research work at lium based on the use of cmu
sphinx.
In CMU SPUD Workshop, march 13, 2010.
.
Sadaf Abdul Rauf and Holger Schwenk.
On the use of comparable corpora to improve SMT
performance.
In Proceedings of the Conference of the European Chapter of
the Association for Computational Lingustics, pages 16-23, 2009.
.
[ .pdf | abstract]
We present a simple and effective method for extracting parallel sentences from comparable corpora. We employ a statistical machine translation (SMT) system built from small amounts of parallel texts to translate the source side of the non-parallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create French/English parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems.
Holger Schwenk, Sadaf Abdul-Rauf, Loïc Barrault, and Jean Senellart.
SMT and SPE machine translation systems for WMT'09.
In Forth ACL Workshop on Statistical Machine Translation,
pages 130-134, 2009.
.
[ .pdf | abstract]
This paper describes the development of several machine translation systems for the 2009 WMT shared task evaluation. We only consider the translation between French and English. We describe a statistical system based on the Moses decoder and a statistical post-editing system using SYSTRAN's rule-based system. We also investigated techniques to automatically extract additional bilingual texts from comparable corpora.
Sadaf Abdul Rauf and Holger Schwenk.
Exploiting comparable corpora with TER and TERp.
In 2nd Workshop on Building and Using Comparable Corpora:
from parallel to non-parallel corpora, 2009.
.
[ .pdf | abstract]
In this paper we present an extension of a successful simple and effective method for extracting parallel sentences from comparable corpora and we apply it to an Arabic/English NIST system. We also report a comparison of our approach with that of (Munteanu et Marcu, 2005) using exactly the same corpora and show the same performance gain by using much lesser data. Our approach employs an SMT system built from small amounts of parallel texts to translate the source side of the non-parallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems.
Holger Schwenk and Jean Senellart.
Translation model adaptation for an Arabic/French news
translation system by lightly-supervised training.
In MT Summit, 2009.
.
[ .pdf | abstract]
Most of the existing, easily available parallel texts to train a statistical machine translation system are from international organizations that use a particular jargon. In this paper, we consider the automatic adaptation of such a translation model to the news domain. The initial system was trained on more than 200M words of UN bitexts. We then explore large amounts of in-domain monolingual texts to modify the probability distribution of the phrase-table and to learn new task-specific phrase-pairs. This procedure achieved an improvement of 3.5 points BLEU on the test set in an Arabic/French statistical machine translation system. This result compares favorably with other large state-of-the-art systems for this language pair.
Holger Schwenk, Loïc Barrault, Yannick Estève, and Patrik Lambert.
LIUM's statistical machine translation systems for IWSLT
2009.
In International Workshop on Spoken Language Translation,
pages 65-70, 2009.
.
[ .pdf | abstract]
This paper describes the systems developed by the LIUM laboratory for the 2009 IWSLT evaluation. We participated in the Arabic and Chinese to English BTEC tasks. We developed three different systems: a statistical phrase-based system using the Moses toolkit, an Statistical Post-Editing system and a hierarchical phrase-based system based on Joshua. A continuous space language model was deployed to improve the modeling of the target language. These systems are combined by a confusion network based approach.
Holger Schwenk and Philipp Koehn.
Large and diverse language models for statistical machine
translation.
In International Joint Conference on Natural Language
Processing, pages 661-6662, 2008.
.
[ .pdf | abstract]
This paper presents methods to combine large language models trained from diverse text sources and applies them to a state-of-art French-English and Arabic-English machine translation system. We show gains of over 2 Bleu points over a strong baseline by using continuous space language models in re-ranking.
Holger Schwenk, Jean-Baptiste Fouet, and Jean Senellart.
First steps towards a general purpose French/English
statistical machine translation system.
In Third ACL Workshop on Statistical Machine Translation,
pages 119-122, 2008.
.
[ .pdf | abstract]
This paper describes an initial version of a general purpose French/English statistical machine translation system. The main features of this system are the open-source Moses decoder, the integration of a bilingual dictionary and a continuous space target language model. We analyze the performance of this system on the test data of the WMT'08 evaluation.
Holger Schwenk and Yannick Estève.
Data selection and smoothing in an open-source system for
the 2008 NIST machine translation evaluation.
In Interspeech, pages 2727-2730, 2008.
.
[ .pdf | abstract]
This paper gives a detailed description of a statistical machine translation system developed for the 2008 NIST open MT evaluation. The system is based on the open source toolkit Moses with extensions for language model rescoring in a second pass. Significant improvements were obtained with data selection methods for the language and translation model. An improvement of more than 1 point BLEU on the test set was achieved by a continuous space language model which performs the probability estimation with a neural network. The described system has achieved a very good ranking in the 2008 NIST open MT evaluation.
Holger Schwenk, Yannick Estève, and Sadaf Abdul Rauf.
The LIUM Arabic/English statistical machine
translation system for IWSLT 2008.
In International Workshop on Spoken Language Translation,
pages 63-68, 2008.
.
[ .pdf | abstract]
This paper describes the system developed by the LIUM laboratory for the 2008 IWSLT evaluation. We only participated in the Arabic/English BTEC task. We developed a statistical phrase-based system using the Moses toolkit and SYSTRAN's rule-based translation system to perform a morphological decomposition of the Arabic words. A continuous space language model was deployed to improve the modeling of the target language. Both approaches achieved significant improvements in the BLEU score. The system achieves a score of 49.4 on the test set of the 2008 IWSLT evaluation.
Holger Schwenk.
Investigations on large-scale lightly-supervised training
for statistical machine translation.
In International Workshop on Spoken Language Translation,
pages 182-189, 2008.
.
[ .pdf | abstract]
Sentence-aligned bilingual texts are a crucial resource to build statistical machine translation (SMT) systems. In this paper we propose to apply lightly-supervised training to produce additional parallel data. The idea is to translate large amounts of monolingual data (up to 275M words) with an SMT system, and to use those as additional training data. Results are reported for the translation from French into English. We consider two setups: first the intial SMT system is only trained with a very limited amount of human-produced translations, and then the case where we have more than 100 million words. In both conditions, lightly-supervised training achieves significant improvements of the BLEU score.
Hélène Bonneau-Maynard, Alexandre Allauzen, Daniel Déchelotte, and Holger
Schwenk.
Combining morphosyntactic enriched representation with
n-best reranking in statistical translation.
In HLT/NAACL workshop on Syntax and Structure in Statistical
Translation, pages 65-71, April 2007.
.
[ .pdf | abstract]
The purpose of this work is to explore the integration of morphosyntactic information into the translation model itself, by enriching words with their morphosyntactic categories. We investigate word disambiguation using morphosyntactic categories, n-best hypotheses reranking, and the combination of both methods with word or morphosyntactic n-gram language model reranking. Experiments are carried out on the English-to-Spanish translation task. Using the morphosyntactic language model alone does not results in any improvement in performance. However, combining morphosyntactic word disambiguation with a word based 4-gram language model results in an improvement in the BLEU score of 0.6% on the development set and 0.3% on the test set.
Holger Schwenk, Marta R. Costa-jussà, and José A. R. Fonollosa.
Smooth bilingual n-gram translation.
In Empirical Methods in Natural Language Processing, pages
430-438, 2007.
.
[ .pdf | abstract]
We address the problem of smoothing translation probabilities in a bilingual N-gram-based statistical machine translation system. It is proposed to project the bilingual tuples onto a continuous space and to estimate the translation probabilities in this representation. A neural network is used to perform the projection and the probability estimation.Smoothing probabilities is most important for tasks with a limited amount of training material. We consider here the Btec task of the 2006 Iwslt evaluation. Improvements in all official automatic measures are reported when translating from Italian to English. Using a continuous space model for the translation model and the target language model, an improvement of 1.5 BLEU on the test data is observed.
Holger Schwenk, Daniel Déchelotte, Hélène Bonneau-Maynard, and Alexandre
Allauzen.
Modèles statistiques enrichis par la syntaxe pour la
traduction automatique.
In Traitement du Langage Naturel, pages 253-262, 2007.
.
[ .pdf | abstract]
La traduction automatique statistique par séquences de mots est une voie prometteuse. Nous présentons dans cet article deux évolutions complémentaires. La première permet une modélisation de la langue cible dans un espace continu. La seconde intègre des catégories morpho-syntaxiques aux unités manipulées par le modèle de traduction. Ces deux approches sont évaluées sur la tâche Tc-Star. Les résultats les plus intéressants sont obtenus par la combinaison de ces deux méthodes.
Daniel Déchelotte, Holger Schwenk, Gilles Adda, and Jean-Luc Gauvain.
Improved machine translation of text-to-speech outputs.
In Interspeech, pages 2441-2444, 2007.
.
[ .pdf | abstract]
Combining automatic speech recognition and machine translation is frequent in current research programs. This paper first presents several pre-processing steps to limit the performance degradation observed when translating an automatic transcription (as opposed to a manual transcription). Indeed, automatically transcribed speech often differs significantly from the machine translation system's training material, with respect to caseing, punctuation and word normalization. The proposed system outperforms the best system at the 2007 TC-STAR evaluation by almost 2 points BLEU. The paper then attempts to determine a criteria characterizing how well an STT system can be translated, but the current experiments could only confirm that lower word error rates lead to better translations.
Holger Schwenk.
Building a statistical machine translation system for
French using the Europarl corpus.
In Second ACL Workshop on Statistical Machine Translation,
pages 189-192, 2007.
.
[ .pdf | abstract]
This paper describes the development of a statistical machine translation system based on the Moses decoder for the 2007 WMT shared tasks. Several different translation strategies were explored. We also use a statistical language model that is based on a continuous representation of the words in the vocabulary. By these means we expect to take better advantage of the limited amount of training data. Finally, we have investigated the usefulness of a second reference translation of the development data.
Holger Schwenk.
Continuous space language models.
Computer Speech and Language, 21:492-518, 2007.
.
[ .pdf | abstract]
This paper describes the use of a neural network language model for large vocabulary continuous speech recognition. The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability estimation in a continuous space. Very efficient learning algorithms are described that enable the use of training corpora of several hundred million words. It is also shown that this approach can be incorporated into a large vocabulary continuous speech recognizer using a lattice rescoring framework at a very low additional processing time. The neural network language model has been thoroughly evaluated in a state-of-the-art large vocabulary continuous speech recognizer for several international benchmark tasks, in particular the NIST evaluations on broadcast news and conversational speech recognition. The new approach is compared to 4-gram back-off language models trained with modified Kneser-Ney smoothing which has been often reported to be the best known smoothing method. The neural network language model achieved consistent word error rate reductions for all considered tasks and languages, ranging from 0.5% to up to 1.6% absolute.
Daniel Déchelotte, Holger Schwenk, Hélène Bonneau-Maynard, Alexandre
Allauzen, and Gilles Adda.
A state-of-the-art statistical machine translation system
based on Moses.
In MT Summit, pages 127-133, 2007.
.
[ .pdf | abstract]
This paper describes a statistical machine translation system based on freely available programs such as Moses. Several new features were added, in particular a two-pass decoding strategy using n-best lists and a continuous space language model that aims at taking better advantage of the limited training data. We also investigated lexical disambiguation methods in the translation model based on POS information. The task considered in this work is the translation of the European Parliament Plenary Sessions between English and Spanish, in the framework of the Tc-star project. The described systems performed very well in the 2007 Tc-Star evaluation.
Patrik Lambert, Marta R. Costa-jussà, Josep M. Crego, Maxim Khalilov, José
B. Marino, Rafael E. Banchs, José A.R. Fonollosa, and Holger Schwenk.
The TALP ngram-based SMT system for IWSLT 2007.
In International Workshop on Spoken Language Translation,
pages 169-174, 2007.
.
[ .pdf | abstract]
This paper describes TALPtuples, the 2007 N-gram-based statistical machine translation system developed a t the TALP Research Center of the UPC (Universitat Politècnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the system of previous years. Mainly, these include optimizing align ment parameters in function of translation metric scores and rescoring with a neural network language model.Results on two translation directions are reported, namely from Arabic and Chinese into English, thoroughly explaining all language-related preprocessing and translation schemes.
Evgeny Matusov, Gregor Leusch, Rafael E. Banchs, Nicola Bertoldi, Daniel
Déchelotte, Marcello Federico, Muntsin Kolss, Young-Suk Lee, José B.
Mario, Matthias Paulik, Salim Roukos, Holger Schwenk, and Hermann Ney.
System combination for machine translation of spoken and
written language.
IEEE Transactions on Audio, Speech, and Language
Processing, 16(7):1222-1237, 2007.
.
[ .pdf | abstract]
This article describes an approach for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The consensus translation is computed by weighted majority voting on a confusion network, similarly to the well-established ROVER approach of Fiscus [11] for combining speech recognition hypotheses. To create the confusion network, pairwise word alignments of the original MT hypotheses are learned using an enhanced statistical alignment algorithm that explicitly models word reordering. The context of a whole corpus of automatic translations rather than a single sentence is taken into account in order to achieve high alignment quality. The confusion network is rescored with a special language model, and the consensus translation is extracted as the best path. The proposed system combination approach was evaluated in the framework of the TC-STAR speech translation project. Up to six state-of-the-art statistical phrase-based translation systems from different project partners were combined in the experiments. Significant improvements in translation quality from Spanish to English and English to Spanish in comparison with the best of the individual systems were achieved under official evaluation conditions.
Holger Schwenk, Marta R. Costa-jussà, and José A. R. Fonollosa.
Continuous space language models for the IWSLT 2006 task.
In International Workshop on Spoken Language Translation,
pages 166-173, November 2006.
.
[ .pdf | abstract]
The language model of the target language plays an important role in statistical machine translation systems. In this work, we propose to use a new statistical language model that is based on a continuous representation of the words in the vocabulary. A neural network is used to perform the projection and the probability estimation. This kind of approach is in particular promising for tasks where a very limited amount of resources are available, like the Btec corpus of tourism related questions.This language model is used in two state-of-the-art statistical machine translation systems that were developed by UPC for the 2006 Iwslt evaluation campaign: a phrase- and an n-gram-based approach. An experimental evaluation for four different language pairs is provided (translation of Mandarin, Japanese, Arabic and Italian to English). The proposed method achieved improvements in the BLEU score of up to 3 points on the development data and of almost 2 points on the official test data.
Daniel Déchelotte, Holger Schwenk, and Jean-Luc Gauvain.
Transcription et traduction de débats parlementaires.
In Reconnaissance de Formes et Intelligence Artificielle,
2006.
.
[ .pdf | abstract]
Cet article présente un système complet de traduction automatique de la parole non-contrainte. Une approche statistique est utilisée aussi bien pour la reconnaissance de la parole que pour la traduction. Les modèles, algorithmes et optimisations utilisés dans le système de traduction statistique sont décrits en détail. Des résultats sont présentés pour la transcription et la traduction des débats du Parlement européen, de l'anglais vers l'espagnol et inversement. Ils suggèrent que les modèles stochastiques de traduction sont adaptés à la traduction de la parole, de part leur relative robustesse constatée face aux erreurs introduites par la reconnaissance automatique.
Holger Schwenk, Daniel Déchelotte, and Jean-Luc Gauvain.
Continuous space language models for statistical machine
translation.
In Proceedings of the COLING/ACL 2006 Main Conference Poster
Sessions, pages 723-730, 2006.
.
[ .pdf | abstract]
Statistical machine translation systems are based on one or more translation models and a language model of the target language. While many different translation models and phrase extraction algorithms have been proposed, a standard word n-gram back-off language model is used in most systems.In this work, we propose to use a new statistical language model that is based on a continuous representation of the words in the vocabulary. A neural network is used to perform the projection and the probability estimation. We consider the translation of European Parliament Speeches. This task is part of an international evaluation organized by the Tc-Star project in 2006. The proposed method achieves consistent improvements in the BLEU score on the development and test data.
We also present algorithms to improve the estimation of the language model probabilities when splitting long sentences into shorter chunks.
Daniel Déchelotte, Holger Schwenk, and Jean-Luc Gauvain.
The 2006 LIMSI statistical machine translation system for
Tc-Star.
In Tc-Star Speech to Speech Translation Workshop,
Barcelona, pages 25-30, 2006.
.
[ .pdf | abstract]
This paper presents the LIMSI statistical machine translation system developed for 2006 Tc-Star evaluation campaign. We describe an A*-decoder that generates translation lattices using a word-based translation model. A lattice is a rich and compact representation of alternative translations that includes the probability scores of all the involved sub-models. These lattices are then used in subsequent processing steps, in particular to perform sentence splitting and joining, maximum BLEU training and to use improved statistical target language models.
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Claude Barras, Eric Bilinski,
Olivier Galibert, Agusti Pujol, Holger Schwenk, and Xuan Zhu.
The LIMSI 2006 Tc-Star transcription systems.
In Tc-Star Speech to Speech Translation Workshop,
Barcelona, pages 123-128, 2006.
.
[ .pdf | abstract]
This paper describes the speech recognizers evaluated in the TC-STAR Second Evaluation Campaign held in January-February 2006. Systems were developed to transcribe parliamentary speeches in English and Spanish, as well as Broadcast news in Mandarin Chinese. The speech recognizers are state-of-the-art systems using multiple decoding passes with models (lexicon, acoustic models, language models) trained for the different transcription tasks. Compared to the LIMSI TC-STAR 2005 European Parliament Plenary Sessions (EPPS) systems, relative word error rate reductions of about 30% have been achieved on the 2006 development data. The word error rates with the LIMSI systems on the 2006 EPPS evaluation data are 8.2% for English and 7.8% for Spanish. The character error rate for Mandarin for a joint system submission with the University of Karlsruhe was 9.8%. Experiments with cross-site adaptation and system combination are also described.
Spyros Matsoukas, Jean-Luc Gauvain, Gilles Adda, Thomas Colthurst, Chia-Lin
Kao, Owen Kimball, Lori Lamel, Fabrice Lefevre, Jeff Ma, John Makhoul, Long
Nguyen, Rohit Prasad, Richard Schwartz, Holger Schwenk, and Bing Xiang.
Advances in transcription of broadcast news and
conversational telephone speech within the combined ears bbn/limsi system.
IEEE Transactions on Audio, Speech, and Language
Processing, 14:1541-1556, 2006.
.
[ .pdf | abstract]
This paper describes the progress made in the transcription of Broadcast News (BN) and Conversational Telephone Speech (CTS) within the combined BBN/LIMSI system from May 2002 to September 2004. During that period, BBN and LIMSI collaborated in an effort to produce significant reductions in the word error rate (WER), as directed by the aggressive goals of the DARPA EARS (Effective, Affordable, Reusable, Speech-to-text) program. The paper focuses on general modeling techniques that led to recognition accuracy improvements, as well as engineering approaches that enabled efficient use of large amounts of training data and fast decoding architectures. Special attention is given on efforts to integrate components of the BBN and LIMSI systems, discussing the trade-off between speed and accuracy for various system combination strategies. Results on the EARS progress test sets show that the combined BBN/LIMSI system achieved relative reductions of 47% and 51% on the BN and CTS domains, respectively.
Daniel Déchelotte, Holger Schwenk, Jean-Luc Gauvain, Olivier Galibert, and
Lori Lamel.
Investigating translation of parliament speeches.
In IEEE Workshop on Automatic Speech Recognition and
Understanding, pages 116-120, 2005.
.
[ .pdf | abstract]
This paper reports on recent experiments for speech to text (STT) translation of European Parliamentary speeches. A Spanish speech to English text translation system has been built using data from the TC-STAR European project. The speech recognizer is a state-of-the-art multipass system trained for the Spanish EPPS task and the statistical translation system relies on the IBM-4 model. First, MT results are compared using manual transcriptions and 1-best ASR hypotheses with different word error rates. Then, an n-best interface between the ASR and MT components is investigated to improve the STT process. Derivation of the fundamental equation for machine translation suggests that the source language model is not necessary for STT. This was investigated by using weak source language models and by n-best rescoring adding the acoustic model score only. A significant loss in the BLEU score was observed suggesting that the source language model is needed given the insufficiencies of the translation model. Adding the source language model score in the n-best rescoring process recovers the loss and slightly improves the BLEU score over the 1-best ASR hypothesis. The system achieves a BLEU score of 37.3 with an ASR word error rate of 10% and a BLEU score of 40.5 using the manual transcripts.
J.-L. Gauvain, Gilles Adda, Lori Lamel, F. Lefvre, and Holger Schwenk.
Transcription de la parole conversationnelle.
Traitement Automatique des Langages, 45(3), 2005.
.
[ .pdf | abstract]
Cet article dcrit le dveloppement d'un systme de reconnaissance de la parole conversationnelle, partir d'un systme l'tat de l'art pour la transcription d'missions d'information. Nous dcrivons les principales amliorations apportes aux modles acoustiques, aux modles linguistiques et au dcodeur. Pour la modlisation acoustique, nos travaux ont port sur l'introduction d'une normalisation par locuteur, le recours des techniques d'apprentissage adaptatif et d'apprentissage discriminant, et une meilleure prise en compte des variantes de prononciation. Pour la modlisation linguistique, la principale difficult vient de la faible quantit de donnes d'apprentissage disponible. Nous introduisons deux techniques permettant de diminuer l'impact de cette situation sur les performances du systme : la slection de textes de nature conversationnelle et un modle reprsentant les mots dans un espace continu. La transcription est obtenue en effectuant un dcodage par consensus sur un treillis de mots. Ces amliorations ont permis de rduire le taux d'erreur de 51% 21%.
Yoshua Bengio, Holger Schwenk, Jean-Sbastien Sencal, Frderic Morin, and
Jean-Luc Gauvain.
Neural probabilistic language models, 2005.
.
[ .pdf | abstract]
Chapter 6 of the book “Innovations in Machine Learning: Theory and Applications”, D. Holmes and L.C.Jain, editors, Springer-Verlag
Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Alexandre Allauzen,
Veronique Gendner, Lori Lamel, and Holger Schwenk.
Where are we in transcribing BN French?
In Eurospeech, pages 1665-1668, 2005.
.
[ .pdf | abstract]
Given the high flexional properties of the French language, transcribing French broadcast news (BN) is more challenging than English BN. This is in part due to the large number of homophones in the inflected forms. This paper describes the development of a recognition system for processing broadcast news speech in French. The resulting system was evaluated in the first French Technolangue ASR benchmark test [?]. This system runs in about 7xRT and achieved the lowest word error rate in this evaluation, 11.9%. We also report on a 1xRT version of this system.The main differences between the English and French BN systems are: a 200k vocabulary to overcome the lower lexical coverage in French, a case sensitive language model, and the use of a POS based language model to lower the impact of homophonic gender and number disagreement.
Holger Schwenk and Jean-Luc Gauvain.
Building continuous space language models for transcribing
european languages.
In Eurospeech, pages 737-740, 2005.
.
[ .pdf | abstract]
Large vocabulary continuous speech recognizers for English Broadcast News achieve today word error rates below 10%. An important factor for this succes is the availability of large amounts of acoustic and language modeling training data. In this paper the recognition of French Broadcast News and English and Spanish parliament speeches is addressed, tasks for which less resources are available. A neural network language model is applied that takes better advantage of the limited amount of training data. This approach performs the estimation of the probabilities in a continuous space and better generalization to unknown n-grams can be expected. Word error reduction of up to 0.9% absolute are reported with respect to a carefully tuned back-off language model trained on the same data.
Lori Lamel, Holger Schwenk, Jean-Luc Gauvain, Gilles Adda, and Eric Bilinski.
Improvements in transcribing lectures and seminars.
In 2nd Joint Workshop on Multimodal Interaction and Related
Machine Learning Algorithms, 2005.
.
[ .pdf | abstract]
This paper describes recent research carried out in the context of the FP6 Integrated Project Chil (chil.server.de) on developing a system to automatically transcribe lectures and seminars. Widely available corpora were used to train both the acoustic and language models, since only a small amount of Chil data was available for system development. For language model training, text materials come from a variety of online conference proceedings and a neural network language model has been used to take better advantage of the limited data.
Holger Schwenk and Jean-Luc Gauvain.
Training neural network language models on very large
corpora.
In Empirical Methods in Natural Language Processing, pages
201-208, 2005.
.
[ .pdf | abstract]
During the last years there has been growing interest in using neural networks for language modeling. In contrast to the well known back-off n-gram language models, the neural network approach attempts to overcome the data sparseness problem by performing the estimation in a continuous space. This type of language model was mostly used for tasks for which only a very limited amount of in-domain training data is available.In this paper we present new algorithms to train a neural network language model on very large text corpora. This makes possible the use of the approach in domains where several hundreds of millions words of texts are available. The neural network language model is evaluated in a state-of-the-art real-time continuous speech recognizer for French Broadcast News. Word error reductions of 0.5% absolute are reported using only a very limited amount of additional processing time.
Holger Schwenk and Jean-Luc Gauvain.
Neural network language models for conversational speech
recognition.
In International Conference on Speech and Language
Processing, pages 1215-1218, 2004.
.
[ .pdf | abstract]
Recently there is growing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models (LM), the neural network approach tries to limit problems from the data sparseness by performing the estimation in a continuous space, allowing by these means smooth interpolations. Therefore this type of LM is interesting for tasks for which only a very limited amount of in-domain training data is available, such as the modeling of conversational speech.In this paper we analyze the generalization behavior of the neural network LM for in-domain training corpora varying from 7M to over 21M words. In all cases, significant word error reductions were observed compared to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the NIST rich transcription evaluations. We also apply ensemble learning methods and discuss their connections with LM interpolation.
Holger Schwenk and J.-L. Gauvain.
Using neural network language models for LVCSR.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
In this paper we describe how to use a neural network language model for the BN and CTS task in the RT04 evaluation. The new approach performs the estimation of the language model probabilities in a continuous space, allowing by this means smooth interpolations. Details are given on training data selection, fast training and decoding algorithms and parameter estimation. The neural network language model achieved word error reductions of 0.5% for the CTS task and of 0.3% for the BN task with an additional decoding cost of 0.05xRT.
Jean-Luc Gauvain, Abdel Messaoudi, and Holger Schwenk.
Language recognition using phone lattices.
In International Conference on Speech and Language
Processing, pages 1283-1286, 2004.
.
[ .pdf | abstract]
This paper proposes a new phone lattice based method for automatic language recognition from speech data. By using phone lattices some approximations usually made by language identification (LID) systems relying on phonotactic constraints to simplify the training and decoding processes can be avoided. We demonstrate the use of phone lattices both in training and testing significantly improves the accuracy of a phonotactically based LID system. Performance is further enhanced by using a neural network to combine the results of multiple phone recognizers. Using three phone recognizers with context independent phone models, the system achieves an equal error rate of 2.7% on the Eval03 NIST detection test (30s segment, primary condition) with an overall decoding process that runs faster than real-time (0.5xRT).
Holger Schwenk.
Efficient training of large neural networks for language
modeling.
In IEEE joint conference on neural networks, pages
3059-3062, 2004.
.
[ .pdf | abstract]
Recently there has been increasing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models, the neural network approach tries to limit the data sparseness problem by performing the estimation in a continuous space, allowing by this means smooth interpolations. The complexity to train such a model and to calculate one n-gram probability is however several orders of magnitude higher than for the backoff models, making the new approach difficult to use in real applications.In this paper several techniques are presented that allow the use of a neural network language model in a large vocabulary speech recognition system, in particular very fast lattice rescoring and efficient training of large neural networks on training corpora of over 10 million words. The described approach achieves significant word error reductions with respect to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the DARPA rich transcriptions evaluations.
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Leonard
Canseco, Langzhou Chen, Olivier Galibert, Abdel Messaoudi, and Holger
Schwenk.
Speech transcription in multiple languages.
In International Conference on Acoustics, Speech, and Signal
Processing, pages III:753-756, 2004.
.
[ .pdf | abstract]
This paper summarizes recent work underway at Limsi on speech-to-text transcription in multiple languages. The research has been oriented towards the processing of broadcast audio and conversational speech for information access. Broadcast news transcription systems have been developed for seven languages and it is planned to address several other languages in the near term. Research on conversational speech has mainly focused on the English language, with initial work on the French, Arabic and Spanish languages. Automatic processing must take into account the characteristics of the audio data, such as needing to deal with the continuous data stream, specificities of the language and the use of an imperfect word transcription for accessing the information content. Our experience thus far indicates that at today's word error rates, the techniques used in one language can be successfully ported to other languages, and most of the language specificities concern lexical and pronunciation modeling.
Richard Schwartz, Thomas Colthurst, Nicolae Duta, Herb Gish, Rukmini Iyer,
Chia-Lin Kao, Daben Liu, Owen Kimball, J. Ma, John Makhoul, Spyros Matsoukas,
Long Nguyen, Mohamed Noamany, Rohit Prasad, Bing Xiang, Dongxin Xu, Jean-Luc
Gauvain, Lori Lamel, Holger Schwenk, Gilles Adda, and Langzhou Chen.
Speech recognition in multiple languages and domains: The
2003 bbn/limsi ears system.
In International Conference on Acoustics, Speech, and Signal
Processing, pages III:757-760, 2004.
.
[ .pdf | abstract]
We report on the results of the first evaluations for the BBN/LIMSI system under the new DARPA EARS Program. The evaluations were carried out for conversational telephone speech (CTS) and broadcast news (BN) for three languages: English, Mandarin, and Arabic. In addition to providing system descriptions and evaluation results, the paper highlights methods that worked well across the two domains and those that worked well on one domain but not the other. For the BN evaluations, which had to be run under 10 times real-time, we demonstrated that a joint BBN/LIMSI system with that time constraint achieved better results than either system alone.
Long Nguyen, Sherif Abdou, Mohamed Afify, John Makhoul, Spyros Matsoukas,
Richard Schwartz, Bing Xiang, Lori Lamel, Jean-Luc Gauvain, Gilles Adda,
Holger Schwenk, and Fabrice Lefevre.
The 2004 BBN/LIMSI 10xRT english
broadcast news transcription system.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
This paper describes the 2004 BBN/LIMSI 10xRT English Broadcast News (BN) transcription system which uses a tightly integrated combination of components from the BBN and LIMSI speech recognition systems. The integrated system uses both cross-site adaptation and system combination via ROVER, obtaining a word hypothesis that is better than is produced by either system alone, while remaining within the allotted time limit. The system configuration used for the evaluation has two components from each site and two ROVER combinations, and achieved a word error rate (WER) of 13.9% on the Dev04f set and 9.3% on the Dev04 set selected to match the progress set. Compared to last year's system, there is around 30% relative reduction on the WER.
R. Prasad, S. Matsoukas, C-L. Kao, J. Ma, D-X. Xu, T. Colthurst, G. Thattai,
O. Kimball, R. Schwartz, J.-L. Gauvain, Lori Lamel, Holger Schwenk, Gilles
Adda, and F. Lefevre.
The 2004 20xRT BBN/LIMSI english
conversational telephone speech system.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT (real-time as measured on 3.4 GHz Pentium 4 Xeon Processor) on the EARS progress test set. This translates into a 22.8% relative improvement in WER over the 2003 BBN/LIMSI EARS evaluation system, which was run without any time constraints. In addition to reporting on the system architecture and the evaluation results, we also highlight the significant improvements made at both sites.
Holger Schwenk and Jean-Luc Gauvain.
Using Continuous Space Language Models for Conversational
Speech Recognition.
In ISCA & IEEE Workshop on Spontaneous Speech Processing
and Recognition, pages 49-53, Tokyo, April 2003.
.
Jean-Luc Gauvain, Lori Lamel, Holger Schwenk, Gilles Adda, Langzhou Chen, and
Fabrice Lefèvre.
Conversational telephone speech recognition.
In International Conference on Acoustics, Speech, and Signal
Processing, pages I:212-215, 2003.
.
Holger Schwenk and Jean-Luc Gauvain.
Connectionist language modeling for large vocabulary
continuous speech recognition.
In International Conference on Acoustics, Speech, and Signal
Processing, pages I: 765-768, 2002.
.
Holger Schwenk.
Language modeling in the continuous domain.
Technical Report 2001-20, LIMSI-CNRS, Orsay, France, 2001.
.
Holger Schwenk and Yoshua Bengio.
Boosting neural networks.
Neural Computation, 12:1869-1887, 2000.
.
[ .pdf | abstract]
Boosting is a general method for improving the performance of learning algorithms. A recently proposed boosting algorithm is AdaBoost. It has been applied with great success to several benchmark machine learning problems using mainly decision trees as base classifiers. In this paper we investigate whether AdaBoost also works as well with neural networks, and we discuss the advantages and drawbacks of different versions of the AdaBoost algorithm. In particular, we compare training methods based on sampling the training set and weighting the cost function. The results suggest that random resampling of the training data is not the main explanation of the success of the improvements brought by AdaBoost. This is in contrast to Bagging which directly aims at reducing variance and for which random resampling is essential to obtain the reduction in generalization error. Our system achieves about 1.4% error on a data set of online handwritten digits from more than 200 writers. A boosted multi-layer network achieved 1.5% error on the UCI Letters and 8.1% error on the UCI satellite data set, which is significantly better than boosted decision trees.
Holger Schwenk.
The diabolo classifier.
Neural Computation, 10:2175-2200, 1998.
.
[ .pdf | abstract]
We present a new classification architecture based on autoassociative neural networks that are used to learn discriminant models of each class. The proposed architecture has several interesting properties with respect to other model-based classifiers like nearest-neighbors or radial basis functions: is has a low computational complexity and uses a compact distributed representation of the models. The classifier is also well suited for the incorporation of a-priori knowledge by means of a problem-specific distance measure. In particular, we will show that tangent distance (Simard, LeCun and Denker, 1993) can be used to achieve transformation invariance during learning and recognition. We demonstrate the application of this classifier to Optical Character Recognition (OCR), where it has achieved state-of-the-art results on several reference databases. Relations to other models, in particular those based on principal component analysis, are also discussed.
Christophe Servan and Schwenk Holger.
Optimising multiple metrics with MERT.
The Prague Bulletin of Mathematical Linguistics (PBML),
(96):109-117, 2011.
.
[ .pdf ]
Holger Schwenk.
Continuous space language models for statistical machine
translation.
The Prague Bulletin of Mathematical Linguistics,
(93):137-146, 2010.
.
[ .pdf | abstract]
This paper describes an open-source implementation of the so-called continuous space language model and its application to statistical machine translation. The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability estimation in a continuous space. The projection of the words and the probability estimation are both performed by a multi-layer neural network. This paper describes the theoretical background of the approach, efficient algorithms to handle the computational complexity, and gives implementation details and reports experimental results on a variety of tasks.
Holger Schwenk.
Continuous space language models.
Computer Speech and Language, 21:492-518, 2007.
.
[ .pdf | abstract]
This paper describes the use of a neural network language model for large vocabulary continuous speech recognition. The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability estimation in a continuous space. Very efficient learning algorithms are described that enable the use of training corpora of several hundred million words. It is also shown that this approach can be incorporated into a large vocabulary continuous speech recognizer using a lattice rescoring framework at a very low additional processing time. The neural network language model has been thoroughly evaluated in a state-of-the-art large vocabulary continuous speech recognizer for several international benchmark tasks, in particular the NIST evaluations on broadcast news and conversational speech recognition. The new approach is compared to 4-gram back-off language models trained with modified Kneser-Ney smoothing which has been often reported to be the best known smoothing method. The neural network language model achieved consistent word error rate reductions for all considered tasks and languages, ranging from 0.5% to up to 1.6% absolute.
Evgeny Matusov, Gregor Leusch, Rafael E. Banchs, Nicola Bertoldi, Daniel
Déchelotte, Marcello Federico, Muntsin Kolss, Young-Suk Lee, José B.
Mario, Matthias Paulik, Salim Roukos, Holger Schwenk, and Hermann Ney.
System combination for machine translation of spoken and
written language.
IEEE Transactions on Audio, Speech, and Language
Processing, 16(7):1222-1237, 2007.
.
[ .pdf | abstract]
This article describes an approach for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The consensus translation is computed by weighted majority voting on a confusion network, similarly to the well-established ROVER approach of Fiscus [11] for combining speech recognition hypotheses. To create the confusion network, pairwise word alignments of the original MT hypotheses are learned using an enhanced statistical alignment algorithm that explicitly models word reordering. The context of a whole corpus of automatic translations rather than a single sentence is taken into account in order to achieve high alignment quality. The confusion network is rescored with a special language model, and the consensus translation is extracted as the best path. The proposed system combination approach was evaluated in the framework of the TC-STAR speech translation project. Up to six state-of-the-art statistical phrase-based translation systems from different project partners were combined in the experiments. Significant improvements in translation quality from Spanish to English and English to Spanish in comparison with the best of the individual systems were achieved under official evaluation conditions.
Spyros Matsoukas, Jean-Luc Gauvain, Gilles Adda, Thomas Colthurst, Chia-Lin
Kao, Owen Kimball, Lori Lamel, Fabrice Lefevre, Jeff Ma, John Makhoul, Long
Nguyen, Rohit Prasad, Richard Schwartz, Holger Schwenk, and Bing Xiang.
Advances in transcription of broadcast news and
conversational telephone speech within the combined ears bbn/limsi system.
IEEE Transactions on Audio, Speech, and Language
Processing, 14:1541-1556, 2006.
.
[ .pdf | abstract]
This paper describes the progress made in the transcription of Broadcast News (BN) and Conversational Telephone Speech (CTS) within the combined BBN/LIMSI system from May 2002 to September 2004. During that period, BBN and LIMSI collaborated in an effort to produce significant reductions in the word error rate (WER), as directed by the aggressive goals of the DARPA EARS (Effective, Affordable, Reusable, Speech-to-text) program. The paper focuses on general modeling techniques that led to recognition accuracy improvements, as well as engineering approaches that enabled efficient use of large amounts of training data and fast decoding architectures. Special attention is given on efforts to integrate components of the BBN and LIMSI systems, discussing the trade-off between speed and accuracy for various system combination strategies. Results on the EARS progress test sets show that the combined BBN/LIMSI system achieved relative reductions of 47% and 51% on the BN and CTS domains, respectively.
J.-L. Gauvain, Gilles Adda, Lori Lamel, F. Lefvre, and Holger Schwenk.
Transcription de la parole conversationnelle.
Traitement Automatique des Langages, 45(3), 2005.
.
[ .pdf | abstract]
Cet article dcrit le dveloppement d'un systme de reconnaissance de la parole conversationnelle, partir d'un systme l'tat de l'art pour la transcription d'missions d'information. Nous dcrivons les principales amliorations apportes aux modles acoustiques, aux modles linguistiques et au dcodeur. Pour la modlisation acoustique, nos travaux ont port sur l'introduction d'une normalisation par locuteur, le recours des techniques d'apprentissage adaptatif et d'apprentissage discriminant, et une meilleure prise en compte des variantes de prononciation. Pour la modlisation linguistique, la principale difficult vient de la faible quantit de donnes d'apprentissage disponible. Nous introduisons deux techniques permettant de diminuer l'impact de cette situation sur les performances du systme : la slection de textes de nature conversationnelle et un modle reprsentant les mots dans un espace continu. La transcription est obtenue en effectuant un dcodage par consensus sur un treillis de mots. Ces amliorations ont permis de rduire le taux d'erreur de 51% 21%.
Holger Schwenk and Yoshua Bengio.
Boosting neural networks.
Neural Computation, 12:1869-1887, 2000.
.
[ .pdf | abstract]
Boosting is a general method for improving the performance of learning algorithms. A recently proposed boosting algorithm is AdaBoost. It has been applied with great success to several benchmark machine learning problems using mainly decision trees as base classifiers. In this paper we investigate whether AdaBoost also works as well with neural networks, and we discuss the advantages and drawbacks of different versions of the AdaBoost algorithm. In particular, we compare training methods based on sampling the training set and weighting the cost function. The results suggest that random resampling of the training data is not the main explanation of the success of the improvements brought by AdaBoost. This is in contrast to Bagging which directly aims at reducing variance and for which random resampling is essential to obtain the reduction in generalization error. Our system achieves about 1.4% error on a data set of online handwritten digits from more than 200 writers. A boosted multi-layer network achieved 1.5% error on the UCI Letters and 8.1% error on the UCI satellite data set, which is significantly better than boosted decision trees.
Holger Schwenk.
The diabolo classifier.
Neural Computation, 10:2175-2200, 1998.
.
[ .pdf | abstract]
We present a new classification architecture based on autoassociative neural networks that are used to learn discriminant models of each class. The proposed architecture has several interesting properties with respect to other model-based classifiers like nearest-neighbors or radial basis functions: is has a low computational complexity and uses a compact distributed representation of the models. The classifier is also well suited for the incorporation of a-priori knowledge by means of a problem-specific distance measure. In particular, we will show that tangent distance (Simard, LeCun and Denker, 1993) can be used to achieve transformation invariance during learning and recognition. We demonstrate the application of this classifier to Optical Character Recognition (OCR), where it has achieved state-of-the-art results on several reference databases. Relations to other models, in particular those based on principal component analysis, are also discussed.
Yoshua Bengio, Holger Schwenk, Jean-Sbastien Sencal, Frderic Morin, and
Jean-Luc Gauvain.
Neural probabilistic language models, 2005.
.
[ .pdf | abstract]
Chapter 6 of the book “Innovations in Machine Learning: Theory and Applications”, D. Holmes and L.C.Jain, editors, Springer-Verlag
Frédéric Blain, Jean Senellart, Holger Schwenk, Mirko Plitt, and Johann
Roturier.
Qualitative analysis of post-editing for high quality
machine translation.
In Asia-Pacific Association for Machine Translation (AAMT),
editor, Machine Translation Summit XIII, Xiamen (China), 19-23 sept.
2011.
.
[ .pdf ]
Kashif Shah, Loïc Barrault, and Holger Schwenk.
Parametric weighting of parallel data for statistical
machine translation.
In The 5th International Joint Conference on Natural
Language Processing, pages 1323-1331, Chiang Mai (Thialand), 2011.
.
Patrik Lambert, Sadaf Abdul-Rauf, and Holger Schwenk.
LIUM SMT machine translation system for WMT 2010.
page in press, 2010.
.
[ .pdf | abstract]
This paper describes the development of French-English and English-French machine translation systems for the 2010 WMT shared task evaluation. These systems were standard phrase-based statistical systems based on the Moses decoder, trained on the provided data only. Most of our efforts were devoted to the choice and extraction of bilingual data used for training. We filtered out some bilingual corpora and pruned the phrase table. We also investigated the impact of adding two types of additional bilingual texts, extracted automatically from the available monolingual data. We first collected bilingual data by performing automatic translations of monolingual texts. The second type of bilingual text was harvested from comparable corpora with Information Retrieval techniques.
Patrik Lambert, Sadaf Abdul-Rauf, and Holger Schwenk.
LIUM SMT machine translation system for WMT 2010.
pages 127-132, 2010.
.
[ .pdf | abstract]
This paper describes the development of French-English and English-French machine translation systems for the 2010 WMT shared task evaluation. These systems were standard phrase-based statistical systems based on the Moses decoder, trained on the provided data only. Most of our efforts were devoted to the choice and extraction of bilingual data used for training. We filtered out some bilingual corpora and pruned the phrase table. We also investigated the impact of adding two types of additional bilingual texts, extracted automatically from the available monolingual data. We first collected bilingual data by performing automatic translations of monolingual texts. The second type of bilingual text was harvested from comparable corpora with Information Retrieval techniques.
Holger Schwenk.
Adaptation d'un système de traduction automatique
statistique avec des ressources monolingues.
In Traitement du Langage Naturel, page in press, 2010.
.
[ .pdf | abstract]
The performance of a statistical machine translation system depends a lot on the quality and quantity of the available training data. Most of the existing, easily available parallel texts come from international organizations and the jargon observed in those texts is not very appropriate to build a machine translation system for other domains. In this paper, we present a technique to automatically adapt the translation model to a new domain using monolingual data in the source language only. We observe significant improvements in the BLEU score in statistical machine translation systems from Arabic to French and English respectively.
Sadaf Abdul Rauf and Holger Schwenk.
On the use of comparable corpora to improve SMT
performance.
In Proceedings of the Conference of the European Chapter of
the Association for Computational Lingustics, pages 16-23, 2009.
.
[ .pdf | abstract]
We present a simple and effective method for extracting parallel sentences from comparable corpora. We employ a statistical machine translation (SMT) system built from small amounts of parallel texts to translate the source side of the non-parallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create French/English parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems.
Holger Schwenk, Sadaf Abdul-Rauf, Loïc Barrault, and Jean Senellart.
SMT and SPE machine translation systems for WMT'09.
In Forth ACL Workshop on Statistical Machine Translation,
pages 130-134, 2009.
.
[ .pdf | abstract]
This paper describes the development of several machine translation systems for the 2009 WMT shared task evaluation. We only consider the translation between French and English. We describe a statistical system based on the Moses decoder and a statistical post-editing system using SYSTRAN's rule-based system. We also investigated techniques to automatically extract additional bilingual texts from comparable corpora.
Sadaf Abdul Rauf and Holger Schwenk.
Exploiting comparable corpora with TER and TERp.
In 2nd Workshop on Building and Using Comparable Corpora:
from parallel to non-parallel corpora, 2009.
.
[ .pdf | abstract]
In this paper we present an extension of a successful simple and effective method for extracting parallel sentences from comparable corpora and we apply it to an Arabic/English NIST system. We also report a comparison of our approach with that of (Munteanu et Marcu, 2005) using exactly the same corpora and show the same performance gain by using much lesser data. Our approach employs an SMT system built from small amounts of parallel texts to translate the source side of the non-parallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems.
Holger Schwenk and Jean Senellart.
Translation model adaptation for an Arabic/French news
translation system by lightly-supervised training.
In MT Summit, 2009.
.
[ .pdf | abstract]
Most of the existing, easily available parallel texts to train a statistical machine translation system are from international organizations that use a particular jargon. In this paper, we consider the automatic adaptation of such a translation model to the news domain. The initial system was trained on more than 200M words of UN bitexts. We then explore large amounts of in-domain monolingual texts to modify the probability distribution of the phrase-table and to learn new task-specific phrase-pairs. This procedure achieved an improvement of 3.5 points BLEU on the test set in an Arabic/French statistical machine translation system. This result compares favorably with other large state-of-the-art systems for this language pair.
Holger Schwenk, Loïc Barrault, Yannick Estève, and Patrik Lambert.
LIUM's statistical machine translation systems for IWSLT
2009.
In International Workshop on Spoken Language Translation,
pages 65-70, 2009.
.
[ .pdf | abstract]
This paper describes the systems developed by the LIUM laboratory for the 2009 IWSLT evaluation. We participated in the Arabic and Chinese to English BTEC tasks. We developed three different systems: a statistical phrase-based system using the Moses toolkit, an Statistical Post-Editing system and a hierarchical phrase-based system based on Joshua. A continuous space language model was deployed to improve the modeling of the target language. These systems are combined by a confusion network based approach.
Holger Schwenk and Philipp Koehn.
Large and diverse language models for statistical machine
translation.
In International Joint Conference on Natural Language
Processing, pages 661-6662, 2008.
.
[ .pdf | abstract]
This paper presents methods to combine large language models trained from diverse text sources and applies them to a state-of-art French-English and Arabic-English machine translation system. We show gains of over 2 Bleu points over a strong baseline by using continuous space language models in re-ranking.
Holger Schwenk, Jean-Baptiste Fouet, and Jean Senellart.
First steps towards a general purpose French/English
statistical machine translation system.
In Third ACL Workshop on Statistical Machine Translation,
pages 119-122, 2008.
.
[ .pdf | abstract]
This paper describes an initial version of a general purpose French/English statistical machine translation system. The main features of this system are the open-source Moses decoder, the integration of a bilingual dictionary and a continuous space target language model. We analyze the performance of this system on the test data of the WMT'08 evaluation.
Holger Schwenk and Yannick Estève.
Data selection and smoothing in an open-source system for
the 2008 NIST machine translation evaluation.
In Interspeech, pages 2727-2730, 2008.
.
[ .pdf | abstract]
This paper gives a detailed description of a statistical machine translation system developed for the 2008 NIST open MT evaluation. The system is based on the open source toolkit Moses with extensions for language model rescoring in a second pass. Significant improvements were obtained with data selection methods for the language and translation model. An improvement of more than 1 point BLEU on the test set was achieved by a continuous space language model which performs the probability estimation with a neural network. The described system has achieved a very good ranking in the 2008 NIST open MT evaluation.
Holger Schwenk, Yannick Estève, and Sadaf Abdul Rauf.
The LIUM Arabic/English statistical machine
translation system for IWSLT 2008.
In International Workshop on Spoken Language Translation,
pages 63-68, 2008.
.
[ .pdf | abstract]
This paper describes the system developed by the LIUM laboratory for the 2008 IWSLT evaluation. We only participated in the Arabic/English BTEC task. We developed a statistical phrase-based system using the Moses toolkit and SYSTRAN's rule-based translation system to perform a morphological decomposition of the Arabic words. A continuous space language model was deployed to improve the modeling of the target language. Both approaches achieved significant improvements in the BLEU score. The system achieves a score of 49.4 on the test set of the 2008 IWSLT evaluation.
Holger Schwenk, Marta R. Costa-jussà, and José A. R. Fonollosa.
Smooth bilingual n-gram translation.
In Empirical Methods in Natural Language Processing, pages
430-438, 2007.
.
[ .pdf | abstract]
We address the problem of smoothing translation probabilities in a bilingual N-gram-based statistical machine translation system. It is proposed to project the bilingual tuples onto a continuous space and to estimate the translation probabilities in this representation. A neural network is used to perform the projection and the probability estimation.Smoothing probabilities is most important for tasks with a limited amount of training material. We consider here the Btec task of the 2006 Iwslt evaluation. Improvements in all official automatic measures are reported when translating from Italian to English. Using a continuous space model for the translation model and the target language model, an improvement of 1.5 BLEU on the test data is observed.
Daniel Déchelotte, Holger Schwenk, Gilles Adda, and Jean-Luc Gauvain.
Improved machine translation of text-to-speech outputs.
In Interspeech, pages 2441-2444, 2007.
.
[ .pdf | abstract]
Combining automatic speech recognition and machine translation is frequent in current research programs. This paper first presents several pre-processing steps to limit the performance degradation observed when translating an automatic transcription (as opposed to a manual transcription). Indeed, automatically transcribed speech often differs significantly from the machine translation system's training material, with respect to caseing, punctuation and word normalization. The proposed system outperforms the best system at the 2007 TC-STAR evaluation by almost 2 points BLEU. The paper then attempts to determine a criteria characterizing how well an STT system can be translated, but the current experiments could only confirm that lower word error rates lead to better translations.
Holger Schwenk.
Building a statistical machine translation system for
French using the Europarl corpus.
In Second ACL Workshop on Statistical Machine Translation,
pages 189-192, 2007.
.
[ .pdf | abstract]
This paper describes the development of a statistical machine translation system based on the Moses decoder for the 2007 WMT shared tasks. Several different translation strategies were explored. We also use a statistical language model that is based on a continuous representation of the words in the vocabulary. By these means we expect to take better advantage of the limited amount of training data. Finally, we have investigated the usefulness of a second reference translation of the development data.
Daniel Déchelotte, Holger Schwenk, Hélène Bonneau-Maynard, Alexandre
Allauzen, and Gilles Adda.
A state-of-the-art statistical machine translation system
based on Moses.
In MT Summit, pages 127-133, 2007.
.
[ .pdf | abstract]
This paper describes a statistical machine translation system based on freely available programs such as Moses. Several new features were added, in particular a two-pass decoding strategy using n-best lists and a continuous space language model that aims at taking better advantage of the limited training data. We also investigated lexical disambiguation methods in the translation model based on POS information. The task considered in this work is the translation of the European Parliament Plenary Sessions between English and Spanish, in the framework of the Tc-star project. The described systems performed very well in the 2007 Tc-Star evaluation.
Patrik Lambert, Marta R. Costa-jussà, Josep M. Crego, Maxim Khalilov, José
B. Marino, Rafael E. Banchs, José A.R. Fonollosa, and Holger Schwenk.
The TALP ngram-based SMT system for IWSLT 2007.
In International Workshop on Spoken Language Translation,
pages 169-174, 2007.
.
[ .pdf | abstract]
This paper describes TALPtuples, the 2007 N-gram-based statistical machine translation system developed a t the TALP Research Center of the UPC (Universitat Politècnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the system of previous years. Mainly, these include optimizing align ment parameters in function of translation metric scores and rescoring with a neural network language model.Results on two translation directions are reported, namely from Arabic and Chinese into English, thoroughly explaining all language-related preprocessing and translation schemes.
Holger Schwenk, Marta R. Costa-jussà, and José A. R. Fonollosa.
Continuous space language models for the IWSLT 2006 task.
In International Workshop on Spoken Language Translation,
pages 166-173, November 2006.
.
[ .pdf | abstract]
The language model of the target language plays an important role in statistical machine translation systems. In this work, we propose to use a new statistical language model that is based on a continuous representation of the words in the vocabulary. A neural network is used to perform the projection and the probability estimation. This kind of approach is in particular promising for tasks where a very limited amount of resources are available, like the Btec corpus of tourism related questions.This language model is used in two state-of-the-art statistical machine translation systems that were developed by UPC for the 2006 Iwslt evaluation campaign: a phrase- and an n-gram-based approach. An experimental evaluation for four different language pairs is provided (translation of Mandarin, Japanese, Arabic and Italian to English). The proposed method achieved improvements in the BLEU score of up to 3 points on the development data and of almost 2 points on the official test data.
Holger Schwenk, Daniel Déchelotte, and Jean-Luc Gauvain.
Continuous space language models for statistical machine
translation.
In Proceedings of the COLING/ACL 2006 Main Conference Poster
Sessions, pages 723-730, 2006.
.
[ .pdf | abstract]
Statistical machine translation systems are based on one or more translation models and a language model of the target language. While many different translation models and phrase extraction algorithms have been proposed, a standard word n-gram back-off language model is used in most systems.In this work, we propose to use a new statistical language model that is based on a continuous representation of the words in the vocabulary. A neural network is used to perform the projection and the probability estimation. We consider the translation of European Parliament Speeches. This task is part of an international evaluation organized by the Tc-Star project in 2006. The proposed method achieves consistent improvements in the BLEU score on the development and test data.
We also present algorithms to improve the estimation of the language model probabilities when splitting long sentences into shorter chunks.
Daniel Déchelotte, Holger Schwenk, and Jean-Luc Gauvain.
The 2006 LIMSI statistical machine translation system for
Tc-Star.
In Tc-Star Speech to Speech Translation Workshop,
Barcelona, pages 25-30, 2006.
.
[ .pdf | abstract]
This paper presents the LIMSI statistical machine translation system developed for 2006 Tc-Star evaluation campaign. We describe an A*-decoder that generates translation lattices using a word-based translation model. A lattice is a rich and compact representation of alternative translations that includes the probability scores of all the involved sub-models. These lattices are then used in subsequent processing steps, in particular to perform sentence splitting and joining, maximum BLEU training and to use improved statistical target language models.
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Claude Barras, Eric Bilinski,
Olivier Galibert, Agusti Pujol, Holger Schwenk, and Xuan Zhu.
The LIMSI 2006 Tc-Star transcription systems.
In Tc-Star Speech to Speech Translation Workshop,
Barcelona, pages 123-128, 2006.
.
[ .pdf | abstract]
This paper describes the speech recognizers evaluated in the TC-STAR Second Evaluation Campaign held in January-February 2006. Systems were developed to transcribe parliamentary speeches in English and Spanish, as well as Broadcast news in Mandarin Chinese. The speech recognizers are state-of-the-art systems using multiple decoding passes with models (lexicon, acoustic models, language models) trained for the different transcription tasks. Compared to the LIMSI TC-STAR 2005 European Parliament Plenary Sessions (EPPS) systems, relative word error rate reductions of about 30% have been achieved on the 2006 development data. The word error rates with the LIMSI systems on the 2006 EPPS evaluation data are 8.2% for English and 7.8% for Spanish. The character error rate for Mandarin for a joint system submission with the University of Karlsruhe was 9.8%. Experiments with cross-site adaptation and system combination are also described.
Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Alexandre Allauzen,
Veronique Gendner, Lori Lamel, and Holger Schwenk.
Where are we in transcribing BN French?
In Eurospeech, pages 1665-1668, 2005.
.
[ .pdf | abstract]
Given the high flexional properties of the French language, transcribing French broadcast news (BN) is more challenging than English BN. This is in part due to the large number of homophones in the inflected forms. This paper describes the development of a recognition system for processing broadcast news speech in French. The resulting system was evaluated in the first French Technolangue ASR benchmark test [?]. This system runs in about 7xRT and achieved the lowest word error rate in this evaluation, 11.9%. We also report on a 1xRT version of this system.The main differences between the English and French BN systems are: a 200k vocabulary to overcome the lower lexical coverage in French, a case sensitive language model, and the use of a POS based language model to lower the impact of homophonic gender and number disagreement.
Holger Schwenk and Jean-Luc Gauvain.
Building continuous space language models for transcribing
european languages.
In Eurospeech, pages 737-740, 2005.
.
[ .pdf | abstract]
Large vocabulary continuous speech recognizers for English Broadcast News achieve today word error rates below 10%. An important factor for this succes is the availability of large amounts of acoustic and language modeling training data. In this paper the recognition of French Broadcast News and English and Spanish parliament speeches is addressed, tasks for which less resources are available. A neural network language model is applied that takes better advantage of the limited amount of training data. This approach performs the estimation of the probabilities in a continuous space and better generalization to unknown n-grams can be expected. Word error reduction of up to 0.9% absolute are reported with respect to a carefully tuned back-off language model trained on the same data.
Holger Schwenk and Jean-Luc Gauvain.
Training neural network language models on very large
corpora.
In Empirical Methods in Natural Language Processing, pages
201-208, 2005.
.
[ .pdf | abstract]
During the last years there has been growing interest in using neural networks for language modeling. In contrast to the well known back-off n-gram language models, the neural network approach attempts to overcome the data sparseness problem by performing the estimation in a continuous space. This type of language model was mostly used for tasks for which only a very limited amount of in-domain training data is available.In this paper we present new algorithms to train a neural network language model on very large text corpora. This makes possible the use of the approach in domains where several hundreds of millions words of texts are available. The neural network language model is evaluated in a state-of-the-art real-time continuous speech recognizer for French Broadcast News. Word error reductions of 0.5% absolute are reported using only a very limited amount of additional processing time.
Holger Schwenk and Jean-Luc Gauvain.
Neural network language models for conversational speech
recognition.
In International Conference on Speech and Language
Processing, pages 1215-1218, 2004.
.
[ .pdf | abstract]
Recently there is growing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models (LM), the neural network approach tries to limit problems from the data sparseness by performing the estimation in a continuous space, allowing by these means smooth interpolations. Therefore this type of LM is interesting for tasks for which only a very limited amount of in-domain training data is available, such as the modeling of conversational speech.In this paper we analyze the generalization behavior of the neural network LM for in-domain training corpora varying from 7M to over 21M words. In all cases, significant word error reductions were observed compared to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the NIST rich transcription evaluations. We also apply ensemble learning methods and discuss their connections with LM interpolation.
Holger Schwenk and J.-L. Gauvain.
Using neural network language models for LVCSR.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
In this paper we describe how to use a neural network language model for the BN and CTS task in the RT04 evaluation. The new approach performs the estimation of the language model probabilities in a continuous space, allowing by this means smooth interpolations. Details are given on training data selection, fast training and decoding algorithms and parameter estimation. The neural network language model achieved word error reductions of 0.5% for the CTS task and of 0.3% for the BN task with an additional decoding cost of 0.05xRT.
Jean-Luc Gauvain, Abdel Messaoudi, and Holger Schwenk.
Language recognition using phone lattices.
In International Conference on Speech and Language
Processing, pages 1283-1286, 2004.
.
[ .pdf | abstract]
This paper proposes a new phone lattice based method for automatic language recognition from speech data. By using phone lattices some approximations usually made by language identification (LID) systems relying on phonotactic constraints to simplify the training and decoding processes can be avoided. We demonstrate the use of phone lattices both in training and testing significantly improves the accuracy of a phonotactically based LID system. Performance is further enhanced by using a neural network to combine the results of multiple phone recognizers. Using three phone recognizers with context independent phone models, the system achieves an equal error rate of 2.7% on the Eval03 NIST detection test (30s segment, primary condition) with an overall decoding process that runs faster than real-time (0.5xRT).
Holger Schwenk.
Efficient training of large neural networks for language
modeling.
In IEEE joint conference on neural networks, pages
3059-3062, 2004.
.
[ .pdf | abstract]
Recently there has been increasing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models, the neural network approach tries to limit the data sparseness problem by performing the estimation in a continuous space, allowing by this means smooth interpolations. The complexity to train such a model and to calculate one n-gram probability is however several orders of magnitude higher than for the backoff models, making the new approach difficult to use in real applications.In this paper several techniques are presented that allow the use of a neural network language model in a large vocabulary speech recognition system, in particular very fast lattice rescoring and efficient training of large neural networks on training corpora of over 10 million words. The described approach achieves significant word error reductions with respect to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the DARPA rich transcriptions evaluations.
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Leonard
Canseco, Langzhou Chen, Olivier Galibert, Abdel Messaoudi, and Holger
Schwenk.
Speech transcription in multiple languages.
In International Conference on Acoustics, Speech, and Signal
Processing, pages III:753-756, 2004.
.
[ .pdf | abstract]
This paper summarizes recent work underway at Limsi on speech-to-text transcription in multiple languages. The research has been oriented towards the processing of broadcast audio and conversational speech for information access. Broadcast news transcription systems have been developed for seven languages and it is planned to address several other languages in the near term. Research on conversational speech has mainly focused on the English language, with initial work on the French, Arabic and Spanish languages. Automatic processing must take into account the characteristics of the audio data, such as needing to deal with the continuous data stream, specificities of the language and the use of an imperfect word transcription for accessing the information content. Our experience thus far indicates that at today's word error rates, the techniques used in one language can be successfully ported to other languages, and most of the language specificities concern lexical and pronunciation modeling.
Richard Schwartz, Thomas Colthurst, Nicolae Duta, Herb Gish, Rukmini Iyer,
Chia-Lin Kao, Daben Liu, Owen Kimball, J. Ma, John Makhoul, Spyros Matsoukas,
Long Nguyen, Mohamed Noamany, Rohit Prasad, Bing Xiang, Dongxin Xu, Jean-Luc
Gauvain, Lori Lamel, Holger Schwenk, Gilles Adda, and Langzhou Chen.
Speech recognition in multiple languages and domains: The
2003 bbn/limsi ears system.
In International Conference on Acoustics, Speech, and Signal
Processing, pages III:757-760, 2004.
.
[ .pdf | abstract]
We report on the results of the first evaluations for the BBN/LIMSI system under the new DARPA EARS Program. The evaluations were carried out for conversational telephone speech (CTS) and broadcast news (BN) for three languages: English, Mandarin, and Arabic. In addition to providing system descriptions and evaluation results, the paper highlights methods that worked well across the two domains and those that worked well on one domain but not the other. For the BN evaluations, which had to be run under 10 times real-time, we demonstrated that a joint BBN/LIMSI system with that time constraint achieved better results than either system alone.
Long Nguyen, Sherif Abdou, Mohamed Afify, John Makhoul, Spyros Matsoukas,
Richard Schwartz, Bing Xiang, Lori Lamel, Jean-Luc Gauvain, Gilles Adda,
Holger Schwenk, and Fabrice Lefevre.
The 2004 BBN/LIMSI 10xRT english
broadcast news transcription system.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
This paper describes the 2004 BBN/LIMSI 10xRT English Broadcast News (BN) transcription system which uses a tightly integrated combination of components from the BBN and LIMSI speech recognition systems. The integrated system uses both cross-site adaptation and system combination via ROVER, obtaining a word hypothesis that is better than is produced by either system alone, while remaining within the allotted time limit. The system configuration used for the evaluation has two components from each site and two ROVER combinations, and achieved a word error rate (WER) of 13.9% on the Dev04f set and 9.3% on the Dev04 set selected to match the progress set. Compared to last year's system, there is around 30% relative reduction on the WER.
R. Prasad, S. Matsoukas, C-L. Kao, J. Ma, D-X. Xu, T. Colthurst, G. Thattai,
O. Kimball, R. Schwartz, J.-L. Gauvain, Lori Lamel, Holger Schwenk, Gilles
Adda, and F. Lefevre.
The 2004 20xRT BBN/LIMSI english
conversational telephone speech system.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT (real-time as measured on 3.4 GHz Pentium 4 Xeon Processor) on the EARS progress test set. This translates into a 22.8% relative improvement in WER over the 2003 BBN/LIMSI EARS evaluation system, which was run without any time constraints. In addition to reporting on the system architecture and the evaluation results, we also highlight the significant improvements made at both sites.
Holger Schwenk, Daniel Déchelotte, Hélène Bonneau-Maynard, and Alexandre
Allauzen.
Modèles statistiques enrichis par la syntaxe pour la
traduction automatique.
In Traitement du Langage Naturel, pages 253-262, 2007.
.
[ .pdf | abstract]
La traduction automatique statistique par séquences de mots est une voie prometteuse. Nous présentons dans cet article deux évolutions complémentaires. La première permet une modélisation de la langue cible dans un espace continu. La seconde intègre des catégories morpho-syntaxiques aux unités manipulées par le modèle de traduction. Ces deux approches sont évaluées sur la tâche Tc-Star. Les résultats les plus intéressants sont obtenus par la combinaison de ces deux méthodes.
Daniel Déchelotte, Holger Schwenk, and Jean-Luc Gauvain.
Transcription et traduction de débats parlementaires.
In Reconnaissance de Formes et Intelligence Artificielle,
2006.
.
[ .pdf | abstract]
Cet article présente un système complet de traduction automatique de la parole non-contrainte. Une approche statistique est utilisée aussi bien pour la reconnaissance de la parole que pour la traduction. Les modèles, algorithmes et optimisations utilisés dans le système de traduction statistique sont décrits en détail. Des résultats sont présentés pour la transcription et la traduction des débats du Parlement européen, de l'anglais vers l'espagnol et inversement. Ils suggèrent que les modèles stochastiques de traduction sont adaptés à la traduction de la parole, de part leur relative robustesse constatée face aux erreurs introduites par la reconnaissance automatique.
Lori Lamel, Holger Schwenk, Jean-Luc Gauvain, Gilles Adda, and Eric Bilinski.
Improvements in transcribing lectures and seminars.
In 2nd Joint Workshop on Multimodal Interaction and Related
Machine Learning Algorithms, 2005.
.
[ .pdf | abstract]
This paper describes recent research carried out in the context of the FP6 Integrated Project Chil (chil.server.de) on developing a system to automatically transcribe lectures and seminars. Widely available corpora were used to train both the acoustic and language models, since only a small amount of Chil data was available for system development. For language model training, text materials come from a variety of online conference proceedings and a neural network language model has been used to take better advantage of the limited data.
Holger Schwenk.
Continuous space language models.
Computer Speech and Language, 21:492-518, 2007.
.
[ .pdf | abstract]
This paper describes the use of a neural network language model for large vocabulary continuous speech recognition. The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability estimation in a continuous space. Very efficient learning algorithms are described that enable the use of training corpora of several hundred million words. It is also shown that this approach can be incorporated into a large vocabulary continuous speech recognizer using a lattice rescoring framework at a very low additional processing time. The neural network language model has been thoroughly evaluated in a state-of-the-art large vocabulary continuous speech recognizer for several international benchmark tasks, in particular the NIST evaluations on broadcast news and conversational speech recognition. The new approach is compared to 4-gram back-off language models trained with modified Kneser-Ney smoothing which has been often reported to be the best known smoothing method. The neural network language model achieved consistent word error rate reductions for all considered tasks and languages, ranging from 0.5% to up to 1.6% absolute.
Holger Schwenk, Marta R. Costa-jussà, and José A. R. Fonollosa.
Continuous space language models for the IWSLT 2006 task.
In International Workshop on Spoken Language Translation,
pages 166-173, November 2006.
.
[ .pdf | abstract]
The language model of the target language plays an important role in statistical machine translation systems. In this work, we propose to use a new statistical language model that is based on a continuous representation of the words in the vocabulary. A neural network is used to perform the projection and the probability estimation. This kind of approach is in particular promising for tasks where a very limited amount of resources are available, like the Btec corpus of tourism related questions.This language model is used in two state-of-the-art statistical machine translation systems that were developed by UPC for the 2006 Iwslt evaluation campaign: a phrase- and an n-gram-based approach. An experimental evaluation for four different language pairs is provided (translation of Mandarin, Japanese, Arabic and Italian to English). The proposed method achieved improvements in the BLEU score of up to 3 points on the development data and of almost 2 points on the official test data.
Yoshua Bengio, Holger Schwenk, Jean-Sbastien Sencal, Frderic Morin, and
Jean-Luc Gauvain.
Neural probabilistic language models, 2005.
.
[ .pdf | abstract]
Chapter 6 of the book “Innovations in Machine Learning: Theory and Applications”, D. Holmes and L.C.Jain, editors, Springer-Verlag
Holger Schwenk and Jean-Luc Gauvain.
Building continuous space language models for transcribing
european languages.
In Eurospeech, pages 737-740, 2005.
.
[ .pdf | abstract]
Large vocabulary continuous speech recognizers for English Broadcast News achieve today word error rates below 10%. An important factor for this succes is the availability of large amounts of acoustic and language modeling training data. In this paper the recognition of French Broadcast News and English and Spanish parliament speeches is addressed, tasks for which less resources are available. A neural network language model is applied that takes better advantage of the limited amount of training data. This approach performs the estimation of the probabilities in a continuous space and better generalization to unknown n-grams can be expected. Word error reduction of up to 0.9% absolute are reported with respect to a carefully tuned back-off language model trained on the same data.
Lori Lamel, Holger Schwenk, Jean-Luc Gauvain, Gilles Adda, and Eric Bilinski.
Improvements in transcribing lectures and seminars.
In 2nd Joint Workshop on Multimodal Interaction and Related
Machine Learning Algorithms, 2005.
.
[ .pdf | abstract]
This paper describes recent research carried out in the context of the FP6 Integrated Project Chil (chil.server.de) on developing a system to automatically transcribe lectures and seminars. Widely available corpora were used to train both the acoustic and language models, since only a small amount of Chil data was available for system development. For language model training, text materials come from a variety of online conference proceedings and a neural network language model has been used to take better advantage of the limited data.
Holger Schwenk and Jean-Luc Gauvain.
Training neural network language models on very large
corpora.
In Empirical Methods in Natural Language Processing, pages
201-208, 2005.
.
[ .pdf | abstract]
During the last years there has been growing interest in using neural networks for language modeling. In contrast to the well known back-off n-gram language models, the neural network approach attempts to overcome the data sparseness problem by performing the estimation in a continuous space. This type of language model was mostly used for tasks for which only a very limited amount of in-domain training data is available.In this paper we present new algorithms to train a neural network language model on very large text corpora. This makes possible the use of the approach in domains where several hundreds of millions words of texts are available. The neural network language model is evaluated in a state-of-the-art real-time continuous speech recognizer for French Broadcast News. Word error reductions of 0.5% absolute are reported using only a very limited amount of additional processing time.
Holger Schwenk and Jean-Luc Gauvain.
Neural network language models for conversational speech
recognition.
In International Conference on Speech and Language
Processing, pages 1215-1218, 2004.
.
[ .pdf | abstract]
Recently there is growing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models (LM), the neural network approach tries to limit problems from the data sparseness by performing the estimation in a continuous space, allowing by these means smooth interpolations. Therefore this type of LM is interesting for tasks for which only a very limited amount of in-domain training data is available, such as the modeling of conversational speech.In this paper we analyze the generalization behavior of the neural network LM for in-domain training corpora varying from 7M to over 21M words. In all cases, significant word error reductions were observed compared to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the NIST rich transcription evaluations. We also apply ensemble learning methods and discuss their connections with LM interpolation.
Holger Schwenk and J.-L. Gauvain.
Using neural network language models for LVCSR.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
In this paper we describe how to use a neural network language model for the BN and CTS task in the RT04 evaluation. The new approach performs the estimation of the language model probabilities in a continuous space, allowing by this means smooth interpolations. Details are given on training data selection, fast training and decoding algorithms and parameter estimation. The neural network language model achieved word error reductions of 0.5% for the CTS task and of 0.3% for the BN task with an additional decoding cost of 0.05xRT.
Holger Schwenk.
Efficient training of large neural networks for language
modeling.
In IEEE joint conference on neural networks, pages
3059-3062, 2004.
.
[ .pdf | abstract]
Recently there has been increasing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models, the neural network approach tries to limit the data sparseness problem by performing the estimation in a continuous space, allowing by this means smooth interpolations. The complexity to train such a model and to calculate one n-gram probability is however several orders of magnitude higher than for the backoff models, making the new approach difficult to use in real applications.In this paper several techniques are presented that allow the use of a neural network language model in a large vocabulary speech recognition system, in particular very fast lattice rescoring and efficient training of large neural networks on training corpora of over 10 million words. The described approach achieves significant word error reductions with respect to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the DARPA rich transcriptions evaluations.
Holger Schwenk and Jean-Luc Gauvain.
Using Continuous Space Language Models for Conversational
Speech Recognition.
In ISCA & IEEE Workshop on Spontaneous Speech Processing
and Recognition, pages 49-53, Tokyo, April 2003.
.
Holger Schwenk and Jean-Luc Gauvain.
Connectionist language modeling for large vocabulary
continuous speech recognition.
In International Conference on Acoustics, Speech, and Signal
Processing, pages I: 765-768, 2002.
.
Holger Schwenk.
Language modeling in the continuous domain.
Technical Report 2001-20, LIMSI-CNRS, Orsay, France, 2001.
.
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Claude Barras, Eric Bilinski,
Olivier Galibert, Agusti Pujol, Holger Schwenk, and Xuan Zhu.
The LIMSI 2006 Tc-Star transcription systems.
In Tc-Star Speech to Speech Translation Workshop,
Barcelona, pages 123-128, 2006.
.
[ .pdf | abstract]
This paper describes the speech recognizers evaluated in the TC-STAR Second Evaluation Campaign held in January-February 2006. Systems were developed to transcribe parliamentary speeches in English and Spanish, as well as Broadcast news in Mandarin Chinese. The speech recognizers are state-of-the-art systems using multiple decoding passes with models (lexicon, acoustic models, language models) trained for the different transcription tasks. Compared to the LIMSI TC-STAR 2005 European Parliament Plenary Sessions (EPPS) systems, relative word error rate reductions of about 30% have been achieved on the 2006 development data. The word error rates with the LIMSI systems on the 2006 EPPS evaluation data are 8.2% for English and 7.8% for Spanish. The character error rate for Mandarin for a joint system submission with the University of Karlsruhe was 9.8%. Experiments with cross-site adaptation and system combination are also described.
J.-L. Gauvain, Gilles Adda, Lori Lamel, F. Lefvre, and Holger Schwenk.
Transcription de la parole conversationnelle.
Traitement Automatique des Langages, 45(3), 2005.
.
[ .pdf | abstract]
Cet article dcrit le dveloppement d'un systme de reconnaissance de la parole conversationnelle, partir d'un systme l'tat de l'art pour la transcription d'missions d'information. Nous dcrivons les principales amliorations apportes aux modles acoustiques, aux modles linguistiques et au dcodeur. Pour la modlisation acoustique, nos travaux ont port sur l'introduction d'une normalisation par locuteur, le recours des techniques d'apprentissage adaptatif et d'apprentissage discriminant, et une meilleure prise en compte des variantes de prononciation. Pour la modlisation linguistique, la principale difficult vient de la faible quantit de donnes d'apprentissage disponible. Nous introduisons deux techniques permettant de diminuer l'impact de cette situation sur les performances du systme : la slection de textes de nature conversationnelle et un modle reprsentant les mots dans un espace continu. La transcription est obtenue en effectuant un dcodage par consensus sur un treillis de mots. Ces amliorations ont permis de rduire le taux d'erreur de 51% 21%.
Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Alexandre Allauzen,
Veronique Gendner, Lori Lamel, and Holger Schwenk.
Where are we in transcribing BN French?
In Eurospeech, pages 1665-1668, 2005.
.
[ .pdf | abstract]
Given the high flexional properties of the French language, transcribing French broadcast news (BN) is more challenging than English BN. This is in part due to the large number of homophones in the inflected forms. This paper describes the development of a recognition system for processing broadcast news speech in French. The resulting system was evaluated in the first French Technolangue ASR benchmark test [?]. This system runs in about 7xRT and achieved the lowest word error rate in this evaluation, 11.9%. We also report on a 1xRT version of this system.The main differences between the English and French BN systems are: a 200k vocabulary to overcome the lower lexical coverage in French, a case sensitive language model, and the use of a POS based language model to lower the impact of homophonic gender and number disagreement.
Holger Schwenk and Jean-Luc Gauvain.
Building continuous space language models for transcribing
european languages.
In Eurospeech, pages 737-740, 2005.
.
[ .pdf | abstract]
Large vocabulary continuous speech recognizers for English Broadcast News achieve today word error rates below 10%. An important factor for this succes is the availability of large amounts of acoustic and language modeling training data. In this paper the recognition of French Broadcast News and English and Spanish parliament speeches is addressed, tasks for which less resources are available. A neural network language model is applied that takes better advantage of the limited amount of training data. This approach performs the estimation of the probabilities in a continuous space and better generalization to unknown n-grams can be expected. Word error reduction of up to 0.9% absolute are reported with respect to a carefully tuned back-off language model trained on the same data.
Lori Lamel, Holger Schwenk, Jean-Luc Gauvain, Gilles Adda, and Eric Bilinski.
Improvements in transcribing lectures and seminars.
In 2nd Joint Workshop on Multimodal Interaction and Related
Machine Learning Algorithms, 2005.
.
[ .pdf | abstract]
This paper describes recent research carried out in the context of the FP6 Integrated Project Chil (chil.server.de) on developing a system to automatically transcribe lectures and seminars. Widely available corpora were used to train both the acoustic and language models, since only a small amount of Chil data was available for system development. For language model training, text materials come from a variety of online conference proceedings and a neural network language model has been used to take better advantage of the limited data.
Holger Schwenk and Jean-Luc Gauvain.
Training neural network language models on very large
corpora.
In Empirical Methods in Natural Language Processing, pages
201-208, 2005.
.
[ .pdf | abstract]
During the last years there has been growing interest in using neural networks for language modeling. In contrast to the well known back-off n-gram language models, the neural network approach attempts to overcome the data sparseness problem by performing the estimation in a continuous space. This type of language model was mostly used for tasks for which only a very limited amount of in-domain training data is available.In this paper we present new algorithms to train a neural network language model on very large text corpora. This makes possible the use of the approach in domains where several hundreds of millions words of texts are available. The neural network language model is evaluated in a state-of-the-art real-time continuous speech recognizer for French Broadcast News. Word error reductions of 0.5% absolute are reported using only a very limited amount of additional processing time.
Holger Schwenk and Jean-Luc Gauvain.
Neural network language models for conversational speech
recognition.
In International Conference on Speech and Language
Processing, pages 1215-1218, 2004.
.
[ .pdf | abstract]
Recently there is growing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models (LM), the neural network approach tries to limit problems from the data sparseness by performing the estimation in a continuous space, allowing by these means smooth interpolations. Therefore this type of LM is interesting for tasks for which only a very limited amount of in-domain training data is available, such as the modeling of conversational speech.In this paper we analyze the generalization behavior of the neural network LM for in-domain training corpora varying from 7M to over 21M words. In all cases, significant word error reductions were observed compared to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the NIST rich transcription evaluations. We also apply ensemble learning methods and discuss their connections with LM interpolation.
Holger Schwenk and J.-L. Gauvain.
Using neural network language models for LVCSR.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
In this paper we describe how to use a neural network language model for the BN and CTS task in the RT04 evaluation. The new approach performs the estimation of the language model probabilities in a continuous space, allowing by this means smooth interpolations. Details are given on training data selection, fast training and decoding algorithms and parameter estimation. The neural network language model achieved word error reductions of 0.5% for the CTS task and of 0.3% for the BN task with an additional decoding cost of 0.05xRT.
Holger Schwenk.
Efficient training of large neural networks for language
modeling.
In IEEE joint conference on neural networks, pages
3059-3062, 2004.
.
[ .pdf | abstract]
Recently there has been increasing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models, the neural network approach tries to limit the data sparseness problem by performing the estimation in a continuous space, allowing by this means smooth interpolations. The complexity to train such a model and to calculate one n-gram probability is however several orders of magnitude higher than for the backoff models, making the new approach difficult to use in real applications.In this paper several techniques are presented that allow the use of a neural network language model in a large vocabulary speech recognition system, in particular very fast lattice rescoring and efficient training of large neural networks on training corpora of over 10 million words. The described approach achieves significant word error reductions with respect to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the DARPA rich transcriptions evaluations.
Lori Lamel, Jean-Luc Gauvain, Gilles Adda, Martine Adda-Decker, Leonard
Canseco, Langzhou Chen, Olivier Galibert, Abdel Messaoudi, and Holger
Schwenk.
Speech transcription in multiple languages.
In International Conference on Acoustics, Speech, and Signal
Processing, pages III:753-756, 2004.
.
[ .pdf | abstract]
This paper summarizes recent work underway at Limsi on speech-to-text transcription in multiple languages. The research has been oriented towards the processing of broadcast audio and conversational speech for information access. Broadcast news transcription systems have been developed for seven languages and it is planned to address several other languages in the near term. Research on conversational speech has mainly focused on the English language, with initial work on the French, Arabic and Spanish languages. Automatic processing must take into account the characteristics of the audio data, such as needing to deal with the continuous data stream, specificities of the language and the use of an imperfect word transcription for accessing the information content. Our experience thus far indicates that at today's word error rates, the techniques used in one language can be successfully ported to other languages, and most of the language specificities concern lexical and pronunciation modeling.
Richard Schwartz, Thomas Colthurst, Nicolae Duta, Herb Gish, Rukmini Iyer,
Chia-Lin Kao, Daben Liu, Owen Kimball, J. Ma, John Makhoul, Spyros Matsoukas,
Long Nguyen, Mohamed Noamany, Rohit Prasad, Bing Xiang, Dongxin Xu, Jean-Luc
Gauvain, Lori Lamel, Holger Schwenk, Gilles Adda, and Langzhou Chen.
Speech recognition in multiple languages and domains: The
2003 bbn/limsi ears system.
In International Conference on Acoustics, Speech, and Signal
Processing, pages III:757-760, 2004.
.
[ .pdf | abstract]
We report on the results of the first evaluations for the BBN/LIMSI system under the new DARPA EARS Program. The evaluations were carried out for conversational telephone speech (CTS) and broadcast news (BN) for three languages: English, Mandarin, and Arabic. In addition to providing system descriptions and evaluation results, the paper highlights methods that worked well across the two domains and those that worked well on one domain but not the other. For the BN evaluations, which had to be run under 10 times real-time, we demonstrated that a joint BBN/LIMSI system with that time constraint achieved better results than either system alone.
Long Nguyen, Sherif Abdou, Mohamed Afify, John Makhoul, Spyros Matsoukas,
Richard Schwartz, Bing Xiang, Lori Lamel, Jean-Luc Gauvain, Gilles Adda,
Holger Schwenk, and Fabrice Lefevre.
The 2004 BBN/LIMSI 10xRT english
broadcast news transcription system.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
This paper describes the 2004 BBN/LIMSI 10xRT English Broadcast News (BN) transcription system which uses a tightly integrated combination of components from the BBN and LIMSI speech recognition systems. The integrated system uses both cross-site adaptation and system combination via ROVER, obtaining a word hypothesis that is better than is produced by either system alone, while remaining within the allotted time limit. The system configuration used for the evaluation has two components from each site and two ROVER combinations, and achieved a word error rate (WER) of 13.9% on the Dev04f set and 9.3% on the Dev04 set selected to match the progress set. Compared to last year's system, there is around 30% relative reduction on the WER.
R. Prasad, S. Matsoukas, C-L. Kao, J. Ma, D-X. Xu, T. Colthurst, G. Thattai,
O. Kimball, R. Schwartz, J.-L. Gauvain, Lori Lamel, Holger Schwenk, Gilles
Adda, and F. Lefevre.
The 2004 20xRT BBN/LIMSI english
conversational telephone speech system.
In 2004 Rich Transcriptions Workshop, Pallisades, NY, 2004.
.
[ .pdf | abstract]
In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT (real-time as measured on 3.4 GHz Pentium 4 Xeon Processor) on the EARS progress test set. This translates into a 22.8% relative improvement in WER over the 2003 BBN/LIMSI EARS evaluation system, which was run without any time constraints. In addition to reporting on the system architecture and the evaluation results, we also highlight the significant improvements made at both sites.
Holger Schwenk and Jean-Luc Gauvain.
Using Continuous Space Language Models for Conversational
Speech Recognition.
In ISCA & IEEE Workshop on Spontaneous Speech Processing
and Recognition, pages 49-53, Tokyo, April 2003.
.
Jean-Luc Gauvain, Lori Lamel, Holger Schwenk, Gilles Adda, Langzhou Chen, and
Fabrice Lefèvre.
Conversational telephone speech recognition.
In International Conference on Acoustics, Speech, and Signal
Processing, pages I:212-215, 2003.
.
Holger Schwenk and Jean-Luc Gauvain.
Connectionist language modeling for large vocabulary
continuous speech recognition.
In International Conference on Acoustics, Speech, and Signal
Processing, pages I: 765-768, 2002.
.
Holger Schwenk.
Language modeling in the continuous domain.
Technical Report 2001-20, LIMSI-CNRS, Orsay, France, 2001.
.
Christophe Servan and Schwenk Holger.
Optimising multiple metrics with MERT.
The Prague Bulletin of Mathematical Linguistics (PBML),
(96):109-117, 2011.
.
[ .pdf ]
Holger Schwenk.
Continuous space language models for statistical machine
translation.
The Prague Bulletin of Mathematical Linguistics,
(93):137-146, 2010.
.
[ .pdf | abstract]
This paper describes an open-source implementation of the so-called continuous space language model and its application to statistical machine translation. The underlying idea of this approach is to attack the data sparseness problem by performing the language model probability estimation in a continuous space. The projection of the words and the probability estimation are both performed by a multi-layer neural network. This paper describes the theoretical background of the approach, efficient algorithms to handle the computational complexity, and gives implementation details and reports experimental results on a variety of tasks.
Patrik Lambert, Sadaf Abdul-Rauf, and Holger Schwenk.
LIUM SMT machine translation system for WMT 2010.
page in press, 2010.
.
[ .pdf | abstract]
This paper describes the development of French-English and English-French machine translation systems for the 2010 WMT shared task evaluation. These systems were standard phrase-based statistical systems based on the Moses decoder, trained on the provided data only. Most of our efforts were devoted to the choice and extraction of bilingual data used for training. We filtered out some bilingual corpora and pruned the phrase table. We also investigated the impact of adding two types of additional bilingual texts, extracted automatically from the available monolingual data. We first collected bilingual data by performing automatic translations of monolingual texts. The second type of bilingual text was harvested from comparable corpora with Information Retrieval techniques.
Patrik Lambert, Sadaf Abdul-Rauf, and Holger Schwenk.
LIUM SMT machine translation system for WMT 2010.
pages 127-132, 2010.
.
[ .pdf | abstract]
This paper describes the development of French-English and English-French machine translation systems for the 2010 WMT shared task evaluation. These systems were standard phrase-based statistical systems based on the Moses decoder, trained on the provided data only. Most of our efforts were devoted to the choice and extraction of bilingual data used for training. We filtered out some bilingual corpora and pruned the phrase table. We also investigated the impact of adding two types of additional bilingual texts, extracted automatically from the available monolingual data. We first collected bilingual data by performing automatic translations of monolingual texts. The second type of bilingual text was harvested from comparable corpora with Information Retrieval techniques.
Holger Schwenk.
Adaptation d'un système de traduction automatique
statistique avec des ressources monolingues.
In Traitement du Langage Naturel, page in press, 2010.
.
[ .pdf | abstract]
The performance of a statistical machine translation system depends a lot on the quality and quantity of the available training data. Most of the existing, easily available parallel texts come from international organizations and the jargon observed in those texts is not very appropriate to build a machine translation system for other domains. In this paper, we present a technique to automatically adapt the translation model to a new domain using monolingual data in the source language only. We observe significant improvements in the BLEU score in statistical machine translation systems from Arabic to French and English respectively.
Sadaf Abdul Rauf and Holger Schwenk.
On the use of comparable corpora to improve SMT
performance.
In Proceedings of the Conference of the European Chapter of
the Association for Computational Lingustics, pages 16-23, 2009.
.
[ .pdf | abstract]
We present a simple and effective method for extracting parallel sentences from comparable corpora. We employ a statistical machine translation (SMT) system built from small amounts of parallel texts to translate the source side of the non-parallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create French/English parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems.
Holger Schwenk, Sadaf Abdul-Rauf, Loïc Barrault, and Jean Senellart.
SMT and SPE machine translation systems for WMT'09.
In Forth ACL Workshop on Statistical Machine Translation,
pages 130-134, 2009.
.
[ .pdf | abstract]
This paper describes the development of several machine translation systems for the 2009 WMT shared task evaluation. We only consider the translation between French and English. We describe a statistical system based on the Moses decoder and a statistical post-editing system using SYSTRAN's rule-based system. We also investigated techniques to automatically extract additional bilingual texts from comparable corpora.
Sadaf Abdul Rauf and Holger Schwenk.
Exploiting comparable corpora with TER and TERp.
In 2nd Workshop on Building and Using Comparable Corpora:
from parallel to non-parallel corpora, 2009.
.
[ .pdf | abstract]
In this paper we present an extension of a successful simple and effective method for extracting parallel sentences from comparable corpora and we apply it to an Arabic/English NIST system. We also report a comparison of our approach with that of (Munteanu et Marcu, 2005) using exactly the same corpora and show the same performance gain by using much lesser data. Our approach employs an SMT system built from small amounts of parallel texts to translate the source side of the non-parallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems.
Holger Schwenk and Jean Senellart.
Translation model adaptation for an Arabic/French news
translation system by lightly-supervised training.
In MT Summit, 2009.
.
[ .pdf | abstract]
Most of the existing, easily available parallel texts to train a statistical machine translation system are from international organizations that use a particular jargon. In this paper, we consider the automatic adaptation of such a translation model to the news domain. The initial system was trained on more than 200M words of UN bitexts. We then explore large amounts of in-domain monolingual texts to modify the probability distribution of the phrase-table and to learn new task-specific phrase-pairs. This procedure achieved an improvement of 3.5 points BLEU on the test set in an Arabic/French statistical machine translation system. This result compares favorably with other large state-of-the-art systems for this language pair.
Holger Schwenk, Loïc Barrault, Yannick Estève, and Patrik Lambert.
LIUM's statistical machine translation systems for IWSLT
2009.
In International Workshop on Spoken Language Translation,
pages 65-70, 2009.
.
[ .pdf | abstract]
This paper describes the systems developed by the LIUM laboratory for the 2009 IWSLT evaluation. We participated in the Arabic and Chinese to English BTEC tasks. We developed three different systems: a statistical phrase-based system using the Moses toolkit, an Statistical Post-Editing system and a hierarchical phrase-based system based on Joshua. A continuous space language model was deployed to improve the modeling of the target language. These systems are combined by a confusion network based approach.
Holger Schwenk and Philipp Koehn.
Large and diverse language models for statistical machine
translation.
In International Joint Conference on Natural Language
Processing, pages 661-6662, 2008.
.
[ .pdf | abstract]
This paper presents methods to combine large language models trained from diverse text sources and applies them to a state-of-art French-English and Arabic-English machine translation system. We show gains of over 2 Bleu points over a strong baseline by using continuous space language models in re-ranking.
Holger Schwenk, Jean-Baptiste Fouet, and Jean Senellart.
First steps towards a general purpose French/English
statistical machine translation system.
In Third ACL Workshop on Statistical Machine Translation,
pages 119-122, 2008.
.
[ .pdf | abstract]
This paper describes an initial version of a general purpose French/English statistical machine translation system. The main features of this system are the open-source Moses decoder, the integration of a bilingual dictionary and a continuous space target language model. We analyze the performance of this system on the test data of the WMT'08 evaluation.
Holger Schwenk and Yannick Estève.
Data selection and smoothing in an open-source system for
the 2008 NIST machine translation evaluation.
In Interspeech, pages 2727-2730, 2008.
.
[ .pdf | abstract]
This paper gives a detailed description of a statistical machine translation system developed for the 2008 NIST open MT evaluation. The system is based on the open source toolkit Moses with extensions for language model rescoring in a second pass. Significant improvements were obtained with data selection methods for the language and translation model. An improvement of more than 1 point BLEU on the test set was achieved by a continuous space language model which performs the probability estimation with a neural network. The described system has achieved a very good ranking in the 2008 NIST open MT evaluation.
Holger Schwenk, Yannick Estève, and Sadaf Abdul Rauf.
The LIUM Arabic/English statistical machine
translation system for IWSLT 2008.
In International Workshop on Spoken Language Translation,
pages 63-68, 2008.
.
[ .pdf | abstract]
This paper describes the system developed by the LIUM laboratory for the 2008 IWSLT evaluation. We only participated in the Arabic/English BTEC task. We developed a statistical phrase-based system using the Moses toolkit and SYSTRAN's rule-based translation system to perform a morphological decomposition of the Arabic words. A continuous space language model was deployed to improve the modeling of the target language. Both approaches achieved significant improvements in the BLEU score. The system achieves a score of 49.4 on the test set of the 2008 IWSLT evaluation.
Hélène Bonneau-Maynard, Alexandre Allauzen, Daniel Déchelotte, and Holger
Schwenk.
Combining morphosyntactic enriched representation with
n-best reranking in statistical translation.
In HLT/NAACL workshop on Syntax and Structure in Statistical
Translation, pages 65-71, April 2007.
.
[ .pdf | abstract]
The purpose of this work is to explore the integration of morphosyntactic information into the translation model itself, by enriching words with their morphosyntactic categories. We investigate word disambiguation using morphosyntactic categories, n-best hypotheses reranking, and the combination of both methods with word or morphosyntactic n-gram language model reranking. Experiments are carried out on the English-to-Spanish translation task. Using the morphosyntactic language model alone does not results in any improvement in performance. However, combining morphosyntactic word disambiguation with a word based 4-gram language model results in an improvement in the BLEU score of 0.6% on the development set and 0.3% on the test set.
Holger Schwenk, Marta R. Costa-jussà, and José A. R. Fonollosa.
Smooth bilingual n-gram translation.
In Empirical Methods in Natural Language Processing, pages
430-438, 2007.
.
[ .pdf | abstract]
We address the problem of smoothing translation probabilities in a bilingual N-gram-based statistical machine translation system. It is proposed to project the bilingual tuples onto a continuous space and to estimate the translation probabilities in this representation. A neural network is used to perform the projection and the probability estimation.Smoothing probabilities is most important for tasks with a limited amount of training material. We consider here the Btec task of the 2006 Iwslt evaluation. Improvements in all official automatic measures are reported when translating from Italian to English. Using a continuous space model for the translation model and the target language model, an improvement of 1.5 BLEU on the test data is observed.
Holger Schwenk, Daniel Déchelotte, Hélène Bonneau-Maynard, and Alexandre
Allauzen.
Modèles statistiques enrichis par la syntaxe pour la
traduction automatique.
In Traitement du Langage Naturel, pages 253-262, 2007.
.
[ .pdf | abstract]
La traduction automatique statistique par séquences de mots est une voie prometteuse. Nous présentons dans cet article deux évolutions complémentaires. La première permet une modélisation de la langue cible dans un espace continu. La seconde intègre des catégories morpho-syntaxiques aux unités manipulées par le modèle de traduction. Ces deux approches sont évaluées sur la tâche Tc-Star. Les résultats les plus intéressants sont obtenus par la combinaison de ces deux méthodes.
Daniel Déchelotte, Holger Schwenk, Gilles Adda, and Jean-Luc Gauvain.
Improved machine translation of text-to-speech outputs.
In Interspeech, pages 2441-2444, 2007.
.
[ .pdf | abstract]
Combining automatic speech recognition and machine translation is frequent in current research programs. This paper first presents several pre-processing steps to limit the performance degradation observed when translating an automatic transcription (as opposed to a manual transcription). Indeed, automatically transcribed speech often differs significantly from the machine translation system's training material, with respect to caseing, punctuation and word normalization. The proposed system outperforms the best system at the 2007 TC-STAR evaluation by almost 2 points BLEU. The paper then attempts to determine a criteria characterizing how well an STT system can be translated, but the current experiments could only confirm that lower word error rates lead to better translations.
Holger Schwenk.
Building a statistical machine translation system for
French using the Europarl corpus.
In Second ACL Workshop on Statistical Machine Translation,
pages 189-192, 2007.
.
[ .pdf | abstract]
This paper describes the development of a statistical machine translation system based on the Moses decoder for the 2007 WMT shared tasks. Several different translation strategies were explored. We also use a statistical language model that is based on a continuous representation of the words in the vocabulary. By these means we expect to take better advantage of the limited amount of training data. Finally, we have investigated the usefulness of a second reference translation of the development data.
Daniel Déchelotte, Holger Schwenk, Hélène Bonneau-Maynard, Alexandre
Allauzen, and Gilles Adda.
A state-of-the-art statistical machine translation system
based on Moses.
In MT Summit, pages 127-133, 2007.
.
[ .pdf | abstract]
This paper describes a statistical machine translation system based on freely available programs such as Moses. Several new features were added, in particular a two-pass decoding strategy using n-best lists and a continuous space language model that aims at taking better advantage of the limited training data. We also investigated lexical disambiguation methods in the translation model based on POS information. The task considered in this work is the translation of the European Parliament Plenary Sessions between English and Spanish, in the framework of the Tc-star project. The described systems performed very well in the 2007 Tc-Star evaluation.
Patrik Lambert, Marta R. Costa-jussà, Josep M. Crego, Maxim Khalilov, José
B. Marino, Rafael E. Banchs, José A.R. Fonollosa, and Holger Schwenk.
The TALP ngram-based SMT system for IWSLT 2007.
In International Workshop on Spoken Language Translation,
pages 169-174, 2007.
.
[ .pdf | abstract]
This paper describes TALPtuples, the 2007 N-gram-based statistical machine translation system developed a t the TALP Research Center of the UPC (Universitat Politècnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the system of previous years. Mainly, these include optimizing align ment parameters in function of translation metric scores and rescoring with a neural network language model.Results on two translation directions are reported, namely from Arabic and Chinese into English, thoroughly explaining all language-related preprocessing and translation schemes.
Evgeny Matusov, Gregor Leusch, Rafael E. Banchs, Nicola Bertoldi, Daniel
Déchelotte, Marcello Federico, Muntsin Kolss, Young-Suk Lee, José B.
Mario, Matthias Paulik, Salim Roukos, Holger Schwenk, and Hermann Ney.
System combination for machine translation of spoken and
written language.
IEEE Transactions on Audio, Speech, and Language
Processing, 16(7):1222-1237, 2007.
.
[ .pdf | abstract]
This article describes an approach for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The consensus translation is computed by weighted majority voting on a confusion network, similarly to the well-established ROVER approach of Fiscus [11] for combining speech recognition hypotheses. To create the confusion network, pairwise word alignments of the original MT hypotheses are learned using an enhanced statistical alignment algorithm that explicitly models word reordering. The context of a whole corpus of automatic translations rather than a single sentence is taken into account in order to achieve high alignment quality. The confusion network is rescored with a special language model, and the consensus translation is extracted as the best path. The proposed system combination approach was evaluated in the framework of the TC-STAR speech translation project. Up to six state-of-the-art statistical phrase-based translation systems from different project partners were combined in the experiments. Significant improvements in translation quality from Spanish to English and English to Spanish in comparison with the best of the individual systems were achieved under official evaluation conditions.
Holger Schwenk, Marta R. Costa-jussà, and José A. R. Fonollosa.
Continuous space language models for the IWSLT 2006 task.
In International Workshop on Spoken Language Translation,
pages 166-173, November 2006.
.
[ .pdf | abstract]
The language model of the target language plays an important role in statistical machine translation systems. In this work, we propose to use a new statistical language model that is based on a continuous representation of the words in the vocabulary. A neural network is used to perform the projection and the probability estimation. This kind of approach is in particular promising for tasks where a very limited amount of resources are available, like the Btec corpus of tourism related questions.This language model is used in two state-of-the-art statistical machine translation systems that were developed by UPC for the 2006 Iwslt evaluation campaign: a phrase- and an n-gram-based approach. An experimental evaluation for four different language pairs is provided (translation of Mandarin, Japanese, Arabic and Italian to English). The proposed method achieved improvements in the BLEU score of up to 3 points on the development data and of almost 2 points on the official test data.
Daniel Déchelotte, Holger Schwenk, and Jean-Luc Gauvain.
Transcription et traduction de débats parlementaires.
In Reconnaissance de Formes et Intelligence Artificielle,
2006.
.
[ .pdf | abstract]
Cet article présente un système complet de traduction automatique de la parole non-contrainte. Une approche statistique est utilisée aussi bien pour la reconnaissance de la parole que pour la traduction. Les modèles, algorithmes et optimisations utilisés dans le système de traduction statistique sont décrits en détail. Des résultats sont présentés pour la transcription et la traduction des débats du Parlement européen, de l'anglais vers l'espagnol et inversement. Ils suggèrent que les modèles stochastiques de traduction sont adaptés à la traduction de la parole, de part leur relative robustesse constatée face aux erreurs introduites par la reconnaissance automatique.
Daniel Déchelotte, Holger Schwenk, and Jean-Luc Gauvain.
The 2006 LIMSI statistical machine translation system for
Tc-Star.
In Tc-Star Speech to Speech Translation Workshop,
Barcelona, pages 25-30, 2006.
.
[ .pdf | abstract]
This paper presents the LIMSI statistical machine translation system developed for 2006 Tc-Star evaluation campaign. We describe an A*-decoder that generates translation lattices using a word-based translation model. A lattice is a rich and compact representation of alternative translations that includes the probability scores of all the involved sub-models. These lattices are then used in subsequent processing steps, in particular to perform sentence splitting and joining, maximum BLEU training and to use improved statistical target language models.
Daniel Déchelotte, Holger Schwenk, Jean-Luc Gauvain, Olivier Galibert, and
Lori Lamel.
Investigating translation of parliament speeches.
In IEEE Workshop on Automatic Speech Recognition and
Understanding, pages 116-120, 2005.
.
[ .pdf | abstract]
This paper reports on recent experiments for speech to text (STT) translation of European Parliamentary speeches. A Spanish speech to English text translation system has been built using data from the TC-STAR European project. The speech recognizer is a state-of-the-art multipass system trained for the Spanish EPPS task and the statistical translation system relies on the IBM-4 model. First, MT results are compared using manual transcriptions and 1-best ASR hypotheses with different word error rates. Then, an n-best interface between the ASR and MT components is investigated to improve the STT process. Derivation of the fundamental equation for machine translation suggests that the source language model is not necessary for STT. This was investigated by using weak source language models and by n-best rescoring adding the acoustic model score only. A significant loss in the BLEU score was observed suggesting that the source language model is needed given the insufficiencies of the translation model. Adding the source language model score in the n-best rescoring process recovers the loss and slightly improves the BLEU score over the 1-best ASR hypothesis. The system achieves a BLEU score of 37.3 with an ASR word error rate of 10% and a BLEU score of 40.5 using the manual transcripts.
Holger Schwenk.
Efficient training of large neural networks for language
modeling.
In IEEE joint conference on neural networks, pages
3059-3062, 2004.
.
[ .pdf | abstract]
Recently there has been increasing interest in using neural networks for language modeling. In contrast to the well known backoff n-gram language models, the neural network approach tries to limit the data sparseness problem by performing the estimation in a continuous space, allowing by this means smooth interpolations. The complexity to train such a model and to calculate one n-gram probability is however several orders of magnitude higher than for the backoff models, making the new approach difficult to use in real applications.In this paper several techniques are presented that allow the use of a neural network language model in a large vocabulary speech recognition system, in particular very fast lattice rescoring and efficient training of large neural networks on training corpora of over 10 million words. The described approach achieves significant word error reductions with respect to a carefully tuned 4-gram backoff language model in a state of the art conversational speech recognizer for the DARPA rich transcriptions evaluations.