Research topics

Multilevel Information: Data Fusion

Combination of Systems:

  • In the domain of translation, our research efforts on system combination focuse on the use of confusion networks (which has resulted in the development of the tool MANY).
  • For automatic transcription, our work on system combination is based on the DDA algorithm developed conjointly with the LIA and IRISA laboratories (as part of project ANR-ASH).

Combination of Information applied to the detection of spontaneous speech segments. The detection relies on acoustic and linguistic clues extracted during the automatic transcription process.

Co-operation Between Transcription and Diarization: person names found in the transcription could help identify speakers. The names are automatically associated with the speaker labels provided by the diarization process.

This work requires a good transcription of proper nouns. However proper nouns are hard to detect and transcript. We developed an automatic phonetic transcription method that combines the results of an acoustic-phonetic decoding with a the results of a more traditional grapheme-to-phoneme (G2P) system.

Integration of Language Knowledge:

  • In transcription, our objective consists in improving the results of our semantic interpretation system using various levels of transcription outputs. This work is part of the develoment of our dialog system.
  • In translation, our work focused on the SPE (Statistical Post-Editing) technique in collaboration with the company SYSTRAN.

Automatic Translation

Statistical MT systems require a large volume of parallel texts suited to the application in order to learn a powerful translation model. However huge corpora are not available for many language pairs. Even for languages with rich ressources, it is hard to find corpora within the scope of many applications. We tackled this problem in several ways:

Comparable corpora: a comparable corpus is a collection of texts in different languages which are not translations of each other but do share common topics. Such corpora are of course much easier to obtain than parallel texts. We are working on developing algorithms to automatically extract a parallel corpus from a comparable corpus.

Adaptation of translation models through unsupervised training: current parallel corpora are primarily made of only one kind of documents: reports of the European and Canadian Parliaments or the United Nations. This is problematic when an automatic translation system trained on these data must be deployed in other fields, since the lexico-syntactic field of parliamentary reports is not very relevant in other linguistic contexts. We are developing new approaches to adapt the translation model to a new domain by using only monolingual data in the source language.

Conversational Speech Processing

The LIUM transcription system yields good results for prepared speech such as broadcast news. But the results are much less satisfying with conversational speech, such as debates. Using our spontaneous speech detector, we are developing adaptation strategies dedicated to conversational speech processing.

Transcription/Translation Synergy

Our research work in speech transcription and translation is transverse, and several approaches are shared between the two domains. For example, we use the same language models, and our automatic phonetic transcription method is based on translation tools.

Modeling in Continuous Space: N-gram language models play a very important role in both the transcription and the translation systems. An approach based on a continuous space modeling was developed. This technique, initially developed for speech transcription, was successfully carried over to translation. The tool is now freely available (see the LIUM tools section of this web site).

Speech Translation: since 2008, we have been developing a research activity focusing on translation of speech, with integration of translation right in the speech decoder.