Statistical Machine Translation: ongoing research

Marta Ruiz Costa-jussà
Barcelona Media Innovation Center
Résumé du séminaire: 
In this talk, we will mainly present our most recent on-going research in statistical machine translation. Firstly, we will describe a novel approach to introduce source context information in a phrase-based statistical machine translation system. This approach introduces a feature function inspired in the popularly known vector-space model which is typically used in information retrieval and text mining applications. This feature function aims at improving translation unit selection at decoding time. Significant improvements are shown on an English-Spanish experimental corpus. Secondly, we will present our experiments on statistical chunking which allow to enrich a phrase-based system with novel segmentations.These novel segmentations are computed using statistical measures such as Log-likelihood, T-score, Chi-squared, Dice, Mutual Information or Gravity-Counts. Experimental results are reported on the French-to-English IWSLT 2010 task where our system was ranked 3rd out of nine systems. Finally, we will talk about a non-linear semantic mapping procedure implemented for cross-language text matching at the sentence level. The method relies on a non-linear space reduction technique which is used for constructing semantic embeddings of multilingual sentence collections. In the proposed method, an independent embedding is constructed for each language in the multilingual collection and the similarities among the resulting semantic representations are used for cross-language matching. It is shown that the proposed method outperforms other conventional cross-language information retrieval methods.