Workshop on Hybrid Approaches to Machine Translation (HyTra)

The main paradigms in machine translation (MT), namely the statistical and the rule-based approaches, had been seen for quite some time as competitors. In recent years, however, there is a strong trend and growing interest towards combining their underlying methods, including those of example-based MT, which can be seen as a link between the two, sharing properties of both.

The advantages of statistical MT are fast development cycles, low cost, robustness, superior lexical selection and relative fluency due to the use of language models. But (pure) statistical MT has also disadvantages: It needs huge amounts of data, which for many language pairs involving under-resourced languages are not available, and are unlikely to become available in the foreseeable future. Recent advances in factored morphological models and syntax-based models in SMT indicate that non-statistical symbolic representations and processing models need to have their proper place in MT research and development, and more research is needed to understand how to develop and integrate these non-statistical models most efficiently.

The advantages of rule-based MT are that its rules and representations are geared towards human understanding and can be more easily checked, corrected and exploited for applications outside of machine translation such as dictionaries, text understanding and dialog systems. But (pure) rule-based MT has also severe disadvantages, among them slow development cycles, high cost, a lack of robustness in the case of incorrect input, and difficulties in making correct choices with respect to ambiguous words, structures, and transfer equivalents.

The translations of statistical systems are often surprisingly good with respect to phrases and short distance collocations, but they often fail when selectional preferences need to be based on more distant words. In contrast, the output of rule-based systems is often surprisingly good if the parser assigns the correct analysis to a sentence. However, it usually leaves something to be desired if the correct analysis cannot be computed, or if there is not enough information for selecting the correct target words when translating ambiguous words and structures.

Given this complementarity of statistical and rule-based MT, with their very different strengths and weaknesses, an optimized MT architecture should probably comprise elements of both concepts. The question is which ones, and how to connect them. For example, different types of MT systems can work in parallel and their outputs can be combined. Alternatively, one system can be used for the pre- or post-editing of the input or output of another system. A further option is to closely integrate components from different paradigms.

Most hybrid systems described in the literature have tried to put some analytical abstraction on top of a statistical kernel. As open source statistical MT systems such as Moses are readily available, this might have been the natural choice for many. However, there is also the alternative of doing it the other way round, i.e. integrating statistical information as extracted from corpora into a rule-based system. There are also good reasons for this approach, as a well-designed rule-based system is likely to provide good interfaces allowing the incorporation of statistical knowledge in a systematic way.

The aim of this workshop is to bring together and share ideas among researchers developing statistical, example-based, or rule-based translation systems and who wish to enhance their systems with elements from the other approaches.

We solicit contributions including but not limited to the following topics: