Workshop on Creating Cross-language Resources for Disconnected Languages and Styles


This half-day workshop aims at developing strategies and sharing experiences on creating resources for reducing the linguistic gap between those language pairs for which cross-language resources are scarce. Although this specific situation has been most commonly addressed for the case of minority languages that have scarce resources by themselves, it also happens to be an important issue in some other situations such as: majority languages that, because of their cultural, historical and/or geographical disconnection, do not count with a significant amount of cross-language resources between them (as Chinese and Spanish, just to mention an excellent example in this category); or, single languages for which new communication trends and styles do not have available cross-language resources between the main formal language and it (as chat speak style communications and formal languages).

Current computational and data storage capabilities have favoured the proliferation of data-driven and statistical approaches in natural language processing and computational linguistics. Empirical evidence has demonstrated in a large number of cases and applications how the availability of appropriate datasets can boost the performance of processing methods and analysis techniques. In this scenario, the availability of data has become to play a fundamental role. On the other hand, both the diversity of languages and the emergence of new communication media and stylistic trends are responsible for the scarcity of resources in the case of some specific tasks and applications. In this sense, this workshop attempts to focus its attention on those specific applications or cases for which data scarcity poses a restrictive problem for data-driven approaches. This includes the following three specific situations:

Minority Languages, for which scarcity of resources is a consequence of the minority nature of the language itself. In this case, attention is focused on the development of both monolingual and cross-lingual resources. Some examples in this category include: Basque, Pashto and Haitian Creole, just to mention a few.

Disconnected Languages, for which a large amount of monolingual resources are available, but due to cultural, historical and/or geographical reasons cross-language resources are actually scarce. Some examples in this category include language pairs such as Chinese and Spanish, Russian and Portuguese, and Arabic and Japanese, just to mention a few.

New Language Styles, which represent different communication forms or emerging stylistic trends in languages for which the available resources are practically useless. This case includes the typical examples of tweets and chat speak communications, as well as other informal form of communications, in many languages.