NLPCS 2010 Abstracts

Full Papers

Paper Nr:	12
Title:	A Language Model for Human-Machine Dialog: The Reversible Semantic Grammar
Authors:	Jérôme Lehuen and Thierry Lemeunier
Abstract:	In this paper we present algorithms for analysis and generation in a human-machine dialog context. The originality of our approach is to base these two algorithms on the same knowledge. The latter combines both semantic and syntactic aspects. The algorithms are based on a double principle: the correspondence between offers and expectations, and the calculation of a heuristic score. We present also some results obtained by performing an evaluation based on the MEDIA French corpus.
Download

Paper Nr:	13
Title:	FROM SENTENCES TO SCOPE RELATIONS AND BACKWARD
Authors:	Márton Károly, Judit Kleiber and Gábor Alberti
Abstract:	As we strive for sophisticated machine translation and reliable information extraction, we have launched a subproject pertaining to the revelation of reference and information structure in (Hungarian) declarative sentences. The crucial part of information extraction is a procedure whose input is a sentence, and whose output is an information structure, which is practically a set of possible operator scope orders (acceptance). A similar procedure forms the first half of machine translation, too: we need the information structure of the source-language sentence. Then an opposite procedure should come (generation), whose input is an information structure, and whose output is an intoned word sequence, that is, a sentence in the target language. We can base the procedure of acceptance (in the above sense) upon that of generation, due to the reversibility of Prolog mechanisms. And as our approach to grammar is “totally lexicalist”, the lexical description of verbs is responsible for the order and intonation of words in the generated sentence.
Download

Paper Nr:	16
Title:	Relating Production Units and Alignment Units in Translation Activity Data
Authors:	Michael Carl and Arnt Lykke Jakobsen
Abstract:	The definition and characterisation of Translation Units (TUs) in human translation is controversial and has been described in many different ways. This paper looks at TUs from a translation process perspective: we investigate the sequences of keystrokes which have been typed during translation production and re-define TUs in terms of text production units (PUs). We correlate those units with translation equivalences in the translation product, so-called alignment units (AUs) and compare the translation performance of student and professional translators on a small translation task of 160 words from English into Danish. In contrast to what has frequently been assumed, our data reveals that TUs are rather coarse, as compared to the notion of `translation atoms', comprising several AUs, and they are particularly coarse for professional translators.
Download

Paper Nr:	17
Title:	Context Accommodation in Human Language Processing
Authors:	Jerry Ball
Abstract:	This paper describes a model of human language processing (HLP) which is incremental and interactive, in concert with prevailing psycholinguistic evidence. To achieve this, the model combines an incremental, serial, pseudo-deterministic processing mechanism, which relies on a non-monotonic mechanism of context accommodation, with an interactive mechanism that uses all available information in parallel to select the best choice at each choice point.
Download

Paper Nr:	22
Title:	A Generic Tool for Creating and  Using Multilingual Phrasebooks
Authors:	Michael Zock
Abstract:	To speak fluently is a complex skill. If reaching this goal in one’s mother tongue is already quite a feat, to do so in a foreign language can be overwhelming. One way to overcome the expression problem when going abroad is to use a dictionary or a phrasebook. While neither of them ensures fluency, both of them are useful translation tools. Yet, neither can teach you to speak. We will show here how this can be done in the case of a phrasebook. Being interested in the learning of foreign languages, we have started to build a multilingual phrasebook (English, French, Japanese, Chinese) whose sentence elements, typically words or expressions, are clickable items. This fairly simple feature allows extending considerably the potential of the resource. Rather than learning merely a list of concrete instances (sentences), the user may learn in addition the underlying principles (patterns), that is, the generative mechanism capable to produce quickly analogous, but more or less different sentences. A similar feature may be used to extend the resource, by mining a corpus for sentences built according to the same principle, i.e. based on the same pattern, but this is work for the future. Two of the main goals of this paper are to present a method helping learners to acquire the skill of speaking (patterns augmented with rules), and to allow experts (teachers) to add information either to extend the database or to add a new language. We’ve started from English and Japanese, adding very quickly French and Chinese.
Download

Paper Nr:	23
Title:	Experiments with single-class support vector data descriptions as a tool for vocabulary grounding
Authors:	Aneesh Chauhan and Luís Lopes
Abstract:	This paper explores support vectors as a tool for vocabulary acquisition in robots. The intention is to investigate the language grounding process at the single-word stage. A social language grounding scenario is designed, where a robotic agent is taught the names of the objects by a human instructor. The agent grounds the names of these objects by associating them with their respective sensor-based category descriptions. The process of vocabulary acquisition is inherently incremental and open-ended. The vocabulary evolves with time as new words are learned. Therefore any system for grounding vocabulary should be incremental, adaptive and support gradual evolution. A novel learning model based on single-class support vector data descriptions (SVDD), which conforms to these properties, is presented. For robustness and flexibility, a kernel based implementation of support vectors was implemented. For this purpose, a sigmoid kernel using histogram pyramid matching has been developed. The support vectors are trained based on an original approach using genetic algorithms. The model is tested over a series of semi-automated experiments and the results are reported.
Download

Paper Nr:	24
Title:	Identifying Multidocument Relations
Authors:	Erick Galani Maziero, Erick Galani Maziero, Maria Lucia Castro Jorge and Thiago Pardo
Abstract:	The digital world generates an incredible accumulation of information. This results in redundant, complementary, and contradictory information, which may be produced by several sources. Applications as multidocument summarization and question answering are committed to handling this information and require the identification of relations among the various texts in order to accomplish their tasks. In this paper we first describe an effort to create and annotate a corpus of news texts with multidocument relations from the Cross-document Structure Theory (CST) and then present a machine learning experiment for the automatic identification of some of these relations. We show that our results for both tasks are satisfactory.
Download

Paper Nr:	28
Title:	Learning Sentence Reduction Rules for Brazilian Portuguese
Authors:	Daniel Kawamoto and Thiago Pardo
Abstract:	We present in this paper a method for sentence reduction with summarization purposes. The task is modeled as a machine learning problem, relying on shallow and linguistic features, in order to automatically learn symbolic patterns/rules that produce good sentence reductions. We evaluate our results with Brazilian Portuguese texts and show that we achieve high accuracy and produce better results than the existing solution for this language.
Download

Short Papers

Paper Nr:	8
Title:	X-plain - a Game that Collects Common Sense Propositions
Authors:	Zuzana Neverilová
Abstract:	Common sense knowledge is very important for some NLP tasks, but it is hard to extract from existing linguistic resources. Thus specialized collections of common sense propositions are created. This paper presents one of the ways of making such collection w.r.t. Czech language. We have created a cooperative game, where computer program plays together with human. The purpose of the game is to describe a word with short sentences to the co-player. While the human player is expected to use his/her common sense, the computer program uses word sketches. The paper describes in detail the game, its background and discusses the need for motivation and game policy. It also discusses the quality and coverage of the collection.
Download

Paper Nr:	9
Title:	Robust Morphologic Analyzer for Highly Inflected Languages
Authors:	Judith Donayo, Hohendahl Andrés Tomás and Zelasco José Francisco
Abstract:	We present a multilingual robust morphologic tagger and tokenizer for highly inflected languages like Spanish, with efficient spell correction and ‘sound-like’ word inference, obtaining some semantic extraction even on para-synthetic and unknown words. This algorithm combines rules, statistical best-affix-fit along with a language estimator. A rich flag set controls the internal behavior. The system has been designed for efficiency and low memory foot-print, using data structures based on simple available affixing rules. Our sys-tem, packed with a Spanish dictionary of 83k lemmas and 5k rules, recognizes 2.2M exact words, the guessing word-space is many times this much
Download

Paper Nr:	26
Title:	Effects of Comparable Corpora on Cross-Language Information Retrieval
Authors:	Fatiha Sadat
Abstract:	This paper seeks to present an approach to learning bilingual terminology from scarce resources in order to translate and expand terms from source language to target language and possibly retrieve documents across languages. An extracted bilingual lexicon from comparable corpora will provide a valuable resource to enrich existing bilingual dictionaries and thesauri. A linear combination involving the extracted bilingual terminology from comparable corpora, readily avail-able bilingual dictionaries and transliteration is proposed to Cross-Language Information Retrieval. An application on Japanese-English language pair of languages shows that the proposed combination yields better translations and an effectiveness of information retrieval could be achieved across languages.
Download