2005 (Volume 15)
Antonio Zampolli and computational linguistics
(Istituto di Linguistica Computazionale – CNR – Pisa – Italy)
Antonio Zampolli has been one of the pioneers of Computational Linguistics (CL) as well as one of the most important and influential figures at the international level in our sector. Even more in Italy, the birth, development and affirmation of CL is mainly due to his countless initiatives.
Going through the main stages in his career, I try to highlight some of his numerous ideas and initiatives that have shaped our field and are still influential today, which means somehow to provide some hints for an historical overview of Computational Linguistics itself.
Massive multi lingual corpus compilation: Acquis Communautaire and totale
(Department of Knowledge Technologies, Jo?ef Stefan Institute, Ljubljana, Slovenia)
|C. Ignat, B. Pouliquen and R. Steinberger|
(European Commission – Joint Research Centre, Italy)
Large, uniformly encoded collections of texts, corpora, are an invaluable source of data, not only for linguists, but also for Language Technology tools. Especially useful are multilingual parallel corpora, as they enable, e.g. the induction of translation knowledge in the shape of multilingual lexica or full-fledged machine translation models. But parallel corpora, esp. large ones, are still scarce, and have been, so far, difficult to acquire; recently, however, a large new source of parallel texts has become available on the Web, which contains EU law texts (the Acquis Communautaire) in all the languages of the current EU, and more, i.e. parallel texts in over twenty different languages. The paper discusses the compilation of this text collection into the massively multilingual JRC-Acquis corpus, which is freely available for research use. Next, the text annotation tool "totale", which performs multilingual text tokenization, tagging and lemmatisation is presented. The tool implements a simple pipelined architecture which is, for the most part, fully trainable, requiring a word-level syntactically annotated text corpus and, optionally, a morphological lexicon.We describe theMULTEXT-East corpus and lexicons, which have been used to train totale for seven languages, and the application of the tool to the Slovene part of the JRC-Acquis corpus.
keywords: multilingual corpora, EU languages, multilingual linguistic analysis, tokenisation, part-of-speech tagging, lemmatisation
Building multilingual spoken dialogue systems
(Interactive Systems Laboratories, Universität Karlsruhe, Germany)
Developing multilingual dialogue systems brings up various challenges. In addition to monolingual dialogue systems, multiple language resources are developed that need to be compatible. For development of new language resources existing knowledge can be used to improve the development process to obtain higher quality and results as rapid as possible. Another challenge is to ensure compatibility between the different language specific components during maintenance and ongoing development of the system. In this paper we describe how to build a multilingual system, and the relation to existing work.We describe our experiences with designing multilingual dialogue systems and present methods for multilingual grammar specification, as well as development and maintenance methods for multilingual grammars, and multilingual generation templates. We introduce grammar interfaces, similar to interface concepts known from object oriented languages, to improve compatibility between different grammar parts and to simplify grammar development.
keywords: multilingual spoken dialogue design, spoken dialogue management, grammar
A conceptual ontology for machine translation from/into Polish
(Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poznań)
(Institute of Linguistics, Adam Mickiewicz University, Poznań)
The paper presents a conceptual ontology that has been developed for the purpose of machine translation from and into Polish. The ontology has been applied in Translatica – a general domainMT system that currently translates texts between Polish and English and aims at the development of other language pairs including Polish. The Translatica ontology, designed mainly for disambiguation purposes, contains noun terms. The ontological concepts are applied as semantic values in lexical rules for verbs, adjectives and prepositions. The ontology is based on WordNet. The paper compares the adopted approach to those used in other transfer-based and interlingua-based systems. It also points out and justifies the differences between the Translatica ontology and WordNet.
keywords: machine translation, conceptual ontology, sensus
Analyzing the effect of dimensionality reduction in document categorization for Basque
|A. Zelaia, I. Alegria, O. Arregi, B. Sierra|
(University of the Basque Country, UPV-EHU, Computer Science Faculty, Donostia, Gipuzkoa, Euskal-Herria, Spain)
This paper analyzes the incidence that dimensionality reduction techniques have in the process of text categorization of documents written in Basque. Classification techniques such as Naïve Bayes,Winnow, SVMs and k-NN have been selected. The Singular Value Decomposition (SVD) dimensionality reduction technique together with lemmatization and noun selection have
been used in our experiments. The results obtained show that the approach which combines SVD and k-NN for a lemmatized corpus gives the best accuracy rates of all with a remarkable difference.
keywords: text categorization, singular value decomposition (SVD), supervised classification.
A corpora assisted multilingual thematic dictionary
(Institute of Linguistics, Adam Mickiewicz University, Poznań)
Consulting documented language usage in large corpora has become a fundamental tool in lexicography. The selection and systematization of lexical units are supported by corpora tools providing frequency and different concordances - as will be presented in the practice of the current project of a multilingual thematic dictionary. On-line dictionaries can also provide a richer and more up-to-date vocabulary. The dictionary in progress employs a special structure that aids in language learning, based on pragmatic and semantic relations. Its machine-readable version will be more suited to take advantage of its potentials.
keywords: corpora, semantic frames, computational lexicography, electronic dictionary, CALL
Translation of sentences by analogy principle
(ATR Spoken Language Communication Research Laboratories, Keihanna Science City, Japan)
This paper presents a machine translation method based on the analogy principle, and an experiment in translation as well as its objective evaluation. We first restate and generalize the original proposed framework called “translation by analogy principle”. Then, we build on the recent development of linguistic resources, namely bicorpora, and report on an experiment in translation using the method, its objective evaluation, as well as a comparison with other systems. We conclude by commenting on the characteristics of the method and put it into a more linguistic perspective.
keywords: proportional analogy, machine translation
Perspective and aspect in narrative generation
(French Studies, Queen’s University, Kingston, Ontario, Canada)
(School of Computing, Queen’s University, Kingston, Ontario, Canada)
The generation of narrative texts poses particular difficulties with respect to the interplay of description and narration, the recounting and interpretation of events from different perspectives, and the interweaving of dialogue and narration. Starting from previous work within the VINCI natural language generation environment, we show how a model of perspective and aspect proposed in the 1960’s by the linguist Andr? Burger allows for a significant enrichment of generated narratives in these three respects.
keywords: natural language generation, French, narrative, aspect, tense
Extracting Brazilian Portuguese noun phrases from texts with TBL
(Departamento de Engenharia de Sistemas, Instituto Militar de Engenharia, Rio de Janeiro, Brazil)
|C. Nogueira dos Santos|
(Departamento de Inform?tica, Rio de Janeiro, Brazil)
This paper describes a Transformation Based Learning (TBL) noun phrase extractor for Portuguese. We discuss the reasons for variation in performance between experiments with Portuguese and with English, taking special notice of the linguistic differences between the two languages with respect to noun phrases. Latin languages such as Spanish, French and Italian will present the same problems, and could benefit from the same analysis presented here.
keywords: computational linguistics, natural language processing, noun phrase Chunking, transformation based learning
Experiments on classification of Polish newspaper articles
(TDFKI GmbH, Saarbr?ecken, Germany)
(Polish-Japanese Institute of Information Technology, Warsaw, Poland)
This article reports on some experiments on automatic classification of Polish newspaper articles. In particular, we explore two alternative approaches, one based on deployment of linguistic features and second involving purely language-independent character-level n-gram modelling. Extensive evaluation results are presented. Interestingly, both the, very different, methods exhibit comparable classification performance.
keywords: text classification, n-gram modelling, linguistic-based classification
Between understanding and translating: a context-sensitive comprehension tool
|G. Prószéky, A. Földes|
(MorphoLogic Budapest, Hungary)
This paper introduces an English-Polish/Polish-English comprehension tool. In fact, it is a special electronic dictionary which is sensitive to the context of the input words or expressions. The dictionary program provides translations for any piece of text displayed on a computer screen without requiring user interaction. This functionality is provided by a three-layer process: text acquisition from the screen, morpho-syntactic analysis of the context of the selected word and the dictionary lookup. By dividing dictionary entries into smaller pieces and indexing them individually, the program is able to display a restricted set of information that is as relevant to the context as possible. For this purpose, we utilize automatic and semi-automatic XML tools for processing dictionary content. The construction of such an electronic dictionary involves natural language processing at almost every point of operation. Both dictionary entries and user input require linguistic analysis and intelligent pattern-matching techniques in order to identify multi-word expressions in the context of the input. An on-going research makes the program incorporate more sophisticated language technology: multi-word phrases and sentences are recognized, and translation hints are offered in an intelligent way - by a parser/transformer module matching underspecified patterns of different degrees of abstraction.
keywords: comprehension assistant, translation support, intelligent electronic dictionary, context-sensitivity
Paraphrasing for template-based shake and bake generation
|M. Carl, E. Rascu, P. Schmidt|
(Institut für Angewandte Informationsforschung, Saarbrücken, Germany)
In this paper we propose an approach to corpus-based generation in a machine translation framework that is similar to shake & bake. A bag of words is mapped against an automatically induced target language template grammar and a sentence is generated by recursively applying rules that are extracted from the template grammar. We show how we induce a template grammar and how it can be enriched with additional paraphrasing knowledge. We suggest a framework for weighting and training the template grammars and show that the enriched template grammars produce better paraphrases.
keywords: shake and bake generation, template grammar, paraphrasing
Across-genres and empirical evaluation of state-of-the-art treebank-style parsers
|V. Rus, C.F. Hempelmann|
(Institute for Intelligent Systems, Department of Computer Science, Department of Psychology The University of Memphis, Memphis, USA)
This paper evaluates a series of freely available, state-of-the-art parsers on a standard benchmark as well as with respect to a set of data important to measure text cohesion in the context of learning technology. We outline advantages and disadvantages of existing parsing technologies and make recommendations. The performance reported uses traditional measures as well as novel dimensions for parsing evaluation to develop a gold standard for narrative and expository texts. To our knowledge this is the first attempt to evaluate parsers across genres.
keywords: parsing evaluation, bracketing, learning technologies
From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SXPipe
|B. Sagot, P. Boullier|
(INRIA - Projet Atoll, Domaine de Voluceau, Le Chesnay Cedex, France)
We present a robust full-featured architecture to preprocess text before parsing. This architecture, called SXPipe, converts raw noisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and un-/re-capitalization. Though our system currently deals with the French language, almost all components are in fact language-independent, and the others can be straightforwardly adapted to virtually any inflectional language. The output is a sequence of word lattices, all words being present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign and during experiments of large corpora parsing, showing both good efficiency and very satisfying precision and recall.
keywords: raw text processing, named entities recognition, spelling correction, ambiguous tokenization
Building multilingual terminological lexicons for less widely available languages
|M.Monachini, C. Soria|
(CNR-ILC, Pisa, Italy)
Availability of Linguistic Resources for the development of Human Language Technology applications is nowadays recognized as a critical issue with both political and economic impact and implications on the sphere of cultural identity. This paper reports about the experience gained during the INTERA European project for the production of multilingual terminological lexicons for less widely available languages, i.e. those languages that suffer from poor representation over the net and from scarce computational resources, but yet are requested by the market. It discusses the procedure followed within the project, focuses on the problems faced which had an impact on the initial goals, presents the necessary modifications that resulted from these problems, evaluates the market needs as attested by various surveys, and describes the methodology that is proposed for the efficient production of Multilingual Terminological Lexicons.
keywords: multilingual terminological resources, resource production methodology, less widely available languages, standards
Sentence extraction using similar words
|Y. Suzuki, F. Fukumoto|
(University of Yamanashi, Kofu, Japan)
In both written and spoken languages, we sometimes use different words in order to describe the same thing. For instance, we use “candidacy” (rikkouho) and “running in an election” (shutsuba) as the same meaning. This makes text classification, event tracking and text summarization difficult. In this paper, we propose a method to extract words which are semantically similar to each other accurately. Using this method, we extracted similar word pairs on newspaper articles. Further, we performed sentence extraction of the newspaper articles using the extracted similar word pairs. We hypothesized that the headline is salient information of the newspaper article and presence of headline terms in the article can be used to detect salient sentences of news text. By using similar words in the headline, we obtained better results than that without using it. The results suggest that our method is useful for text summarization.
keywords: sentence extraction, newspaper articles, headline, similar words, respective nearest neighbors (RNNs), text summarization
Compressing annotated natural language text
(The Szczecin University, Institute of Information Technology in Management, Szczecin, Poland)
The paper is devoted to description and evaluation of a new method of linguistically annotated text compression. A semantically motivated transcoding scheme is proposed in which text is split into three distinct streams of data. By applying the scheme it is possible to reduce compressed text length by as high as 67%, compared to the initial compression algorithm. An important advantage of the method is the feasibility of processing text in its compressed form.
keywords: text compression, text transcoding, tagged text, POS tags
Context-free grammar induction with grammar-based classifier system
(Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Wroc³aw, Poland)
In the paper we deal with the induction of context-free grammars from sample sentences. We present some extensions to grammar-based classifier system (GCS) that evolves population of classifiers represented by context-free grammar productions rules in Chomsky Normal Form. GCS is a new version of Learning Classifier Systems but it differs from it in the covering, in the matching, and in representation. We modify the discovering component of the GCS and apply system for inferring such context-free languages as toy language, and grammar for large corpora of part-of-speech tagged natural English language. For each task a set of experiments was performed.
keywords: grammar induction, machine learning, learning classifier system, context-free grammar
Reinterpreting DCG for free word order languages
|Z. Vetulani, F. Graliński|
(Adam Mickiewicz University, Poznań, Poland)
The DCG rules are context-free rules with non-terminal symbols allowing arbitrary terms as parameters. Because of that the DCG-like formalisms are considered particularly well suited to encode NL grammars. This observation is however only partially true for languages with free (or freer) word order and possibilities of discontinuous constructions), e.g. for Slavonic and (some) Germanic languages. What seems interesting is that a minor formal modification in the DCG formalism makes it flexible enough to cover word-order related problems. What we explore here is the idea of reinterpretation of the concept of difference list. This implies a nonstandard interpretation of DCG rules, in which the ordering of the right-hand-side symbols does not necessarily correspond to the surface (linear) ordering of corresponding expressions and the non-terminals may represent discontinuous categories. In this paper we propose a solution in which both non-standard and standard interpretations of DCG rules co-exist.
keywords: natural language processing, DCG like formalisms, free word order
A simple CF formalism and free word order
(Adam Mickiewicz University, Poznan, Poland)
The first objective of this paper is to present a simple grammatical formalism named Treegenerating Binary Grammar. The formalism is weakly equivalent to CFGs, yet it is capable of generating various kinds of syntactic trees, including dependency trees. Its strong equivalence to some other grammatical formalisms is discussed. The second objective is to show how some free word order phenomena in Polish can be captured within the proposed formalism.
keywords: grammatical formalisms, parsing, free word order, discontinuity