2005 (Volume 15)
In memoriam Maurice Gross
(Institut Gaspard-Monge (IGM), University of Marne-la-Vallée, France)
Maurice Gross (1934-2001) was both a great linguist and a pioneer in natural language processing. This article is written in homage to his memory. Maurice Gross is remembered as a vital man surrounded by a lively group, constituted by his laboratory, the LADL, and by his international network, RELEX. He and his group always resolutely steered away from any kind of abstemiousness or abstinence. He selected three of his first collaborators in 1968 from among his bistrot pals of longstanding. Judging by their later scientific production, the selection was done with sharp judgment (e.g. , ). A convivial atmosphere, picnics, drinks at the lab and other revelries were the hallmarks of his group - though he has been perceived, on other occasions, as a tyrannical father. As a linguist, Maurice Gross contributed to the revival of formal linguistics in the 1960s, and he created and implemented an efficient methodology for descriptive lexicology. As specialist of natural language processing (NLP), he was also a pioneer of linguistics-based processing.
Chosen digital signal processing procedures for hearing aids
|Adam Dąbrowski, Tomasz Marciniak and Paweł Pawłowski|
(Institute of Control and System Engineering, Poznań University of Technology, Poland)
This paper presents chosen aspects of developing new DSP algorithms, which can be used in devices for hearing impaired people. We have concentrated on the advanced test for the hearing loss diagnosis and on the real time directional filtering both realized using a DSP (digital signal processor). The analyzed algorithms are implemented using specially designed, experimental DSP modules.
keywords: hearing aids, directional filtering, cochlea dead regions, digital signal processors.
Extracting subcategorisation information from Maurice Gross' grammar lexicon
(University Nancy 2/LORIA, France)
Maurice Gross' grammar lexicon contains rich and exhaustive information about the morphosyntactic and semantic properties of French syntactic functors (verbs, adjectives, nouns). Yet its use within natural language processing systems is hampered both by its non standard encoding and by a structure that is partly implicit and partly underspecified. In this paper, we present a method for translating this information into a format more amenable for use by NLP systems, we discuss the results obtained so far, we compare our approach with related work and we identify the possible further uses that can be made of the reformatted information.
keywords: computational linguistics, syntactic lexicon.
Accessing language specific linguistic information for triphone model generation: feature tables in a speech recognition system
|Supphanat Kanokphara, Anja Geumann and Julie Carson-Berndsen|
(School of Computer Science and Informatics, University College, Dublin, Ireland)
This paper is concerned with a method for generating phonetic questions used in tree-based state tying for speech recognition. In order to implement a speech recognition system, languagedependent knowledge which goes beyond annotated material is usually required. The approach presented here generates phonetic questions for decision trees based on a feature table that summarizes the articulatory characteristics of each sound. On the one hand, this method allows better language-specific triphone models to be defined given only a feature-table as linguistic input. On the other hand, the feature-table approach facilitates efficient definition of triphone models for other languages since again only a feature table for this language is required. The approach is exemplified with speech recognition systems for English and Thai.
keywords: speech recognition, tree-based state tying, phonetic features.
Design and implementation of a morphology-based spellchecker for Marathi, an Indian language
|Veena Dixit, Satish Dethe and Rushikesh K. Joshi|
(Department of Computer Science and Engineering, Indian Institute of Technology, India)
Morphological analysis is a core component of Technology for Indian languages. Complexities involved in spellchecking of documents in Marathi, an Indian language are described. Issues for both orthography and morphology are discussed. We have applied morphological analysis to a large number of words of different parts of speech. A spellchecker based on this analysis has been developed. The architecture of the spellchecker and the spell-checking algorithm based on morphological rules are outlined.
keywords: morphological analysis, rules of orthography, spellchecker, indian languages, marathi language.
Regular derivation and synonymy in an e-dictionary of Serbian
(Faculty ofMathematics, University of Belgrade, Serbia and Montenegro)
(Faculty of Philology, University of Belgrade, Serbia and Montenegro)
In this paper we explore the relation between derivational morphology and synonymy in connection with an electronic dictionary, inspired by the work of Maurice Gross. The characteristics of this relation are illustrated by derivation in Serbian, which produces new lemmas with predictable meaning. We call this regular derivation. We then demonstrate how this kind of derivation is handled in text processing using a morphological e-dictionary of Serbian and a collection of transducers with lexical constraints. Finally, we analyze the cases of synonymy that include regular derivation in one aligned text.
keywords: electronic dictionary, morphology, derivation, lexicography.
Lexicon management and standard formats
(Institut Gaspard-Monge (IGM), University of Marne-la-Vallée, France)
International standards for lexicon formats are in preparation. To a certain extent, the proposed formats converge with prior results of standardization projects. However, their adequacy for (i) lexicon management and (ii) lexicon-driven applications have been little debated in the past, nor are they as a part of the present standardization effort. We examine these issues. IGM has developed XML formats compatible with the emerging international standards, and we report experimental results on large-coverage lexicons.
keywords: language resource, lexicon management, standardization, inflection, morphology.
Semantic annotation of hierarchical taxonomies
(Institute of Computing Science, Poznań University of Technology, Poland)
Hierarchical taxonomies (HTs) like Web directories or marketplace catalogs are used to organize documents in a way that helps user in their retrieval. However, using HTs in tasks like automatic schema mapping or automatic services discovery requires the semantics of HTs hidden in their structures and labels to be made explicit. The goal of this paper is given a hierarchical taxonomy, return an interpretation of each node in terms of a logical formula built from word senses. In the proposed methodology we rely on a linguistic repositoryWordNet and we use Bayesian networks as a tool for the word sense disambiguation.
keywords: semantic annotation, Bayesian networks, WordNet.
Mapping speech streams to conceptual structures
|Ronny Melz, Chris Biemann, Karsten Böhm, Gerhard Heyer and Fabian Schmidt|
(Institute of Computer Science, University of Leipzig, Germany)
We describe a software system that processes textual data and spoken input streams of natural language and arranges the information in a meaningful way on the screen: concepts as nodes, relations as edges. For spoken input, the software simulates conceptual awareness. A naturally spoken speech stream is converted into a word stream (speech-to-text), the most significant concepts are extracted and associated to related concepts, which have not been mentioned by the speaker(s) yet. The result is displayed on a screen as a conceptual structure.
keywords: statistical language processing, co-occurrences, conceptual graphs, speech.
Making shallow look deeper: anaphora and comparisons in medical information extraction
|Agnieszka Mykowiecka, Małgorzata Marciniak and Anna Kupść|
(Institute of Computer Science, Polish Academy of Sciences, Poland)
The paper focuses on resolving natural language issues which have been affecting performance of our system processing Polish medical data. In particular, we address phenomena such as ellipsis, anaphora, comparisons, coordination and negation occurring in mammogram reports. We propose practical data-driven solutions which allow us to improve the system's performance.
keywords: information extraction, shallow parsing, anaphoric expressions.
Validation techniques for parallel feature streams: the case of phoneme identification for speech recognition
|Daniel Aioanei, Moritz Neugebauer and Julie Carson-Berndsen|
(School of Computer Science and Informatics, University College Dublin, Ireland)
This paper presents an approach to the phonetic interpretation of multilinear feature representations of speech utterances combining linguistic knowledge and efficient computational techniques. Multilinear feature representations are processed as intervals and the linguistic knowledge used by the system takes the form of feature implication rules (constraints) represented as subsumption hierarchies which are used to validate each interval. In the case of noisy or underspecified data, the linguistic constraints can be used to enrich the representations. Experiments are also presented to show that the system is logically correct and does not introduce errors in the data, and that it deals with underspecified and noisy input.
keywords: automatic speech recognition, articulatory features.
Semi-unsupervised PP attachment disambiguation for Norwegian
(Department of Linguistics and Scandinavian Studies, University of Oslo, Norway)
Determining the correct attachment site for prepositional phrases is a difficult task for NLP systems. In this work we automatically extract unambiguous PP attachments from a Norwegian corpus and use them for semi-unsupervised training of a memory-based learner on the attachment disambiguation task. The performance of the system is similar to that of a related method which has previously been applied to English, but it obtains this performance level using a simpler and more flexible approach.
keywords: prepositional phrase attachment disambiguation, memory-based learning, Norwegian.
An extension of Świdzińki's grammar of Polish
(PhD student at the Chair of Formal Linguistics, Faculty of Modern Languages,Warsaw University, Poland)
The article presents a series of proposed extensions to Świdzińki's Formal Grammar of Polish which were introduced in the course of automated syntax verification of a corpus of Polish expressions.
keywords: formal grammar of Polish, formalisation of natural languages, natural language parsing.
Discourse interpretation based on dynamic constraints
(Wrocław University of Technology, Institute of Applied Informatics, Poland)
Our main objective will be to construct a fully compositional representation of nominal anaphora in discourse. The proposed representation is not dependent on the remote ascription (i.e. done outside the formal representation) of syntactic indexes, which identify anaphoric links. A formal language of variable free logic is introduced. It is based on dynamic semantics paradigm and is a variant of many-sorted type logic. We will also present the scope free treatment of quantification in multiple quantifier sentences. The interpretation of multiple quantifiers is defined by means of a construction of the polyadic Generalised Quantifier (GQ). The polyadic GQ is a constraint that should be satisfied by the denotation of a 'clausal' predicate.
keywords: anaphora, compositionality, dynamic semantics, Generalized Quantifiers, discourse representation, variable free logic, plurality, formal semantics.
Automatic grapheme-to-phoneme conversion for Italian
(Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, Italy)
(Dipartimento di Psicologia, Università di Bari, Italy)
This paper describes two grapheme-to-phoneme conversion systems we implemented for different application domains, namely 1) automatic phonetization and syllabification of Standard Italian pronunciation dictionaries, and 2) speech corpora and text-to-speech systems productions for regional varieties of Italian. The latter system can be considered a sort of revised version of the former, having the determination of the very basic Standard Italian conversion rules as the common core, whereas the main difference lays in its system architecture. Its modularity allows the possible addition of several regional variation pronunciation models. The present version of this system includes Bari and Naples grapheme-to-phoneme conversion modules.
keywords: automatic grapheme-to-phoneme conversion, speech systems, speech corpora, Italian spoken varieties.
Design and implementation of nouns in OriNet: based on the semantic word concept
|P.K. Santi, S. Mohanty and K.P. Das Adhikary|
(1RC-ILTS-Oriya, Department of Comp. Sc. and Application,Utkal University, India)
|Chinmaya Kumar Swain|
(2Department of Computer Science and Engineering, I.T.E.R. Bhubaneswar, Orissa, India)
OriNet (WordNet for Oriya language) is an on-line lexical database in which Oriya nouns, verbs, adjectives, adverbs and indeclinable have been organized through semantic relations such as synonym, antonym, hypernym, hyponymy, meronym, holonym and entailment. The semantic relations of noun concept in OriNet have been organized as per the explanation of WordNet and Indian logic. Besides, some other relative information such as "English meaning", "definition", "examples", "syntactic category", "morphology" etc., have been provided in each noun concept. The organization is informative and relational, which involves the above semantic relations that any information regarding a noun concept can be obtained. The system is designed on the basis of file management system having user-friendly browser.
keywords: WordNet, OriNet, NNP, Synset, Unique Beginners, Hypernym & Hyponym.
Morphological analyser based on finite state transducer: a case study for Oriya language
|Chinmaya Kumar Swain|
(2Department of Computer Science and Engineering, I.T.E.R. Bhubaneswar, Orissa, India)
|Prabhat Kumar Santi and Sanghamitra Mohanty|
(1RC-ILTS-Oriya, Department of Comp. Sc. and Application, Utkal University, India)
This paper deals with the effective design and implementation of the morphological analyser of Oriya, which is a morphologically rich language derived from Sanskrit. The most of the morphemes in Oriya coordinate with the root words in the form of suffixes. Information such as Part of Speech (PoS), Case-relation, number, person, tense, aspect and mood are all conveyed through morphological attachments to the root of nominal or verbal words. This makes morphological analyser and generator of Oriya words a challenging task which is very essential tool in the area of Machine Translation, Parser, Spell Checker, OriNet (WordNet for Oriya), PoS tagging etc. This paper elucidates the simple and efficient computational model for Oriya morphology based on finite state transducers.
keywords: finite state transducer, morphological analyser, OriNet, morpheme, machine translation, PM,IM,DM.
A formalism for the computational morphology of multi-word units
(Université François Rabelais, Laboratoire d'Informatique, IUT de Blois, France)
Multi-word units (MWUs) are linguistic objects placed on the frontier between morphology and syntax. A reliable computational treatment of their inflectional morphology requires a fine-grained grammar-based approach allowing a description of general large-coverage phenomena as well as of lexicalized counter-rules. We propose a formalism that answers these requirements. Due to a graph-based description and a simple unification algorithm it allows to compactly and exhaustively describe the inflection paradigm of a MWU in terms of its component words morphology as well as of some regular-language patterns.
keywords: computational linguistics, multi-word units, inflection, graph-based description, unification.
Transcription-based automatic segmentation of speech
|Marcin Szymański and Stefan Grocholewski|
(Poznań University of Technology, Institute of Computing Science, Poland)
The important element of today's speech systems is the set of recorded wavefiles annotated by a sequence of phonemes and boundary time-points. The manual segmentation of speech is a very laborious task, hence the need for automatic segmenation algorithms. However, the manual segmentation still outperforms the automatic one and at the same time the quality of resulting synthetic voice highly depends on the accuracy of the phonetic segmentation. This paper describes our methodology and implementation of automatic speech segmentation, emphasizing its new elements.
keywords: speech segmentation, speech synthesis, unit selection.
An efficient implementation of a large grammar of Polish
(Instytut Podstaw Informatyki, PAN, Poland)
The paper presents a parser implementing Marek Świdziński's formal grammar of the Polish language. The grammar was written by a linguist, without a computational implementation in mind, unlike most formal grammars. The aim of the work reported here is to examine what the exact set of sentences accepted by the grammar is and what structures are assigned to them. For that reason, it was crucial to remain as close to the original grammar as possible. The resulting program, named ?wigra, goes far beyond a toy parser due to the use of a morphological analyser and a broad range of linguistic phenomena covered by ?widzi?ski's grammar. In this article preliminary results of the parser evaluation are also provided.
keywords: surface syntax of Polish, morphological ambiguity, DCG parsing.
Automatic recognition of signed Polish expressions
|Tomasz Kapusciński and Marian Wysocki|
(Computer and Control Engineering Chair, Rzeszow University of Technology, Poland)
The paper considers recognition of single sentences of the Polish Sign Language. We use a canonical stereo system that observes the signer from a frontal view. Feature vectors take into account information about the hand shape and orientation, as well as 3D position of the hand with respect to the face. Recognition based on human skin detection and hidden Markov models (HMMs) is performed on line. We focus on 35 sentences and a 101 word vocabulary that can be used at the doctor's and at the post office. Details of the solution and results of experiments with regular and parallel HMMs are given.
keywords: gesture recognition, sign language, computer vision, hidden Markov models.