View: The crative human
Information Systems
Working Group: Language Technology

Description of the group:

"Language technology incorporates broad research areas like informatics, linguistics, phonetics, signal processing, with important contributions from mathematics and statistics, among others. The basic research questions in language technology concern how to enable computers to 'understand' natural language and behave as if they do in relevant situations. Examples are automatic speech recognition, speech production, dialogue systems with natural language front ends, semantically based information extraction, and translation of text and speech between natural languages. Computer and telephony services based on language technology are increasing within the domains mentioned above; even many fundamental research questions are far from being solved. The working group will try to identify and describe directions in future information services involving natural language, from business environments to entertainment settings."

Position paper

Slide Show

Language Technology Towards 2020

 

Torbjørn Nordgård, Department of Language and Communication Studies, NTNU Trondheim

Torbjørn Svendsen, Department of Electronics and Telecommunications, NTNU Trondheim

Erik Harborg, SINTEF Information and Communication Technology

Knut Kvale, Telenor Research and Development

 

1         What is Language Technology?

A broad definition of language technology could be the branch of information technology which deals with natural language information. This information is typically conveyed in two forms – written or spoken. For this reason we might define two main branches of language technology: speech technology and textual language technology, although the latter is integrated in many speech applications. Today we find language technology in many products – spell checkers and grammar checkers in standard office suites, word prediction in cellular phones, dialogue systems between humans and machines (for instance automatic reservation via telephone), rough machine translation in internet browsers, automatic dictation (conversion of speech to text). These technologies have evolved into commercial products over the last 10 to 20 years. In the years to come we expect applications like search in speech databases (e.g. radio programs), automatic transcription of colloquial speech and TV programs, among others.

 

2         Textual Language Technology

The total amount of available text for all purposes increases extremely rapidly. Text production is to an increasingly larger extent required for documentation purposes – in businesses as well as in governmental domains, for instance within the European Union. Various aids for text production have come to the market in the last 15 years, in particular text processing tools with “enabling” facilities like spelling and grammar checkers. Automatic dictation systems (speech converted to text) are expected to be improved in the years to come, both due to improved hardware and language resources, at least for the language societies which give priority to use of native languages in technological environments. Thus the possibilities for producing text becomes better and better.

            This state of affairs leads to the well-known problem of sorting out relevant information for some particular task from the huge amount of irrelevant information, or get rid of relevant information of bad quality. This problem is of course closely tied to the content of documents in the searching domain. Content analysis thus becomes increasingly more important in the areas of information management and so-called data mining. Content extraction based on more or less simple word count is the current state of the art – words are counted and often compared to normal word distributions, and unexpected distributions will in general give reasonably good clues to document content, albeit that the precision can be much better in theory if reliable information about documents can be used to classify them, cf initiatives within work on so-called semantic web. The problem for the word count approach is basically computational power when the amount of text is huge, for instance the entire world wide web. Given Moores law[[10]] we can expect that conventional indexing can be performed on ordinary PCs some years ahead from now, but more extensive content analysis algorithms can also be put to use, for example document indexing based on hierarchical semantic or logical structure. Resources like WordNet[[11]] and EuroWordnet[[12]] are important in this context. Entire document collections can be parsed and assigned sophisticated semantic content labels, and we can imagine search engines being able to perform logical deductions over their own updated formal models of the real world, as it manifests itself in written sources.

            There are, however, multilingual aspects involved here which are difficult to handle today. Relevant documents might exist in different languages. Queries should thus be translated to other languages if desired by the users, and relevant information in other languages should ideally be translated into the user’s preferred language, perhaps as translated summaries if desired. These ideas are by no means novel – they have been quite clearly formulated in research projects, most notably within the EU. But results are poor, so far. High quality products of this type do not exist, and the best commercially available translation system is good old Systran[[13]] initiated in the early sixties. Unfortunately, the languages available for automatic translation do not include e.g. Norwegian. But the emergence of statistical methods in machine translation paves the way for robust translation between Norwegian and other languages, provided that manually aligned bilingual text corpora are made available for research and industry.

            15 years ahead we foresee that industry and consumer markets will expect that textual applications like the ones hinted at above will be available with Norwegian language as a naturally integrated part. Research within language technology in Norway should give priority to development of formal and in particular statistical methods in textual analysis, both with respect to information management and automatic translation. Another extremely important aspect is creation of language resources for research and application development, which most naturally appears to be a task for the government, see conclusions in the report Norwegian Language Bank[[14]].

 

3         Spoken language technology

Twenty years ago, automatic speech recognition was basically treated as a signal processing problem whilst text-to-speech synthesis was very much rule-oriented, based on expert linguistic and acoustic-phonetic knowledge. During the past two decades, paradigm shifts in the technology and a move to much more complex task domains have lead spoken language technology to become a multi-disciplinary research area where insights from fields like linguistics, statistics, acoustics, phonetics, computer science and signal processing are required in order to adequately address problems involved with enabling computers to understand and produce natural spoken language. Continued progress in spoken language technology will depend on exploitation of all available, relevant knowledge sources, making close cross-disciplinary collaboration a necessity.

The past two decades have witnessed a convergence of the methodologies employed for solving the various problems in speech processing to an extent that today, data-driven statistical modeling techniques constitute the backbone of both systems for recognizing and understanding speech, as well as for production of speech from texts or concepts. This development is expected to continue, and will facilitate better interaction and utilization of knowledge between the different application areas. 

The development of the core technologies of automatic speech recognition (ASR) and text-to-speech synthesis (TTS) will to a great extent govern the evolution of speech enabled systems such as conversational user interfaces, spoken dialogue information systems and multi-modal user interfaces which integrate core speech technologies with e.g. natural language understanding, decision support, data mining etc. Although substantial improvement in the performance of the core technologies are to be expected, it is doubtful whether it will be possible to design systems that are close to human performance outside narrow application domains. HAL 9000, the intelligent computer in the movie 2001 A Space Oddysey, will be at least a few decades delayed.

In ASR major obstacles that need to be overcome mainly concern various aspects of robustness. These aspects include on the acoustical level sensitivity to ambient and electrical noise and variations on voice characteristics, on the lexical level variation in speaking rate and style and pronunciation variations (including non-native accents), and on the language model level task domain dependency, out-of-vocabulary words and variations in phrasing.  As has been witnessed over the past years, the hardware development has facilitated the application of ever more powerful models and algorithms, and it is expected that this will contribute to an evolutionary development of the performance of speech recognizers. The inclusion of higher level knowledge sources, e.g. semantics will most likely be necessary to improve the accuracy of both dictation and advanced spoken dialogue systems, but this will not obviate the need for better modeling at the lower levels. The perhaps most challenging problem facing speech recognition is the handling of natural, spontaneous speech where current systems typically exhibit word error rates in excess of 30% (transcription of conversational speech).

Text-to-speech (TTS) synthesis has evolved from being rule-based to being more and more data-driven. This was initially the case for the sound generation, but statistical methods are gradually being employed also in the linguistic layers of the synthesis engine. The quality of synthetic speech is evaluated for intelligibility and naturalness, and current state-of-the-art TTS is mainly lacking in naturalness. Improved methods for sound generation and intonation modeling will over the next 15 years lead to speech synthesis engines producing highly intelligible and natural sounding synthetic speech. Among the challenges facing TTS development in the near future, we will see development of tools for efficient production of new synthetic voices and dialects; investigations into the characteristics and modeling of emotional speech; methods for improving the flexibility of TTS engines e.g. with respect to speaking rate and style.

The task of speaker recognition is to identify who is speaking. The sub-domain of speaker verification is the task of verifying a speaker’s claimed or assumed identity, usually for access control.  The main issue governing the prospects of success for speaker verification systems will be related to reliability, in particular against impostors. The issue can currently be handled fairly reliably, but can be questioned if future TTS systems provide means of fast and inexpensive methods for generating high quality synthetic speech from a limited amount of speech data from any given individual.

Telephony based and networked speech technology applications will be influenced by the projected adoption of the internet protocol (IP) for telecommunications [1]. Packet oriented speech transmission by voice over IP (VoIP) will introduce artifacts due to speech coding distortion, packet loss and round-trip delay that need to be addressed in future systems. Perceived speech quality is important to user rating of interactive voice response (IVR) systems. IVR is currently the largest application area of speech technology, and annual revenues are expected to grow to 3 billion USD by 2008 [6].

Spoken dialogue systems[2] are inherently dependent on the performance of the core modules of speech recognition and synthesis. However, correct recognition of every word is not necessary for the human/machine interaction as long as key words and phrases are correctly interpreted. Well designed dialogue strategies including error handling are essential for well functioning dialogue systems. Automated response generation, i.e. the generation of spoken responses from a response concept will improve the usability and flexibility of spoken dialogue systems.

Multi-linguality[5] will be an important issue. English may be the lingua franca (sic!) of modern society, but the national languages will prevail. However, foreign names and phrases will need to be understood by a speech recognizer and well pronounced by a speech synthesizer. Language identification is essential to switching a foreign language speaker to the correct language module in a spoken language interface. Another important aspect is the possibility of utilizing existing foreign language resources (speech and text data, statistical models) for improving the performance of speech technology systems in another language.

Multimodal[4] user interfaces integrate different I/O-modalities like speech, sound, graphics, text and user gestures such as pen/stylus on touch sensitive screens, eye movements and body movements. Such user interfaces are particularly interesting for small, mobile terminals where the terminal size precludes the use of standard I/O (keyboard, mouse…). Speech recognition and synthesis will be an important part of these user interfaces. Important issues include how to effectively integrate the various modalities such that the different information channels can be synchronized and disambiguated for maximal efficiency in use.

 

4         A successful Language Technology in Norway

Successful language technology infrastructure in Norway in 2020 will provide product types like the ones mentioned above, and an industry which is able to utilize national and international potential in language technology. Certain groups of disabled people will hopefully find help in language technology products like automatic dictation (blind people, dyslectics, and persons with serious physical handicap) and naturally sounding, personalized speech synthesis (people being unable to speak properly).

But there is at least one major obstacle. Given that statistical methods will continue to dominate speech technology and at the same time will become more and more important in textual language technology, it is extremely important that Norwegian language resources are made available for research and commercial use in the years to come. If not methods and solutions developed for other languages, in particular English, will be predominant and the pressure to use English instead of Norwegian will be stronger. Hopefully Norwegian authorities will make sure that relevant resources are being created.

The Norwegian Research Council has intensified research on language technology, most notably via the KUNSTI program[7] which funds two major projects in the areas of machine translation[[8]] and dialog modeling with spontaneous speech[[9]], in addition to a number of smaller projects. Unfortunately, these projects, and most other projects financed by KUNSTI, suffer from the lack of language resources. There are fundamental research issues that need to be resolved in order to improve the general performance of language technology. Some of these issues will be language specific. Continued support of language technology research, maintaining a critical mass of activity on a high international level is a vital prerequisite for the success of Norwegian language technology. We assume that the KUNSTI program will be continued after the end of the current project period in 2006, and hopefully with a better Norwegian language infrastructure.

 

References

 

[1]     R.V. Cox et al: ”Speech and Language Processing for Next-Milennium Communications Services”, Proc. IEEE, vol 88, no. 8, pp. 1314-1337, August 2000

[2]     A.L. Gorin et al.: “Automated Natural Spoken Dialog”, IEEE Computer, pp. 51-56, April 2002

[3]     R. Rogoff: “Voice Activated GUI: The Next User Interface”, Proc. IEEE International Professional Communication Conference, 2001 (IPCC 2001), pp. 117-121,  Oct. 2001

[4]     S. Oviatt: “User-Centered Modeling and Evaluation of Multimodal Interfaces”, Proc. IEEE, Vol. 91, No. 9, Sept. 2003

[5]     A. Waibel et al.: “Multilinguality in Speech and Spoken Language Systems”, Proc. IEEE, vol 88, no. 8, pp. 1297-1313, August 2000

[6]     Datamonitor: “Voice Automation – Past, Present and Future”, White Paper for Intervoice, July 2003

[7]     KUNSTI web-site, http://www.program.forskningsradet.no/kunsti/, accessed March 2004.

[8]     LOGON info at http://www.emmtee.net/, accessed March 2004.

[9]     BRAGE web-site, http://www.tele.ntnu.no/projects/brage/, accessed March 2004.

[10]  Moores law, http://www.intel.com/research/silicon/mooreslaw.htm, accessed March 2004.

[11]  The Global WordNet Association, http://www.globalwordnet.org/, accessed March 2004.

[12]  EuroWordNet web-site, http://www.illc.uva.nl/EuroWordNet/, accessed March 2004.

[13]  SYSTRAN, http://www.systransoft.com/, accessed March 2004.

[14]  Consolidating and Increasing the Availability of Norwegian Human Language Technology Resources, report October 2002, http://www.sprakrad.no/sbank2.htm, accessed March 2004.




Members of the working group:

Professor Torbjørn Nordgård, Department of Language- and Communication Studies
Torbjorn@hf.ntnu.no
Professor Torbjørn Svendsen, Department of Telematics
torbjorn.svendsen@tele.ntnu.no
Professor Wim van Dommelen, Department of Language and Communication Studies
wim.van.dommelen@hf.ntnu.no
Professor Knut Kvale, Telenor Research and Development
Professor Jon Atle Gulla, Department of Computer and Information Science
jon.atle.gulla@idi.ntnu.no
Associate Professor Arild Faxvaag, Department of Neuroscience
arild.faxvaag@medisin.ntnu.no
Researcher dr.ing. Erik Harborg, SINTEF ICT
erik.harborg@sintef.no