Joseph Norment Bell
University of Bergen
University of Helsinki
University of Bergen
Istituto Universario Orientale, Napoli
NEL studies have a long history at European universities and received some attention in colonial times. One of their initial purposes was to offer a working knowledge of the language and culture for personal or professional communication, often in diplomacy, commerce, and defense. In the middle of the 20th century, they addressed a wider audience including industry and finance. At the same time, there were also many linguists with a theoretical interest in the vast number of languages which are radically different from the European ones.
Following the evolution of our societies, NEL studies are today serving various additional purposes, which in the past were not really given much thought, and they have expanded their scope to socio-cultural matters in a drastic way. This is due to globalisation phenomena, and to migration ones in particular. Regulating our multicultural and multilingual societies has become a crucial necessity indeed. More social aspects concerning immigrant populations are now taken into account and a few NEL courses for teachers and social workers are provided. Furthermore, developmental aid towards non-European countries has also shown that European actors need to receive training in NEL, if some inadequacies of European past interventions are to be avoided. Non-governmental organisation workers active in medical, agricultural, and other sectors create a significant demand in this respect.
In some European countries we already find the recognised figure of a cultural mediator, who in some cases started as work on a voluntary basis to a more officially recognised work position in public structures (generally public services, schools, etc.). However, the linguistic competence of the persons in these positions varies widely. It is clear that a cultural mediator should also be a linguistic mediator. Unfortunately, what the language research and service centres do is sometimes difficult to bring to the public, especially where person to person situations are involved. Therefore, there is a need for new professional profiles, in which computing in humanities skills and competencies plays a substantial part.
Our world has shrunk, our cultural attitudes have changed. With the spreading of information and communication technology, contacts between Europe and other continents have become much cheaper and easier. However, the rate at which the digital media are made compatible with languages of the non-occidental countries is not equal to that for the European languages. This is a non-trivial factor contributing to an imbalance in international relations. Computer linguists with a special interest in NEL feel that fatalism in this matter is to be avoided and that the EU should be able to contribute towards improving the situation. Through some basic analyses they offer to exemplify, in the remainder of this chapter, what is at stake and what measures could or must be taken to shape better, sounder co-operation systems both within Europe and together with non-European universities.
Within ACO*HUM, the SOCRATES thematic network project on Advanced Computing in the Humanities, a special working group for NEL was created in 1997. This group is the representative of the widest and most heterogeneous field in European higher education. Its members, all from different linguistic and computational backgrounds, arrived at consensus views on a number of key issues. Their programme, as NEL teachers and researchers, but also as European citizens of the world, expresses a deep awareness of the changes which are to, or should, take place in our understanding and praxis of:
One of the main problems with making strategic plans concerns the authority of those suggesting initiatives. The priorities of African countries are quite different from those of EU countries. The EU sees the world from its own viewpoint and is likely to ensure that its interests are not put aside. This has direct consequences on the development and position of African languages. French, English and Portuguese, all EU languages, dominate Sub-Saharan Africa as official languages, and there are good reasons to believe that EU countries are not unanimous in the policy to be pursued as regards the development and support of native African languages. This is a serious obstacle. Therefore, the universities and research institutes of EU countries should make it clear to the decision making bodies of the EU that the language policies inherited in African countries from colonial times are not acceptable and do not meet the interests of those countries.
The EU should enter into genuine dialogue with African countries and give financial support to programmes aiming at implementing local language policies. Universities and commercial companies of EU countries could have a determining role in the field of computer applications for major African languages, a field which has become central in the monitoring of language policies. Among such applications are basic language analysis tools (morphology, syntax, semantics), which could be used in various task-specific functions (spelling, hyphenating, lemmatizing, constructing lexical and terminological databases, electronic dictionaries, information retrieval, and even machine translation applications). The projects should be carried out in close co-operation with African universities and it is important that the know-how in the field is effectively communicated to all parties concerned.
Chinese, Japanese and Korean made the most of it, thanks to their economical strength and the scientific expertise of the countries where they are spoken. Very little in that direction was done from the European side. The rich countries bordering the Pacific (mainly the USA and Australia) had a different attitude and have acquired a leading position in the Asian language processing industry. The EU seems to be hesitant to maximise its chances of interacting with this part of the world and is not yet exploiting its know-how for co-operating with Asian partners to any significant degree.
Quite a number of these languages have been studied locally for centuries. But tradition has not stifled the development of more modern approaches, and today a few of the major Asian languages rank among the leading digitised languages world-wide. On the other hand, many other Asian languages have hardly been worked on by computer linguists. Taking into account that the linguistic descriptions provided by native speakers of those languages can be accepted as authoritative (though this is debatable) and that many descriptions are already available, EU universities could initiate projects aimed at the creation of computer applications for those languages in collaboration with highly motivated Asian university partners.
The benefits for European countries could be drastic as far as teaching and research are concerned. The creation of computer tools would permit a better training of European students in some particular languages which are understaffed or have never been taught. European teaching tools for Chinese, Japanese and Korean would also have to be produced to meet with the high quality demands of today's European training.
Research targeted at these languages could stimulate the field, allowing for some fruitful knowledge transfer among European and Asian project teams. As a final result, this would bring European competence resources in the domain to a level of competitiveness with the US. At present, it is already possible to stress that since co-operation is a central element of the strategies to be implemented, larger efforts will have to be made to facilitate mobility towards non-European partner countries.
The same applies to archival resources which are in electronic form. The scarce electronic archives on Africa are accumulated and maintained by universities, commercial efforts in this field being non-existent or unknown. Obviously, companies do not see any commercial benefits. Generally, resources accumulated and maintained by universities, some of which are in Europe, are made available to researchers only. The access to resources is arranged through the Internet, whereby permission is given for using the material, but copying the source material is prohibited.
In recent years, sizeable European programmes have been initiated, a few of them associating university and industrial partners. In Germany, for example, CITAL (Centre for International Terminology and Applied Linguistics at Fachhochschule Konstanz, Baden-Württemberg) is maintaining German-Chinese technical and economical databases in co-operation with IBM Germany, Siemens AG, Volkswagen AG, and a number of Chinese universities.
But Europe still has a lot of ground to make up in this field, especially compared with the US, where resources in computer linguistics are extensive and Asian languages specialists numerous, given the importance of the Asian community living there. European NEL computing specialists had anticipated the needs of the EU in this domain, and tried to convince companies trading with Asia to invest in research and training for NEL computer applications. Their warning was not heard.
No one should be surprised then to see that Microsoft is extending its competitive edge to applications for Asian languages. In addition, Lernout and Hauspie Speech products, the leading speech products company, which has a remarkable catalogue of machine translation software targeted also at Japanese, Chinese, and Korean, opened its capital to Microsoft in a dramatic way in the first semester of 1999.
The European will to improve its poor position and make up for this handicap has to be affirmed. Concrete measures have to be strengthened to raise the economical actors' awareness and encourage investments in the field of computer applications for Asian languages.
With the spread of European colonial interests into the Arabic and Islamic world in the course of the nineteenth century, a corresponding increase in the teaching of the languages and cultures of the region occurred, as was also the case for the languages and cultures of other areas subjected to colonial influence or rule. Schools, academies, and societies specializing in the teaching of Oriental languages and the study of Oriental civilisations flourished. Arabic was a central subject in most of these. Not only did one need it to understand the religion of Islam, it was also a necessary element in the acquisition of formal Persian and of Ottoman Turkish, the court languages of India and of the Ottoman Empire respectively. The fairly rapid replacement of Persian by English as the language of administration in India and the slower break-up of the Ottoman Empire, culminating in the establishment shortly after World War I of the modern Turkish Republic, which attempted to cut its ties with its Arabic and Islamic past, came too late to diminish the growing importance of Arabic.
The discovery of large oil reserves in the Middle East, rapid population growth in the area, and the emergence of powerful Islamic, secular nationalist, and Marxist political formations ensured continued and growing interest in the language, although not always in Western Europe, which soon after World War II lost most of its control over the Middle East and much of its cultural influence there. By the early sixties, the US had an extensive "National Defense Foreign Language" fellowship programme in full swing. The teaching of Arabic language and culture benefited from a rather heavy importation of European scholars, particularly from Great Britain and Germany.
The long tradition of teaching and study which has succeeded reasonably well in providing Europe and America with trained scholars in Arabic and Islamic studies, as one may readily understand from the preceding brief historical introduction, accepts change only gradually. Moreover, it cannot easily be replaced by a native Arabic tradition, something on the order of the Alliance Française or the Goethe Institute. The native Arabic grammar, although one of the most sophisticated systems of linguistic analysis ever devised, was developed by scholars who lacked the concepts of consonant, vowel, and syllable and for whom subject and predicate were logical rather than grammatical categories. Western students who learn Arabic must learn a whole new science if they are to understand explanations of Arabic grammar given by teachers trained in the native tradition.
Until recently, it should be added, standard (written) Arabic was very often taught as a dead language. If students learned to speak, it was most often on their own. Aids for learning dialects were not entirely lacking, but study of a dialect was seldom required in university programmes. Little emphasis was put on speaking the written language, which is generally only used by educated Arabs in certain formal or semi-formal circumstances, although for many educated non-Arab Muslims across the predominantly Islamic areas of Africa and Asia it is commonly spoken as a lingua franca.
In effect the student should be taught three different forms of Arabic if he or she is to acquire something resembling the command of an educated native speaker: first, the highly inflected classical language (known as Modern Standard Arabic, or MSA), second, an Arabic dialect (these have lost much of their inflection, have many markedly different features, and can almost be considered as separate languages), and, third, a somewhat artificial version of standard Arabic with a loss of inflection resembling what has occurred in the dialects. Finally, after years of practice, the student should learn how to blend these three linguistic levels according to varying circumstances, as educated native speakers do.
Many commercial initiatives in computing for Arabic have been undertaken in the Middle East, but also in Europe, e.g., the production of Arabic fonts by the DecoType company in the Netherlands. While previous Arabic computer fonts lack most of the elegant ligatures that characterize traditional Arabic (including Ottoman and Persian) typography, DecoType is the first company to produce a set of fonts that realize these features, and these fonts are now an important addition to standard Arabic desktop publishing programmes. Also worthy of note is the development of the Syriac version of Windows, in cooperation with Microsoft, by a Cambridge linguist, George Kiraz, who is now working for Lucent Technologies in the United States. In the area of linguistic analysis and language engineering, an important package of linguistic tools not only for Arabic but also for other languages has been produced by the Multi-Lingual Theory and Technology team at Xerox in Grenoble.
Computing is progressively being introduced in the traditional framework of NEL faculties where increasing numbers of researchers and teachers are using its tools and methods. However, this is not yet leading to the redefinition of educational objectives. The fundamental mission of NEL departments has always been to teach the languages, literatures, cultures and history of non-European countries. Large institutes such as INALCO in Paris, SOAS in London, and IUO in Napoli have taught those disciplines for some two centuries now, and they will continue to do so, since it caters to precise societal needs. The interest of a great majority of students of NEL lies in its core subjects; to these students, computing is usually limited to what is of direct use in learning these subjects. Consequently, only a very small but perhaps increasing minority of students becomes involved in deeper scientific issues related to multilingual computing.
Other disciplines which are taught at faculties offering NEL have also started to adopt computing, often at a different pace. New technologies and methodologies are opening new perspectives in computational linguistics, as well as in other branches of linguistics such as sociolinguistics, psycholinguistics, or general linguistics. When these developments are confronted with NEL languages, a cross-fertilisation between the various knowledge domains can be observed in the process.
In a wider societal context, increasing cultural and economic contacts between EU and non-EU countries require efficient and fast language communication competencies. Modern methods for language acquisition, text editing and information retrieval, which already exist for EU languages, cannot be easily converted or created for non-EU languages. Therefore, the need for new professional competencies manifests itself. There is an urgent need for language engineers with expertise in advanced computing adapted to non-European linguistic structures, from character sets to grammars.
Since many non-European languages are primarily used in countries which have not sufficiently mastered new technologies, it cannot be expected that these countries develop suitable language technologies on their own. European institutions may therefore have to take the lead in developing the necessary competencies, while closely cooperating with non-European partners at all stages. Indeed, European departments of linguistics and foreign language departments have already been familiarised with advanced computing and are beginning to adapt existing computing techniques for dealing with various languages. They have gained some experience which will be a very valuable asset for better scaled projects together with non-European partners.
While lack of proper financial resources has not prevented people from expressing their productivity and creativity, it has nevertheless severely limited the scope of application of their results or products. Indeed the field of NEL, being extremely wide, is approached today by scattered, small teams with limited resources, but these teams cannot fully cope with the magnitude of the problems. International strategies would be wise to take into consideration that future actions must be on a larger scale in order to make any impact.
The most significant success in recent times was met in the field of African studies. As far back as 1986, a European project, sponsored by the European Commission (DG XXII, ERASMUS ICP 1140/09), started its activities with a small group of four universities: the Istituto Universario Orientale di Napoli (which acted as the central co-ordinator), Leiden University, INALCO at Paris and Université Libre de Bruxelles. From 1987, membership increased and the application was renewed every year, receiving approval for teacher mobility, student mobility and occasionally Intensive Programmes.
In 1992, a first WOCAAL (Workshop On Computer Application on African Linguistics) was organised at ULB (Brussels). The event offered a clear vision of the field and gave more substance to the project. The 2nd WOCAAL event took place in Helsinki in 1994, and proved successful both with respect to the number of participants (7 teachers and 20 students) and the range of subjects (phonetics, text treatment, lexicography, etc.).
During the various meetings connected to the work of the ERASMUS programme, and on other occasions such as conferences, seminars, etc., the need was felt to introduce a systematic treatment of questions related to computer applications for African linguistics into national curricula. Consequently, the partners of ICP-1140/09 applied in 1994 for the organisation of a curriculum development project. The year 1995 saw the profiling of the CAMEEL project (Computer Applications for Modern Extra-European Languages) aimed at creating and developing a multidisciplinary and transnational curriculum to train students in computer applications to non-European languages.
From 1996, curriculum definition, course designing and various feasibility studies were executed. The creation of a European master's in CAMEEL was planned. In 1998, the master's still had to be fitted within national structures and diploma awarding systems. At present, in 1999, we are about to see the starting of the first level of the master's in CAMEEL; in its pilot phase it will focus on African languages exclusively. Meanwhile, in 1997, the partners active in the NEL area entered into the ACO*HUM thematic network project and founded a working group aimed at deeper analysis and subsequent dissemination.
Some relevant research on Arabic has been done in Europe, including Eastern Europe. One of the first examples, done on a main-frame computer at Oxford University, was the corpus of early Arabic poetry established by Alan Jones. This work predates the widespread use of personal computers capable of handling Arabic script at European universities. A noteworthy effort for rendering non-Roman script languages in transliteration was made in connection with the establishment ot TITUS: Thesaurus Indogermanischer Text- und Sprachmaterialen at the University of Frankfurt/Main. At Charles University, Prague, are stored numerous Ugaritic and South Arabian texts, and considerable work has been done with optical character recognition (OCR) of Ethiopic. At the University of Venice a corpus of Persian poetry in transliteration has been maintained for some years, while at the University of Bergen valuable work has been done in the development of Arabic and Hausa transliteration fonts, particularly for the Macintosh platform, by Knut S. Vikør.
Also at Bergen, in cooperation with Dr. Petr Zemánek of Charles University at Prague, a cross-platform transliteration font that includes most of the signs needed for the various Middle Eastern languages and the diacritics of almost all Roman script European languages has been developed. Currently a large corpus of Arabic is being built for lexicographical use at the University of Nijmegen. But perhaps the most remarkable European contribution to Arabic language technology to date was inspired by Orientalists at the Russian Academy of Sciences, St. Petersburg Branch, who needed OCR for both Arabic and other non-European languages, as well as other pattern recognition tools, for pre-cataloging personal seal imprints and the handwriting of manuscript copyists. This work is still in progress, but the early DOS OCR program produced in St. Petersburg, now marketed for Windows by an Egyptian firm (Sakhr Software Company, see below), is perhaps the most versatile Arabic OCR program available today.
Particularly in France, often in cooperation with North African partners, work is being done in a variety of areas. In cooperation with the Institut du Monde Arabe, Rafik Belhadj-Kacem, on secondment from Bull to EPOS AS to co-ordinate and manage international projects, is directing, with financing from the Francophonie project, the production of a multilingual search engine for a bio-bibliographical collection of French and Arabic texts. The same group will be producing a program for reducing the confusing variety of European spellings of Arabic names to their standard Arabic form. Mohamed Azzedine of the CIMOS company in Paris has produced working machine translation tools for Arabic to and from French and to and from English. ATA Software, based in London has produced an English-Arabic program. Dr. Javier Sánchez González, with others at Complutense University in Madrid and cooperating institutions, have applied computer techniques to such areas as Persian prosody and Middle Eastern musical scales. Work in these and other areas, such as speech recognition, is being done elsewhere in Europe, both commercially and at universities, as well. It has been possible to mention only a few examples here. Reports of results and work in progress can often be found in the Proceedings of the International Conference and Exhibition on Multi-lingual Computing, organized every two years by Dr. Ahmad Ubaydli under the auspices of the Centre of Middle Eastern Studies at Cambridge University (six to date, the last in 1998).
With respect to the African region, digitised Swahili text resources at Helsinki University have grown to a very large size, which makes their use in the language industries feasible. The Helsinki Language Corpus Server contains a wide variety of documents (fiction, scientific literature, newspaper articles, dictionaries, Bible translation, Koran translation, and quantities of transcriptions of oral materials) which consists of approximately 10 million words. The exploitation of those archives is facilitated by a number of specific tools which were created for the purpose, on top of the more usual ones. Researchers who have access to those resources can use morphological and syntactic parsers and disambiguators to prepare their materials. Noise and gaps are thus largely eliminated, and succeeding operations can be carried out on a sounder basis.
In the above, we have focussed on NEL research going on in Europe, due to the scope of the project. However, it is important to keep in mind that much NEL research is conducted in non-European countries. Just by way of example, we mention the University of Stellenbosch in South-Africa, where the Research Unit for Experimental Phonology is a vivid centre for phonological research including computer applications for speech recognition and speech analysis of African languages such as Xhosa, Zulu, Nguni, Sotho, etc.
With respect to research publications, there is at present hardly any appropriate forum for NEL computing. Since NEL computer linguists are few in numbers and scattered over many different languages, there has never been a natural occasion or motivation for them to publish a dedicated journal spanning this field. Consequently, they stand out as strange birds both in computer linguistics journals and in journals on traditional NEL, a situation which has had very negative effects in terms of knowledge transfer. European sites dedicated to NEL languages are very few, especially if their numbers are to be compared with the ones of similar American web-sites. To this day European NEL computing is hardly represented on the Web, and related digital resources are more than scarce. Some exist though, such as the Swahili Text Archives of Helsinki University, which represent a formidable asset for Swahili computer studies and research. Also worth mentioning is JAIS, the Journal of Arabic and Islamic Studies, which is electronically published at Bergen and Prague; even though the scope of JAIS is general textual and cultural studies rather than computing, it does address some issues of digitisation related to its electronic publication.
We feel that further fact finding and dissemination efforts concerning the general state of the art in NEL research should be conducted internationally, with the participation of both EU and non-European countries.
In an attempt to answer students needs, CRIM, Centre de Recherche en Ingénierie Multilingue at INALCO, the Institut National des Langues et Civilisations Orientales in Paris, started such a training at post-graduate level. Since 1992 their Multilingual Engineering DESS (Diplôme d'Etudes Supérieures Spécialisées) has been awarded to some 80 NEL students, half of them of non-EU origin. More than 50% of the courses are taught by people from the industry, a number of them from research and development departments, for example, XRCE, EDF, IBM, and Technologies-GID. A strong stress is laid on corpus linguistics, terminology and lexicography. Companies show a great interest in this curriculum, though some of them often see its NEL dimension as secondary.
It is thus clear that special attention is needed for the problems of transfering computational linguistics methods to widely varying multilingual contexts. Awareness to these problems needs to be stimulated in European society and the sharing of visions should be encouraged. Up to now, however, a sufficiently large forum dedicated to NEL computing has not materialised. We feel that the following worthwhile aims and objectives are still insufficiently addressed today:
Since writing systems may be based on representations of speech sounds, of meanings, or a combination of these (as in Chinese), the variation is enormous. The fact that a single character may represent a speech sound, a syllable or an entire word shows that characters must be treated at different linguistic levels. The Roman script, on which most current computer systems are based, is just one of many writing systems for the world's languages. 99% of the world's written languages use scripts which trace their ancestry back to the old Phoenician, Brahmi, Sogdian and Sinitic scripts. Chinese characters (hanzi in Chinese) still use similar shapes to the Sinitic characters used from around 1200 BC. The Japanese and Korean scripts use Chinese characters together with phonetic/syllabic ones.
Digitising NEL scripts is not a simple matter indeed, and the design of computer solutions requires strong competencies. As prerequisites to most of NEL text processing, transliteration issues have up to now absorbed a lot of resources and have pulled away many researchers from other important research questions. Hopefully, some of the imbalance will soon be corrected by the adoption of international standards. The International Organization for Standardization, which has proposed standards such as ISO 639 and ISO/IEC 10646 (currently the same as Unicode) has a subcommittee on the transliteration of written languages which is in the process of discussing standards for several non-European languages in various working groups. But even if adequate standards can be adopted at a purely character coding level, some linguistic questions remain.
Languages which have an alphabetic writing system but which do not mark vowels, or mark them defectively, occasionally pose problems. Among such languages are Arabic, Hebrew and Persian. One computational challenge concerning those languages is the automatic insertion of vowels into text with consonant characters only. Arabic includes most of the problems of the other right-to-left scripts and a few more of its own. Unlike Hebrew, which has block letters like Roman script, most Arabic letters are joined to the preceding and/or following letter. Moreover, since many letters have special initial or final forms, in many cases no space need be left between words. While Persian and Urdu share these features, a particular problem of Arabic is that many texts, especially religious texts and elementary level schoolbooks, must have short vowels painstakingly added. The ambiguity of vowelless script in some languages is a linguistic problem which may lead to interesting studies in their own right.
Related to the above problem is the missing, or defective marking, of tones in the written form of tonal languages. If tone is lexical, it is essential to mark it in the lexicon; but tone marks may be left out in written documents, because the tone pattern can be determined on the basis of context. Marking tones may be important in the context of the literacy problem in Africa, where huge segments of the population have poor reading and writing skills. In other types of tonal languages, the tone is grammatical, and the tone pattern of a word-form varies according to phonological, grammatical, and syntactic features. Here again, a computational challenge is to insert tone patterns into text which is deprived of tone marking and, thus, make the writing more accurate and more readable to a language learner.
For a language that basically uses the Latin alphabet, Vietnamese is undoubtedly one of the most complicated languages to write on a computer, due to the extensive use of diacritics. Vietnamese uses diacritics both to designate special consonants and vowels, but also to designate tone. For some characters there will therefore be two diacritics. These diacritics must be placed in a special configuration with respect to each other, but there is not only one possible configuration for each combination. For example, the tonal diacritic may in some cases be placed either to the right or the left of another diacritic.
Although it is not possible to present one single standard for how Vietnamese is to be written, it is essential that the diacritics are represented on a computer in an accurate fashion. Since each Vietnamese syllable is normally written as a separate orthographical word, regardless of whether it is morphologically a single word or not, an enormous amount of apparent homophony would result if the diacritics were left out. For a native speaker, the text may still be comprehensible since the appropriate word may be interpreted in by the context it occurs in. For a language learner, however, it is essential to learn the correct vowel or consonant and the correct tone. Furthermore, it is important that keyboard typing sequences for producing letters with diacritics follow the well-established methods already in use for typing in Vietnam.
Summing up, the problems related to the digital coding of NEL writing systems are still perceived as a major bottleneck giving many languages an unfair disadvantage compared to English. At last, the magnitude of the problems is receiving attention, even to the extent that European governments are becoming involved: as an example, we mention the project website Écritures du monde, presented by the French Ministry of Culture and Communication. We call for strong international support to continue the work and obtain adequate character coding standards for all NEL languages.
The many different character sets used in NEL have also had the effect that scholars are dispersed as regards computer environments. Much work was done on Apple Macintosh, which until recently was the best environment for writing in those languages. In contrast, the UNIX environment, common in computational linguistics research (cf. chapter 4), was hardly used at all for NEL. This, it can be assumed, contributed to the dispersal of computer applications on non-European languages and to the lack of contact beween NEL computing and mainstream computational linguistics.
There are, however, big differences regarding the level and extent of computerisation among those languages. Japanese, for example, counts among the best studied languages in this respect, and large numbers of people are working with all kinds of applications, including machine translation, speech recognition, etc. The major part of non-European languages, however, is still without any tools, and no research is done on them.
If Asian and Near-Eastern languages are characterised by varying writing systems, almost all languages in Africa south of the Sahara use the Latin character set in writing (with some notable exceptions, such as Amharic, Tigre, and Tigrinya, which use different alphabets). As regards computer applications, this is a major advantage for those often neglected languages; very little effort is needed in character representation, and work can be concentrated directly on the deeper levels of language processing. One can say that general purpose tools developed within the mainstream of computational linguistics can be applied to all languages with the Latin alphabet, and the same applies to programmes dealing with spoken language. This reduces greatly the need to devise new tools. Nevertheless, work on many of those languages is still in its initial phases, and so far only Swahili has been described thoroughly enough to guarantee the production of marketable applications. The description of tonal languages, which constitute the majority of African languages, is a focus in some other research.
Work for accumulating text archives for various languages has been going on for some years, and the pace in this field is accelerating, as computerised texts in various languages are becoming available. The importance of this fast growing field is going to be capital for the systematisation of corpus linguistics within NEL. A small number of general purpose tools for corpus work can already be used to some extent, but their limited adaptability to NEL characteristics prohibit the launching of large scale projects. To remedy this serious handicap, which prevents the adequate retrieval of linguistic information, language-specific tools are clearly necessary. While much needs to be done for quite a few major non-European languages, sufficient attention should also be given to the minor languages, even to the extent that urgent measures ought to be considered for them. A number of target languages should be indeed prioritised, taking into account geo-political, economical, societal and scientific parameters. Among the considerations is that some languages are seriously endangered; their extinction would have societal consequences as well as the loss of irreplaceable scientific material (see references).
Since the survey was aimed at departments of African studies which had volunteered to participate in such a project, it was expected that the answers obtained by means of the questionnaires would show the level of use of information and communication technologies in the targeted departments to be higher than usual in NEL departments. This proved to be a correct assumption, and, despite a few dark spots here and there, the general picture was quite encouraging, revealing the high potential of the teams involved.
The institutions where African studies are situated did not, however, always respond positively to the consequences of the survey. As long as the project consisted in discussions and thinking among motivated individuals, universities and faculties had an approving attitude towards it, but the implications of its materialisation were experienced as rather disturbing. This is perhaps not surprising; innovations may cause reactions due to uncertainty on behalf of the involved. For the CAMEEL enthusiasts, it meant that work remained to be done to inform, explain, and convince their institutions, and to find solutions regarding funding, equipment, and local co-operation.
The results of the survey reveal that, as far as the staff of African studies are concerned, motivation is high. Small teams seem to be more energetic than bigger ones, which, in a way, is quite natural since they feel that working with other European centres and sharing resources and competencies together will optimise their action with respect to student training, research, exchanges, and so on.
Among the 44 teachers and/or researchers who filled in the questionnaire, 6 respondents (i.e. 13.5%) thought of themselves as basic users of desktop software, 16 (i.e. 36.5%) had more advanced expertise in computing, and 22 (i.e. 50%) did not answer the question related to this subject.
Most of the computer expert scholars had serious expertise in research and development. The outcome of the projects they participated in or ran are products that are currently used in African departments today, either for teaching, publishing, or research purposes. These range from parsers to speech applications:
Indeed the merits of the NEL computing experts are to be strongly emphasised. Their double competencies in African languages and computational linguistics, acquired at the cost of years of extra training and self-tutoring which did not necessarily bring them career credits, can be quite extensive. LISP, PROLOG, SMALLTALK, VISUAL BASIC, C++, and JAVA are some of the computer programming languages they can handle, and as for African languages they are generally experts in two or more of them.
Together with the other teachers interviewed they would like to be able to use more computer tools for teaching and research. Database programs and parsers for a variety of languages are still sorely missed. Machine translation applications are non-existent. But as scholars systematically emphasise, the absence of common standards in existing applications might discourage further developments in the field. What is awaited with some impatience is the adoption and the generalisation of the Unicode format for character encoding. Along with this harmonisation, it is expected that substantial means will be allocated for updating most of available tools and already existing resources. It is felt that the realisation of such objectives would give NEL computing research and development a dramatic new impulse.
Some of the improvements NEL scholars would like to see are of a more concrete nature and seem amply justified, according to the data collected via the questionnaires sent for the CAMEEL survey. Firstly, computers remain too few in African studies departments, and consequently, an investment in computer systems, including multimedia workstations and software, is badly needed. Secondly, networking and electronic communications should be brought to a satisfactory level.
The frustrations caused by these severe lacks are further augmented by the difficulties teachers meet when they wish to train their students in NEL computing. As a matter of fact the number of computer-equipped rooms in Humanities faculties can be ridiculously low. Some African departments have found local arrangements with computer science departments, which shows their determination, but of course the number of hours they are allotted is hardly satisfactory. Very frequently, this means that only post-graduate students are entitled to computer training and that their free access to computer facilities is quite limited.
Most of all, NEL teachers are insisting on effective mechanisms for change. Good intentions, in the form of universities' declarations to move towards the information age, fall short when there are not enough trained people to make things work. Universities need to take immediate action to provide for today's basic material needs and competence requirements. The latter refers especially to the need for adequate training of teaching staff, who in our experience have expressed a great thirst for participating in retraining schemes.
Subjects related to new technologies were partly covered, of course, but were not scrutinised systematically. Useful data were, however, extracted from the 400 answers received, which give a rather clear idea of the degree of technologisation at Japanese departments, especially in relation to equipment, teaching, and students' experience in the use of basic tools.
A few structural elements which Japanese studies do not share with African studies ought to be mentioned. First, Japanese is one of the 3 major NEL languages taught in the EU. The number of students at higher education level is above 21000, the number of teachers around 1200. Second, for about 40% of the students, Japanese is not their main subject, but an elective. Third, the divide between Japanese at humanities faculties and Japanese at other higher education institutions is not simply superficial. It must be noted, for example, that the teacher/student ratios are different at the various types of institutions:
From the results, it is obvious that the better equipped departments are found within science and technology institutions, whose facilities are quite extensive. Very often these institutions offer their students sufficient access to computers and also run language centres which students attend freely for self-learning purposes. Multimedia environments, Internet and e-mail communication tools are provided and used extensively. A similar situation can be observed at business and commerce institutions, though their level of equipment tends to be lower. However, the size of the staff (teachers, engineers, and technicians) responsible for the tutoring of students and running of computer facilities remains acceptable, which is an important factor for the effective and efficient integration of information technology in the students' training.
Unfortunately, and as could be expected, the situation of Japanese at humanities faculties is far from being comparable. At these kinds of institutions, students of Japanese generally have limited access to computers, or no access at all. When computers are available, the time students can spend on them is much too short, which is deeply frustrating, since many students have limited skills as regards basic computer tools. Teachers, on the other hand, do not show much enthusiasm towards new technologies, and only 2% of them try to include the use of computers in the training of their students. Added to that, it has to be kept in mind that, whereas science and technology curricula include a systematic training of students in new technologies, humanities Japanese curricula do not even mention the existence of these technologies. The little training given to a handful of students by their rare pioneering Japanese teachers as regards new technologies, is generally dedicated to word-processing, a fundamental technique for all Japanese students.
Japanese CALL CD-ROMs do not appear to be part of the tools available for Japanese at humanities faculties. CALL programs exist, though, and non-humanities Japanese departments usually have some. A majority of the CD-ROMs that are to be found deal with Kanji learning, reading or writing; the language of the targeted learner is English. Some optimists would say it is a good start, but others would emphasise that CALL programs are not going to develop fast if producers do not count on a satisfactory and expanding market. Some Japanese multimedia experts remark that the number of quality products in the field is rather low at around a dozen products.
If reasons are found to stimulate the creation of computer applications for the Japanese language and their industrial implementation, there would also be a reason for training its user base, including students and, a fortiori, their teachers. But since humanities Japanese departments still have a long way to go as far as computer equipment is concerned, this will not be a quick development.
Some of the answers to the shortcomings highlighted above may be found in the relative re-assessment of Japanese studies objectives at humanities faculties. It would certainly be agreed that Japanese studies should reflect the image of Japan's modernity, to some extent, and that apart from literature and civilisation, new disciplines should be included in or associated to the field. In any case, most exchanges with Japan concern contemporary needs, and are supported by highly technologised media. Linking curricular objectives to societal needs in a closer manner would undoubtedly offer students more opportunities to find jobs once they have completed their training.
The learning and teaching of Arabic in Europe has a long tradition, which perhaps explains in part why the field has remained so conservative in its methods and in the development of instructional tools. Today, two approaches for inculcating Arabic skills in adult students predominate in the teaching of Arabic: the traditional 'grammar and translation' method, which is based primarily on giving the student a systematic overview of the language with accompanying reinforcing exercises, but which has often neglected speaking and writing skills, and the more recently introduced 'communicative' approach, which stresses exposure rather than systematic grammar, and which aims at producing 'four-skill' competence (speaking, listening, reading, and writing). The differing approaches are the object of a good deal of controversy, and at a number of institutions an attempt is made to combine the best aspects of both approaches. For neither approach, in any event, is there a significant volume of digital aids that go beyond what has traditionally been available on audio or video tape. Simply transferring these traditional materials to CD, which is being done, can only be considered a first step.
There are a number of Arabic tutoring programs on CD or the Web, many meant for Muslim children and mostly not of European origin. Although they make use of some methods which should be incorporated into university level instruction, they are not our concern here. Somewhat more advanced programs, such as The Arabic Tutor, offer multimedia training on CDs and sometimes on-line tutoring. But they generally do not tackle the three language levels named above successfully. For the Egyptian dialect, Smiles Productions has an interactive course on CD-ROM.
Probably among the most serious efforts in this direction is the Let's Learn Arabic program copyrighted by Roger Allen and Abercrombie (1988, 1994) at the University of Pennsylvania, which incidentally itself has a long Arabic-teaching tradition, having introduced instruction in the language in 1788. The DOS-based multimedia course, which stresses a four-skill approach, is used by the University of Pennsylvania as part of its curriculum, and may be downloaded for non-commercial use. Courses in advanced syntax of Arabic and in readings in specific areas complement the initial skills-oriented program.
Training in Arabic word processing, which ideally should be a part of all university level study of the language, lags considerably. The most immediate reason is usually lack of technical staff able to support a large number of machines with Arabic software installed. Another reason is confusion regarding the choice of platform. Established university staff members very often work on the Macintosh platform, which until quite recently provided much greater multilingual flexibility than the PC. Many educational institutions have moved away from supporting two separate platforms, however, opting generally for the PC, and the Macintosh in fact no longer enjoys so many advantages over Windows as before. In the all-important shareware area the PC has long since surpassed the Macintosh, and on the Internet, Windows Arabic encoding has become the de facto standard, meaning that Mac users must download and filter Arabic texts in order to read them. At the level of the individual user, however, many solutions created for the Macintosh years ago are still lacking for Windows. The question of which platform to chose, or what percentage of each type, and on which platform to train students in word processing remains therefore a difficult one.
A major effort at devising a curriculum for computer applications for non-European languages is the EU-sponsored master's degree called CAMEEL, planned to start up in the year 2000. The programme requires one and a half year's full-time study, and its major aim is to give students such skills that will help them to get jobs in commercial market.
The training of NEL students in computational linguistics has been the interest of a few European NEL teachers for some time. In the last decade, some dedicated teachers with advanced competencies in computing have tried to introduce their students to computing tools and techniques, with varying degrees of success. Confronted with generally unfavourable conditions, including lack of time, machines, tools, competence resources, and institutional support, they have felt that their generous efforts could hardly be sufficient for reaching more ambitious aims, i.e. the production of post-graduate NEL specialists with advanced expertise in NEL computing.
A number of African studies teachers from various EU universities who were dissatisfied with the situation came up with the idea of working together towards the creation of a specialised curriculum leading to a European master's degree. Their project, European Master's in Computer Applications for Extra-European languages (CAMEEL) was officially started in 1995 and received funding from the European Commission as a CDA project in the DG22 ERASMUS programme.
The participants in CAMEEL had a long history of co-operation: they had worked together on various projects, under the umbrella of ERASMUS, since 1986. Today the project, co-ordinated by the Istituto Universitario Orientale (Naples), gathers 13 European institutions; 10 of them are funded and constitute the core group. The teams are geographically spread over most of Europe:
The transnational nature of the project was not a sole consequence of EU policies. In fact, institutions where African languages can be learnt are few. In most countries only a few units with small teaching staffs offer teaching in this field. As a result, the desire to work with other teams led the pioneering CAMEEL teachers to look beyond their borders long ago.
Berlin (Humboldt Universität) Brussels (Vrije Universiteit/Université Libre) Hamburg (Universität) Helsinki (Yliopisto, i.e. University) Köln (Universität) Leiden (Universiteit) Leipzig (Universität) London (SOAS: School of Oriental and African Studies) Nice (Université Sophia-Antipolis) Paris (INALCO: Institut National des Langues et Civilisations Orientales) Pisa (ILC: Istituto di Linguistica Computazionale) Wien (Universität) Zürich (Universität).
Although common interests initially brought the participants together, there were additional reasons which helped to motivate the project. First, they had to break from scientific isolation on a permanent basis. Second, individually they were neither able to fully assess their needs nor even define adequately what constitutes NEL computing. Third, despite the local help they might receive, e.g. from computer science or computational linguistics departments, they realised that their competence resources remained insufficient and could not ensure the running of a well-rounded training curriculum.
Within the first two years, the CAMEEL partners developed their project by confronting their experiences, exploring various paths, looking for suitable opportunities, and measuring realistically what their ultimate target entailed. In 1997, having clarified their objectives, and confident that the project was not only viable but necessary - whether it be on educational, scientific, economical or political grounds - they went ahead with the preparation of the master's curriculum.
Designing a curriculum did not turn out to be an easy task. Refining initial suggestions into a structured, well-balanced programme adjusted to student levels, university time scales, and so on, required an immense amount of work, as CAMEEL members experienced. As a matter of fact, as the project was getting closer to implementation time in 1999, a few questions which were supposed to find answers without much difficulty proved more complicated than expected, thus leading to the conclusion that more information and thinking had to be put into the project. Details regarding ECTS (European Credit Transfer System), student recruitment, diploma equivalencies, availability of teaching resources, etc., kept coming up. With the support of a questionnaire-based inquiry and contacts with SOCRATES office experts, extra investigations were carried out, which allowed the finalisation of the master's and the preparation of its official implementation.
The first level (Level 0) offers eight preparatory modules (prerequisites to enrolment in the master's): four of them are aimed at students whose background is African languages and linguistics; the other four at students coming from computational linguistics/computer science (depending on their knowledge of African languages).
At Level 1 (the first year of the master's proper) and Level 2 (second year) students have to acquire a number of compulsory and optional modules according to their profile and interests. Within those two years the students have to spend a minimum of three months in the industry (internship = 14 ECTS). At the end of the training a final paper or project is required from the students (worth 8 ECTS). The total number of ECTS to be acquired is 120.
After completion of the CAMEEL master's, students should show a high expertise in the use of computer tools, fully master the appropriate methodologies and techniques for processing African languages, lexicography research and development, or for documentation, multimedia publication, CALL, internet ODL, dedicated to African languages and their cultural specificities.
The professional knowledge and skills to be acquired at university and in the industry (internship) will be permanently reviewed so as to be adequate with respect to job market requirements. To this effect, contacts and exchanges with private sector companies will be aimed at encouraging closer and deeper collaboration; research and development staff from such companies will be invited to participate in the training of the students so as to allow a constant updating of the course contents. The scientific contents of the courses, on the other hand, will be reassessed regularly and tuned to the evolution of what occurs in university research laboratories at a theoretical or methodological level. As a side effect, knowledge transfer between university and industry will hopefully have a greater chance of success. Respecting the above conditions will additionally enable students to get a more accurate view of their profession and give them more credibility in the eyes of employers.
Presently, the list of the CAMEEL master's modules is as follows:
1. Structure and use of computers: basic concepts 70 hrs = 4 ECTSOption b) or African linguistics basics required of computer sciences/computational linguistics students
2. Introduction to computational linguistics 70 hrs = 4 ECTS
3. Networks (e.g. Internet), multimedia and databases 70 hrs = 4 ECTS
4. Use of computer-aided applications in language learning (CALL) 70 hrs = 4 ECTS
5. Introduction to linguistics. 70 hrs = 4 ECTS
6. Phonetics, phonology, tonology and orthography 70 hrs = 4 ECTS
7. Morphology and syntax 70 hrs = 4 ECTS
Plus one module of the following (8-10):
8. Lexicography and lexical semantics 70 hrs = 4 ECTS
9. Historical and comparative linguistics 70 hrs = 4 ECTS
10. Dialectology and socio-linguistics 70 hrs = 4 ECTS
11. Programming language (1) 140 hrs = 8 ECTS
12. Design of data structures and algorithms 140 hrs = 8 ECTS
13. Use of language corpora and text archives 70 hrs = 4 ECTS
14. Use of tools for phonetic analysis 70 hrs = 4 ECTS
15. Working in Unix/Linux environment 140 hrs = 8 ECTSOptional courses; three modules to be chosen among:
16. Programming language (2) 140 hrs = 8 ECTS
17. Preparing and encoding texts 140 hrs = 8 ECTS
18. Computer-aided language learning systems (CALL) 140 hrs = 8 ECTSIt must be kept in mind that none of the participating institutions can provide all the above modules. Departments of African studies will have to obtain some of the missing training resources locally at other departments or universities. Some highly specialised modules will be taught via ODL, a few will be run under the form of intensive programmes (IP) followed by ODL complements. At level 2, the teaching of the specialisation oriented modules will take place in 2 or 3 host universities concentrating all the necessary resources (staff, equipment). Most students will thus be required to spend 3 to 6 months abroad in very intensive study conditions, where they will have the opportunity to be confronted with new environments, new methods, new people, and in this way enlarge their cultural experience.
19. Open and distance learning (ODL) 140 hrs = 8 ECTS
20. Language teaching methods using computer applications 140 hrs = 8 ECTS
21. Building morphological analysers 140 hrs = 8 ECTS
22. Building syntactic and semantic analysers 140 hrs = 8 ECTS
23. Building lexical databases 140 hrs = 8 ECTS
24. Speech recognition and phonetic analysis 140 hrs = 8 ECTS
25. Tools for comparative linguistics 140 hrs = 8 ECTS
26. Additional unspecified course 140 hrs = 8 ECTS
The European dimension of the master's - common prerequisites, curriculum and level of certification, shared courses (ODL, intensive programmes) and residential courses for specialisation - is well established. Another implicit aim of this curriculum is to prepare students, through its various non-traditional approaches, to the realities of Europe in the third millennium. Its transnational nature, its flexibility, its capacity for reacting to new challenges, and its new modes of collaboration towards a common goal, should be a powerful asset and a guarantee of success.
At the moment of writing, the CAMEEL master's has yet to start and be put to the test. It remains to be seen if its ambitious proposals are close enough to students' training needs, employers' expectancies, and universities' thirst for complementary educational provisions.
The ALI-AKAN project, completed in 1999, fully tested in a real scale pilot training course, presents many innovative features. The project stands for African Languages through the Internet and focused on the language Akan in its first phase. The project was authored by the Akan team of the Seminar of General Linguistics under the direction of Prof. Thomas Bearth at the University of Zürich, in collaboration with the Language Laboratory under the direction of Dr. Paul Mauriac, also at the University of Zürich. The project, carried out in the framework of the European-Swiss ERASMUS Intensive Programme, offers imaginative and realistic solutions for the running of courses dedicated to neglected languages, or any other subject for which available resources are too scarce.
The object of ALI-AKAN, a programme combining ODL, multimedia CD-ROM and Intensive Programme (IP) support, is the study of Akan, a national language of Ghana spoken by well over 10 million people. In addition to being a teaching aid enabling students to acquire basic communicative skills and a first-hand working knowledge in the Akan language, it is also aimed at studying in depth some linguistic phenomena which have long been overseen in traditional curricula. Through the exploration of the Akan sound system, as well as its specific rendering of the conceptualisation of human experience in its syntax and lexicon, students are led to re-appreciate or discover such notions as lexical tone, vowel harmony, secondary consonant articulations and serial verbal constructions.
The use of technology under the form of a CD-ROM was dictated by several factors. First, important volumes of materials (texts, sounds and pictures or graphics) had to be made easily available at a low cost (thus excluding books, cassettes, and to a certain extent the Web, because of lengthy connections). Second, since Akan is a tonal language, sound quality had to be very high, and this could not be handled via the Internet without hurting the interactive response time, due to the large size of the sound files. Today's transmission speed on the Internet remains dramatically insufficient for multimedia applications, especially when they require high interactivity and repetition rates. Third, a speech analysis software package linked to the programme had to be made available on a permanent basis, so as to enable the student to observe visually the original oral productions and compare them with his own.
Given their limited budget and time, the heaviness of most multimedia authoring systems, and the flexibility concept of the course, the project team opted for simplicity. Standard text processing software was used around an HTML platform, which means that the course eventually could be put on the Internet. For the students this means that no complex installation is required in order to run the course since all they need on their work stations (PC or Macintosh) is an HTML browser. Nevertheless, a very large effort had to be spent in solving the problem of Akan fonts, so as to make them readable and usable on both PC and Macintosh. The resulting WestAfrica7 fonts, produced by Hannes Hirzel, Zürich, can be integrated in standard word processors and e-mail software, thanks to macros linked to a PERL converter.
The ALI-AKAN course consists of 11 units. Each of these offers dialogues, exercises, notes on grammar and pronunciation, and vocabulary. Optional sections are provided dealing with additional linguistic or cultural aspects, and various types of pedagogical exposition are used so as to accommodate different learning styles. The main body of the course is complemented with a number of extra resources, accessible from within any part of a unit via appropriate links: glossaries, bibliographies, charts, maps, and links.
The residential part of the ALI-AKAN programme (which was held at Humboldt University in 1999) is fundamental for the introduction of students to the Akan language, since most of its characteristics are completely new for the students. Within this introductory course the students really discover a new field and are guided by the teachers who help them clarify fundamental notions, structure their apprehension of the language and integrate its linguistic specificities into their broader view of linguistic sciences.
A second important aspect of the study programme is oral training. There students learn how to produce Akan sounds; the phonological production skills they acquire then will be their survival kit for the rest of the course. For Europeans, indeed, such an oral training is indispensible as a foundation for the study of Akan.
Third, it is essential to familiarise the students with a multimedia environment. Taking for granted that they already know everything about this technology and have developed routine skills related to its use, would be a serious error. In addition, students also need to be taught about data exchange procedures, since they generally have very little or no experience in this domain.
Finally, the programme is very stimulating, bringing students from various European institutions together. The sense of belonging to a study group amplifies their motivation and builds up their confidence. Furthermore it gives them a sense of belonging, of not being isolated through the rest of the course, which is taken by distance learning.
Once they have returned to their respective universities, students are expected to go through the course within 11 weeks to cover the 11 units. They keep in contact with the Zürich team by e-mail, receive weekly assignments, some of them consisting in recording their own Akan oral productions.
Part of the exams are run by means of the Internet. In addition, 5-minute telephone conversations in Akan have been tested as a way of assessing students' oral comprehension and communicative skills. Even if e-mail exchanges do favour interaction between teachers and students for the written aspects of communication, oral interaction with the students is thought to be indispensable, given the difficulties of the Akan language. A solution to this could be to organise, at a local level, occasional meetings with Akan native speaker residents. The Akan team have also started to design new exercises which will help improve students' self-monitoring of their pronunciation. Since their multimedia application is open, a number of enrichments suggested by students' feedback will be easily implemented.
Keeping the learner at the centre of attention has certainly ensured the success of the ALI-AKAN project team. If the whole of their pedagogical product and methodology can be quoted as an example of good practice, it also puts into perspective the lack of interest Europe has shown towards West African languages in the recent past.
Based on these experiences, the wider field of African languages must be considered. European countries have had a tendency to neglect West African languages, so much so that Akan has recently been totally absent from the scene at European universities. This meant that all the knowledge accumulated in the past regarding Akan had been 'forgotten'. It also meant that Ghanaian universities hardly had any links with European linguists, and this can explain why Ghanaian scholars had turned to US universities for scientific collaboration. The ALI-AKAN project has opened up the road to a new appraisal of European linguistic strategy as regards the African continent and its languages. The University of Zürich is playing a leading role in this respect, co-operating for example with the University of Ghana at Accra on the one hand and sharing its knowledge and competencies with other EU universities on the other hand.
We have used ALI-AKAN as a prominent, but rather unique example of the present state of affairs. We are not aware of other significant international ODL initiatives in NEL. Given the success of the ALI-AKAN project, we feel its methods can profitably be applied to other languages and contexts. We must add, however, that issues in second language learning in general are receiving attention in Europe. As an example, a new ODL course in second language learning started in January 1999 at the University of Bergen. The course was broadcast on Norwegian national TV, supported by printed materials and by personal teacher-student interaction via the Internet. In the context of this course, an introductory textbook on Vietnamese has been written (Rosén, 1999).
Many a factor could help explain the reasons for this lack of dynamism. Among the most salient ones is the fact that Asian researchers themselves very often work on English instead of their own language. At European universities, the general competencies needed will be concentrated in centres dealing with European languages almost exclusively, English in particular. As for lesser technologised languages, and as a general rule, the field is practically untilled. For the most endangered languages, time is running out. If tools do not quickly become available for these languages, their survival in the information age becomes threatened as they are overrun by English on the information superhighway. And for the smallest of them, if sufficient data is not recorded soon in an appropriate format before they become extinct, we may in the future not even know what these languages were like.
The positive side of the computational linguists' preoccupation with English is that all the work carried out on English has enabled the scientific community to develop solid methodologies and highly performing tools. Taking into account the knowledge and experience acquired, it can be assumed that creating language resources for distant languages will at least be somewhat more cost effective than has been the case with English. The negative side is that methodologies for English cannot always be simply transferred to languages with radically different linguistics structures, different writing systems and texts produced in different literary traditions and socio-political settings. For example, the development and use of parallel corpora for a number of applications seems of limited use to languages with extensive morphological variation and quite different syntactic structures. One should not stare oneself blind at the goal of amassing data without reflecting on what the data will be used for.
With the advent of more powerful hardware, its wider dissemination around the world, and the generalisation of electronic publishing, more and more NEL textual data is being produced in digital form. With some exceptions, the rather insignificant amounts of available data they represent for most languages can hardly be thought of as potential resources for the creation of textual archives. As for the electronic documents not meant for private use, their reusability is subjected to the rules of copyrights. Added to this restriction come technical problems due to the fact that many non-European languages exhibit significant differences in their writing systems as regards the direction of the script, the encoding of characters, the variability of characters depending on their position in a string, and the linguistic units represented by the characters. Due to lack of standardisation, many NEL documents have been produced using do-it-yourself or proprietary fonts which does not promote compatibility with other platforms.
Consequently, it will be appreciated that creating NEL textual resources is not a trivial task. The difficulties lay essentially in the accessibility and usability of the resources which are gathered, partly due to organisation and technical difficulties which are special to NEL and all too often neglected in a European context. One of the main bottlenecks could be the problem of copyrights, followed by the high homogeneity requirements of the technological formats to be adopted.
Because it is labour-intensive and expensive to create data resources that are large enough to be of real use, the creators of such resources have been unwilling to share them with others. Especially if the resources contain material that cannot be sold because of copyright restrictions, there is a real problem. If the copyright holders create such data, it is very unlikely that they do so for any other purpose than for making money. It is also doubtful whether some ways of avoiding copyright issues are a good basis for creating language data. When copyright holders prevent or restrict language data from being made public, some compilers of the data try to get around these restrictions by taking only small samples from many texts, thus making impossible the use of these data for most literature research. Alternatively, the compilers refrain from trying to sell the data, in which case the burden of financing the work is on the compilers alone. If this is the case, there is not much motivation for sharing the data with others.
Because language resources are created primarily for research purposes, copyright holders should be made aware that allowing inclusion of their material in language resources would benefit them in several ways. By licencing the material for testing language processing tools they would ensure that the tools will be tuned so that they will be capable of processing the kind of material that they are producing. As a consequence, also they themselves would then benefit by getting appropriate tools for their use.
At least for Swahili, experience has shown that publishing houses have in general been willing to licence their material for research purposes in an environment where the access to the resources is controlled and granted only by a signed agreement. In our experience, a non-commercial language resource server, with access limited to researchers only, is a better solution than freely available resources because it facilitates the inclusion of materials under copyright restrictions. An unsolved problem is how to obtain some kind of compensation from users of such high quality data. One solution used is to ask them to contribute research efforts to building the data bank. Although it has produced substantial results only in few cases, it has so far proved to be the most practical way of making useful data available.
The accessibility of texts by itself is not sufficient for building qualitatively good text corpora. For the European languages, public bodies have vastly contributed to funding the creation of some carefully coded, reusable language resources, which clearly have been seen as assets of strategic value in this era of global multilingual information. It is an open question, however, who will be willing to fund further development of NEL resources. As for now, we can but observe that even for European languages, such resources are not made available to end users free of charge. The price to be paid for Swedish SpeechDat(II) can reach EUR 60,000 (for non-ELRA members). For the PAROLE Dutch Corpus, the price is comparatively moderate at EUR 300 (for ELRA members). If Europeans want NEL textual resources they can use and work on, they will certainly have to devise ways of making it possible. Co-operation with NEL speaking countries could then prove essential.
Drawing on the experience acquired in the treatment of European languages, NEL textual resources builders will certainly be able to progress at a cost effective speed. If designed in agreement with today's recognised standards, systems for building resources can be extended and made available for multilingual applications. Indeed, standards such as TEI, SGML, HTML, and Unicode (ISO/IEC 10646), are the ultimate guarantees for the viability and efficiency of future NEL projects. Although some standards may need further adjustments to take into account some NEL characteristics, researchers are convinced that the harmonisation process must continue on a global scale.
Quality control of text archives is a determining factor towards the usability of these resources. Indeed, further processing of texts could become useless if the material they are applied to contains mistakes, and might even amplify the mistakes. On the other hand, good texts are not sufficient in themselves. They need adequate tools to exploit their usability. Computer linguists have already produced a few tools for non-European languages which are indispensable for general purpose activities; their know-how has allowed for the production of information retrieval, concordancing and statistical analysis systems, to start with. The need for further specialised tools for the exploitation of NEL texts and data is dependent on requirements to be formulated by the various humanities disciplines or by industry.
It is thus clear that textual archives and other language data banks which have been built ought to be tuned to the needs of their end users, who will not all be computer linguists. Our concern here is the reusability of the resources, which is a critical issue, given the large investments necessary for their acquisition and edition. Some researchers feel that while we wait for new standards, e.g. XML, the most effective short-term option consists of using the lowest common denominator, i.e. quasi raw text, without any tags, in the Unicode format. This would drastically reduce the need for special but incompatible tools. This simple approach to standardisation would certainly be feasible in the short term, since NEL languages are so numerous and a lot of the fundamental work for their computer treatment remains to be done. It would also ensure a high rate of productivity in the NEL field, where people at present are permanently confronted with innumerable technical obstacles, a multiplicity of proprietary systems, and a severe lack of tools. A larger number NEL specialists would be attracted to the use of advanced computer tools if the field did not look like another Babel world to be deciphered.
Even though more and more material becomes accessible in electronic form, at the same time the amount of material available in traditional paper format, or paper with restricted access to parallel electronic files, is also increasing, reflecting in many cases the issue of rights to intellectual property, the overhead brought to publications by traditional publishing houses, and the associated problems of marketing and pricing. It would be desirable to identify some of the implications of the new information situation with regard to primary and secondary textual sources for NEL studies, particularly with regard to the uniform preparation of texts to make them searchable as a single corpus. Marketing and pricing problems will also be dealt with, but only ad hoc, as they arise.
End users of text resources have many different kinds of goals. Within the humanities, the purposes of scholars not only include the study of language, but also cultural, literary, historical, anthropological, and religious approaches. In addition, there may be a variety of uses by other actors in society, for instance, by financial, agricultural, medical and political organisations. Information retrieval methods for searching texts in different ways are of paramount importance. Pure string search is an easy method, but may be unsuitable for a variety of reasons. First, search methods which are restricted to specific forms and codings of a word may be grossly insufficient for languages where words have many morphological variants. Second, simple search methods do not take into account the semantic or pragmatic value of a word or the linguistic or cultural context in which it occurs in the text.
But even at a more basic level, simple string searching is problematic for many NEL languages using different scripts, since many existing search engines do not handle the codings of the scripts in practically usable ways. A few trial searches in any of the many hundreds of scholarly journals published in the now nearly standard PDF (portable document format) will show how unsatisfactory this format is for retrieval of anything but simple text in the most familiar western languages. The technology can reproduce non-Latin scripts and Latin with diacritics, but it can seldom retrieve them. Nor can it retrieve symbols from within a mathematical equation, for example, even when it can find the same symbols outside an equation. The irony is that all the text one may wish to search in Adobe Acrobat files is produced in a totally coherent (word-processing) format, and is consequently machine searchable before it is stored electronically in PDF form.
Orientalist publications tend by their nature to include a considerable variety of languages and scripts, and thus they provide an ideal testing ground for solutions for multilingual and multiscript corpora. Bell (1998) proposes to use the Journal of Arabic and Islamic Studies, which is electronically published at Bergen and Prague, as a model for a large unified corpus in these areas. JAIS publishes material in accordance with strict philological principles. The presentation in electronic form, however, offers new opportunities and presents new challenges. The electronic files of an article or a text are not definitively fixed, but will evolve with technology in order to allow the possible application of future technological developments. Currently, the texts are minimally tagged and use a specially designed, freely downloadable font. In a current project, the journal, seen as a corpus, will undergo various forms of syntactic and even conceptual tagging which will allow more refined searches, even cross-linguistic ones, for a variety of purposes.
As a final general comment, it has to be underlined that today most European NEL students are completely unaware of what textual archives represent. With the exception of one or two centres of excellence, no NEL departments train their students on a systematic basis in the use of textual archives or NEL corpus linguistics. If institutions offering NEL further ignore new technologies, they run the risk of producing students who cannot live up to job market requirements. The industries themselves voice their worries plainly enough when they remark that most of the post-graduate linguists they hire suffer, to some degree, from technological and methodological inadequacy. We conclude this section with a brief overview of currently available resources in some language areas.
The following list of language data available on African languages is by no means exhaustive. It shows, however, where the emphasis is.
An important Japanese actor in the field is MESSC (Ministry of Education, Science, Sports and Culture), which has initiated projects with the National Language Research Institute of Japan, with ATR (Advanced Telecommunications Research Laboratory) and with numerous universities. We also mention the RWC Text Databases: http://www.rwcp.or.jp/wswg/rwcdb/text/.
The storage of Japanese text and speech data is also a strong feature in the USA, with such products as:
With regard to commercial products on CD-ROM, two of the truly most significant contributions to the study of Arabic language and Arabic and Islamic culture are the work of European publishing houses, namely, the CD-ROM version of the multi-volume Encyclopaedia of Islam about to be released by E. J. Brill in Leiden and the CD-ROM version of Index Islamicus, an essential bibliography of virtually all scholarly articles on Arabic and Islamic studies in European languages from 1906 to the present, published by Bowker Saur in Britain.
For the most part, textual material in digital format is to be had from Middle Eastern companies, often encrypted so as to make it difficult to use together with other digital resources (a necessary evil in a small market). Islamic religious material is predominant. The Sakhr Software Company in Egypt, which markets an OCR program mentioned elsewhere in this volume, is also responsible for a large proportion of the text collections and databases which have recently become available. (Some Sakhr products are now marketed under the name of Harf Information Technology.)
The Koran with commentary, the nine most highly regarded collections in Sunni Islam of the traditions of the prophet Muhammad, the shorter work al-Bayan (including the prophetic traditions agreed upon by the two major medieval scholars al-Bukhari and Muslim with English, French, German, Indonesian, Malay, and Turkish translations), compendia of Islamic jurisprudence and Muslim rulings regarding business law, as well as other Islamic materials, are marketed by the Harf company. Numerous Ugaritic and South Arabian texts are collected at Charles University, Prague.
Literary texts have been slower in appearing in digital format, although multimedia versions of children's stories are being marketed. The Sakhr company has on its Internet site in searchable form the collected poetry of the major medieval poet al-Mutanabbi with extensive commentary, but it has not marketed the product as of this writing. The mainframe corpus of early Arabic poetry established by Alan Jones at Oxford before the day of the personal computer remains therefore one of our major accessible digital sources of literary material.
Journalistic text in Arabic has, for the reasons mentioned above, not been particularly easy to come by. However, the Arabic newspaper al-Hayat, based in London, has published recent years on CD-ROM (Macintosh format), and some other newspapers use text format for their Internet articles, although many, as mentioned earlier, use image files.
Students have a need for multilingual corpora in order to practice information retrieval across language barriers. One example of such a multilingual corpus including Arabic consists of the multiple translations of the collection of prophetic tradition in al-Bayan, mentioned above. But also the electronic version of the Journal of Arabic and Islamic Studies, is in part being prepared for this purpose. A further example consists of the French and Arabic texts included in DICAU, a digital bilingual French and Arabic bio-bibliographical dictionary of contemporary Arab authors, under development at the Institut du Monde Arabe in cooperation with Dr. Rafik Belhadj Kacem from the software producer Société EPOS, Paris.
Dictionaries are beginning to appear on CD-ROM and on-line. Sakhr has also produced a five language (Arabic, English, French, German, and Turkish) dictionary including meanings, synonyms, and antonyms. It also has several major dictionaries on its Internet site. Meanwhile Future Publishing of Lebanon has made available on CD-ROM the voluminous medieval Arabic-Arabic dictionary Lisan al-Arab by Ibn Manzur (readable on both PC and Macintosh). Future Publishing has also produced a multimedia version of the standard one-volume Arabic encyclopedia al-Mawrid. Also, in this respect we mention again TITUS, the Thesaurus Indogermanischer Text und Sprachmaterialen, at the University of Frankfurt/Main.
The above are only a few examples of the kinds of material being made available on-line and on CD-ROM. For linguistic, textual, and cultural studies, the list is quite representative, both in scope and length. Gradually it is getting longer, but it is nowhere near long enough to allow the kinds of textual analyses scholars in European languages can take for granted.
The Section for Linguistic Studies, University of Bergen, has access to a number of syntactic workbenches, advanced computational tools which greatly support the development of grammars. Xerox has developed a number of such tools, including the XLE platform and the LFG Grammar Writer's Workbench (Kaplan and Maxwell 1996), which provide an implementation of the Lexical-Functional Grammar syntactic formalism. The implementation includes also more recent features of the theory, such as functional uncertainty and multiple projections. The LFG Workbench has been used to develop advanced grammars not only for English and other major European languages, but also for Vietnamese (cf. Rosén 1997). Specific constructions covered by these grammars include the topic-comment construction and sentences with empty pronouns in Vietnamese, and topicalisation structures in English. The syntactic analysis involves both c-structures (phrase structure trees) and f-structures (feature matrices with grammatical information presented in a manner independent of phrase structure representation). In addition, there are semantic and discourse analyses, both in the form of feature matrices. Although these grammars have extremely limited coverage, they can be valuable from a number of perspectives, including theoretical and pedagogical perspectives.
The grammar of languages such as Vietnamese and Chinese has been treated in the West from two basic perspectives: structuralist and generative. The structuralist approach usually stresses how different these languages are from English, while the generative approach tends to attempt to make these languages appear to have the same structure (if not at surface structure, then at least at some deeper level) as English. In a theory such as LFG, it is possible to write grammars for widely varying languages based on a common set of linguistic principles. The level of functional structure permits different languages to express the same grammatical distinction in different ways, for instance, one language may code object status through accusative case, another may code it through phrase structure position, and a third may code it through use of an adjacent object particle. Students with such a workbench at their disposal may use the somewhat eclectic language descriptions in thorough but outdated structuralist grammars to implement a generative grammar and simultaneously learn more about generative grammar and about the grammar of the language itself.
The LFG Grammar Writer's Workbench is already being used for the teaching of syntax to beginning linguistics students at the Section for Linguistic Studies. They use it for writing syntactic rules for Norwegian, but there are plans to use it also in their study of their non-Indo-European language. Although the study of such a language is a popular part of the linguistics programme, students have often complained that there is too little connection between the study of this language and the linguistic theories they learn about. Actually writing syntactic rules for various grammatical constructions in the non-Indo-European language and using the workbench to test whether their rules correctly analyse these constructions, would provide students with a much more concrete link between these two parts of the study programme.
In the course on articulatory phonetics which forms part of the preparatory programme for African linguistics at Leiden University in the Netherlands, students learn to recognize African speech sounds. This goal is well suited to computer supported approaches. In 1997, the project AFROFOON was started by Alex de Voogt at Leiden University. AFROFOON is implemented as a Hypercard application on the Macintosh. It provides students with self-paced exercises in the distinction, recognition and production of speech sounds from different languages. At the same time, students learn transcription in IPA (International Phonetic Alphabet). We feel that projects like this one could profitably be repeated in other settings.
For all but the most studied languages, there is still a lot of work to be done in creating and distributing basic language resources, such as corpora and lexicons. On the one hand, fortunately, there are some new projects, especially in corpus compilation. By way of example, we mention a new Arabic corpus project in Italy, carried out jointly by Istituto Universario Orientale, Napoli, and Istituto di Linguistica Computazionale, Pisa. Another Arabic corpus project, this one oriented towards its use for lexicography, is carried out at the University of Nijmegen, the Netherlands. On the other hand, however, the pace of development for the different languages is slow. Especially in the field of terminology, very little is being done for NEL. With the increase of specialisation in multilingual and multicultural environments, we think terminology will shift from the marginal position it now has (both in applied linguistics and in computing), to a more central, strategic position. And this, far from being just a software technology, is a question of structuring knowledge. Moreover, there seems to be a real need in this field both in training and in applied situation.
Centre de Recherche en Ingénierie Multilingue de l'Institut National
des Langues et Civilisations Orientales (CRIM-INALCO)
2 rue de Lille, F-75007 Paris, France.
e-mail: firstname.lastname@example.org, Web: http://www.inalco.fr/pub/enseignements/filieres/crim/anglais/index.htm.
Istituto Universitario Orientale (IUO), Dipartimento di Studi e Ricerche
su Africa e Paesi Arabi (DSRAPA)
Piazza San Domenico Mag., Palazzo Corigliano, I-80134 Napoli, Italia.
e-mail: email@example.com, Web: http://www.iuo.it/.
Universitetet i Bergen, Seksjon for midtøstens språk og
kultur (Section for Middle Eastern Languages and Cultures)
Hans Tanksgt. 19, N-5020 Bergen, Norway.
e-mail: firstname.lastname@example.org, Web: http://www.hf.uib.no/i/Midtspraak/Midtspraak.html.
Hurskainen, Arvi (1998). Maximizing the (re)usability of language data: http://www.hd.uib.no/AcoHum/nel/paper-hurskainen.html.
Journal of Arabic and Islamic studies (JAIS), Bell, Joseph (ed.). http://www.uib.no/jais/ (ISSN: 0806-198X).
Karangwa, Jean de Dieu, and Souillot, Jacques (1998). Report on the Cameel inquiry (on African languages at EU universities): http://www.hd.uib.no/AcoHum/nel/Cameel-report.html.
Koskenniemi, Kimmo (1983). Two-level morphology: a general computational model for word-form recognition and production. Publication No. 11. Helsinki: University of Helsinki Department of General Linguistics.
Rosén, Victoria (1997). Topics and Empty Pronouns in Vietnamese. Doctoral dissertation, University of Bergen.
Rosén, Victoria (1999). Vietnamesisk. En kontrastiv og typologisk introduksjon. Trondheim: Tapir.
Shuichi Itahashi (1998). "On Speech and Text Database Activities in Japan", Proceedings of the First International Conference on Language Resources & Evaluation. Granada, Spain, 28-30 May 1998, pp. 355-360.
Souillot, Jacques (1998). Technologization and computer literacy in Africanistics and Japanistics. http://www.hd.uib.no/AcoHum/nel/jacques-inquiry.html.
ALI-AKAN project, co-ordinator Thomas Bearth: http://www.spw.unizh.ch/afrling/aliakan.
Seminar für Allgemeine Sprachwissenschaft. Universität Zürich. Plattenstr 54. CH-8032 Zürich, Schweiz. Tel. ++41-1-2572091, fax: ++41-1-980 54 28; e-mail: email@example.com.
Atlas on endangered languages: http://www.adelaide.edu.au/PR/atlas.html.
CAMEEL project, co-ordinator Maddalena Toscano: http://www.agora.stm.it/cila/progetti/cameel/cameelhome.htm.
Dipartimento di Studi e Ricerche su Africa e Paesi Arabi (DSRAPA) - Istituto Universitario Orientale (IUO). Piazza San Domenico Mag., Palazzo Corigliano. I-80134 Napoli. Italia. Tel. ++39-81-5517840, fax : ++39-81-5517901; e-mail: firstname.lastname@example.org.
Écritures du monde: http://www.culture.fr/edm.
Foundation for Endangered Languages: http://www.bris.ac.uk/Depts/Philosophy/CTLL/FEL/.
Minority/Endangered Languages Links: http://www.tooyoo.l.u-tokyo.ac.jp/minority.html.
PRISME, ODL course on second language learning: http://studier.uib.no/prisme/index.nsf.
The International Clearing House for Endangered Languages: http://www.tooyoo.l.u-tokyo.ac.jp/ichel.html.
Let's learn Arabic, University of Pennsylvania: gopher://ccat.sas.upenn.edu:70/11/software/dos.
Swahili Text Archives (University of Helsinki): email@example.com.
The Arabic Tutor: http://www.arabic2000.com/school/.
WestAfricaScripts, Hannes Hirzel: firstname.lastname@example.org.
Xerox LFG tools: http://www.parc.xerox.com/istl/groups/nltt/pargram/dev-env.html.
Lernout and Hauspie Speech products: http://www.lhs.com
University of Stellenbosch, Research Unit for Experimental Phonology: http://www.sun.ac.za/local/academic/arts/nefus/main.htm.