NORSK AVISKORPUS |
|
|
NYHETER
|
The Norwegian Newspaper Corpus
The Norwegian Newspaper Corpus is a large and self-expanding corpus of Norwegian newspaper texts. The collection of this dynamic and continually growing corpus began in 1998. The corpus is automatically updated by means of w3mir, which is an all-purpose http copying and mirroring tool. On a daily basis, the mirroring tool retrieves recently published texts from a set of remote web sites, specifically the entire Internet version of ten major Norwegian newspapers. A set of own-developed tools is used for further processing and annotation of the texts. The system automatically selects the relevant text, ignoring advertisements, navigation menus, metatext, html code and so on. Next, it automatically classifies the text as either bokmål or nynorsk - the two official, written forms of Norwegian that exist, also identifying and rejecting texts that are entirely written in English. Further, the texts are annotated with word class and other morphosyntactic information by means of the Oslo-Bergen tagger, and the tagged and untagged texts are added to the database. Approximately 200,000-250,000 running words are added per day. As of April 2008, the database consists of about 640 million words, and it is by far the greatest searchable corpus of Norwegian. For latest statistics, click here. The selection of newspapers that are included allows for comparison across various categories: broadsheet versus tabloid formats, national versus regional newspapers, and general content versus business and finance newspapers. The corpus is accessible via a search interface that uses IMS Corpus Workbench. In addition to the text collection and annotation procedure briefly described, the set of corpus tools daily creates a list of newly encountered word forms. Each word form of all the new texts is compared against a comprehensive list of known word forms. This list is a compilation that includes all word forms gathered in connection with various corpus and lexicographical projects carried out at Aksis over the years, including the full form lexicon Bokmålsordboka and the lexical resources developed in connection with the Oslo-Bergen Corpus and tagger. The word list consists currently of about 3.5 million unique word forms. The new word forms that are not found in the long reference list are then stored in an archive of neologisms. This archive naturally provides a good resource for the study of word formation processes, lexical productivity, linguistic creativity and so on, and it is considered a valuable resource among Norwegian lexicographers. A language processing tool automatically classifies the latest neologisms according to morphologically distinct categories such as hyphenated or non-hyphenated compounds, names, digits, acronyms, name-lexeme combinations, and so on, and the latest newcomers in each category can be viewed on the project web site. A module for identification of anglicisms, which is an integral part of this classification tool. On average, about 1,500 new word forms are encountered every day. Preliminary studies of the material (including Andersen 2004a, 2004b) indicate that the most common types of neologisms found in this material are compounds, names (including acronyms), spelling errors, and, indeed, anglicisms. Compounds are particularly prevalent, and appear to account for 40-50 per cent of the neologisms. This is crucially linked to the fact that in Norwegian, like the other North Germanic languages and German (but unlike English and French), compounds are as a rule written as one word, without a hyphen (although hyphenation is optional). Consequently, an overt result of journalistic linguistic creativity is the emergence of content-rich, new compounds, often found in newspaper headlines. |
| Edit |