1. Basic Information About the Corpus


1.1 Aim

In 1991 a group of students at Freiburg University were engaged in what at first sight must appear as an almost anachronistic activity: they were keying in extracts of roughly 2,000 words from British newspapers. The sampling model was the press section of the LOB corpus (see Sand/Siemund 1992). 1992 saw the beginning of a new Brown corpus. The ultimate aim was to compile parallel one-million-word corpora of the early 1990s that matched the original LOB and Brown corpora as closely as possible, and that would thus provide linguists with an empirical basis to study language change in progress. This aim is spelled out in some detail in Mair (1997:196). The parallel corpora were compiled to enable linguists to

  1. test at least some current hypotheses on linguistic change in present-day English;
  2. detect changes not previously noticed in the literature through the systematic comparison of lexical frequencies, particularly of closed-class items;
  3. to tackle systematically one of the major methodological issues in the study of ongoing change, namely the inter-dependence of synchronic regional (in our case British vs. American) and stylistic variation on the one hand, and genuine diachronic developments on the other."

An additional advantage of the new British and American corpora is that they provide more suitable databases for a comparison with the Indian, Australian and New Zealand corpora (samples representing language use of the late 1980s) than the original LOB and Brown.


1.2 Sampling Techniques

The basic sampling principle in the compilation of Brown and LOB (see Francis/Kucera 1979, Johansson et al. 1978 and Hofland/Johansson 1982) was to randomly select not only the titles from the bibliographical sources but also the particular section of a text using a random-number table. This sampling principle was modified either out of practical considerations dictated, e.g. by the availability of material or whenever a single text did not yield the required 2,000 words. Rather than simply include the next article, the next suitable article (as far as style and subject matter were concerned) was chosen. "This modification of purely random sampling was used extensively in compiling the categories of newspaper prose" (Hofland/Johansson 1982:2). The press sections of Brown and LOB are therefore not representative samples in a strict statistical sense. This applies even more so to the sampling procedures employed in the compilation of FLOB. The main aim in compiling the press section of FLOB was to match the 1991 material as closely as possible with that used in LOB by sampling the same newspapers (see Sand/Siemund, 1992). For the other sections, the same magazines and periodicals used in LOB were sampled whenever possible. In the sampling of monographs great care was taken to select books on equivalent topics rather than to randomly select titles from bibliographical sources. The main aim was to achieve close comparability with FLOB rather than statistical representativeness. (For an overview of the original composition of the corpus, see Johansson et al. 1978).


1.3 Mark-Up Conventions

The purpose of any coding system is to produce an ASCII text that maintains as much of the information of the original text. Instead of using the rather complicated coding system of the LOB corpus, we used a simplified version of SGML-based mark-up codes that were drawn up for the coding of the International Corpus of English (ICE) subcorpora (see Nelson 1996 for details). For example, all types of typeface change, like underlined, bold, italics, etc. were subsumed under one general typeface-change code. The mark-up codes are enclosed in angular brackets. They consist of an opening tag (e.g. <quote_>) and a closing tag (e.g. <quote/>). If a feature extends only over one word a vertical stroke is used (e.g. <quote|>). A list of all the mark-up codes used in the FLOB corpus is included in Chapter 2 of this manual.

In addition to those mark-up tags that help to represent the microstructure of the original (i.e. those indicating a typeface change or the beginning of a new paragraph), the ICE mark-up includes codes that help to interpret rather than represent the original text (i.e. the marking of non-English text or transliterations of Greek or Hebrew text).

In order to ensure that the corpus text would be as ‘readable’ as possible, the use of mark-up symbols was kept to a minimum. In particular, we tried to avoid the use of double codes. If a non-English word in the original text was set in italics it was only coded as non-English (<foreign_>word<foreign/>) and not as (<tf_><foreign_>word<foreign/><tf/>).


1.4 Corpus-Related Publications

Bauer, Laurie. 1994. "Introducing the Wellington corpus of written New Zealand English." Te Reo: Journal of the Linguistic Society of New Zealand 37: 21-28.

Gloderer, Gabriele. 1993. Morphological Regularisation of Irregular Verbs: A Comparison of British and American English. Unpublished M.A. Thesis: Freiburg.

Graf, Dorothee. 1996. Relative Clauses in Their Discourse Context: A Corpus-Based Study. Unpublished M.A. Thesis: Freiburg.

Hundt, Marianne. 1997. "Has British English been catching up with American English over the past thirty years?" Ljung, Magnus, ed. Corpus-Based Studies in English: Papers from the Seventeenth International Conference on English-Language Research Based on Computerized Corpora (ICAME 17). Amsterdam: Rodopi. pp. 135-51.

-----. 1998. New Zealand English Grammar - Fact or Fiction? A Corpus-Based Study in Morphosyntactic Variation. Amsterdam/Philadelphia: Benjamins.

-----. forthcoming. "The Press Sections of One-Million-Word Corpora." To be published in Anglistik und Englischunterricht.

-----. forthcoming. "It is Important that this study (should) be based on the analysis of parallel corpora: On the use of the mandative subjunctive in four major varieties of English." IN Lindquist, Hans et al., eds. Proceedings of the 1997 MAVEN Conference, Växjö.

-----, and Christian Mair. forthcoming. "'Agile' and 'uptight' genres: The corpus-based approach to language change in progress."

Krug, Manfred. 1994. Contractions in Spoken and Written English. A Corpus-Based Study of Short-Term Developments Since 1960. Unpublished M.A. Thesis: Exeter.

-----. 1996. "Language change in progress: Contractions in journalese in 1961 and 1991/92." In McGill, S., ed. Proceedings of the 1995 Graduate Research Conference on Language and Linguistics. Exeter Working Papers in English Language Studies 1, Exeter. pp. 17-28.

-----. 1997. "The auxiliarization of want and wanna." In Blumenfeld, O., ed. From Margin to Centre (Papers from the International Conference at the University of Iasi, 31 October - 2 November 1996). Iasi: Dept. of English.

-----. Forthcoming. "Korpuslinguistik als Lehrmethode im universitären Fremdsprachenunterricht." In Börner, W. und K. Vogel, eds. Sammelband der 7. Göttinger Fachtagung "Fremdsprachenausbildung an der Hochschule".

Lovejoy, James. 1995. "Prepositions in British and American English - A Computer-Aided Corpus Study." Arbeiten aus Anglistik und Amerikanistik 20: 55-74.

Mair, Christian. 1994. "Is see becoming a conjunction? The study of grammaticalisation as a meeting ground for corpus linguistics and grammatical theory." In Fries, Udo et al., eds. Creating and Using English Language Corpora: Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich 1993. Amsterdam: Rodopi, 127-137.

-----. 1995. "Changing patterns of complementation, and concomitant grammaticalisation, of the verb help in present-day British English." In Aarts, Bas and Charles F. Meyer, eds. The Verb in Contemporary English: Theory and Description. Cambridge: CUP, 258-272.

-----. 1996. "The spread of the going to-future in written English." In Hickey, Raymond und Stanislaw Puppel, eds. Language History and Linguistic Modelling: A Festschrift for Jacek Fisiak. Vol. II. Berlin: Mouton de Gruyter, 1537-1543.

-----. 1997. "Parallel corpora: A real-time approach to language change in progress." In Ljung, Magnus, ed. Corpus-Based Studies in English: Papers from the Seventeenth International Conference on English-Language Research Based on Computerized Corpora (ICAME 17). Amsterdam: Rodopi. pp. 195-209.

-----. 1998. "Corpora and the study of the major varieties of English: Issues and results." In Lindquist, Hans et al., eds. Proceedings of the 1997 MAVEN Conference, Växjö.

-----, and Marianne Hundt. 1995. "Why is the progressive becoming more frequent in English? - A corpus-based investigation of language change in progress." Zeitschrift für Anglistik und Amerikanistik 43: 111-122.

-----, and Marianne Hundt. 1997. "The corpus-based approach to language change in progress." In Böker, Uwe und Hans Sauer, eds. Anglistentag 1996 Dresden: Proceedings. Trier: Wissenschaftlicher Verlag, 71-82.

Nelson, Gerald. 1996. ‘’Markup Systems.’’ In Greenbaum, Sidney, ed. Comparing English Worldwide - The International Corpus of English. Oxford: Clarendon. pp. 36-53.

Meyer, Matthias. 1995. "Einzelaspekt: Do als Pro-Form." In Ahrens, Rüdiger et al., eds. Handbuch Englisch als Fremdsprache. Berlin: Erich Schmidt, 139-142.

Raab-Fischer, Roswitha (=Fischer, Roswitha). 1995. "Löst der Genetiv die Of-Phrase ab? Eine korpusgestützte Studie zum Sprachwandel im heutigen Englisch." Zeitschrift für Anglistik und Amerikanistik 43: 123-132.

Sand, Andrea, Siemund, Rainer. 1992. "LOB - 30 years on..." ICAME Journal 16: 119-122.

-----. 1998. Linguistic Variation in Jamaica: A Corpus-Based Analysis of Radio and Newspaper Language. Tübingen: Narr.

-----. Forthcoming. "Machine-readable corpora in research and teaching: A survey." Anglistik und Englischunterricht: Media Old and New in Teaching and Research.

Siemund, Rainer. 1993. Aspects of Language Change in Progress: A Corpus-Based Study of British Newspaper English in 1961 and 1991. Unpublished M.A. Thesis: Freiburg.

-----. 1995. "For who the bell tolls - Why corpus linguistics should carry the bell in the study of language change in present day English." Arbeiten aus Anglistik und Amerikanistik 20: 351-377.

Sigley, Robert. 1997a. 'Text categories and where you can stick them: A crude formality index.' International Journal of Corpus Linguistics 2 (2): 1-39.

-----. 1997b. Choosing Your Relatives: Relative Clauses in New Zealand English. Phd thesis. Wellington: Victoria University of Wellington.

Skandera, Paul. 1995. Computergestützte Korpuslinguistik und Sprachwandelforschung - Grammatische Neuerungen in der amerikanischen Pressesprache seit 1961. Unpublished M.A. Thesis: Freiburg.