MANUAL OF INFORMATION

TO ACCOMPANY

THE LANCASTER-OSLO/BERGEN CORPUS

OF BRITISH ENGLISH, FOR USE WITH

DIGITAL COMPUTERS

 

BY

STIG JOHANSSON

IN COLLABORATION WITH

GEOFFREY N. LEECH

HELEN GOODLUCK

 

 

DEPARTMENT OF ENGLISH

UNIVERSITY OF OSLO

1978

PREFACE

The present corpus is the result of cooperation between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities at Bergen. During 1970- 1976 the project was conducted at the University of Lancaster, the Department of Linguistics and Modern English Language, under the direction of Geoffrey N. Leech and financially supported by grants from Longman Group Limited and the British Academy. In 1977 the project was moved to Norway, where it was directed by Stig Johansson, Department of English, University of Oslo, and was completed in 1978, thanks to financial support from the Norwegian Research Council for Science and the Humanities (NAVF).

We gratefully acknowledge the support of the Longman Group, the British Academy, and NAVF (as well as some financial assistance from the Jahre Fund). We also wish to thank the large number of people who have been working within the project, especially:

Rosemary Leonard (who completed a four year period on the staff at Lancaster, first as Tutorial Fellow, then as Research Assistant on the corpus project)

Norman Fairclough ( the original secretary of the project) Helen Goodluck (who was full-time Research Assistant in 1970- 71 and worked as a temporary Research Assistant in July-August, 1974)

Faith Ann Johansson (who was employed as a part time assistant on the project both at Lancaster, 1973-74, and at Oslo, 1977-78)

Norma Hainsworth, Fanny Leech, Christine Murphy, Judith Perryman, and Alison Ross (who worked on the project at various times)

Mona Flognfelt, Gro Frydenberg, John Y. Jones, Oonagh Sayce, and Mons Thyness (who were employed as part time assistants in 1977-78)

Johan Elsness and Kari Anne Rand Schmidt (who helped with the coding and proof-reading during the final stages of the project) Kari Utheim Riis (who did secretarial work in 1977-78, including the typing of this manual) Doreen Grotdal (who contributed with expert punching and editing)

Special thanks are due to our programmers, Len Wagstaff (Lancaster), and Knut Hofland (Bergen), and other members of staff at the two computing centres.

We are also grateful to Professor W. Nelson Francis, of Brown University, and Professor Bryan Higman, of Lancaster University, for their guidance in setting up the project.

In conclusion, we wish to thank the large number of copyright holders, who allowed their texts to be included, free of charge, in the corpus.

December 1978
Stig Johansson Oslo
in collaboration with:
Jostein H. Hauge, Bergen
Geoffrey N. Leech, Lancaster

CONTENTS

BASIC INFORMATION ABOUT THE CORPUS

Aim
Distribution
The Composition of the Corpus
Organization of the Material
Main Coding Key
Basic Technical Information

SOURCES AND SAMPLING TECHNIQUES
FURTHER DETAILS OF CODING
LIST OF TEXT EXTRACTS

BASIC INFORMATION ABOUT THE CORPUS

Aim

The preparation of a corpus cannot be seen in isolation from its intended uses. These dictate the selection of texts, the amount of material, the coding system, etc. The aim of our project has been to assemble a British English equivalent to the Brown University Corpus of American English. 1) Both sources of data, rather than concentrating on limited types of texts to be used for specific purposes, aim at a general representation of text types for use in research on a broad range of aspects of the language. To facilitate a combined use of the two corpora, an attempt has been made to match the British English material as closely as possible with the American corpus.

Like its American counterpart, the Lancaster-Oslo/Bergen Corpus contains 500 printed texts of about 2,000 words each, or about a million running words in all. The year of publication (1961) and the sampling principles are identical to those of the Brown Corpus, though there were necessarily some differences in text selection. The coding system differs, however, in many respects in the two corpora, the main discrepancy being the greater degree of delicacy of coding in the new corpus.2) This should not seriously affect the possibilities of using the two sources of material in combination. The extent of similarity or difference is apparent from the sections on sampling and coding.

Distribution

The corpus and accompanying manual are available at cost to bona fide researchers through the International Computer Archive of Modern English (ICAME), at the Norwegian Computing Centre for the Humanities, Bergen, Norway.

The following restrictions on the use of the material must be strictly observed:

1) See W. Nelson Francis, Manual of Information to Accompany a Standard Sample of Present-Day Edited American English, for Use with Digital Computers. Providence, Rhode Island: Department of Linguistics, Brown University, 1964.

2) The main difference is the use of a larger character set in this corpus. This is simply a reflection of the development of computer technology.

1. No copies of the corpus, or parts of the corpus, are to be distributed under any circumstances without the written permission of ICAME.

2. Print-outs of the corpus, or parts thereof, are to be used for bona fide research of a non-profit nature. Holders of copies of the corpus may not reproduce any texts, or parts of texts, for any purpose other than scholarly research without obtaining the written permission of the individual copyright holders, as listed in the manual ccompanying the corpus.

3. Commercial publishers and other non-academic organizations wishing to make use of part or all of the corpus or a print-out thereof must obtain permission from all the individual copyright holders involved.

Persons or institutions ordering copies of the material will be required to subscribe to these restrictions by signing a written contract before a copy is issued.

The Composition of the Corpus

As already mentioned, the sampling principles were identical to those used in assembling the Brown Corpus. The Brown Corpus text categories and subcategories were analysed in detail and matched with corresponding British categories. Table 1 summarizes the basic composition of the British material, as compared with the American corpus. More detailed information for particular text categories or groups of categories is given in the section on Sources and Sampling Techniques.

Organization of the Material

General
The material within the main text categories has been arranged to match the Brown Corpus as closely as possible. Note, however:

1. The samples in the British corpus are in general more consistently grouped into subject categories than those in the Brown Corpus.

Table 1 The basic composition of the British and American corpora 1)

Text categories

Number of texts in each category

 

 

American corpus

British corpus

A

Press: reportage

44

44

B

Press: editorial

27

27

C

Press: reviews

17

17

D

Religion

17

17

E

Skills, trades, and hobbies

36

38

F

Popular lore

48

44

G

Belles lettres, biography, essays

75

77

H

Miscellaneous (government documents, foundation reports, industry reports, college catalogue, industry house organ)

30

30

J

Learned and scientific writings

80

80

K

General fiction

29

29

L

Mystery and detective fiction

24

24

M

Science fiction

6

6

N

Adventure and western fiction

29

29

P

Romance and love story

29

29

R

Humour

9

9

Total

500

500

l) See the detailed discussion of the comparability of the two corpora in the section on Sources and Sampling Techniques.

2. The matching between the two corpora is in terms of the general categories only. There is obviously no one-to-one correspondence between samples, although the general arrangement of subcategories has been followed wherever possible.

The organization of the material within each text category will be briefly outlined below. For information on the individual texts, see the List of Text Extracts.

Category A (Press: reportage)

A01-06

National

daily

Political

A07-08

"

"

Sports

A09-10

"

"

Society

All-14

"

"

Spot news

A15-16

"

"

Financial

A17-19

"

"

Cultural

A20-21

National

Sunday

Political

A22-23

"

"

Sports

A24

"

"

Spot news

A25

"

"

Financial

A26

"

"

Cultural

A27-31

Provincial

Daily

Political

A32-33

"

"

Sports

A34-37

"

"

Spot news

A38

"

"

Financial

A39-40

"

"

Cultural

A41

Provincial

Weekly

Sports

A42

"

"

Society

A43

"

"

Spot news

A44

"

"

Cultural

Category B (Press: editorial)

B01-04

National

daily

Institutional editorial

B05-08

"

"

Personal editorial

B09-11

"

"

Letters to the editor

B12-13

National

Sunday

Institutional editorial

B14-15

"

"

Personal editorial

B16

"

"

Letters to the editor

B17-19

Provincial

Daily

Institutional editorial

B20-22

"

"

Personal editorial

B23-24

"

"

Letters to the editor

B25

Provincial

Weekly

Institutional editorial

B26

"

"

Personal editorial

B27

"

"

Letters to the editor

Category C (Press: reviews)

C01-06

National

daily

 

C07-11

National

Sunday

 

C12-14

National

weekly

 

C15-16

Provincial

daily

 

C17

Provincial

weekly

 

Category D (Religion)

D01-09

Books

 

 

D10-17

Periodicals and tracts

 

Category E (Skills, trades and hobbies)

E01-05

Homecraft, handiman

 

E06-10

Hobbies

 

 

E11-13

Music, dance

 

 

E14

Pets

 

 

E15-18

Sport

 

 

El9-20

Food, wine

 

 

E21-22

Travel

 

 

E23-26

Miscellaneous

 

 

E27-35

Trade, professional journals

 

E36-38

Farming

 

 

Category F (Popular lore)

F01-22

Popular politics, psychology, sociology

F23-30

Popular history

 

F31-33

Popular health, medicine

 

F34-37

"Culture"

 

 

F38-44

Miscellaneous

 

 

Category G (Belles lettres, biography, essays)

G01-35

Biography, memoirs

 

G36-41

Literary essays and criticism

 

G42-50

Arts

 

 

G51-77

General essays

 

 

Category H (Miscellaneous)

H01-24

Government documents

 

a H01-12

Reports, department publications

 

b H13-14

Acts, treaties

 

c H15-19

Proceedings, debates

 

d H20-24

Other Government documents

H25-26

Foundation reports

H27-28

Industry reports

H29

University catalogue

H30

Industry house organ

Category J (Learned and scientific writings)

J01-12

Natural sciences

J13-17

Medicine

 

 

J18-21

Mathematics

 

 

J22-35

Social, behavioral sciences

 

a J22-25

Psychology

 

 

b J26-30

Sociology

 

 

c J31

Demography

 

 

d J32-35

Linguistics

 

J36-50

Political science, law, education

 

a J36-39

Education

 

b J40-47

Politics and economics

 

c J48-50

Law

J51-68

Humanities

 

 

 

a J51-54

Philosophy

 

 

b J55-59

History

 

 

c J60-63

Literary criticism

 

d J64-67

Art

 

 

e J68

Music

 

J69-80

Technology and engineering

Category K (General fiction)

K01-20

Novels

 

 

K21-29

Short stories

 

 

Category L (Mystery and detective fiction)

L01-21

Novels

 

 

L22-24

Short stories

 

 

Category M (Science fiction)

M01-03

Novels

 

 

M04-06

Short stories

 

 

Category N (Adventure and western fiction)

N01-15

Novels

N16-29

Short stories

 

 

Category P (Romance and love story)

P01-16

Novels

 

 

P17-29

Short stories

 

 

Category R (Humour)

R01-03

Novels

 

 

R04-06

Articles from periodicals

R07-09

Articles from humorous books other than novels

Main Coding Key 1)

To echo a previous statement, the coding system cannot be seen isolated from the uses of the corpus. The text must be coded in such a way that it can be used maximally efficiently in linguistic research. As the possible uses are many and difficult to foresee, the main guiding principle has been to produce a faithful representation of the text with as little loss of information as possible. Needless to say, some loss of information is inevitable (see p.22), dictated by clarity and/or practicability of representation. It is difficult to find the right balance between these principles and faithfulness to the original text. However, the user of this corpus will probably find that there is normally too much rather than too little information.

In order to facilitate the use of the corpus for linguistic research, the coding system includes some features which serve to interpret rather than represent the original text. These are, in particular, markers of abbreviations and non-English material, separate symbols for the beginning and the end of quotations, headline codes, and sentence-initial markers.

Detailed information on the coding system is given in a later section. For the convenience of the user of the corpus, the main coding key is repeated here in somewhat abbreviated form.

Alphanumeric characters

Alphanumeric characters represent themselves, e.g.

A = A

B = B

C = C

 

a = a

b = b

c = c

 

1 = 1

2 = 2

3 = 3

etc.

Other characters

* is reserved as a prefix for a compound coding symbol.

When not preceded by *, all other characters represent

themselves, except for ^ ~ _ | " ' \ { } (see below):

, = ,

. = .

? = ?

 

/ = /

@ = @

( = (

etc.

But:

^ = new sentence

~ = included sentence

_ = begin list

| = new paragraph or new line or blank line

" = umlaut or diaeresis on preceding letter

' = apostrophe (but not single quote mark)

\ = begin non-English word

{ = begin non-English phrase or passage

} = end non-English phrase or passage

l) The basic coding system was set up while the project was being conducted at Lancaster; see Geoffrey Leech and Rosemary Leonard, "A Computer Corpus of British English", Hamburger Phonetische Beiträge 13 (1974), pp. 41-57. A number of changes were introduced during the continuation of the project.

Compound coding symbols

These have * or ** as a prefix:

*0 = begin lower case (roman)
*1 = begin italic (or underlining in non-printed text)
*2 = begin capitalisation (roman)
*3 = begin italic capitalisation
*4 = begin bold face
*5 = begin italic bold face
*6 = begin bold face capitalisation
*7 = begin italic bold capitalisation
*8 = begin script
*9 = begin gothic
*@ = ° (degree symbol)
*= = begin upper case Roman numeral
**= = begin lower case Roman numeral
*+ =
*_ = dash
*/ = * (asterisk)
*# = end of corpus text
**# = end of corpus
*? = uncoded character
*" = begin double quotes **" = end double quotes
*' = begin single quotes **' = end single quotes
*< = begin heading * > = end heading
**[ = begin comment tag ** ] = end continent tag
*; = begin subscript **; = end subscript
*: = begin superscript **: = end superscript

Basic Technical Information

As pointed out on p. 1, the corpus is available at cost to bona fide researchers through the International Computer Archive of Modern English (ICAME), at the Norwegian Computing Centre for the Humanities, Bergen, Norway. The material is available on magnetic tape. Some technical data:

  1. The tape has no label.
  2. There are 15 files on the tape, one for each text category. There is one EOF mark after each file, and two EOF marks at the end of the tape.
  3. The text is divided into 80-character lines, as follows: 

a) Reference: 8 characters

 

3 characters sample code (cf. p. 42ff.)

 

l space

 

 

3 characters line number within the sample

 

1 space

 

b) Text: 72 characters

4. Each tape record consists of 100 lines, except that the last one in each file may contain fewer than 100 lines.

5. Character code: ASCII. Note especially the following codes where the character sets of different printers may vary:

133 [

173 {

134 \

174 |

135 ]

175 }

136 ^

176 ~

137 _

 

6. The corpus is available on the following types of tape:

a) 9-track, density 1600 fpi, 1200 ft. There is one character on each tape frame (one parity bit and eight data bits).

b) 7-track, density 556 or 800 fpi, 2400 ft. There are six data bits and one parity bit on each frame. Six tape frames contain four characters (each 9 bits, the two leftmost bits always zero).

For further information on the material see ICAME JOURNAL, published by the Norwegian Computing Centre for the Humanities, Bergen, Norway.

SOURCES AND SAMPLING TECHNIQUES 1)


Like its American counterpart, the British corpus is intended to be a representative sample of the texts printed in 1961. The texts were selected by stratified random sampling to represent the categories of Table 1. The sampling was based on the following bibliographical sources:

1. For books: 

The British National Bibliography Cumulated Subject Index, 1960-1964 (B.N.B.) was used to ensure that items published during 1961 but not catalogued until 1962 or 1963 were included in the sampling. This had the practical disadvantage of slowing down the sampling process, since about 70% of the items randomly sampled had to be disregarded as not being published during 1961.

In some cases there was doubt whether a book which was sampled should be included in the corpus. The following rules were followed:

A. To include entries with the publication date mark such as:

1961 [i.e. March 1962]

which means that the book bears the date 1961 on its imprint, but was actually published in March 1962.

B. To include items published after 1961, but originally appearing in some form during 1961 from the evidence of the B.N.B. entry:

    e.g. Crabtree, Arthur Bamford.

C. To include items appearing before 1961 (in the form of lectures, broadcasts, etc.) but not actually published until 1961.

The inclusion of samples of the type B and C was based on comparison with samples in the Brown Corpus (e.g. Brown D07, H23).

D. To include second impressions of 1961 first editions published after 1961 where the B.N.B. entry states that the first edition was published in 1961, and where the first edition was not catalogued separately in the B.N.B.:

    e.g. Walsh, Lilian Alberta.

E. To exclude entries with the publication date mark, of the form:

[d. March 1961]

which means that although the book was deposited in the British Museum during 1961, neither the month nor year of publication has been supplied by the publisher, and the year of the book does not appear in the imprint.

1) The description below is based mainly on a report on sampling by Helen Goodluck.

F. To exclude yearbooks, first issues of journals, etc., which have been entered in the B.N.B. These are treated as periodicals and are included only when sampled from Willings Press Guide (cf. below).

Many of the samples included as books would be better described as pamphlets, reports, etc. The Brown Corpus also included such material in its non-periodical sections.

The method of sampling was simply to match the subject divisions of the Dewey Decimal Classification of the B.N.B. to the subject subcategories of the corpus. This proved to be a satisfactory method in that all but a very few subdivisions of the Dewey Classification were sampled. Since the Dewey Classification is based strictly on subject, and the corpus categories are based on subject within broad stylistic divisions it was Necessary to be fairly flexible in the placing of items sampled initially for one category into the other categories; for example, although the 'History' divisions of the Dewey classification (900-910, 930-994) were sampled initially for Category J. these sections yielded samples of popular, non-learned history that were placed in the appropriate subcategory of Category F. In this way many subcategories were wholly or partially filled out before specific sampling was carried out for them.

2. For periodicals:

The sampling of periodicals as well as newspapers was made on the basis of Willing's Press Guide, 1961 (W.P.G.). There were considerable problems in the sampling of periodicals. Some of the chief difficulties were:

A. The inclusion in W.P.G. of many titles that had ceased publication by 1961, or were published under different titles by 1961.

B. The Class Index of W.P.G. (an index of periodicals under subject headings) did not provide an equivalent to the Dewey Classification of the B.N.B., and it was not possible to match subject subcategories of periodicals against the divisions of this index in any very satisfactory way. This was chiefly because a periodical may be listed up to four or five times under different headings of the Index, or it may be listed under only one heading. Thus some periodicals would have a greater chance of occurrence if a simple random sampling process was applied to the Index. In addition, the classification of certain periodicals was clearly anomalous. For example, linguistics journals are listed under both the Index headings 'Education' and 'Reviews, Literary Periodicals and Political Reviews'.

C. The multiplicity of minor magazines and news-sheets (many of which proved to be unobtainable from any library source) meant that the exclusion of many important and influential periodicals would have resulted from a purely random sampling of the Class Index.

The sampling for periodicals was made by matching the corpus categories with subject divisions of the W.P.G. Class Index, but with the following procedure to offset the difficulties outlined above:

A. Where two or more Index divisions were sampled for one subcategory, any more than one entry for an item was excluded in the numbering for sampling purposes.

B. In choosing the subject divisions of the Class Index to be matched against the subcategories, the more general headings were selected, to the exclusion of some of the more specific headings, at least for the first stage of sampling. For example, the selection of periodicals for Category D was based on the Index headings 'Religious Newspapers', 'Religious Periodicals and Readings' and 'Theology' and the several headings for individual denominations were not sampled from. The multiple classification of the Index means that all the major and many of the lesser known magazines were included under the headings on which the sampling was based. The intention was to carry out supplementary sampling from the relevant Index divisions that had not been sampled if the initial sampling did not provide sufficient samples, or if the samples obtained did not represent the range of material felt to be needed in the category or subcategory (for example, if the religious periodicals sampled had not represented the wide stylistic range judged necessary for Category D (see below)). In the case of Category D such supplementary sampling was not necessary for either reason, but it was carried out for some other subcategories. For example, the sampling for subcategories E06-E10 ('Hobbies'), was initially based on the single Index heading 'Hobbies' but this was supplemented both by inclusion of items sampled initially for other subcategories and by special sampling from Index headings ('Stamp Collecting', 'Photography.' etc.) relevant to the subcategory.

This selective use of the Index divisions as a basis for sampling did of course bias the sample, but it did so in favour of the major periodicals without excluding minor magazines.

C. Where no suitable Index heading was available from which to sample as in the case of subcategory J32-J35 ('Linguistics'), the whole class Index was inspected, all suitable periodicals numbered, and random sampling carried out on the basis of the numbering. Periodicals sampled in this way were then excluded from any subsequent sampling of the Index division under which they were listed.

There was a great del of overlap between the sampling for Categories E and F, and F and G. That is, although the actual sampling was based on a limited number of Index divisions for each category, the periodicals thus obtained were freely allocated to the category to which they were judged best to belong, after the specific articles in the periodicals had been selected by further random sampling in the issue selected.

3. For newspapers:

By contrast to the difficulties in sampling for periodicals, the indexing of W.P.G. made the sampling for the newspaper categories A, B, and C fairly easy.

The Index of Daily Newspapers (pp. 381-382, W.P.G.) was sampled for both provincial and national dailies. Since all the national dailies with the exception of The Guardian were published from London and the Index is subdivided by place of publication (with separate listing of London suburban dailies) separate sampling of the national dailies was a simple matter. (The Guardian, which began publication from London rather than Manchester in November 1961, was simply removed from the listing under Manchester and added to the list of national dailies under London.) As more than one sample from each of the national dailies was needed, these newspapers were numbered and then relisted using a random number table until the required number for all three newspaper categories had been obtained.

In sampling for provincial daily newspapers no distinction was made between evening and morning newspapers.

Weekly provincial newspapers were sampled from the Counties Index to newspapers (pp. 407-426, W.P.G.). Although this Index included also provincial daily papers it was a simple matter to exclude these when they were sampled.

The Index of Sunday Newspapers (p. 383) gave a complete listing of both national and provincial Sunday newspapers. The Sunday nationals were randomly relisted in the same way as the national dailies and the Sunday provincials were included in the sampling for weekly provincial newspapers.

4. For government documents:

The sampling was based on Catalogue of Government Publications, 1961 (London: H.M.S.O., 1962). See below under Category H.

The overall method in sampling has been to randomly select titles from the bibliographical sources (using a random-number table), and then to randomly sample the particular item for the page at which to start the 2,000 word extract. For each text extract selected a check was made whether the author was British, though this could not always be established. Texts published by non-British authors were excluded. There is, however, no absolute guarantee that all the remaining material in the corpus has been produced by native speakers of British English.

In selecting text extracts an attempt was made to limit the amount of dialogue to 50% or less, though this was not always possible. In some cases purely practical considerations have resulted in modification of the random sampling. This applies especially to extracts from periodicals, where library regulations sometimes limited the material that could be used.

In cases where an article sampled was less than 2,000 words long, it has frequently been supplemented by the next comparable article (in style, subject-matter) in the 1961 issues of that periodical, rather than simply from the following article of the same issue, which was often of a completely different genre. Each 2,000 word sample thus has some internal consistency even when composed of up to four or five different articles. This modification of purely random sampling was used extensively in compiling the categories of newspaper prose.

There have been other, and more important, departures from purely random sampling. On the grounds of circulation and influence, the selection of newspapers has been weighted in favour of the national press. Similarly, major periodicals have been favoured at the expense of less important ones. In one or two cases influential periodicals have even been included by deliberate choice rather than by random sampling.

It follows then that the present corpus is not representative in a strict statistical sense. It is, however, an illusion to think that a million-word corpus of English texts selected randomly from the texts printed during a certain year can be an ideal corpus. What is relevant is not only what texts are printed but how they are circulated, by whom they are read, etc. Such factors were no doubt behind the original selection and weighting of text categories in the Brown Corpus. The true representativeness" of the present corpus arises from the deliberate attempt to include relevant categories and subcategories of texts rather than from blind statistical choice. Random sampling simply ensures that, within the stated guidelines, the selection of individual texts is free of the conscious or unconscious influence of personal taste or preference.

Categories A (Press: reportage), B (Press: editorial), C (Press: reviews)

In determining the internal structure of the categories of newspaper prose the main problem was to combine maximum comparability with the Brown categories with a reasonable representation of the British Press. The basic division in Britain between national and regional press that does not exist in America required that an additional criterion be added to those of 1) daily as against weekly publication and 2) type and subject of prose, used to subdivide the Brown Press Categories.

The solution adopted for Categories A and B was to follow the Brown Corpus subdivisions of newspaper copy, and the numerical division for daily as against weekly papers in the two categories as a whole, although not in all cases within the subject subcategories. Within this framework the weighting was 60:40 in favour of national against regional papers. The decision to weight the categories in this way was based on:

A. The greater total circulation figures of national papers.

B. The greater influence and importance of the national press in the life of the nation. 

A further subdivision was made within the national press between daily and Sunday papers. Here the proportion of 3:1 in favour of national dailies in Categories A and B was again based on circulation figures. Although the total weekly circulation figures of about 97 million (dailies) against 26 million (Sundays) might suggest that this over-represents the Sunday nationals, the proportion was felt to be justified in terms of the single day figures of 16 million (dailies) against 26 million (Sundays).

In breaking down the provincial press into daily and weekly subcategories, 'weekly' was taken to include provincial Sunday newspapers as well as weekly papers issued on a weekday, since the provincial Sundays numbered only 5, with fairly small circulation figures. In adopting the proportions of 3:1 in favour of provincial dailies as against provincial weeklies the Brown Corpus daily/weekly proportions were observed, but at the expense of giving a higher representation of provincial weeklies than the British total weekly circulation figures of 76 million (provincial dailies) and 20 million (provincial weeklies) would perhaps suggest.

Category C does not follow the Brown daily/weekly proportions; it has been structured in favour of the national press, with a high representation of 'quality' Sundays and the deliberate inclusion of the Times Literary Supplement and the Times Educational Supplement on the basis of the importance of these in review writing.

The detailed structure of Categories A, B, and C is shown in Table 2.

Table 2 Categories A- C: The British and American corpora compared

AMERICAN CORPUS

BRITISH CORPUS

A. PRESS: REPORTAGE

 

NATIONAL

PROVINCIAL 1)

 

DAILY

WEEKLY

 

DAILY

SUNDAY

DAILY

WEEKLY

 

POLITICAL

10

4

14

6

2

5

-

13

SPORTS

5

2

7

2

2

2

1

7

SOCIETY

3

-

3

2

-

-

1

3

SPOT NEWS

7

2

9

4

1

4

1

10

FINANCIAL

3

1

4

2

1

1

-

4

CULTURAL

5

2

7

3

1

2

1

7

 

TOTAL

44

 

 

TOTAL

44

 

B. PRESS: EDITORIAL

 

NATIONAL

PROVINCIAL 1)

 

DAILY

WEEKLY

 

DAILY

SUNDAY

DAILY

WEEKLY

 

INSTITUTIONAL

7

3

10

4

2

3

1

10

PERSONAL

7

3

10

4

2

3

1

10

LETTERS TO THE EDITOR

5

2

7

3

1

2

1

7

 

 

TOTAL

27

 

 

TOTAL

27

 

C. PRESS: REVIEWS

 

 

 

 

NATIONAL 2)

PROVINCIAL

 

 

DAILY

WEEKLY

 

DAILY

WEEKLY

SUNDAY

DAILY

WEEKLY

 

 

14

3

17

6

3

5

2

1

 

 

TOTAL

17

 

 

TOTAL

17

 

1) Including provincial Sunday

2) The Times Literary Supplement and The Times Educational Supplement.

Category D (Religion)

While this category did not present the same problems of subcategorisation of samples by subject as the other categories of informative prose, this in itself pointed to Category D as an exception to the Brown Category division by stylistic criteria, as well as by subject matter. Since 'religious' prose could embrace any of the stylistic characteristics of categories F, G, and J and indeed category F of the Brown Corpus included samples (F37, F44, F48) that could have been classed under a general heading of 'religion' - the solution in compiling Category D has been to make the samples stylistically as heterogeneous as possible, ranging from learned to popular writing, and where possible to select 'committed' religious writing. More general discussions of, for example, religion and society, were placed, as in the Brown Corpus, in the appropriate category, F or G.

Categories E (Skills, trades, and hobbies), F (Popular lore), G (Belles lettres, biography, essays)

These three categories were the most difficult to structure. In many cases the subject subcategories overlapped, and particularly in the case of categories F and G it was often difficult to decide, either on grounds of subject matter or style, to which category a sample should belong. In general, where the subject matter of a sample did not place it unquestionably in one category, samples were placed in Categories E, F, or G on the stylistic grounds of being 'instructional', 'informative', or 'discursive', respectively. Inevitably, however, there remain one or two samples in each of these categories that could well have been placed in another.

The number of samples within E, F and G differs slightly from the Brown Corpus. Four samples corresponding to Brown Category F have been redistributed, two into Category E and two into Category G. Our Category E thus contains subcategory E 36-38, 'Farming', of which E37 and E38 correspond to Brown Corpus samples F13 and F34. The subcategories for G 'Biography' and 'Memoirs' have each been made larger by one sample than strict correspondence with Brown Category G would demand to match F31 and F42 in the Brown Corpus, which were in fact biography, although included in 'Popular lore'. There is also a slight variation with respect to the Brown Corpus in the proportions of books and periodicals in Categories E and F.

The four samples corresponding to Brown Category F that were redistributed were all books, reducing the number of books in our Category F to 19. This has been further reduced to final proportions of 16 books and 28 periodicals by the fact that the 'miscellaneous' subcategory contains three more periodicals than the corresponding items in Brown Category F. Category E has the opposite imbalance of slightly more books than the Brown Category E (5 books and 33 periodicals). Our Category G contains 41 books and 36 periodicals.

Table 3. Categories D - J: The British and American corpora compared

 

AMERICAN CORPUS

BRITISH CORPUS

D. RELIGION

 

BOOKS

7

9

 

PERIODICALS

6

7

 

TRACTS

4

1

E. SKILLS, TRADES AND HOBBIES

 

BOOKS

2

5

 

PERIODICALS

34

33

F. POPULAR LORE

 

BOOKS

23

16

 

PERIODICALS

25

28

G. BELLES LETTRES ETC.

 

BOOKS

38

41

 

PERIODICALS

37

36

H. MISCELLANEOUS

 

GOV. DOCUMENTS

24

24

 

FOUNDATION REPORTS

2

2

 

INDUSTRY REPORTS

2

2

 

UNIV. CATALOGUE

1

1

 

IND. HOUSE ORGAN

1

1

J. LEARNED

 

NATURAL SCIENCES

12

12

 

MEDICINE

5

5

 

MATHEMATICS

4

4

 

SOC. SCIENCES

14

14

 

POL. SCIENCE, LAW, EDUCATION

15

15

 

HUMANITIES

18

18

 

TECHNOLOGY AND ENGINEERING

12

12

Both our Categories E and F include a 'miscellaneous' subcategory (E23-26, F38-44) which corresponds to the number of items in the Brown categories which could not easily be fitted into a subject subcategory. These 'miscellaneous' sections have been filled out not by an attempt to match exactly the Brown items to which they correspond, but by including in them items, initially sampled for any of the categories, which stylistically were felt to belong to Categories E and F, but which did not fit any of the other subcategories devised for E and F.

Category E has been called 'Skills, trades and hobbies' rather than simply 'Skills and hobbies', as in the Brown Corpus, because the Brown Category E when analysed into subcategories was found to contain a number of trade and professional magazines which the title of the category did not suggest. These form a subcategory in (E27-35) and are grouped together with the subcategory 'Farming' (E36-38) that was formed by adding the two items corresponding to Brown F13, F34 to Category E.

Since Categories E, F and G are neither clear-cut nor directly comparable between the corpora, it is recommended that, for comparative purposes, they should be treated as a single composite category, which we may perhaps label 'General expository writing'.

Category H (Government documents, etc.)

This category presented no problem in subcategorisation as the subcategories that were provided in the Brown manual were easily matched by material published in Britain. The subcategory 'Government documents' was broken down into further subdivisions to ensure as close a correspondence with the Brown Corpus as possible. As with the Brown Corpus this subcategory includes verbatim reports of speeches and debates.

The complete listing of the Catalogue of Government Publications, 1961 (London: H.M.S.O., 1962) was sampled until sufficient samples of each of the subcategories H01-24 had been obtained. Samples His, H16 and H17 ('Parliamentary debates'), subcategory H15-19 ('Proceedings, debates'), were randomly selected from Hansard, even though the entries for Hansard were not amongst those sampled from the Catalogue of Government Publications, in order to match the samples from proceedings in Congress included in the Brown Category H.

The remaining samples in category H (H25-30) were obtained incidentally in the sampling for other categories, or, in the case of H27, 28 ('Industry reports') by a subjective selection of suitable items (as no bibliography of this type of publication was available).

Category J (Learned and scientific writings)

The subject subdivision for Category J provided in the Brown manual was adopted and further broken down by examining the Brown samples to ensure a close correspondence. For example, the Brown subcategory 'Social and Behavioural Sciences' was analysed into four subsections, 'Psychology', 'Sociology', 'Demography' and 'Linguistics'. The exact Brown proportions of books and periodicals have been followed, except in one case (J31, where a periodical was substituted for a book, since there was only one item in the subsection 'Demography', and no suitable books were sampled).

There was a problem of overlap between the subject subcategories of both G and J 'Literary Criticism and Arts'. The solution has been to attempt to place the more discursive criticism in Category G. and the more closely text-based in Category J.

Table 4. Categories K- R: The British and American corpora compared

 

 

AMERICAN CORPUS

BRITISH CORPUS

K. GENERAL FICTION

 

NOVELS

20

20

 

SHORT STORIES

9

9

L. MYSTERY AND DETECTIVE FICTION

 

NOVELS

20

21

 

SHORT STORIES

4

3

H. SCIENCE FICTION

 

NOVELS

3

3

 

SHORT STORIES

3

3

N. ADVENTURES AND WESTERN

 

NOVELS

15

15

 

SHORT STORIES

14

14

P. ROMANCE AND LOVE STORY

 

NOVELS

14

16

 

SHORT STORIES

15

13

R. HUMOUR

 

NOVELS

3

3

 

ESSAYS, ETC.

6

6

Categories D-J: The British and American corpora compared

Using the subdivisions of the Brown Corpus manual, Table 3 summarises the internal composition of Categories D-J in the two corpora. The agreement is complete for Categories H and J. There are some differences in D, though they should not be exaggerated; it is sometimes difficult to draw the line between books/periodicals and tracts. The discrepancy is greatest with Category F. Here, as with E and G, there is, however, close matching by subject.

Categories K-R (Imaginative prose)

The fiction categories were sampled on the basis of the B.N.B. and W.P.G. As these bibliographies do not contain subdivisions which match our text categories, sampling was carried out simultaneously for them all. Texts selected were inspected and placed in the appropriate category, and sampling continued until all the categories were filled.

The problems of comparability with the Brown Corpus were smaller than for the informative-prose texts, though Category N was an exception. For obvious reasons, our Category N ('Adventure and western fiction') contains fewer westerns and a larger number of general adventure stories than does Category N in the Brown Corpus. The main subdivision recognised in the fiction categories is that between books and short stories. As shown in Table 4, the agreement between the two corpora is very close.

FURTHER DETAILS OF CODING


The following pages reproduce, with some additions and other alterations, the coding manual used during the later stages of the editing and proofreading of the texts. For reasons of space, no attempt has been made to give justifications for particular coding decisions, apart from some comments on "interpretive" codes in Sections 8-13. For the convenience of the user, the section is reproduced on coloured pages and includes a table of contents and two indices.

Coding manual-contents

1 Organisation of the material
2 Textual material included/excluded
3 Main coding key
4 Typographical shifts
5 Capitalisation
6 Spacing and ordering
7 Paragraph/line division
8 Sentence-initial marking
9 Headings
10 Quotations
11 Foreign-language material
12 Other "non-English" material
13 Abbreviations
14 Hyphen and dash
15 Mathematical expressions
16 Typographical errors
17 Comment tags
Appendix I: Uncoded character index
Appendix II: Non-English orthography
Appendix III: "Non-English" codes
Appendix IV: Comment tags
Index

1 Organisation of the material

    1.1 The corpus starts with the tag for the first text (cf. 1.2) and ends with end-of-corpus symbol (see 3.4).

    1.2 The text categories are included in the order listed in the table on p. 3 . Each corpus text starts with a tag giving its absolute number and category number (see 17.1) and ends with the end-of-text symbol (see 3.4) and a figure giving the number of words in the text.

    1.3 The texts are divided into 80-character lines, each beginning with the category number of the text and the sequence number of the line.

2 Textual material included/excluded

    2.1 The text of a sample starts with the first sentence beginning on the first page sampled and ends with the sentence containing the 2,000th word. A word is defined orthographically as a character or sequence of characters surrounded by blank spaces (and including no blank spaces).

    2.2 Headings are coded and included in the text (see 3.4 and 9).

    2.3 Sentences used as "tantalizers" at the beginning of imaginative prose periodical texts are included. They are preceded by a paragraph marker (see 7). The same applies to "story so far" summaries.

    2.4 Editorial material extraneous to the source, e.g. summaries of contents, biographical notes, is omitted and represented by a comment tag (see 17.1).

    2.5 Extra-textual material in the source, such as diagrams, maps, lists, bibliographies, is excluded and may be represented by tags (see 17.1).

    2.6 Footnotes and references to footnotes are excluded without comment.

    2.7 Long foreign quotations are excluded and represented by tags (see App. IV). 1)

    2.8 Long poetry quotations are excluded and represented by tags (see App. IV).

    2.9 As regards mathematical expressions, see 15.

3 Main coding key

      3.1 The coding system uses a 128-character set (ASCII code, 96 printable characters). In the lists below, the coding symbol is on the left, and the coded symbol is on the right. = means "represents".

      Coding symbols may be simple (i.e. consisting of a single character) or compound.

      3.2 Alphanumeric characters
      Alphanumeric characters represent themselves, e.g.

      A = A

      B = B

      C = C

       

      a = a

      b = b

      c = c

       

      1 = 1

      2 = 2

      3 = 3

      Etc.

      1)Foreign quotations consisting of 30 words or less have been retained. Slightly longer quotations have been retained if their removal would seriously impair the text.

      3.3 Other characters
      * is reserved as a prefix for a compound coding symbol. When not preceded by *, all other characters represent themselves, except for ^ ~ _ | " ' \ { } (see below):

      , = ,

      . = .

      ? = ?

       

      / = /

      @ = @

      ( = (

      etc.

      But:

      ^ - new sentence

      ~ = included sentence

      _ = begin list

      | = new paragraph or new line or blank line

      " = umlaut or diaeresis on preceding letter

      ' = apostrophe (but not single quote mark)

      \ = begin non-English word

      { a begin non-English phrase or passage

      } = end non-English phrase or passage

      The following should also be noted:

      - = hyphen OR minus (but not dash)

      . = full stop OR abbreviation point OR decimal point OR multiplication sign (see also 15.2)

      ...= ellipsis

      3.4 Compound coding symbols
      These have * or ** as a prefix:

      *0 = begin lower case (roman) 1)

      *1 = begin italic (or underlining in non-printed text)

      *2 = begin capitalisation (roman)

      *3 = begin italic capitalisation

      *4 = begin bold face

      *5 = begin italic bold face

      *6 = begin bold face capitalisation

      *7 = begin italic bold capitalisation

      *8 = begin script

      *9 = begin gothic

      *@ = ° (degree symbol)

      *=   = begin upper case Roman numeral

      **=   = begin lower case Roman numeral

      *+   =

      *-   = dash

      */   = * (asterisk)

      *#   = end of corpus text

      **#   = end of corpus

      *?   = uncoded character (see Appendix I)

      1) The special coding of lower and upper case is a remnant from a stage of the project when a smaller character set was used.

      *" = begin double quotes **" = end double quotes

      *' = begin single quotes **' = end single quotes

      *< = begin heading *> = end heading

      **[ = begin comment tag **] = end comment tag

      *; = begin subscript **; = end subscript

      *: = begin superscript **: = end superscript

      3.5 Designators and Markers
      Coding symbols are called DESIGNATORS if they actually refer to a symbol in the source text. (E.g. A, 1, *-)

      They are called MARKERS if they do not refer to a symbol, but instead indicate some aspect of the form, arrangement, or interpretation of other symbols, which precede or follow them. (E.g. *<, *0, \)

    4 Typographical shifts

      4.1 * followed by a digit indicates a TYPOGRAPHICAL SHIFT (see 3.4), i.e. the beginning of a section of text in a given type.

      In printouts for proofreading roman appears as ordinary print (with upper or lower case letters), italics as characters underlined, and bold face or bold face italics as bold face print.1)

      4.2 The shift symbol (*0, *1, etc.) occurs directly before the first character to which it applies, except that \ or *{ may follow it, e.g.

        ^*0The word *l\enfant *0is French. OR

        ^*0The word \*lenfant *0is French.

      4.3 The symbols . , - and other punctuation marks are regarded as neutral between roman, italics, and bold face.

      4.4 An introductory shift symbol always occurs before the first word of a corpus text.

      4.5 Note that typographical shift symbols may occur within a word, e.g.

        ^*6T*2HE *0crisis in Spain (as in a newspaper article)

      4.6 See further Appendix II, the end.

    5 Capitalisation

      5.1 The character set includes both upper and lower case letters. Continuing capitalisation is indicated redundantly by a typographical shift symbol (see 3.4), e.g.

        ^*2THE USA HAS ...

      5.2 Before short sequences of capitals the shift symbol may be omitted, e.g.

        ^*0The USA has ...

      5.3 Proper names are broadly identifiable as character sequences introduced by a capital letter which is not preceded by a sentence-initial marker (see 8).

      1) To prevent misunderstanding, it should be pointed out that this paragraph is probably irrelevant for the present user. However, the representation described was very efficient during the proofreading stage and can be recommended to other users.

    6 Spacing and ordering

      6.1 A single space indicates a typographic word-boundary in the source text. A space may not occur within a word.

      6.2 A space follows the punctuation marks . , ; : ? **" **' as in printed texts. In addition, contrary to printing practice, a space follows a dash (*-).

      6.3 Contrary to the practice of a few printers, no space is inserted preceding end-quote symbols (**", **') or following begin-quote symbols (*", *').

      6.4 / occurring between words is followed by a space, except in the case of and/or, which is coded as a single word. No space is inserted after / in numerical expressions (see 15.2).

      6.5 The ordering of punctuation symbol and marker, or of marker and marker, is immaterial if they apply at the same point in the text, except that:

        (a) The new-paragraph marker | precedes other markers at the same point in the text, e.g.

          |^*"*0The main thing is, **" she said

        (b) The sentence-initial marker ^ precedes other markers at the same point in the text except the new-paragraph symbol (see the example just given)

        (c) The beginning-of-headline marker *< precedes all other symbols.

        (d) See also 4.2.

      6.6 Regardless of the placement in the source text, end-quote markers (**", **') follow other punctuation marks. l)

    7 Paragraph/line division

      7.1 The beginning of a new paragraph is indicated by | and, redundantly, by indentation in the printout.

      7.2 The paragraph marker is used:

        (a) at the beginning of each paragraph, irrespective of whether the first line is indented;

        (b) to mark significant new line distinctions, e.g. in lists, the line distinctions in poetry.

      7.3 Breaks in the text (as indicated by a two-line space, asterisks, etc.) are represented by a paragraph marker.

      7.4 The paragraph marker introduces a new line.

      7.5 Headings (see 9) are placed on a separate line. The same applies to comment tags (see 17), except **[SIC**], and to the end-of-text and end-of-corpus markers.

    l) There is, however, some inconsistency in the placement of end-quote markers in the corpus.

    8 Sentence-initial marking

      8.1 The object of sentence-initial (SI) marking is (1) to define suitable contexts for the researcher who wishes to use the material, (2) to define basic units to be used in the grammatical tagging of the material, and (3) to simplify studies of sentence length. 1) As (3) has been considered of limited importance and (2) is only a possible future extension of the project, our main object in SI marking has been (1), to provide contexts which are relevant in linguistic research.

      8.2 The sentence-initial marker (SI marker) ^ normally appears where a terminal punctuation mark is followed by a capital letter, e.g.

        ^Where is he? ^I don't know, Don't you? ^No.

      8.3 SI markers are not used at the beginning of headlines. If a headline contains a sentence division, the second sentence is preceded by an SI marker.

      8.4 "Quasi-headlines" (see 9.4) are preceded by an SI marker.

      8.5 SI marking is a problem in connection with quotations. When a quotation is preceded by a reporting clause, an SI marker is used, e.g.

        ^She said, ^*"Let's go.**"

      When the reporting clause follows, no sentence division is recognised, e.g.

        ^*"Let's go,**" she said.

      Included quotations are not preceded by the SI marker. See further 8.10.

      8.6 A semi-colon is never treated as a mark separating sentences.

      8.7 A particular problem in connection with SI marking is where a colon occurs followed by capitalisation. An ^ is inserted if one or more of the elements following the colon has the character of, or includes, a complete and separate sentence (subject + verb). The marker is omitted where the stretch following the colon has the character of an enumeratic and the single elements enumerated do not form, or include, complete sentences.

      8.8 In cases of doubt, an SI marker is omitted rather than included.

      8.9 Many problems, too numerous to be dealt with here, turn up in SI marking. Separate markers were introduced to solve two particular problems (see 8.10 and 8.11). These were used on a trial basis at a late stage of the project and can be ignored by the researcher if he so wishes. Adequate marking would require further distinctions.

      8.10 The marker ~ was introduced in included quotations (cf. 8.5), e.g.

        ^She said, ~*"Let's go,**" and left immediately.

      The same marker is used with sentence-final quotations which are an integral part of the structure of the matrix sentence, e.g.

        ^The Apostle Paul said concerning some that ~*"By good words and fair speeches they deceived the heart of the simple.**"

      8.11 A "begin-list" marker _ was introduced to indicate the beginning of word sequences without syntactic structure (e.g. in recipes). Note, however, that lists were normally excluded (see 2.5).

      8.12 In order not to create unnecessary sentence divisions, "paragraph indicators" in the text such as 1. a) B. etc. are included in the following sentence.

    1) SI marking is also important in distinguishing the use of capitalisation to introduce sentences from other uses (see e.g. 5.3).

    9 Headings

      9.1 Headings are characterised by special typographical and linguistic features and should therefore be marked. They are enclosed within brackets *< *> and are placed on a separate line.

      9.2 In the source text, headings are sometimes not placed at the head of the portion of text to which they apply. (They may, for example, occur in the middle of a newspaper or magazine article, interrupting a sentence in the body of the text). In such cases, the headings are appropriately repositioned.

      9.3 ^ is not used at the beginning of a heading, even though the heading has the grammatical structure of a complete sentence (but cf. 8.3).

      9.4 The brackets *< *> are only used for a heading separated from the body of the text by being on a separate line. In other cases an SI marker is used, e.g.

        Summary. The recent approval of ...

      is coded:

        |^*lSummary. ^*0The recent approval of ...

      Such "quasi-headlines" are therefore treated as sentences. The same applies to the occurrence of "Editor" or the name of the author at the end of an article.

      9.5 Running heads at the top of the page, and other editorial headings (e.g. "continued on p. ..."; "Next week: ...") are ignored.

    10 Quotations

      10.1 There are at least two reasons why the marking of quotations is essential:

        (a) The texts of the corpus should represent British English as printed in 1961. However, there frequently occur quotations from much earlier sources, especially in text categories D, G, and J.

        (b) The fiction categories (especially) contain large sections of dialogue, which exhibit particular linguistic features.

      10.2 To make it possible to identify unambiguously sections within quotations marks, separate markers are used for begin-quote and end-quote (see 3.4).

      10.3 The beginning and end of a quoted passage not enclosed in quotation marks are tagged **[BEGIN QUOTE**] and **[END QUOTE**]. Where a text sample begins or ends in the middle of a quotation this is tagged **[MIDDLE OF QUOTE**]. In one or two fiction texts where quotation marks are consistently omitted in the dialogue there is no tagging. There are instead comments about this in the list of references.

      10.4 In many original texts where quotations extend over two or more paragraphs there are opening quotation marks at the beginning of each paragraph but closing quotation marks only at the very end of the quoted section. No special tagging has been used in these cases, as the quoted passages are unambiguously identifiable without tags.

      10.5 Quotations from earlier sources containing markedly deviant forms or structures may contain a non-English marker (see 12.2 - 3). The same applies to dialogue including non-standard features (see 12.5 - 10).

    11 Foreign-language material

      11.1 For the researcher it is important to be able to identify foreign-language material, mainly in order to be able to exclude it from the analysis.

      11.2 Foreign-language words are marked by the prefix \, e.g.

        \*lhomme = homme

      11.3 The brackets { } are used instead of \ for non-English expressions consisting of more than one word, e.g.

        {lit de parade} = lit de parade

      Note, however, that long foreign quotations are excluded (see 2.7).

      11.4 Originally, the non-English marker was followed by one or two digits indicating the language to which the word or expression belongs. These digits have been eliminated except in codings of the Greek and the Cyrillic alphabets; see Appendix II.

      11.5 Foreign words form a cline from almost complete integration to nonceuses or quotations of foreign-language material, which makes it difficult to decide whether to code or not. Besides, it is completely natural in certain types of texts to use foreign-language material (e.g. Latin in medical texts, Italian in musical contexts). The foreign-marking of individual words has normally only been used where there is an indication in the text that a word is foreign by (1) the otherwise unmotivated use of italics, boldface, or capitalisation, or (2) an explicit statement in the text that a foreign word is being used. To make the contrast less abrupt between coded and uncoded cases, the marker \6 has been introduced for an intermediate category, "foreign expression widely used", e.g.

        \6*landante = andante

      To qualify as a "foreign expression widely used", a word must, on the on hand, be singled out in the text but, on the other hand, it must be listed in non-specialised dictionaries of the English language.

      11.6 The criteria outlined in the preceding paragraph mean (1) that the same word may be treated differently depending upon the context, and (2) that even normally very foreign-sounding words may be uncoded provided that the context is so appropriate that the foreignness is not indicated or commented upon in any way in the text.

      11.7 Foreignisms consisting of more than a single word are easier to deal with. All expressions exhibiting foreign-language syntax (names are a special case; see below) have been marked. The marking {6 }, "foreign expression widely used", is employed with foreign-language expressions that appear in non-specialist dictionaries of the English language, e.g.

        {6a priori}

      11.8 Foreign titles of books, operas, etc. are marked, unless they consist simply of a name:

        {Der Rosenkavalier} BUT: Fidelio

      11.9 Foreign names, e.g. Voltaire, are not marked as foreign. However, when a lower case foreign word, e.g. de or van, occurs as part of a name, it is marked:

        President \de Gaulle

        the Lys \d'Or

      11.10 mere is no complete consistency in the treatment of foreign names. Descriptive foreign names have sometimes been marked, e.g.

        {Rue de la Paix}

        the {Theatre Francais}

      11.11 As regards foreign abbreviations, see 13.14.

    12 Other "non-English" material

      12.1 The researcher may wish to be able to identify other types of "deviance" in the text than foreign-language material. Special codes were introduced to deal with non-current English, non-standard English and other types of "deviance".

      12.2 Non-current English (in quotations from pre-20th-century sources) is tagged \1, e.g.

        \1abhominable \1publick

      12.3 A whole passage of non-current English may be marked {1 } if it pervasively contains features (whether grammatical, lexical, or orthographic) uncharacteristic of present-day English. The marking has, however, been sparingly used. 1)

      12.4 Old English and early Middle English are coded \ and { } rather than \1 and {1 }.

      12.5 Non-standard English is tagged \2. This applies mainly to the coding of impressions of dialect, represented by deviant spelling, e.g.

        \2yer ("your") \2gonna ("going to")

      12.6 No tagging is used where an apostrophe occurs in shortening a word, e.g. y'know, d'you, 'em, 'bout, talkin'.

      12.7 A whole passage of non-standard English may be enclosed within {2 } if it pervasively contains features (whether grammatical, lexical, or orthographic) uncharacteristic of standard English.

    1) It was felt to be too cumbersome to use the {1 } marking with all quotations predating the main text.

      12.8 Dialogue passages containing occasional substandard grammatical features such as double negation or lack of concord have not been marked.

      12.9 Foreigner English, i.e. non-standard English spoken by foreigners, is tagged \3, e.g.

        \3ze ("the") \3t'ing ("thing")

      12.10 A whole passage of foreigner English may be enclosed within {3 } if it pervasively contains features (whether grammatical, lexical, or orthographic) uncharacteristic of standard English.

      12.11 A small number of coinages in science-fiction texts are tagged \4, e.g.

        \4wiltmilt \4pluggyrugg

      12.12 A miscellaneous category, \5, is used for nonce-forms such as vicilisation ("civilisation"), bunkrapt ("bankrupt"). Other examples of this coding are:

        \5d-damp \5ye-es

      12.13 The non-English marking is only intended as a rough guide for the researcher. Information is also given in the list of Text Extracts for texts with notable "non-English" features.

    13 Abbreviations

      13.1 Abbreviations are coded (a) so that they can be distinguished from full vocabulary items, and (b) so that the abbreviation point can be automatically recognised as distinct from the full stop marking the end of a sentence.

      13.2 The code for abbreviations is the non-English marker (see 11) followed by zero: \0 {0 }.

      13.3 Abbreviations are coded \0 whether or not they end in an abbreviation point, e.g.

        \0Mr. = Mr.

        \0Mr = Mr

      But see 13.10 and 13.11.

      13.4 A sequence of initials or abbreviations is marked {0 } rather than \0, whether or not the sequence contains spaces or abbreviation points, e.g.

        \0Mr. {0F. A.} Parker =Mr. F. A. Parker

        {0U.S.A.} = U.S.A.

        {0U S A } = U S A

        {0USA} = USA

      13.5 Typical abbreviations are initials, such as in G. B. Shaw or E. Pound, and acronyms, such as N.A.S.A. or N.A.T.O.

      13.6 Clipped words and short forms are not marked as abbreviations, e.g. Chev (=Chevrolet), didn't. However, clippings which are not pronounced as such in speech and/or regularly appear with an abbreviation point, are marked, e.g.

        \0para ("paragraph")

        \0Capt. ("Captain")

        \0ed. ("editor")

      13.7 12th, 21st, etc. are not marked as abbreviations.

      13.8 Chemical formulae are marked \0.

      13.9 Herts, Bucks and other abbreviations of English counties are marked.

      13.10 In some cases the same word may be treated differently. This applies mainly to:

        N.A.T.O., NATO (marked) O.K., OK (marked)

          Nato (unmarked) okay (unmarked)

      The unmarked cases have here lost the features of abbreviations and seem to be treated as ordinary vocabulary items.

      13.11 There is also a difference of treatment in cases like:

        per cent. (marked) ad. (marked)

        per cent (unmarked) ad (unmarked)

      Here the use of an abbreviation point is exceptional, and the marking is only used where there is typographical justification.

      13.12 The abbreviation point is placed within the "non-English" bracket, e.g. {0U.S.A.}.

      13.13 If an abbreviation occurs at the end of a sentence, it is not clear (unless the same abbreviation occurs elsewhere in the text) whether . is to be treated as an abbreviation point (as well as a full stop) or merely as a full stop. In cases of doubt, the . is included in the non-English bracket.

      13.14 Foreign abbreviations are marked in the same way as English abbreviations (and only as abbreviations), e.g.

        \0Mme. = Mme.

        {0i.e.} = i.e.

      13.15 An abbreviation marker can occur in the middle of a word, e.g.

        8\0s. = 8s. ("8 shillings")

      13.16 Abbreviations ending in the middle of a word are marked as follows:

        10-{0yr}-old = 10-year-old

        {0MP}s = MPs

    14 Hyphen and dash

      14.1 The hyphen (-) is used within a word, and is not preceded or followed by a space.

      14.2 The dash (*-) is followed by a space.

      14.3 A line-end hyphen in the source text is not coded, except where the hyphen is part of the normal spelling of the word.

      14.4 Where spelling practice varies with regard to hyphenation, a coding decision has to be made as to whether the line-end hyphen is preserved in the coded text or not. The hyphen is preserved:

        (a) if dictionaries show that hyphenation is normal.

        (b) if the word in question is hyphenated elsewhere in the same text.

      14.5 In other cases, where doubt still remains, the line-end hyphen is included or excluded according to the judgment of the coder.

      14.6 :- as a punctuation symbol is coded :*- (colon followed by dash), and is followed by a space.

      14.7 - meaning "to" (as in 1573-1640) is coded as a hyphen.

    15 Mathematical expressions

      15.1 Mathematical characters are where possible coded as themselves, e.g. + = @ %.

      15.2 - between numerical expressions represents "minus" (but see 14).

      . between digits represents "decimal point" OR "multiplication sign".

      x in numerical expressions represents "multiplication sign".

      / in numerical expressions represents a divisor in fractions, e.g.

        1/2 represents 1 over 2

        61/2 represents 61 over 2

      Note, however, that 6 1/2 represents 6 and 1 over 2.

      15.3 Other mathematical characters are represented where necessary by entries in the "Uncoded Character Index" (see Appendix I), e.g.

        *?7 codes "single prime"

        *?22 codes "square root"

      15.4 More complex mathematical expressions and equations are represented by **[FORMULA**1. The decision of whether to use the blanket coding **[FORMULA**] is a practical one, depending on whether a fully coded representation of the expression would require additional coding apparatus not allowed for above. Examples of expressions coded **[FORMULA**] are:

    16 Typographical errors

      16.1 Obvious typographical errors are corrected, e.g.

      Form in the source:

      Corrected to:

      thetre

      theatre

      ofr

      for

      on eof

      one of

      for for

      for

      All corrections are listed under the text in question in the List of Text Extracts.

      16.2 The following types of errors are normally corrected:

        (a) Errors in spelling

        (b) Cases of incorrect spacing

        (c) Duplication of words

      For examples, see 16.1.

      16.3 If there is any doubt about the intended form, the text is left unchanged and the tag **[SIC**] is inserted in the text. This applies to all cases of aberrant syntax.

      16.4 The tag **[SIC**] is used sparingly and only where it is considered necessary for the user of the corpus. All occurrences of this tag are listed under the text in question in the List of Text Extracts.

      16.5 Errors in the representation of foreign names or in foreign-language passages have normally been left without comment.

    17 Comment tags

      17.1 **[ **] marks comment tags, Such tags are used for explanatory comments on the text, e.g.

        (a) Each corpus text is headed by a tag giving the text's absolute number and category number; e.g.

        **[001 TEXT A01**]

        **[421 TEXT L18**].

        (b) Extra-textual material in the source, such as maps, lists, bibliographies, is represented by tags where it is relevant for the interpretation of the text. E.g. **[LIST**].

        (c) Editorial material extraneous to the text is omitted and replaced by the tag **[EDITORIAL**].

        (d) The beginning and end of a passage indented away from the margin are tagged **[BEGIN INDENTATION**] and **[END INDENTATION**].

        (e) See further 10.3, 15.4, and 16 for comment tags used with quotations, mathematical expressions, and typographical errors.

      17.2 An alphabetical list of comment tags is given in Appendix IV.

      17.3 Comment tags, except **[SIC**], are placed on a separate line at the appropriate point in the text.

    Appendix I: Uncoded character index

    Appendix II: Non-English orthography
      1. ö, ä, ü are coded a", a", u" (see 3.3)

      2. æ, æ are coded oe, ae.

      3. ß in German orthography is coded ss.

      4. For representations of some other non-English characters, see further Appendix I.

      5. Hebrew quotations are omitted and represented by a comment tag.

      6. The following keys are used for transliterating Cyrillic and Greek characters:

    CODING FOR RUSSIAN CYRILLIC ALPHABET Marker: \11 OR {11 }

    CODING FOR GREEK ALPHABET Marker: \15 OR {15. }

    Appendix III: "Non-English" codes

      \, { } foreign word or expression (see 11)

      \0, {0 } abbreviation (see 13)

      \1, {1 } non-current English (see 12.2-4)

      \2, {2 } non-standard English (see 12.5-8)

      \3, {3 } foreigner English (see 12.9-10)

      \4, {4 } science fiction (see 12.11)

      \5, {5 } miscellaneous (see 12.12)

      \6, {6 } foreign word or expression widely used (see 11.5, 11.7)

      \11, {11 } Cyrillic alphabet (see App. II)

      \15, {15 } Greek alphabet (see App. II)

    Appendix IV: Comment tags

    The following is a list of the comment tags most often used in the corpus:

    **[A01 TEXT 001**] etc. (headings for corpus texts)

    **[BEGIN INDENTATION**]

    **[BEGIN QUOTE**]

    **[BIBLIOGRAPHY**]

    **[DIAGRAM**]

    **[EDITORIAL**]

    **[END INDENTATION**]

    **[END QUOTE**]

    **[FIGURE**]

    **[FORMULA**]

    **[ILLUSTRATION**]

    **[LIST**]

    **[LONG FOREIGN QUOTATION**] **[MAP**]

    **[MIDDLE OF QUOTE**]

    **[POEM**] **[SIC**]

    **[TABLE**]

    Index

    [Numbers refer to sections]

    abbreviations 13

    acronyms 13.4-5

    alphanumeric characters 3.2

    asterisk 3.4

    capitalisation 5

    chemical formulae 13.8

    comment tags 17, App. IV

    Cyrillic alphabet App. II

    designators 3.5

    editorial material 17.1

    foreign words or expressions 11, App. II-III

    formulae 15.4

    Greek alphabet App. II

    headings 9

    hyphens 14

    main coding key 3

    markers 3.5

    mathematical expressions 15

    non-current English 12.2 - 12.4

    non-standard English 12.5 - 12.8

    paragraph/line division 7

    punctuation symbols 3.3, 6.2, 6.5

    quotation marks 3.4, 6.2-3, 6.5

    quotations 8.5, 8.10, 10

    Roman numerals, 3.4

    Sentence-initial marking 8

    spacing and ordering 6

    subscript 3.4

    superscript 3.4

    typographical errors 16

    typographical shifts 3, 4

    Index to discussion of symbols

    [For key to coding symbols, see 3]

    ^   3.3, 6.5, 8

    ~   8.10

    _   8.11

    *   3.1-5, 4

    |   3.3, 6.5

    "   3.3

    '   3.3

    \,   { *}   11

    \0,   {0 *}   13

    \1,   {1 *}   12.2-3

    \2,   {2 *}   12.5-8

    \3,   {3 *}   12.9-10

    \4,   {4 *}   12.11

    \5,   {5 *}   12.12

    \6,   {6 *}   11.5, 11.7

    \11,   {11 *}   App. II

    \15   {15 *}   App. II

    .   3.3, 6.2, 13.3-4, 13.12-13, 15.2

    -   3.3, 14, 15.2

    *#   1.2, 3.4

    **#   1.1, 3.4

    *<   *> 9

    **[   3.3-4, 17

    **]   3.3-4, 17

    *-   6.2, 14

    /   6.4, 15.2

    *?   3.4, App. I

    LIST OF TEXT EXTRACTS


    A B C D E F G H J K L M N

    Page last update 6. January -98
    Anne Lindebjerg