You might want to have a look at the sample sizes proposed by Doug Biber
for various linguistic features in:
Biber, D. (1990). Methodological issues regarding corpus-based analyses of
linguistic variation. *Literary and Linguistic Computing*, *5*, 257-269.;
Biber, D. (1993). Representativeness in corpus design. *Literary and
Linguistic Computing*, *8*, 243-257.
Since these articles don't include listings for many of the major
morphosyntactic and syntactic units, I used Biber's methodology to estimate
the size of corpora needed to represent the main POS categories in general
and specialist corpora and came up with the following figures:
POS | General corpus | Specialist corpus
Verb | 67,187 | 13,848
Noun | 74,551 | 8,555
Adj | 149,694 | 21,234
Adv | 205,206 | 68,953
Pron | 913,256 | 40,945
Num | 1,180,815 | 91,161
The claim would be that corpora as large as or larger than those above
would be representative samples of the POS categories. The most frequent
and evenly distributed categories require smaller samples. Also, the
recommended sample sizes for the specialist corpus are always much smaller,
suggesting a high degree of closure (McEnery & Wilson 1996). The general
corpus figures were based on the Brown, LOB, and the written component of
the BNC sampler. The specific corpus was a 11,761-word collection of
British job application letters.
Dr Tony Berber Sardinha
Catholic University of Sao Paulo, Brazil
> From: Short, Mick <firstname.lastname@example.org>
> To: 'email@example.com'
> Subject: Corpora: Sensible sizes for specialist corpora
> Date: 19 July 1999 10:17
> I have a PhD student who wants to establish for her thesis a small corpus
> writings from serious fiction and popular fiction in order to investigate
> whether the claims made by critics about the linguistic differences in
> genres has any basis in reality. Our current intention is (1) to
> corpus of two serious and two popular fiction novels (with an equalised
> division between male and female authors) for each of the two decades
> 1950s to the 1990s (a total of 20 novels) and (2) to sample three
> samples from each (from the beginning, middle and end of each), thus
> total of 240,000 words. This matches roughly the size of my own speech,
> and writing presentation corpus and Geoffrey Sampson's Susanne corpus.
> First questions: Is that in general terms adequate? Could it be any
> student is very worried that she won't cope analytically as most, if not
> of the corpus will have to be analysed by hand)?
> The second general issue is that it presumably takes different sizes of
> to establish different sorts of claims. At present, from what we have
> looks as if we will need to establish the presence or absence of
> significant contrasts for:
> sentence/clausal length and complexity
> word complexity
> particular sorts of clausal constructions/syntactico-semantic
> (e.g. the ratio of passives, and of active clauses with parts of a
> or abstract entities as subjects to dynamic verbs)
> the 'upgraded' use of action and speech verbs (e.g. using 'shouted'
> the prevalence of descriptions of characters' outward appearance,
> incidence of various sorts of speech and thought presentation
> Do you have any views on what size of sample would be needed to make safe
> judgements about these various factors?
> Would it be better to reduce the size of the samples taken from each
> increase the number of novels passages are extracted from?
> Do you know of any software which might be used to analyse texts
> in these respects?
> Do you know of an electronic versions of relevant novels which we might
> to access?
> Any other comments or suggestions would be greatly appreciated.
> Mick Short
> Mick Short
> Professor of English Language and Literature,
> Department of Linguistics and Modern English Language,
> Lancaster University,
> Lancaster LA1 4YT,
> Telephone: ((0)1524) 593035
> Fax: ((0)1524) 843085
> email: firstname.lastname@example.org
> World Wide Web site: http://www.ling.lancs.ac.uk/