Comparing Corpora: Last call for papers

From: Adam Kilgarriff (
Date: Tue Jul 04 2000 - 22:02:43 MET DST

  • Next message: petek kurtboke: "Re: [Corpora: language boundaries + code switching]"

                            2ND CALL FOR PAPERS

                                ACL Workshop:

                              COMPARING CORPORA

                                 October 2000

                Hong Kong University of Science and Technology


    Anyone who has worked with corpora will be all too aware of
    differences between them. Depending on the differences, it may, or
    may not, be reasonable to expect results based on one corpus to also
    be valid for another. It may, or may not, be appropriate for a
    grammar, or parser, based on one to perform well on another. It may,
    or may not, be straightforward to port an application from a domain of
    the first text type to a domain of the second. Currently,
    characterisations of corpora are mostly textual and at different
    levels of generality. A corpus is described as ``Wall Street
    Journal'' or ``transcripts of business meetings'' or ``foreign
    learners' essays (intermediate grade)''. It would be desirable to be
    able to place a new corpus in relation to existing ones, and to be
    able to quantify similarities and differences.

    Allied to corpus-similarity is corpus-homogeneity. An understanding of
    homogeneity is a prerequisite to a measure of the similarity -- it makes
    little sense to compare a corpus sampled across many genres, like the
    Brown, with a corpus of weather forecasts, without first accounting
    for the one being broad, the other narrow.

    Given the centrality of corpora to contemporary language engineering,
    it is remarkable how little research there has been to date on the
    question. Biber's work, coming from sociolinguistics, has made a
    considerable impact, with various researchers in computational
    lingustics taking forward the model (Biber 1989, 1995). Studies in
    text classification, genre and sublanguage are also salient, but it is
    rarely evident how well the technologies ddeveloped in these fields are
    suited to measuring corpus similarity or homogeneity.

    The workshop will welcome contributions concerned with measuring and
    comparing corpora using quantitative methods, from any field.

    Where and when

    The workshop will last half a day and will be on either 7th or 8th
    Oct, the main ACL conference being 3rd-6th October. The venue will be
    the same.


    Submissions are limited to original, unpublished work. Papers may
    not exceed 3200 words (exclusive of title page and references).
    They must be received by July 8, 2000, in hard copy (4 copies)
    OR postscript OR rtf format. Electronic delivery is to

    and hard copies are to be mailed to

    Compcorp submission
    University of Brighton
    Lewes Road
    Brighton BN2 4GJ
    United Kingdom

    Important Dates:
      July 8, 2000 Submission (of full-length paper)
      August 17, 2000 Acceptance notice
      September 1, 2000 Camera-ready paper received
      October 7 or 8 Workshop date

    Adam Kilgarriff - University of Brighton, UK
    Tony Berber Sardinha - Catholic University of Sao Paulo, Brazil

    Programme committee

    Douglas Biber Northern Arizona University
    Jeremy Clear University of Birmingham
    Ted Dunning MusicMatch Software, Inc.
    Tomaz Erjavec Jozef Stefan Institute, Slovenia
    Pascale Fung University of Science and Technology, Hong Kong
    Greg Grefenstette Xerox Research Centre Europe
    Benoit Habert LIMSI, France
    Przemek Kaszubski Adam Mickiewicz University, Poland
    Adam Kilgarriff University of Brighton
    David Lee University of Lancaster
    Oliver Mason University of Birmingham
    Doug Oard University of Maryland
    Tony Rose Canon Research
    Tony Berber Sardinha Catholic University of Sao Paulo, Brazil
    George Tambouratzis ILSP, Athens
    Christopher Tribble King's College, London University


    This archive was generated by hypermail 2b29 : Tue Jul 04 2000 - 22:01:05 MET DST