Bookmarks for Corpus-based Linguists 

Major English Corpora |  Other Recent English Corpora  |  Spoken Corpora  |  Diachronic Comparisons  |  Historical Corpora | 1st language acquisition |  Learner/Lingua Franca Corpora  |  Specialised Corpora  |  Text Archives & Corpus Distribution Sites  |  Non-English & Multilingual Corpora  |  The Web as a Corpus |  Parsed Corpora  |  D-I-Y Corpora |  Audio/Visual 'Corpora' |  *Free, web-accessible Corpora   |  [Bookmarks HOME]


Corpora, Collections, Data Archives*


* What are the differences among these terms? See below.

* Ok, so how do I actually get my hands on on these corpora, & how can I search them? See below.

* For freely accessible, on-line corpora, see separate section below.

 Major English Language General Corpora

Kennedy (1998) suggests a three-way categorisation of corpora.

Pre-electronic Corpora: (biblical & literary studies, early dictionaries, etc.)

1st-generation Major Corpora:

         Brown, LOB, LLC, Kolhapur, Wellington, etc.

2nd-generation Corpora:

         Mega Corpora:

         British National Corpus (BNC);  Corpus of Contemporary American English (COCA);  COBUILD

         Not-so-mega Corpora:

         ICE-GB, American National Corpus (ANC), etc.

[Some people call many of the major "general" corpora above "balanced corpora", but I would avoid such a term. To say something is "balanced" suggests that linguists have agreed on what proportions to assign to different genres (patently untrue) in order to achieve this "balance". More apt terms are "wide-coverage" or "general".]

 

 

Other General Corpora for Written English

(excluding those already in the above lists; please also note other categories (for speech corpora, for instance) below; the same corpus may appear under more than one category, for easy access)

FLOB (Freiburg-LOB Corpus of British English)

1990s analogue to the LOB corpus (1 m wds, written British English)

FROWN (Freiburg-Brown Corpus of American English)

1990s analogue to the Brown corpus (1 m wds, written American English)

LUCY

structurally analysed written British English (drawn from the British National Corpus); a treebank sampling modern written British English of three genres (edited published prose, the writing of young adults (e.g. A-level exam scripts, 1st-year undergraduate essays), spontaneous writing by 9- to 12-year-old children).

SUSANNE (Surface & Underlying Structural Analyses of Naturalistic English)

130,000-word cross-section of written American English (based on a subset of the million-word Brown Corpus; 64 texts x 2,000 wds each from four Brown genre categories) syntactically analysed (treebanked).

Corpus of Contemporary American English

(by Mark Davies, Brigham Young University)

c. 360 million wds, including 20m for each year from 1990 to the present. Each year (& therefore overall, as well), the corpus is evenly divided between spoken, fiction, popular magazines, newspapers, & academic. In addition, the corpus will be continually updated--20m wds each year. (Because of copyright & licensing issues, the texts themselves are not available for download—they can only be searched online.)

ICE-Project

International Corpus of English

see description under 2nd-generation Mega-corpora  

ICE-East Africa

International Corpus of English: East Africa component

* For other national varieties of ICE, go to the main ICE web page here  (site includes downloadable sound files from several ICE teams, including Australia, India, Jamaica, Singapore & the Philippines)

ICE-HK

International Corpus of English: Hong Kong component

The Hong Kong component of the International Corpus of English. Possible to listen to a selection of sound files of Hong Kong speakers of English.

ICE-NZ

International Corpus of English: New Zealand component

1 m wds of spoken & written New Zealand English collected 1989 to 1994; consists of 600,000 wds of speech & 400,000 wds of written text. The Wellington Corpus of Spoken New Zealand English (WSC) & the spoken component of ICE-NZ share 9 categories. Because informal conversational data in particular was so difficult to collect, there is an overlap of 339,530 wds (173 files) between the two corpora to achieve economy in data collection.

Longman Written American Corpus

[This blurb is from their web site. Availability is unknown, as with all proprietary corpora... no comment on the use of 'corpuses'...] 

A dynamic corpus of 100 m wds from newspapers, journals, magazines, best-selling novels, technical & scientific writing, & coffee-table books..composition constantly being refined & new material added.... based on the general design principles of the Longman Lancaster English Language Corpus & the written component of the British National Corpus. Like other corpuses[sic] in the Longman Corpus Network, wds can be concordanced, wordlists created, & statistical features analysed, allowing lexicographers to compare & contrast usage in British & American English.

Reuters Corpora

(registration required to get the CDs, or get the older Reuters-21578 here.)

Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 [810,000 news stories]

Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 [over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, & Swedish). These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.]

 * Some of the above-mentioned corpora are conveniently bundled together on the new ICAME Corpus Collection on CD-ROM (click to find out more). It comes with the concordancers WordSmith, TACT & WordCurncher.


Spoken Corpora of English

For phonemic/acoustic/articulatory databanks (mainly isolated words, phonemes, or sentences), see separate list of links here (Kiel) or the ELRA/ELDA pages or the LDC. Some people make a distinction between 'speech corpora' (suitable for acoustic/phonetic studies) & 'spoken corpora' (containing transcriptions of any type of spoken language). I use 'spoken corpora' here as an umbrella term for both types.

ANDOSL (Australian National Database of Spoken Language)

comprises spoken language as it occurs in a variety of major speaker groups in Australia (both native-born & overseas-born migrants); data was elicited either by written material which was read aloud (the "read speech" data) or by graphical material which was discussed by two speakers thereby generating spontaneous speech (the "map task" or "spontaneous" data). Speakers were rigorously selected within phonologically defined speaker groups, each group balanced for age ranges & gender. Recorded in a high quality environment at the National Acoustic Laboratories. Manual annotation at both word & phonemic levels using highly trained transcribers is being combined with automatic methods.

BASE (British Academic Spoken English)

See separate entry under "Specialised Corpora" below

CANCODE

(Cambridge & Nottingham Corpus of Discourse in English)

Not generally available for research except at specific sites (annoying!). 5 m wds of spontaneous speech collected between 1995 & 2000. We are told that a feature of CANCODE that makes it different from other spoken corpora is that all the transcripts have been coded to reflect the relationship between the speakers–whether they are intimates (living together), casual acquaintances, colleagues at work, or unknown to each other. Speech events were recorded at hundreds of locations across the British Isles, covering a wide variety of situations: casual conversation, people working together, people shopping, people finding out information, discussions, etc. [see also the Centre for Research in Applied Linguistics, University of Nottingham ]

CHRISTINE

(Spoken version of SUSANNE Corpus)

SUSANNE-meets-spoken-English; Geoffrey Sampson's project 

CSLU Speech Corpora

(Center for Spoken Language Understanding)

several free speech corpora (telephone recordings, conversations with children, pronunciations of isolated digits & alphabets, etc.)

CUCASE

(City University Corpus of Academic Spoken English)

A c. 2-m-word (multimedia) corpus currently being compiled (Jan 2008-Sept 2009, initially) by David Y.W. Lee. Will mirror the design of MICASE & BASE; will contain English spoken at a Hong Kong university (native & non-native speakers).

Diachronic Corpus of Present-day Spoken English (DCPSE)

800,000 wds (87,188 parse trees) of fully-parsed & annotated spoken British English from the 1950s to 1990s; composed of two 400,000-word samples of spoken English from the London-Lund Corpus (late 1960s-early 80s) & ICE-GB (early 1990s); fully parsed to be consistent with ICE-GB & searchable using ICECUP,  (Survey of English Usage, University College London).

Dialogue Diversity 'Corpus' (DDC)

See separate entry under "Specialised Corpora" below

ELISA (English Language Interview Corpus as a Second-Language Application)

60,000 wds, 28 interviews with native speakers of English; multimodal (video files available). They talk about their professional career (e.g. in tourism, politics, the media or environmental education). Free for non-commercial use. Has an on-line concordancer. University of Tuebingen.

EUSTACE (Edinburgh University Speech Timing Archive & Corpus of English)

Free for non-commercial use; esp. useful for phonetics researchers & speech technologists working on synthesis & recognition. Comprises 4608 spoken sentences spoken by six speakers of British English; sentences were designed to examine a number of durational effects in speech & are controlled for length & phonetic content. Subconstituents of key words in each sentence have been identified by labels in xlabel (ESPS) format & notes have been made about the prosodic realisation of the sentences. Example sentences available for playback. Speech waveform files are available in .wav (RIFF) format & .sd (ESPS) format.

FRED

 (Freiburg English Dialect Corpus)

A specialized corpus of British English dialects covering nine major dialect areas in Britain; 370 texts; c. 2.45 m wds; 300 hours of speech, excluding interviewer utterances (recorded between 1968 & 2000-- some recordings were taken from oral history interviews), 420 different informants (a majority are non-mobile old rural males who typically grew up before WW I.). Recordings will be made available 

HCRC Map Task Corpus  

(by the Human Communication Research Centre at Edinburgh University) See also LDC Catalog entry

a set of 8 CD-ROMs containing linked audio & transcriptions of a total of about 18 hours (roughly 150,000 word tokens) of spontaneous (task-oriented) speech that was recorded from 128 two-person conversations according to a detailed experimental design. OR Download/ftp a gzipped tar file of the entire corpus (tar [compressesd] file is 10MB, whole corpus is 80MB; 2562 XML files & a dtd directory containing 15 dtd files.)

ISLE corpus  

See separate entry under "Specialised Corpora" below

IViE

(Intonational Variation in English)

created to investigate cross-varietal & stylistic variation in English intonation. Focus is on modern or mainstream dialects: nine urban varieties of English spoken in the British Isles, viz. Belfast, Bradford (bilingual Punjabi/English speakers), Cambridge, Cardiff (bilingual Welsh-English speakers), Dublin, Liverpool, Leeds, London, & Newcastle; approximately 36 hours of speech data in five different speaking styles

Lancaster/ IBM Spoken English Corpus (SEC)

52,000 wds of mostly prepared (& mostly monologic) southern British English speech (approximating to RP), collected in the period 1984-1987; orthographic & prosodic transcription & in two versions with grammatical tagging (like those for the LOB Corpus). Detailed description: see: -- See the ICAME Corpus Collection's SEC manual for a description of the SEC & the AMALGAM web site for the SEC Tag-set Ref: A Corpus of Formal British English Speech (1996), Knowles, Gerald, Briony Williams & Lita Taylor (eds.), London: Longman. A collection of research papers based on the SEC has also been published as Working with Speech (1996), Knowles, Gerald, Anne Wichmann & Peter Alderson (eds.), London: Longman.

LeaP (Learning the Prosody of a foreign language)

(Bielefeld)

a large corpus of foreign language learners' speech (Target Languages are English & German, Native Languages span a wide range: German, Polish, Arabic, Chinese, Spanish, etc.). A multitude of data of various types is being collecetd: the corpus of spoken language will consist of at least 400 recordings of between 2 & 20 minutes length. It comprises there different speech styles: (i) read speech (a story of 268 wds); (ii) prepared speech (the re-telling of the story); (iii)  free speech from an interview context. The central question of the project is to provide a detailed decription of non-native prosody. The second line of research aims to explore whether & how it is possible for learners of a foreign language to acquire the prosody of the target language without having a distinct "foreign accent". In a longitudinal study, various methods of teaching prosody will be tested.

Limerick Corpus of Irish-English (L-CIE)

one-million word spoken corpus of Irish English discourse; conversations recorded in a wide variety of mostly informal settings throughout Ireland (excluding Northern Ireland); currently (accessed: Feb 2008) has 375 transcripts totalling over 1m wds; mainly casual conversation, but also over 200K wds of professional, transactional & pedagogic Irish English; not designed to be geographically representative (does not include data from every county); speakers range in age from 14 to 78; equal representation of both male & female speakers; designed to allow inter-corpus comparisons with CANCODE

London-Lund Corpus (LLC)

 See description here

Longman Spoken American Corpus

5 m wds, demographically sampled speech from 12 regions (30 states) across the continental US; coordinated by the University of California at Santa Barbara; everyday conversations of more than 1,000 Americans of various age groups, levels of education, & ethnicity.  PDF with more information is here:

Machine-Readable Spoken English Corpus
(MARSEC)

Some notes on MARSEC version 2 here (latest) or  here (outdated).

MICASE (Michigan Corpus of Academic Spoken English)

See separate entry under "Specialised Corpora" below

Newcastle Electronic Corpus of Tyneside English (NECTE)

(Under construction)

Northern Ireland Transcribed Corpus

400,000 wds transcribed speech from 42 locations, across three age groups. Contact the Oxford Text Archive.

PROSICE Corpus

a collection of re-recorded ICE-GB texts with high technical specifications; syntactically analysed & temporally aligned. See here for more info.

Reading/Leeds Emotional Speech Corpus

prosodically & paralinguistically coded speech corpus for investigating suprasegmental & affective information in the speech signal. 4.5-hour database of machine-readable speech, of which 26 mins were transcribed using the extended ToBI system. Unfortunately, this corpus is NOT available for use by others, but you can find out more info from the people listed on the website, & also from here.

Saarbruecken Corpus of Spoken English (ScoSE)

See separate entry under "Specialised Corpora" below

Santa Barbara Corpus of Spoken American English (SBCSAE)

( university site is here)

recordings of people talking -- people from all over the United States, in all walks of life, talking about & doing all sorts of things; target of 200,000 wds. The three CD-ROM volumes in Part 1 contain 14 speech files of between fifteen & thirty minutes each. Alternative site for the data at TalkBank

Part II contains 47,000 wds (6 hrs; 16 wave format speech files)

Spoken Corpus of the Survey of English Dialects

See DRH (Digital Resources for the Humanities) Program

Switchboard Corpus (SWB)

(LDC version & (more recent) ISIP version)

a corpus of over 240 hours of recorded spontaneous (but topic-prompted) telephone conversations (2,438 conversations averaging 6 minutes in length each) recorded in the early 1990s; c. 3 m wds (3,044,734) of text, spoken by 543 unique speakers (302 males & 241 females) from most major dialect groups of American English. Info on the speakers' age, sex, education & dialect region. On average, each speaker participates in about 9 calls (but it ranges from 1 to 32).

TRAINS Spoken Dialogue Corpus on CD-ROM (University of Rochester web site)

six & a half hours' worth of human-human dialogues; includes 55,000 wds & about 5,500 speaker turns. Audio files for the dialogues are available on the CD-ROM; 

Translanguage English Database (TED)
or
the LDC equivalent

See separate entry under "Learner Corpora" below

Tyneside Linguistic Survey (TLS)

Not much info available, but some given on the NECTE page. The TLS corpus was compiled in the late 1960s, & consists of 86 loosely-structured 30-min interviews. The informants were drawn from a stratified random sample of Gateshead in North-East England, & were equally divided among various social class groupings of male & female speakers, with young, middle, & old-aged cohorts

Wellington Corpus of Spoken New Zealand English (WSC)

1 m wds of spoken New Zealand English collected from 1988 to 1994 (99% (545 out of 551 extracts) was collected between 1990 to 1994). Of the eight remaining files, four were collected in 1988 (4 oral history interviews) & four in 1989 (4 social dialect interviews). 2,000 word extracts (where possible) & comprises different proportions of formal, semi-formal & informal speech. Both monologue & dialogue categories are included & there is broadcast as well as private material collected in a range of settings. Access to recordings from the WSC is restricted to use at Victoria University of Wellington. A small number of the recordings which are shared with the ICE-NZ corpus will be made available on CD through ICE.

Wellington Language in the Workplace Project Corpus

Not generally available (?). Project aims to analyse features of interpersonal communication in a wide variety of New Zealand workplaces, with recordings done as unobtrusively as possible. Volunteers tape-record a range of their everday work interactions over a period of time, collecting two-party & multipary meetings, informal work-related conversations, telephone calls, & workplace small talk. Currently (2004) comprises 2000 interactions involving >500 participants, recorded in a number of government departments & commercial white-collar organizations, small businesses, & blue-collar factories. Social talk & business or task-oriented talk, ranging from short telephone calls of <1min to meetings >4 hrs long. Audio recordings are supplemented by detailed on-site ethnographic observations, written agendas & minutes, demographic & organizational info, & video recordings. Contact Janet Holmes at the Victoria University of Wellington, NZ.

West Point Company G3 American English Speech Data Corpus

During the 2000-2001 academic year, cadets, staff & faculty members at the US Military Academy volunteered to participate in a speech data collection project for American English (high-quality read speech---not spontaneous). The 185 sentences comprising the data collection script were written to elicit examples of all or most all of the possible syllables used in spoken American English. The G3 Corpus audio data comes from 53 female & 56 male volunteers, each of whom recorded approximately 104 utterances. The recordings are sampled at a 16 bit resolution, 22,050 samples per second. Total:  c.15 hours.

British National Corpus (BNC)

Naturally, the spoken component of the British National Corpus is also a rich resource (although for phonetic/prosodic research you'll need to get the audio tapes from the British Library... these are now generally available, but the matching of tapes & actual BNC files is problematic).

The LDC also contains various resources which are not 'corpora' as such, but may be of interest. Example: the LDC American English Spoken Lexicon, which is a collection of pronunciations captured in individual audio files for more than 50,000 of the most common words in English (words were extracted from newswire & telephone conversation; description & links to audio files here).

Diachronic Comparisons (recent changes in English)

Since the first major English corpora were collected in the 1960s, it is now possible to compare these earlier corpora with more contemporary (1990s) corpora. For written British English, LOB can now be compared with FLOB, while for American English, it's Brown v. Frown. For spoken British English, the Diachronic Corpus of Present-Day Spoken English (DCPSE) allows comparisons of the London-Lund Corpus (LLC, 1960s) with the British component of the International Corpus of English (ICE-GB, 1990s).

 


Historical Corpora or Collections (English)

Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English

a selection of texts from the Old English Section of the Helsinki Corpus of English Texts; contains 106,210 wds of Old Eng text; the samples from the longer texts are 5,000 to 10,000 wds in length; texts represent a range of dates of composition, authors, & genres. For a list of the texts included in the Brooklyn Corpus, click here. The texts are syntactically  & morphologically annotated, & each word is glossed. Size of the corpus: c.12 megabytes.

Complete Corpus of Old English

3,022 texts representing all extant Old English texts, compiled at the University of Toronto.

A Corpus of English Dialogues, 1560-1760 (CED)

1.2-m wds of Early Modern English speech-related texts (177 text files). The CED contains texts representative of five text types (plus a mixed bag of dialogues labelled 'Miscellaneous'), which divide into two categories: these are 'authentic dialogue', which is written records of real speech events (Trial Proceedings & Witness Depositions), &  'constructed dialogue', in which the dialogue is constructed by an author (Drama Comedy, Didactic Works, & Prose Fiction).

Early English Books On-line (EEBO)

(subscription required)

(images of original print documents, with some now searchable as texts) "From the first book published in English through the age of Spenser & Shakespeare, this incomparable collection now contains about 100,000 of over 125,000 titles listed in Pollard & Redgrave's Short-Title Catalogue (1475-1640) & Wing's Short-Title Catalogue (1641-1700) & their revised editions, as well as the Thomason Tracts (1640-1661) collection & the Early English Books Tract Supplement."

Corpus of Newsbooks

approximately 800,000 wds of running text drawn from all the newsbooks present in the Thomason Tracts that were published from December 1653 to May 1654.

York-Toronto-Helsinki Corpus of Old English Prose (YCOE)

1.5 million word syntactically-annotated corpus of Old English prose texts; sister corpus to the Penn-Helsinki Parsed Corpus of Middle English (uses the same form of annotation & is accessed by the same search engine, CorpusSearch). The corpus itself (the annotated text files) is distributed by the Oxford Text Archive. Free for non-commercial use.

York-Helsinki parsed corpus of Old English poetry

a selection of poetic texts from the Old English Section of the Helsinki Corpus of English Texts; 71,490 wds of Old English text; the samples from the longer texts are 4,000 to 17,000 wds in length. The texts represent a range of dates of composition & authors. For a list of the texts included in the York Poetry Corpus, click here. The texts are syntactically & morphologically annotated.

Helsinki Corpus of English Texts: Diachronic Part

c. 1.5 m wds; 242 files; covers the period from c. 750 to c. 1700 (Old English to Early Modern)

Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET)

(1) The Prose Corpus of ICAMET: compilation of 129 texts (March 1999) of Middle Eng prose (1100-1500), digitalized from extant editions & constantly enlarged by further files. Since it is a full-text database, it particularly aims at target groups of users who, unlike those of the Helsinki Corpus, are not so much interested in extracts of texts, but in their complete versions. Thus allows literary, historical & topical analyses of various kinds, esp. studies of cultural history. It also invites linguists to raise questions of style, rhetoric or narrative technique, for which one would want a lengthier piece of text or even the complete text.

(2) The Letter Corpus of ICAMET contains 254 complete letters, arranged diachronically, from different sources (written between 1386 & 1688). Particularly encourages pragmatic & sociolinguistic studies, & analyses concerning cultural life & lifestyle.

Penn-Helsinki Parsed Corpus of Middle English

prose text samples of Middle Eng, annotated for syntactic structure. Designed for the use of students & scholars of the history of English, especially the historical syntax of the language

Corpus of Middle English Prose & Verse (CME)

(or visit the parent site, the Middle English Compendium)

collection of Middle Eng texts assembled from works contributed by Univ of Michigan faculty & from texts provided by the Oxford Text Archive, as well as works created specifically for the Corpus (archive last updated in October 2000). All 61 texts in the archive are valid SGML documents, tagged in conformance with the TEI Guidelines, & converted to the TEI Lite DTD for wider use. Web-searchable.

Corpus of Early English Correspondence (CEEC) & the Parsed Corpus of Early English Correspondence (PCEEC),

2.7 m wds; 1410 to 1681 (CEES = 450,000 wds); a supplement, the "Corpus of Early Correspondence Supplement (CEECSu; 0.44 m wds) extends the time range: 1402-1663, while the "Corpus of Early English Correspondence Extension" (CEECE; 2.2 m wds) covers the period 1681-1800. The project home page & the manual at ICAME give more details.

Corpus of Early English Medical Writing & Corpus of Middle English Medical Texts (MEMT)

a corpus of medical treatises from 1375-1800. Shorter texts are included in toto & longer treatises are represented by extracts of approximately 10-12 K wds. The medieval section contains about 500,000 wds

Century of Prose Corpus

half a m wds of literary & non-literary English; 1680-1780; 120 authors.

Corpus of Late 18c Prose

c.300,000 wds of local English letters on practical subjects, dated 1761-89, as a sample of the English language of the north-west of England in the late Modern English period. These letters, written to Richard Orford, a steward at Lyme Hall in Cheshire, are unselfconscious practical letters, often by uneducated people, on matters of business, farming, mining, & social relations. Available free for ftp download as a single text file or as three linked HTML files for maximum readability.

Corpus of Late Modern English Prose

A 100K-word corpus of informal private letters by British writers, covering the period 1861 to 1919. (Range of dates by birth-date of writer is narrower: 1837-67.) Available from the Oxford Text Archive & through the owner (David Denison).

Corpus of Late Modern English Texts (CLMET)

c.10 m wds; a principled collection of texts drawn from the Project Gutenberg & Oxford Text Archive; Ten m wds of running text, divided over three 70-year sub-periods from 1710-1920.

Corpus of Early American English

English in America from the beginning of the 17th century; compiled in Helsinki.

Helsinki Corpus of Older Scots

830,000 wds; 1450-1700, from fifteen genres.

ARCHER Corpus
(A Representative Corpus of Historical English Registers)

1.7 m wds of British & American English from written & "speech-based" genres sampled from 7 historical periods covering Early Modern English  (range: 1650-1990); 1,037 texts; 10 registers (e.g., drama, letters, science prose) representing speech-based, popular, & specialist/academic written registers. Contact Douglas Biber. Complements the Helsinki corpus. On-going collaborative research efforts are underway to extend the coverage of the corpus with the Universities of Uppsala, Helsinki, Freiburg, Heidelberg, Lancaster, Manchester & Michigan.

NEET (Network of Early Eighteenth-century English Texts)

c. 3-million-word corpus of 18th Century English registers. No more information available, but contact Douglas Biber for more details.

Newdigate Newsletters

750,000 wds; manuscript newsletters from 1674-92.

Lampeter Corpus of Early Modern English Tracts

1m wds of English pamphlet literature covering 1640-1740. Text samples are taken from each decade within this century & several genres are represented. Contains the whole text of pamphlets, rather than fragments.

Leverhulme Corpus Project

(Under construnction: 15 months from October 2003)

1-million-word corpus which matches as closely as possible the LOB & FLOB corpora of written British English, except that the year of data collection is 1931, or near to that date (+/- 3 years). The immediate purpose of building this corpus is to make it possible to compare these three temporally equidistant corpora (1931, 1961, 1991): "Pre-LOB", LOB, & FLOB. This will enable tracking of grammatical change through a period of 60 years of the 20th century. Under construction & as yet unnamed (?)

TIME Magazine Corpus

100-m wds from TIME magazine, 1923-2006. Allows you to see how wds & phrases have increased or decreased in usage & or changed meaning over time.

Women Writers Online

The Brown University Women Writers Project's main undertaking is an SGML-encoded full-text database of pre-Victorian women's writing in English (at present, it covers 1400 to 1850). This collection currently includes nearly 200 texts representing a broad cross-section of the literate culture of pre-Victorian Britain.

Zürich Corpus of English Newspapers (ZEN)

London newspapers from 1660s to the beginning of the 20th century. Contact: Udo Fries

* See also the Early Modern English Dictionaries Database (EMEDD description here)


Corpora for research on 1st language acquisition

Child Language Data Exchange System

(CHILDES)

XML database here

c.20 m wds (180m characters), 20 languages. The CHILDES system provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding,& systems for linking transcripts to digitized audio & video. Includes a language acquisition bibliography

Polytechnic of Wales (POW) Corpus

100,000 wds spoken English by 120 children, aged 6-12; parsed according to Hallidayan Systemic-Functional Grammar. See the manual here. Distributed from two places: The Oxford Text Archive orgainsed by Lou Burnard. & ICAME in Bergen, Norway (icame@hd.uib.no) organised by Knut Hofland. The AMALGAM tagger emulates the POW tagset

Lancaster Corpus of Children's Project Writing

a digitized collection of project work produced by children aged between 9 & 11; part of a larger research program (a longitudinal study of children's writing-for-learning, based on the writing of 8-12 year old children)

 

Learner Corpora, Lingua Franca Corpora (for various languages / 2nd Lg Acquisition research)

(Language produced by  non-native speakers/writers)

* See Yukio Tono's Learner Corpora Resources web page for a more comprehensive index to learner corpora web sites (e.g. the various ICLE projects for learner English, such as SWICLE (Sweden), BRICLE (Brazil) & PICLE (Poland)), plus a useful bibliography on learner corpora

International Corpus of Learners' English (ICLE)

Over 2 m wds of writing by advanced/university learners of English (EFL, not ESL) from 19 different mother tongue backgrounds (e.g. Brazilian Portuguese, Chinese (which dialect?), Czech, Dutch, Finnish, French, German, Japanese, Polish, Spanish & Swedish). 

*NEW: The ICLE corpus is now available for purchase (CD-ROM, version 1.1.) from i6doc.com here

Cambridge Learner Corpus

Not generally available. A large collection of examples of English writing from learners of English all over the world; over 15 m wds & expanding all the time;part of the Cambridge International Corpus (CIC); comes from anonymised exam scripts written by students taking Cambridge ESOL English exams around the world; each script is coded with information about the student's first language, nationality, level of English, age, etc. Currently, it can only be used by authors & writers working for Cambridge University Press & by members of staff at Cambridge ESOL.

Chinese Learner English Corpus (CLEC)
(An associated web site with slightly outdated info is
here)

one m wds of English compositions collected from 5 different levels of Chinese learners of English, tagged according to an error tagging scheme of 61 types of error (excludes stylistic errors & error sources, which are difficult to tag objectively & consistently). CLEC consists of a book & a CD-ROM. The main body of the book has an introduction (in Chinese) which gives an account of the corpus design, the methodology used in the statistical analysis of the corpus, and the major findings, + an Alphabetical List, a Lemmatized List, a Word-Frequency Distribution, a Summary Table of Errors, & a List of Spelling Mistakes. The CD-ROM consists of the error-tagged corpus with a simple concordancer, & all the lists & tables of the book. Another companion to CLEC known as Analysis of Chinese Learner Errors in English is forthcoming. Available by mail: Shanghai Foreign Language Education Press, 295 Zhong Shan Bei Yi Road, Shanghai 200083, PRC. Contact Mrs. Fan Jianying, emailsflep@sflep.com.cn Fax(86)021-55512177. List priceIn PRC,¥76.00plus 15% postage; Overseas: US$60.80(including postage) For further information, please contact Professor Gui Shichun (itscgui@gdvnet.com)

EnglishTLC

(English Taiwan Learner Corpus)

a web-based corpus of c. 2 m wds of unrestricted running text of Eng written by learners in Taiwan (majority by senior high school & university students). Essentially a self-propogating corpus: EnglishTLC is integrated with the writing component of a web-based English learning platform called IWiLL. Partially annotated for errors, consisting of comments made by teachers in their everyday process of correcting essays online using the IWiLL essay correction interface. These comments provide a window onto actual teacher feedback & teaching practice. The research interface provides a search function for extracting every error token marked by teachers on essays in the corpus. This function then lists all comments in descending order of the number of instances marked as tokens of that error type. Then each comment in this list links to a listing of all of the sentences in EnglishTLC that have been marked as that error type. Since teachers are selective in the errors which they mark in student writing, this sort of annotation in EnglishTLC should be regarded as partial annotation. There are devised heuristics for bootstrapping from these partially annotated texts to the extraction of further error tokens that the teachers left unmarked (See Wible et al 2003 for details). Feedback effects are traceable. The errors that teachers have marked as feedback to the students are also indexed to any revisions the learner may have made to their essay after reading that teacher feedback. This makes it possible to uncover learners’ attentiveness to or grasp of comments given.

European Science Foundation Second Language Databank

a computerized archive of the spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, & their communication with native speakers in the respective host countries (France, Germany, Great Britain, The Netherlands & Sweden). For each target language, two source languages were selected.

ELFA
(English as a Lingua Franca in Academic Settings)

recordings & transcripts of English used as a lingua franca in academic settings (Tampere University & Tampere Technological University). Sessions with speakers who all share an L1 are not included, neither are English language courses. Coded for speech event type/genre, discipline/domain, interaction type (dialogic/monologic), age group, gender, nationality & mother tongue.

FRIDA (French Interlanguage Database)

a corpus of French as a foreign language, with a target size of 450,000 wds..

Hong Kong University of Science & Technology (HKUST) Corpus

the biggest corpus of Chinese (Cantonese) learners of English (or, indeed, of any single group of learners of English). 25 m wds, with grammatical & discourse-feature tags. Texts consist of written undergrad assignments & 'A-level' scripts. Contact: Gregory James, Language Centre, Hong Kong University of Science & Technology, Clear Water Bay, HK. See Milton, John & K.S.T Tong (eds.) (1991). Text Analysis in Computer-Assisted Language Learning. Hong Kong: Hong Kong University of Science & Technology. See also: AUTOLANG & WORDPILOT (corpus-based tutor; shareware)

Hungarian Learner English (JPU Corpus)

Hungarian university students' English 

ISLE database

(Interactive Spoken Language Eduation)

 [not really a "corpus" as such]; database of non-native English created to help train & test the ISLE automatic pronunciation tutor system; approx. 20 minutes of speech (per speaker) from 23 German & 23 Italian intermediate learners of E