Bookmarks for Corpus-based Linguists
Major English Corpora
| Other
Recent English Corpora | Spoken Corpora | Diachronic
Comparisons | Historical Corpora | 1st
language acquisition | Learner/Lingua Franca Corpora
| Specialised Corpora |
Text Archives & Corpus
Distribution Sites | Non-English & Multilingual Corpora
| The
Web as a Corpus | Parsed Corpora | D-I-Y Corpora | Multimedia Corpora & texts
| *Free,
web-accessible Corpora | [Bookmarks HOME]
* What are the differences among these terms?
See below.
* Ok, so how do I actually get my
hands on on these corpora, & how can I search them? See below.
* For freely
accessible, on-line corpora, see separate section below.
Major English Language General Corpora |
|
Kennedy (1998) suggests a three-way categorisation of corpora. Pre-electronic Corpora: (biblical & literary studies, early dictionaries, etc.)1st-generation Major Corpora: Brown, LOB,
LLC,
|
Other General Corpora for Written English(excluding those already in the above lists; please also note other categories (for speech corpora, for instance) below; the same corpus may appear under more than one category, for easy access) |
|
|
Corpus of Contemporary American English (by Mark Davies, |
c. 360 million wds, including 20m for each year from 1990 to the present. Each year (& therefore overall, as well), the corpus is evenly divided between spoken, fiction, popular magazines, newspapers, & academic. In addition, the corpus will be continually updated--20m wds each year. (Because of copyright & licensing issues, the texts themselves are not available for download—they can only be searched online.) |
|
FLOB (Freiburg-LOB Corpus of British English) |
1990s analogue to the LOB corpus (1 m wds, written British English) |
|
FROWN (Freiburg-Brown Corpus of American English) |
1990s analogue to the Brown corpus (1 m wds, written American English) |
|
International Corpus of English |
see description under 2nd-generation Mega-corpora. * The main
ICE web site has downloadable sample sound files from several ICE teams.
Current ICE national varieties include |
|
structurally analysed written British English (drawn from the British National Corpus); a treebank sampling modern written British English of three genres (edited published prose, the writing of young adults (e.g. A-level exam scripts, 1st-year undergraduate essays), spontaneous writing by 9- to 12-year-old children). |
|
|
SUSANNE (Surface & Underlying Structural Analyses of Naturalistic English) |
130,000-word cross-section of written American English (based on a subset of the million-word Brown Corpus; 64 texts x 2,000 wds each from four Brown genre categories) syntactically analysed (treebanked). |
|
[This blurb is from their web site. Availability is unknown, as with all proprietary corpora... no comment on the use of 'corpuses'...] A dynamic corpus of 100 m wds from newspapers, journals, magazines, best-selling novels, technical & scientific writing, & coffee-table books..composition constantly being refined & new material added.... based on the general design principles of the Longman Lancaster English Language Corpus & the written component of the British National Corpus. Like other corpuses[sic] in the Longman Corpus Network, wds can be concordanced, wordlists created, & statistical features analysed, allowing lexicographers to compare & contrast usage in British & American English. |
|
|
(registration required to get the CDs, or get the older Reuters-21578 here.) |
Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 [810,000 news stories] Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 [over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, & Swedish). These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.] |
* Some of the
above-mentioned corpora are conveniently bundled together on the new ICAME Corpus Collection on CD-ROM
(click to find out more). It comes with the concordancers
WordSmith, TACT & WordCurncher.
Spoken Corpora of EnglishFor
phonemic/acoustic/articulatory databanks
(mainly isolated words, phonemes, or sentences), see separate list of links here ( |
|
|
(Australian National Database of Spoken Language) |
comprises spoken language as it occurs in a variety of
major speaker groups in |
|
BASE (British Academic Spoken English) |
|
|
(Cambridge & Nottingham Corpus of Discourse in
English) |
Not generally available for research except at specific sites (annoying!). 5 m wds of spontaneous speech collected between 1995 & 2000. CANCODE has all the transcripts coded to reflect the relationship between the speakers–whether they are intimates (living together), casual acquaintances, colleagues at work, or unknown to each other. Speech events were recorded at hundreds of locations across the British Isles, covering a wide variety of situations: casual conversation, people working together, people shopping, people finding out information, discussions, etc. [see also the Centre for Research in Applied Linguistics, University of Nottingham ] |
|
(Spoken version of SUSANNE Corpus) |
SUSANNE-meets-spoken-English; Geoffrey Sampson's project |
|
(Center for Spoken Language Understanding) |
several free speech corpora (telephone recordings, conversations with children, pronunciations of isolated digits & alphabets, etc.) |
|
CUCASE (City University Corpus of Academic Spoken English; forthcoming) |
A multimedia corpus currently being compiled (Jan 2008-Sept
2009, initially) by David Y.W. Lee. Will mirror the design of MICASE & BASE;
contains academic lectures and student presentations in English at a |
|
800,000 wds (87,188 parse trees) of fully-parsed &
annotated spoken British English from the 1950s to 1990s; composed of two
400,000-word samples of spoken English from the London-Lund Corpus (late
1960s-early 80s) & ICE-GB (early 1990s); fully parsed to be
consistent with ICE-GB & searchable using ICECUP, (Survey of
English Usage, University College London). |
|
|
Dialogue Diversity 'Corpus' (DDC) |
|
|
ELISA (English Language Interview Corpus as a Second-Language Application) |
60,000 wds, 28 interviews with native
speakers of English; multimodal (video files available). They talk about
their professional career (e.g. in tourism, politics, the media or
environmental education). Free for non-commercial use. Has an on-line
concordancer. |
|
EUSTACE
( |
Free for non-commercial use; esp. useful for phonetics researchers & speech technologists working on synthesis & recognition. Comprises 4608 spoken sentences spoken by six speakers of British English; sentences were designed to examine a number of durational effects in speech & are controlled for length & phonetic content. Subconstituents of key words in each sentence have been identified by labels in xlabel (ESPS) format & notes have been made about the prosodic realisation of the sentences. Example sentences available for playback. Speech waveform files are available in .wav (RIFF) format & .sd (ESPS) format. |
|
( |
A specialized corpus of British English dialects covering nine major dialect areas in Britain; 370 texts; c. 2.45 m wds; 300 hours of speech, excluding interviewer utterances (recorded between 1968 & 2000-- some recordings were taken from oral history interviews), 420 different informants (a majority are non-mobile old rural males who typically grew up before WW I.). Recordings will be made available. |
|
(by the Human Communication Research
Centre at |
a set of 8 CD-ROMs containing linked audio & transcriptions of a total of about 18 hours (roughly 150,000 word tokens) of spontaneous (task-oriented) speech that was recorded from 128 two-person conversations according to a detailed experimental design. OR Download/ftp a gzipped tar file of the entire corpus (tar [compressesd] file is 10MB, whole corpus is 80MB; 2562 XML files & a dtd directory containing 15 dtd files.) |
|
created to investigate cross-varietal & stylistic variation in English intonation. Focus is on modern or mainstream dialects: nine urban varieties of English spoken in the British Isles, viz. Belfast, Bradford (bilingual Punjabi/English speakers), Cambridge, Cardiff (bilingual Welsh-English speakers), Dublin, Liverpool, Leeds, London, & Newcastle; approximately 36 hours of speech data in five different speaking styles |
|
|
52,000 wds of mostly prepared (& mostly
monologic) southern British English speech (approximating to RP),
collected in the period 1984-1987; orthographic & prosodic
transcription & in two versions with grammatical tagging (like
those for the LOB Corpus). Detailed description: see: -- See the ICAME Corpus
Collection's SEC manual for a description of the SEC & the
AMALGAM web site for the SEC Tag-set
Ref: A Corpus of Formal British English Speech (1996), Knowles,
Gerald, Briony Williams & Lita Taylor (eds.), London: Longman. A collection
of research papers based on the SEC has also been published as Working
with Speech (1996), Knowles, Gerald, Anne Wichmann & Peter
Alderson (eds.), |
|
|
LeaP (Learning the Prosody of a foreign language) ( |
a large corpus of foreign language learners' speech (Target Languages are English & German, Native Languages span a wide range: German, Polish, Arabic, Chinese, Spanish, etc.). A multitude of data of various types is being collecetd: the corpus of spoken language will consist of at least 400 recordings of between 2 & 20 minutes length. It comprises there different speech styles: (i) read speech (a story of 268 wds); (ii) prepared speech (the re-telling of the story); (iii) free speech from an interview context. The central question of the project is to provide a detailed decription of non-native prosody. The second line of research aims to explore whether & how it is possible for learners of a foreign language to acquire the prosody of the target language without having a distinct "foreign accent". In a longitudinal study, various methods of teaching prosody will be tested. |
|
Limerick Corpus of
Irish-English (L-CIE) |
one-million word spoken corpus of Irish English discourse; conversations recorded in a wide variety of mostly informal settings throughout Ireland (excluding Northern Ireland); currently (accessed: Feb 2008) 375 transcripts; mainly casual conversation, but also over 200K wds of professional, transactional & pedagogic Irish English; not designed to be geographically representative (does not include data from every county); speakers range in age from 14 to 78; equal representation of both male & female speakers; designed to allow inter-corpus comparisons with CANCODE |
|
London-Lund Corpus (LLC) |
See description here |
|
5 m wds, demographically sampled speech from 12
regions (30 states) across the continental US; coordinated by the |
|
|
Machine-Readable Spoken English Corpus |
Some notes on MARSEC version 2 here (latest) or here (outdated). |
|
MICASE
( |
|
|
a corpus of spoken language containing recordings of
young male and female talkers (60 in total) from six regions of the United
States. Speech samples include isolated words, sentences, passages, and
interview speech. The purpose of the Nationwide Speech Project was to develop
a corpus of spoken language that can be used in acoustic and perceptual studies
of regional dialect variation in the |
|
|
a corpus of dialect speech from Tyneside in |
|
|
|
400,000 wds transcribed speech from 42 locations, across three age groups. Contact the Oxford Text Archive. |
|
PROSICE Corpus |
a collection of re-recorded ICE-GB texts with high technical specifications; syntactically analysed & temporally aligned. See here for more info. |
|
Reading/Leeds Emotional Speech Corpus |
prosodically & paralinguistically coded speech corpus for investigating suprasegmental & affective information in the speech signal. 4.5-hour database of machine-readable speech, of which 26 mins were transcribed using the extended ToBI system. Unfortunately, this corpus is NOT available for use by others, but you can find out more info from the people listed on the website, & also from here. |
|
Santa Barbara Corpus of Spoken American
English (SBCSAE) (University site is here) |
recordings of people talking -- people from all over the United States, in all walks of life, talking about & doing all sorts of things; 249,000 wds; 60 discourse segments of between fifteen & thirty minutes each. Transcripts & audio can be downloaded from the TalkBank site, and some can be heard & read at the same time (as multimedia presentations) through any browser from the TalkBank browser page (click on "CABank", then on "SBCSAE", then on one of the transcripts, then press the "play" button for Quicktime.) |
|
Spoken Corpus of the Survey of English Dialects |
|
|
Switchboard Corpus (SWB) |
a corpus of over 240 hours of recorded spontaneous (but topic-prompted) telephone conversations (2,438 conversations averaging 6 minutes in length each) recorded in the early 1990s; c. 3 m wds (3,044,734) of text, spoken by 543 unique speakers (302 males & 241 females) from most major dialect groups of American English. Info on the speakers' age, sex, education & dialect region. On average, each speaker participates in about 9 calls (but it ranges from 1 to 32). |
|
TRAINS
Spoken Dialogue Corpus on CD-ROM ( |
six & a half hours' worth of human-human dialogues; includes 55,000 wds & about 5,500 speaker turns. Audio files for the dialogues are available on the CD-ROM; |
|
Tyneside Linguistic Survey (TLS) |
Not much info available, but some given on the NECTE page.
The TLS corpus was compiled in the late 1960s, & consists of 86
loosely-structured 30-min interviews. The informants were drawn from a
stratified random sample of Gateshead in |
|
1 m wds of spoken New Zealand English collected from 1988 to 1994 (99% (545 out of 551 extracts) was collected between 1990 to 1994). Of the eight remaining files, four were collected in 1988 (4 oral history interviews) & four in 1989 (4 social dialect interviews). 2,000 word extracts (where possible) & comprises different proportions of formal, semi-formal & informal speech. Both monologue & dialogue categories are included & there is broadcast as well as private material collected in a range of settings. Access to recordings from the WSC is restricted to use at Victoria University of Wellington. A small number of the recordings which are shared with the ICE-NZ corpus will be made available on CD through ICE. |
|
|
|
Not generally available (?). Project aimed to analyse
socio-pragmatic norms of interpersonal communication in a wide variety of NZ
workplaces, with recordings done as unobtrusively as possible. Volunteers
tape-recorded a range of their everday work interactions over a period of
time, collecting two-party & multipary
meetings, informal work-related conversations, telephone calls, &
workplace small talk. Currently (2004) comprises 2000 interactions involving
>500 participants, recorded in a number of government departments &
commercial white-collar organizations, small businesses, & blue-collar
factories. Social talk & business or task-oriented talk, ranging from
short telephone calls of <1min to meetings >4 hrs long. Audio
recordings are supplemented by detailed on-site ethnographic observations,
written agendas & minutes, demographic & organizational info, &
video recordings. Contact Janet Holmes at the |
|
British National Corpus (BNC) |
Naturally, the spoken component of the British National Corpus is also a rich resource (although for phonetic/prosodic research you'll need to get the audio tapes from the British Library... these are now generally available, but the matching of tapes & actual BNC files is problematic). |
The LDC also contains various resources which
are not 'corpora' as such, but may be of interest. Example: the LDC
American English Spoken Lexicon, which is a collection of pronunciations
captured in individual audio files for more than 50,000 of the most common
words in English (words were extracted from newswire & telephone conversation; description & links to audio files here), or the West
Point Company G3 American English Speech Data, comprising 185
sentences read out by volunteers.
Diachronic Comparisons (recent changes in English) |
|
Since the first major English corpora were collected in the 1960s, it is now possible to compare these earlier corpora with more contemporary (1990s) corpora. For written British English, LOB can now be compared with FLOB, while for American English, it's Brown v. Frown. For spoken British English, the Diachronic Corpus of Present-Day Spoken English (DCPSE) allows comparisons of the London-Lund Corpus (LLC, 1960s) with the British component of the International Corpus of English (ICE-GB, 1990s). |
Historical Corpora or Collections (English) |
|
|
ARCHER
Corpus |
1.8 m wds (so far --May 2009)
of British & American English from written & "speech-based"
genres sampled from 7 historical periods covering Early Modern
English to the present (range: 1650-1990); 1,037 texts; 10 registers
(e.g., drama, letters, science prose) representing speech-based, popular,
& specialist/academic written registers. Complements the |
|
Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English |
a selection of texts from the Old English Section of the Helsinki Corpus of English Texts; contains 106,210 wds of Old Eng text; the samples from the longer texts are 5,000 to 10,000 wds in length; texts represent a range of dates of composition, authors, & genres. For a list of the texts included in the Brooklyn Corpus, click here. The texts are syntactically & morphologically annotated, & each word is glossed. Size of the corpus: c.12 megabytes. |
|
Century of Prose Corpus |
half a m wds of literary & non-literary English; 1680-1780; 120 authors. (Not sure where the Web site is…) |
|
3,022 texts representing all extant Old English texts,
compiled at the |
|
|
1.2-m wds of Early Modern English speech-related
texts (177 text files). The CED contains texts representative of five text
types (plus a mixed bag of dialogues labelled 'Miscellaneous'), which divide
into two categories: these are 'authentic dialogue', which is written records
of real speech events (Trial Proceedings & Witness Depositions),
& 'constructed dialogue', in which
the dialogue is constructed by an author (Drama Comedy, Didactic Works, &
Prose Fiction). |
|
|
approximately 800,000 wds of running text drawn from all the newsbooks present in the Thomason Tracts that were published from December 1653 to May 1654. |
|
|
Corpus of Middle English Prose & Verse (CME) (or visit the parent site, the Middle English Compendium) |
collection of Middle Eng texts assembled from works contributed by Univ of Michigan faculty & from texts provided by the Oxford Text Archive, as well as works created specifically for the Corpus (archive last updated in October 2000). All 61 texts in the archive are valid SGML documents, tagged in conformance with the TEI Guidelines, & converted to the TEI Lite DTD for wider use. Web-searchable. |
|
Corpus
of Early English Correspondence (CEEC) & the Parsed
Corpus of Early English Correspondence (PCEEC), |
2.7 m wds; 1410 to 1681 (CEES = 450,000 wds); a supplement, the "Corpus of Early Correspondence Supplement (CEECSu; 0.44 m wds) extends the time range: 1402-1663, while the "Corpus of Early English Correspondence Extension" (CEECE; 2.2 m wds) covers the period 1681-1800. The project home page & the manual at ICAME give more details. |
|
Corpus
of Early English Medical Writing & Corpus
of Middle English Medical Texts (MEMT) |
a corpus of medical treatises from 1375-1800. Shorter
texts are included in toto & longer treatises are represented by extracts
of approximately 10-12 K wds. The medieval section contains about 500,000
wds |
|
c.300,000 wds of local English letters on practical
subjects, dated 1761-89, as a sample of the English language of the
north-west of |
|
|
A 100K-word corpus of informal private letters by British writers, covering the period 1861 to 1919. (Range of dates by birth-date of writer is narrower: 1837-67.) Available from the Oxford Text Archive & through the owner (David Denison). |
|
|
c.10 m wds; a principled collection of texts drawn from the Project Gutenberg & Oxford Text Archive; Ten m wds of running text, divided over three 70-year sub-periods from 1710-1920. |
|
|
Corpus of Early American English |
English in |
|
|
830,000 wds; 1450-1700, from fifteen genres. |
|
Early English Books On-line (EEBO) (subscription required) |
(images of original print documents, with some now searchable as texts) "From the first book published in English through the age of Spenser & Shakespeare, this incomparable collection now contains about 100,000 of over 125,000 titles listed in Pollard & Redgrave's Short-Title Catalogue (1475-1640) & Wing's Short-Title Catalogue (1641-1700) & their revised editions, as well as the Thomason Tracts (1640-1661) collection & the Early English Books Tract Supplement." |
|
c. 1.5 m wds; 242 files; covers the period from c. 750 to c. 1700 (Old English to Early Modern) |
|
|
Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET) |
(1) The Prose Corpus of ICAMET: compilation of 129 texts (March 1999) of Middle Eng prose (1100-1500), digitalized from extant editions & constantly enlarged by further files. Since it is a full-text database, it particularly aims at target groups of users who, unlike those of the Helsinki Corpus, are not so much interested in extracts of texts, but in their complete versions. Thus allows literary, historical & topical analyses of various kinds, esp. studies of cultural history. It also invites linguists to raise questions of style, rhetoric or narrative technique, for which one would want a lengthier piece of text or even the complete text. (2) The Letter Corpus of ICAMET contains 254 complete letters, arranged diachronically, from different sources (written between 1386 & 1688). Particularly encourages pragmatic & sociolinguistic studies, & analyses concerning cultural life & lifestyle. |
|
NEET (Network of Early Eighteenth-century English Texts) |
c. 3-million-words, 18th Century English registers. No more information available, but contact Douglas Biber for more details. |
|
750,000 wds; manuscript newsletters from 1674-92. |
|
|
Contains the proceedings of the Old Bailey, |
|
|
1m wds of English pamphlet literature covering 1640-1740. Text samples are taken from each decade within this century & several genres are represented. Contains the whole text of pamphlets, rather than fragments. |
|
|
(Under construnction: 15 months from October 2003) |
1-million-word corpus which matches as closely as possible the LOB & FLOB corpora of written British English, except that the year of data collection is 1931, or near to that date (+/- 3 years). The immediate purpose of building this corpus is to make it possible to compare these three temporally equidistant corpora (1931, 1961, 1991): "Pre-LOB", LOB, & FLOB. This will enable tracking of grammatical change through a period of 60 years of the 20th century. Under construction & as yet unnamed (?) |
|
prose text samples of Middle Eng, annotated for syntactic structure. Designed for the use of students & scholars of the history of English, especially the historical syntax of the language |
|
|
100-m wds from TIME magazine, 1923-2006. Allows you to
see how wds & phrases have increased or decreased in usage & or
changed meaning over time. |
|
|
The Brown University Women Writers Project's main undertaking
is an SGML-encoded full-text database of pre-Victorian women's writing in
English (at present, it covers 1400 to 1850). This collection
currently includes nearly 200 texts representing a broad cross-section of the
literate culture of pre-Victorian |
|
|
1.5 million word syntactically-annotated corpus of Old English prose texts; sister corpus to the Penn-Helsinki Parsed Corpus of Middle English (uses the same form of annotation & is accessed by the same search engine, CorpusSearch). The corpus itself (the annotated text files) is distributed by the Oxford Text Archive. Free for non-commercial use. |
|
|
a selection of poetic texts from the Old English Section of the Helsinki Corpus of English Texts; 71,490 wds of Old English text; the samples from the longer texts are 4,000 to 17,000 wds in length. The texts represent a range of dates of composition & authors. For a list of the texts included in the York Poetry Corpus, click here. The texts are syntactically & morphologically annotated. |
|
|
Zürich Corpus of English Newspapers (ZEN) |
|
* See also the Early
Modern English Dictionaries Database (EMEDD description here)
Corpora for research on 1st language acquisition |
|
|
Child Language Data Exchange System (CHILDES) XML database here |
c.20 m wds (180m characters), 20 languages. The CHILDES system provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding,& systems for linking transcripts to digitized audio & video. Includes a language acquisition bibliography |
|
a digitized collection of project work produced by children aged between 9 & 11; part of a larger research program (a longitudinal study of children's writing-for-learning, based on the writing of 8-12 year old children) |
|
|
Polytechnic of |
100,000 wds spoken English by 120 children, aged 6-12;
parsed according to Hallidayan Systemic-Functional Grammar. See the manual here.
Distributed from two places: The Oxford
Text Archive orgainsed by Lou Burnard. & ICAME in |
Learner Corpora, Lingua Franca Corpora (for various languages / 2nd Lg Acquisition research)(Language produced by non-native speakers/writers) * See Yukio Tono's Learner Corpora Resources web page for a more comprehensive index to learner corpora web sites (e.g. the various ICLE projects for learner English, such as SWICLE (Sweden), BRICLE (Brazil) & PICLE (Poland)), plus a useful bibliography on learner corpora |
|
|
International Corpus of Learners' English (ICLE) |
As of May 2009, over 3.7 m wds of writing
by advanced/university learners of English (EFL, not ESL) from 25
different mother tongue backgrounds (e.g. Arabic, Brazilian Portuguese,
Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Norwegian,
Hungarian, Italian, Japanese, Polish, Russian, Spanish, South African,
Swedish, Turkish). Two types of essay writing: (1) Argumentative essays
(untimed); using language reference tools (dictionaries, grammars, etc.) but
entirely the students' own work, i.e. no quoting, no native speaker help; (2)
Literature examination papers (no more than 25% of each national corpus).
Each Essay: between 500 to 1,000 wds long. In May 2009, there were 5,554
argumentative essays & 531 literary or 'other' essays. The ICLE corpus is now available for purchase (CD-ROM, version 1.1.) from i6doc.com here. |
|
( |
a corpus of spoken learner English from learners from 11 different language backgrounds (Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish, & Swedish to date). Two types of speech: informal interviews (free talk on a given topic) and picture-prompted speech (based on a standard set of pictures). There is a comparable corpus of speech from English native speakers called LOCNEC (Louvain Corpus of Native English Conversation). |
|
Chinese Learner English Corpus (CLEC) |
1 m wds of Eng compositions collected from 5 different
levels of Chinese learners of Eng, tagged according to an error tagging
scheme of 61 types of error (excludes stylistic errors & error sources,
which are difficult to tag objectively & consistently); consists of a
book & CD-ROM. The book has an introduction (in Chinese) which gives an
account of the corpus design, the methodology used in the statistical
analysis of the corpus, and the major findings, + an Alphabetical List, a
Lemmatized List, a Word-Frequency Distribution, a Summary Table of Errors,
& a List of Spelling Mistakes. The CD-ROM consists of the error-tagged
corpus with a simple concordancer, & all the lists & tables of the
book. Another companion to CLEC known as Analysis of Chinese Learner Errors
in Eng is forthcoming. Available by mail: |
|
academic writing samples from non-native speakers
of Eng taking study Skills/EAP pre-sessional & undergrad courses. There
is also a small native speaker subcorpus that can be used for comparison.
Some sub-corpora are organised according to writing task & topic,
writer's L1, writing conditions & time at which the piece was produced;
contains more than one piece of writing from each learner, & these
comprise similar essays written by the same learner at different points in
time (e.g., before, during & after the pre-sessional course), as well as
different types of essays (e.g., descriptive, argumentative, etc.) written by
the same learner at the same or different times. A longitudinal sub-corpus of LANCAWE is
called the Hinestroza-Kim
Corpus (HKC). |
|
|
MELD
( |
English (ESL) text written by all levels of learners in |
|
ELFA |
recordings & transcripts of spoken English
used as a lingua franca in academic settings ( |
|
VOICE |
a corpus of English as a Lingua Franca (i.e.,
English as the means of communication regarded as the most convenient one by
speakers from different first-language backgrounds). The focus is on
unscripted, largely face-to-face communication among competent speakers from
a wide range of L1 backgrounds whose primary & secondary education & socialization
did not take place in |
|
a corpus of French as a foreign language, with a target size of 450,000 wds.. |
|
|
(English |
c. 2 m wds of unrestricted running text written by
learners of English in |
|
|
the biggest corpus of Chinese (Cantonese) learners of
English (or, indeed, of any single group of learners of English). 25 m wds,
with grammatical & discourse-feature tags. Texts consist of written
undergrad assignments & 'A-level' scripts. Contact: Gregory James,
Language Centre, |
|
Learner Business Letters Corpus |
209,461 word tokens in 1,464 letters written by Japanese business people. Searchable through a web concordancer here. More details about the collection, constitution, etc. of the corpus can only be found by browsing through Someya's M.A. dissertation available on-line here. |
|
a computerized archive of the spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, & their communication with native speakers in the respective host countries (France, Germany, Great Britain, The Netherlands & Sweden). For each target language, two source languages were selected. |
|
|
Hungarian university students' English |
|
|
(Interactive Spoken Language Eduation) |
[not really a "corpus" as such]; database of non-native English created to help train & test the ISLE automatic pronunciation tutor system; approx. 20 minutes of speech (per speaker) from 23 German & 23 Italian intermediate learners of English. Each speaker recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions.) The prompts were of varying perplexities. About 2/3 of the data for each speaker was annotated by one of a team of linguists. The files were corrected first at the word level, & an automatic recognizer was then used to produce phone-level annotations. The annotator then re-annotated each sentence to mark phone & stress errors (e.g., substitutions, insertions, or deletions.) It comprises: a total of 46 speakers (23 German & 23 Italian.) 11484 utterances 1.92 gigabytes of WAV files 17 hours, 54 minutes, & 44 seconds of speech data It is distributed on 4 CD-ROMs. Contact ELRA for purchasing information. |
|
The Polish component of ICLE. This corpus, along with some comparable English (undocumented) & Polish corpora, can be searched on-line using various tools provided. |
|
|
SILS Learner Corpus of English (Waseda Univ) |
essays by students at SILS, the School of International Liberal Studies at Waseda Univ, Japan; wide variety of backgrounds (majority Japanese); can be used to look at the effects of native lg and educational background on writing skills in English; Will be collecting many essays from each indiviual student (longitudinal), and both 1st and 2nd drafts, with teachers' comments. |
|
Thai English Learner Corpus (TELC) No Web site I can find, but related site is here |
written corpus of 1.3 m wds (on 23/5/2002), tagged for part of speech & lemma. Comprises writing samples of Thai EFL university students, starting 1997 & continues to grow. 700,000 wds of written Eng taken from university entrance exams at the Institute for English Language Education (IELE, Assumption University, Thailand) & 600,000 wds from essays written by fourth year Thai EFL learners at the Institute. Searchable on-line, but limited to 100 concordance lines. For full access, contact the owners. |
|
modelled on ICLE; corpus of 200K wds of argumentative essays from advanced learners of English in institutions of higher learning in South Africa. |
|
|
a corpus of recordings
made of oral presentations given in English by non-native speakers of Eng at
Eurospeech'93 in |
|
|
Not generally available (except by arrangement with publishers). Students & teachers throughout the world sent in essays & exam scripts to help create the Longman Learners' Corpus, a 10-million word computerised database made up entirely of language written by students of English. Every nationality, every language level is represented in the corpus & this provides a unique insight into learner English. |
|
|
Not generally
available. A large collection of examples of English writing from
learners of English all over the world; over 15 m wds & expanding all the
time;part of the Cambridge International Corpus (CIC); comes from anonymised
exam scripts written by students taking Cambridge ESOL English exams around
the world; each script is coded with information about the student's first
language, nationality, level of English, age, etc. Currently, it can only be
used by authors & writers working for |
|
Specialised Corpora of EnglishMany of these are suitable for ESP teaching, learning & research |
|
|
four sub-corpora: (1) 32 examples of 'The author's acknowledgements' in published books (2) 6 examples of 'The publishers' acknowledgements in published books (3) 5 examples of 'Acknowledgements in research articles' placed as footnotes in the papers (4) 6 examples of 'Acknowledgements in research articles' placed just before the references. The associated guide to writing acknowledgments is here. |
|
|
about 63 m wds of plain orthographic English collected by the Association for Computational Linguistics' Data Collection Initiative; consists of: the Collins English Dictionary; selections from the Wall Street Journal (40m wds); a database of scientific abstracts from the U.S. Department of Energy (23m wds); the `Penn Treebank' of skeleton-parsed data compiled by Mitch Marcus & his team at the University of Pennsylvania (Marcus & Santorini, 1992). |
|
|
70 hours of
recorded conversation between controlers & aircrafts in three major
airports of the |
|
|
American Heritage Intermediate (AHI) Corpus |
5.09 m wds; based
on a 1969 survey of US schools; 10,043 samples, each 500 wds long, from
publications which were widely read among American schoolchildren aged 7 to
15 years. See: Carroll, John, Peter Davies & Barry Richman (1971) (eds.) Word
Frequency Book. |
|
Asian Newspaper English ( |
A web-based concordance is derived from a corpus of 114,502 wds (13,971 types) from English-language newspapers in 18 Asian countries, dated September-November 2000, inclusive. Compiled for teaching & demonstration purposes only, & should not be seen as a representative sample, & the texts may not be re-distributed in any form. |
|
BASE
|
The British analogue to MICASE. A corpus of university lectures & seminars
developed at the Universities of |
|
BAWE |
A corpus of good-quality student assignments across disciplines, from
first year undergrad to masters level, developed at the Universities of
Warwick, Reading, Oxford Brookes & Coventry, under the directorship of
Hilary Nesi, with Paul Thompson, Sheena
Gardner & Paul
Wickens. 2,761
assignments from 627 student contributors in 33 university departments,
totalling 2896 independent texts (6,514,776 wds). Corpus development
was funded by the Economic & Social Research Council. (2004-2007). The
corpus will be available to researchers from the Arts & Humanities Data
Service & the ESRC Data Archive. |
|
he collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. |
|
|
|
American English, task-oriented monologues, both read & spontaneous; multiple non-professional speakers who were given written instructions to perform a series of increasingly complex direction-giving tasks. No known web site, & probably not generally available. More information available in publications based on this corpus, such as this one. |
|
Business Letters Corpus |
Someya's corpus of
Business Letters ( 1,020,060 word tokens of |
|
Carnegie Mellon Communicator Corpus (details here) |
a large corpus of speech produced by callers to a Travel Planning system; around 180,605 utterances (90.9 hours) in 2002. |
|
(CHaracterizing INdividual Speakers) |
a novel speech corpus which may be of interest into those looking at diverse speaking styles, & those seeking to characterize speaker identity; features approximately 36 speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in imitation of a dialect-matched co-speaker, in a whisper, & at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English. Free for research purposes. |
|
Circle Archive (old site was here) |
a collection of (mostly freely
downloadable) transcripts of tutorial sessions. [Used
mainly by researchers in education, psychology & cognitive science, I
believe. All were collected in the |
|
a collection of human-human computer-mediated dialogues in which two subjects collaborate on a simple task, buying furniture for the living & dining rooms of a house |
|
|
( |
spoken language of 13 to
17-year-old teenagers from different boroughs of |
|
a major research project of PERC (Professional English Research Consortium) currently underway that, when finished, will consist of a 100-million-word computerized database of Eng used by professionals in science, engineering, technology, law, medicine, finance & other fields. |
|
|
Corpus of
Spoken Professional American-English (CSPA) |
2-m-word part-of-speech tagged corpus consisting of transcripts of American Eng spoken in professional settings (committee meetings, faculty meetings & White House press conferences); recorded from 1994-1998; consists primarily of short interchanges by approximately 400 speakers that are centered on professional activities broadly tied to academics & politics, including academic politics; seventeen files (12 MB). |
|
Mark Sebba's project. Some introductory notes about the corpus here. |
|
|
(click on "CorTec" on the left menu) |
a bilingual (English & Portuguese) comparable corpus of technical language (linked to the COMET project) in 5 areas: Cooking, Contracts, Computing, Environment & Hypertension. For copyright reasons, the corpora themselves cannot be accessed, but they can be searched with the tools provided: Concordancer, wordlist & N-gram extractor. |
|
Dialogue Diversity 'Corpus' (DDC) |
Not, technically speaking, a 'corpus' as such, but a collection of links to different dialogue texts (transcriptions &/or sound files), covering a very diverse collection of interactive situations--a data resource for studies of the breadth of coverage of particular dialogue models, & for studies that compare dialogue from different situations. Taken as a whole, this 'corpus' is irregular & not homogeneous in any way. It is generally unsuitable for drawing any conclusions about dialogue taken as a single category. |
|
collected & prepared by the CALO Project (A Cognitive Assistant that Learns & Organizes). Contains e-mails from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, & posted to the web, by the Federal Energy Regulatory Commission during its investigation. Does not include attachments, & some messages have been deleted "as part of a redaction effort due to requests from affected employees", & some email addresses were anonymized. Probably the only substantial collection of "real" email that is public (because of privacy concerns). In using this dataset, please be sensitive to the privacy of the people involved (& remember that many of these people were certainly not involved in any of the actions which precipitated the investigation). |
|
|
|
1 m wds (2,000 units of
about 500 wds each) from written English texts from the physical sciences,
engineering & technology (divided into the following ten subject areas:
Computers, Metallurgy, Machine Building, Physics, Electrical Engineering,
Civil Engineering, Chemical Engineering, Naval Architecture, Atomic Energy,
Aircraft Manufacturing). Randomly selected from theses, textbooks, academic
works, popular science & science digests, published in the |
|
Hyland's Research Articles Corpus |
Ken Hyland's personal corpus of published research articles, representing written academic English. Not available to the general public, but contact owner directly for more info. Consists of 30 texts each from 8 disciplinary areas (biology, engineering, mechanical engineering, linguistics, marketing, philosophy, sociology, physics), totalling 1.3m wds. |
|
academic writing samples
from non-native speakers of Eng taking study Skills/EAP pre-sessional
& undergrad courses. There is also a small native speaker subcorpus that
can be used for comparison. Some sub-corpora are organised according to
writing task & topic, writer's L1, writing conditions & time at which
the piece was produced; contains more than one piece of writing from each
learner, & these comprise similar essays written by the same learner at
different points in time (e.g., before, during & after the pre-sessional
course), as well as different types of essays (e.g., descriptive, argumentative,
etc.) written by the same learner at the same or different times. A longitudinal sub-corpus of LANCAWE is
called the Hinestroza-Kim Corpus (HKC). |
|
|
(various languages) |
corpora in different languages using the same format & comparable sources (identical in format & similar in size & content). Randomly selected sentences. Available in sizes of 100,000 sentences, 300,000 sentences, 1 million sentences etc.. The sources are either newspaper texts or texts randomly collected from the Web. All data (publicly accessible, copyrighted sources) have been processed automatically so that it is not possible to reconstruct the original source texts. Significant L1, R1 & "within sentence" collocates are computed for each word. Available as plain text files, or as MySQL database tables (ready to use with a supplied Corpus Browser) |
|
|
compiled at the |
|
LOCNESS
( |
a corpus of native English essays made up of: British pupils' A level essays (60,209 wds), British university students essays (95,695 wds), American university students' essays (168,400 wds). Total: 324,304 wds |
|
METER Corpus (MEasuring TExt Reuse) |
collected from British PA (Press Association) archive & 9 British national newspapers; 528,563 wds from the two journalistic domains of 'Law & Courts' & 'Show Business'; project aim was to develop techniques for detecting & measuring text reuse (mapping derived texts to their source texts, indicating the probability of derivation). One CD-ROM (free) |
|
MICASE
( |
a free (& web-accessible) spoken American Eng corpus of c.1.7 m wds (190 hours of recordings) focusing on contemporary university speech within the microcosm of the Univ of Michigan. Has a free-to-use accompanying web concordancer/search engine that can search by speaker or speech event attributes. Speakers include faculty, staff, & all levels of students (mostly native, some non-native speakers) across several speech events (incl. monologic & interactive speech) from all of the major academic divisions (with the exception of the professional schools, i.e., medical, dental, business, & law). 15 different types of speech event: small/large lecture, public interdisciplinary or departmental colloquia, discussion sections, student presentations, seminars, undergraduate lab sessions, lab group & other meetings, one-on-one tutorials, office hours, advising consultations, dissertation defenses, study groups, interviews, campus/museum tours, & service encounters. Full transcripts can be ordered for a nominal fee (XML format). Some audio recordings of the original speech events are available here (streaming Realaudio), or in other formats by special arrangement to bona fide researchers. A manual giving more detailed information about the corpus is here. |
|
MICUSP
(Michigan Corpus of Upper-level Student Papers) |
MICUSP (the Michigan
Corpus of Upper-level Student Papers); 1.6 m wds; assessed genres of writing
by senior undergraduate (4th year) & graduate students in the |
|
a parallel corpus of English-German scientific medical abstracts; c. 1 million tokens for each language. Abstracts are from 41 medical journals, each of which constitutes a relatively homogeneous medical sub-domain (e.g. Neurology, Radiology, etc.). The corpus of downloaded HTML documents was normalized in various ways in order to produce a clean, plain text version consisting of a title, abstract and keywords. Additionally, the corpus was aligned on the sentence level. |
|
|
NIE Corpus of
Spoken Singapore English (NIECSSE) |
aims to provide high-quality recordings of Singaporean speakers. The aim of the corpus is to facilitate acoustic/phonetic analysis of Singapore English. In order to eliminate background noise & thereby facilitate acoustic/phonetic measurement, all recordings were made directly onto the computer in the NIE Phonetics Laboratory. Consists of interviews & a read text. |
|
|
|
|
|
comprises 98,538
English wds & information on the spelling, syntactic category &
number of letters for each of these as well as information on the phonetics,
syllabic count, stress patterns & various criteria affecting
comprehension. See also notes on the use of a psycholinguistic database by |
|
c. 1 m wds
composed of twenty research articles written by |
|
|
Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 |
|
|
[For Discourse Analysis, message understanding, etc.] |
a selection of |
|
Freely downloable. Has seven parts: Part 1: Complete Conversations; Part 2: Indianapolis Interviews; Part 3: Jokes; Part 4: Drawing Experiment; Part 5: Kassel Classroom Discourse; Part 6: Stories; Part 7: London Teenage Talk Transcripts & audio can be downloaded from the TalkBank site, and some can be heard & read at the same time (as multimedia presentations) through any browser from the TalkBank browser page (click on "CABank", then on "SCoSE", then on one of the transcripts, then press the "play" button for Quicktime). |
|
|
[ |
a multimodal corpus
database of "education discourse" in |
|
contains documents in Scottish Standard English, documents in several varieties of Scots, & everything in between. While Scottish Standard English has a standard written form, Scots does not. This means that the corpus contains a wide range of spelling variation (steps being made to offer a means of searching for all of the variant spellings automatically in a later stage of the project). SCOTS is a publicly available resource on the Internet. |
|
|
8 sociolinguistic interviews, 9 speakers. William Labov & one of his students conducted the interviews in the 1960s & 70s. These interviews represent solutions to the problems of achieving cross-cultural contact, reducing the effect of the Observer's Paradox & approximating the vernacular of everyday life. Complete interview recordings plus time-aligned verbatim transcripts for each speaker. Also included: (i) a sociolinguistic variable survey that represents an overview of the intra- & inter-speaker variation attested in the corpus, highlighting a broad range of phonological, phonetic, grammatical, lexical & stylistic variables. (ii) a number of annotation tools that allow users to listen to each interview while browsing the corresponding transcripts, & to display & hear each token identified in the variable survey. The recordings demonstrate successful interviewing techniques, the sound quality is high, & the digitization, segmentation & transcription of the data represent best practice in these areas. The variable survey highlights over 150 sociolinguistic variables attested in the corpus & suggests avenues for further research. Most importantly, the SLX Corpus provides both an example of a digital speech corpus developed specifically to support sociolinguistic research, & a stable benchmark for training in sociolinguistic data collection, digitization, segmentation, transcription, analysis & publication. 17 speech files (22050Hz, 16 bit, single-channel in the MS WAV (RIFF) format), total of 575 minutes (~ 1.5GB); DVD-ROM. |
|
|
A corpus of around 250,000 wds annotated for categories of speech, thought & writing presentation; genres included: fiction, newspaper reports, biographies/autobiographies Corpus not generally available to the public yet. |
|
|
a large corpus of actual travel agent interactions with client callers |
|
|
Switchboard Corpus (LDC) |
See separate entry above |
|
a set of 186 news report documents annotated with the 1.1 version of the TimeML standard for temporal annotation. This release should also include a copy of the TimeML schema version 1.1.. |
|
|
contemporary translational English: written texts translated into English from a variety of source languages, European & non-European. Supports a broad range of studies in two main areas: the way in which the patterning of translated text might be different from that of non-translated text in the same language, & stylistic variation across individual translators. Set up & currently managed by Mona Baker. |
|
|
"read speech" designed to provide speech data for the acquisition of acoustic-phonetic knowledge & for the development & evaluation of automatic speech recognition systems; contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences |
|
|
T2K-SWAL Corpus (The TOEFL 2000 Spoken & Written Academic Language Corpus) |
[Owned by the Educational
Testing Service (ETS), |
|
Wolverhampton Business English Corpus [description & purchasing information from |
10,186,259 wds in the general domain of business, collected from 23 different web sites around the world (from six months within the period 1999-2000), covering a wide variety of categories including product descriptions, company press releases, annual financial reports, business journalism, academic research papers, political speeches & government reports. POS-tagged. Alternatively, on this site, you can see & compare frequency lists & ngrams for various subcorpora/text genres (including business texts). |
Text Archives & Corpus Distribution Sites (various languages)(see also the D-I-Y corpora section) |
|
|
Alex Catalogue of Electronic texts |
a collection of digital documents collected in the subject areas of English literature, American literature, & Western philosophy. Basic concordancing & browsable, downloadable full texts. |
|
a gateway to rich
primary source materials relating to the history & culture of the |
|
|
makes databases of spoken German accessible in a well structured form to the speech science community as well as to speech engineering |
|
|
( |
combines an on-line archive of tens of thousands of SGML & XML-encoded electronic texts & images with a library service that offers hardware & software suitable for the creation & analysis of text. SGML texts are converted to HTML when you select them in your web browser. Has texts in English (Middle & modern), German, French, Latin, Apache, Japanese, Chinese, etc. |
|
Oxford Text Archive (OTA) |
"holdings include electronic editions of works by individual authors, standard reference works such as the Bible & mono-/bilingual dictionaries, & a range of language corpora"; "electronic texts & corpora of interest not only to literary textual scholars, but also those working in linguistics, history, law, modern & ancient languages, indeed almost any humanities discipline which relies upon a close reading of texts." |
|
(or mirror site here) |
books published pre-1923, anything out of copyright; e.g. Shakespeare, Poe, Dante, Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan & Mars books of Edgar Rice Burroughs, Alice's adventures in Wonderland as told by Lewis Carroll, & thousands of others. String frequency reports for 5400+ books (400M wds) from Project Gutenberg available at Ronald Reck's site (but read this Corpora List message for details) |
|
ELDA (European Language Resources Distribution Agency) |
the distribution arm of ELRA (European Language Resources Association). Has a searchable catalogue covering their speech resources, written corpora & terminological resources. |
|
ICAME (International Computer Archive of Modern & Medieval English) |
Collects & distributes information on English language material available for computer processing & on linguistic research completed or in progress on the material. The ICAME CD-ROM (20 different corpora, totalling > 17 m wds) contains most of the important English Language corpora used in research. |
|
TELRI Research Archive of Computational Tools & Resources (TELRI = Trans-European Language Resources Infrastructure); Corpora in 20 languages; Parallel corpora in a variety of pairings; Software for processing corpus evidence; Lexicons & other language-information resources. |
|
|
supports language-related education, research & technology development by creating & sharing linguistic resources: data, tools & standards. Has lots of specialised corpora for many languages (most of them, however, intended for NLP). |
|
|
OLAC |
has a search facility covering the resource catalogs of LDC, ELRA & the ACL/DFKI Natural Language Software Registry, & permits single searches to be applied to all catalogs simultaneously. The OLAC cross-archive search engine now harvests 11,000+ records from 12 OLAC archives. Try it out using the query box in the top right corner of the web page OR the more advanced search facility hosted by the Linguist List. OLAC is an international
partnership of institutions & individuals who are creating a worldwide
virtual library of language resources by: (i) developing consensus on best
current practice for the digital archiving of language resources, & (ii)
developing a network of interoperating repositories & services for
housing & accessing such resources. OLAC was founded at the Workshop on Web-Based Language Documentation & Description, held
in |
|
RELATOR (European Linguistic Resources Repository Network) |
a CEC-funded initiative which adresses the vital area of linguistic resources for spoken & written language processing |
Non-English, Parallel & Multilingual Corpora (click here for separate page
|
The Web as a Corpus(Concordancing current/'live' web pages on the net) |
|
|
KWiCFinder
& WebKWiC |
KWiCFinder (Key Word in Context Finder) is a free stand-alone web search agent optimized for multilingual searches, which builds on AltaVista's support for complex Boolean searches & enlists XML technology to provide a complete range of report options in five different languages to display the Key Words of your search in Context. KWiCFinder makes your web searches more selective, carries them out automatically, & excerpts matching webpages to display your KeyWords in Context so you can efficiently evaluate the documents' relevance. In contrast, WebKWiC (Web Key Word in Context) complements Google.com's popular search engine to simplify & accelerate the task of online research. Both programs automate the process of evaluating documents matching your search terms. Each has strengths & weaknesses which reflect characteristics of the search engines they rely on & the reporting technology they implement. Both are available as free downloads from miniAPPolis.com. |
|
WebCONC |
a tool for generating KWIK-concordances based on webpages (KWIC = Keyword in context). There are two options for defining your corpus: let Google search the relevant webpages for you or specify a set of URLs yourself |
|
Concordances the Web. You enter a word or phrase, choose options from the menus provided & then press the `Submit' button. WebCorp works 'on top of' the search engine of your choice, taking the list of URLs returned by that search engine & extracting concordance lines from each of those pages. All of the concordance lines are presented on a single results page, with links to the sites from which they came. * Also does a frequency listing of words on a web page (from your chosen URL). |
|
|
can be used to perform syntactic searches (done graphically via parse trees) on Internet data. Currently available are a three-million-sentence corpus of sentences from the Internet Archive as well as facilities to build & search corpora based around search results from AltaVista queries. |
|
|
takes the text of a web page you specify & creates a list of sentences that contain the search term. Selecting various options can also produce a concordance of all the words that appear on the page in either alphabetical or frequency order. |
|
|
retrieves words or sequences of words from a pre-selected pool of daily newspapers (French, English, Spanish, Italian, Portuguese). If any match occurs, a concordance is sent to the user by email (this is a list of the retrieved occurrences presented in their context (by default, 40 characters to the right & 40 characters to the left) in text or HTML format). You can set up GlossaNet so that concordances are sent to you on a weekly basis. |
|
|
Search an archive of more than 35 million documents from over 3,000 sources -- a vast collection of articles from leading publications, updated daily & going back as far as 20 years. Can restrict to: (1) Documents (from Newspapers, Magazines, Journals, Transcripts & Books), (2) Images & Maps , (3) Reference books (Encyclopedias, Dictionaries & Almanacs) |
|
|
Tips on using the web as a corpus for lexical/grammatical (or lexicogrammatical) searches |
|
Treebanks/ Parsed Corpora of EnglishThis list excludes the parsed historical corpora listed above. For parsed corpora in languages other than English, please see this page |
|
|
American Printing House for the Blind Treebank (APHB) |
A skeleton-parsed corpus of a wide range of English texts. 200,000 wds. See description at the UCREL website. |
|
Anaphoric Treebank |
A subsample of the AP corpus (English), annotated to show the reference of pronouns & lexical cohesion. Approximately 100,000 wds. See description at the UCREL website. |
|
Associated Press Treebank (AP) |
A skeleton-parsed corpus of American newswire reports. 1m wds. See description at the UCREL website. |
|
Canadian Hansard Treebank |
A skeleton-parsed corpus of proceedings in the Canadian Parliament. 750,000 wds. See description at the UCREL website. |
|
800,000 wds (87,188 parse
trees) of fully-parsed & annotated spoken British English from the 1950s
to 1990s; composed of two 400,000-word samples of spoken English from the London-Lund
Corpus (late 1960s-early 80s) & ICE-GB (early 1990s); fully
parsed to be consistent with ICE-GB & searchable using ICECUP,
(Survey of English Usage, University College London). |
|
|
ICE-GB (the British component of ICE) is the first of the ICE corpora to be completed, & is the British component of the International Corpus of English (ICE) Project. It consists of a m wds - 83,394 parse trees, including 59,640 in the spoken part of the corpus- extracted from 200 written & 300 spoken English texts. It is fully grammatically annotated & has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (International Corpus of English Corpus Utility Program) an exploration software designed for parsed corpora. |
|
|
IBM Manuals Treebank |
A skeleton-parsed corpus of computer manuals. 800,000 wds. See description at the UCREL website. |
|
Lancaster-Leeds Treebank |
A manually parsed subsample of the LOB corpus of English showing the surface phrase structure of each sentence, prepared by Professor Geoffrey Sampson. Approximately 45,000 wds taken from all the genre categories of the LOB corpus. See description at the UCREL website. |
|
(manual is here) |
a parsed subcorpus of the LOB Corpus of English, parsed by computer & manually corrected by researchers (Roger Garside, Geoffrey Leech & Tamas Varadi). Available through ICAME. It is a treebank consisting of over 133.000 wds from each of the 15 categories of the LOB Corpus. Each sentence is annotated with a phrase-structure parse in the form of labelled bracketing. The labels mark the boundaries of sentence, clause, phrase & coordinated word constituents. The word tags used in the tagged version of the LOB Corpus are also part of the annotation of the Lancaster Parsed Corpus. |
|
The Penn Treebank Project annotates naturally-occuring text for linguistic structure -- skeletal parses showing rough syntactic & semantic information (a bank of linguistic trees) in addition to part-of-speech tags, & for the Switchboard corpus of telephone conversations, also dysfluency annotation. The original CD-ROM contains over 1.6 m wds of hand-parsed material from the Dow Jones News Service, plus an additional 1 m wds tagged for part of speech; the first fully parsed version of the Brown Corpus, completely retagged using the Penn Treebank tag set; tagged & parsed data from Dept of Energy abstracts, IBM computer manuals, MUC-3 & ATIS. Release 2 CDROM features
the new Penn Treebank II bracketing style, & contains, among other files,
1 m wds of To search the corpus for parsed structures, try the Penn Treebank Online (you'll need to know how to use the software 'tgrep') or , obtain Tgrep2 for stand-alone machines (Linux + source code for other platforms). CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations. It pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure. Contains 99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies & errors in the original annotation. Can also be searched with Douglas Rohde's TGrep2, version 1.15 or higher. |
|
|
consists of approximately 65,000 wds in 11,396 (sometimes very long) lines, each containing a parse tree. |
|
|
SUSANNE (Surface & Underlying Structural Analyses of Naturalistic English) |
130,000-word cross-section of written American English (based on a subset of the million-word Brown Corpus; 64 texts x 2,000 wds each from four Brown genre categories) syntactically analysed (treebanked). |
D-I-Y (do-it-yourself) Corpora: sources of data for building your own corpus(see also the text archive sites & the audio/visual archive sites) |
|
|
ABU: la Bibliothèque Universelle (French) |
L'accès libre au texte intégral d'oeuvres du domaine public francophone sur Internet depuis 1993. Pour accéder aux textes, consultez le catalogue des AUTEURS OU CELUI DES textes. Vous pouvez également faire des recherches de mots sur tout le corpus. Nous avons aussi plusieurs dictionnaires. |
|
Enormously useful site covering much of the same ground as the OTA (but, refreshingly, without the considerable bother of endless copyright restrictions & legal threats). Besides plain texts of prose fiction & non-fiction, poetry & drama, the site includes: an encyclopaedia, gazetteer, world factbook, dictionary, thesaurus, style guides, books of quotations. |
|
|
more than 2000 free texts (mostly classics), study guides & reference resources (more for the literary/humanities scholar, but worth a look) |
|
|
more than 200 titles available |
|
|
42 collections on such diverse topics as contemporary art, race, Internet studies, sexuality, drama, design, multimedia, accessible publishing & current political & social issues. Also includes hypertexts, audio & video recordings.. |
|
|
a digital resource which
enables you to search and download thousands of English-language university
essays and theses from |
|
|
works of fiction & about fiction. Collection of texts in the public domain, classified into: Late Antique & Medieval Texts, Renaissance & Early Modern Texts, Modern Fiction, Modern Poetry, Historical Documents, Religious Texts & Other Texts. |
|
|
searchable (with basic concordancing) & browsable texts of English classics (More for the literary/humanities scholar, but worth a look. Whole texts not downloadable in one go.) |
|
|
441 works of classical literature by 59 different authors, including user-driven commentary & "reader's choice" Web sites. Mainly Greco-Roman works (some Chinese & Persian), all in English translation. |
|
|
Movie Script sites |
Drew's Scripts-O-Rama / The Movie Script Compendium / Script Central |
|
Watch out for typos and mis-translations |
|
|
Newspaper sites for English (Sampler) (broadsheets & tabloids) (You will, of course, have your own links to hundreds of other newspapers, other varieties of English & other languages.) |
British Broadsheets: The Guardian, The Independent, The Telegraph, The Times, The Evening Standard, The Observer, The Sunday Times, The Scotsman, The Herald, The Irish Times British Tabloids: The Mirror, The Sun, Daily/Sunday Express, News of the World, The Daily Star, The Sunday Mirror
[* More newspaper & magazine links may be found here ]
American Newspapers: |
|
Newspaper sites for Other Languages |
Try this searchable database of Newspapers, Magazines & other media (radio, TV) on the Internet (Kidon Media Link, a meta-site with listings by language & country) or try this site (maintained by IMS Stuttgart). or the selection below: French: Le Monde German: Die Zeit, Die Welt, Süddeutsche Zeitung Russian: Nezavissimaya Gazeta |
|
Renascence
Editions ( |
an online repository of works printed in English between 1477 & 1799; includes Shakespeare, Wordsworth, Bacon, Bunyan, Donne, Hume, Hobbes, Milton, Spenser |
|
a fee-based Corpus Query System incorporating word sketches, grammatical relations, & a distributional thesaurus. A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical & collocational behaviour. A 30-day free trial account is available. Web-based service using standard browsers: no software installation required. Available Resources: (1) Pre-loaded corpora (60M-1.5B wds) for Chinese, English, French, German, Italian, Japanese, Portuguese, Spanish, & Slovene; (2) WebBootCaT (for building your own instant corpus from web pages, then extracting keywords, specialist terminology, etc.); (3) CorpusBuilder (upload & install your own corpora). |
|
|
Transcripts of Spoken News reportage, Debates, Interviews |
|
|
a directory of books that
can be freely read on the Internet. The On-Line Books Page is now hosted by
the |
|
|
Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 |
about 810,000 Reuters English Language News stories, covering the period 20 August 1996 - 19 August 1997. Format: 806,791 XML files in NewsML format; 365 zip files, one per day, over 2 CDs (2.5 GB when uncompressed); the number of stories per day is not constant, but on weekdays there are on average of 2,880 per day & 480 on weekends. Most stories are around 6-7 paragraphs, 1000 wds. Forthcoming: news stories in other languages, covering the same time period. |
|
on-line minutes of meetings (Lords & Commons); bills, reports, bulletins; Hansard & other publications. |
|
|
Good source for getting parallel texts (for a limited range of topics & genres) in Arabic, English, Chinese, French, Russian & Spanish. |
|
|
Parallel texts concerning European law in several EU languages (Spanish, Danish, German, Greek, English, French, Italian, Dutch, Portuguese, Finnish & Swedish). |
|
Multimedia
Corpora & Texts with Audio/Visual accompaniments
(includes historical digital library initiatives. Not all are structured/formatted as other standardized text corpora.) |
|
|
(not strictly a corpus, but...) Audio files of notable lectures & events held at UC Berkeley: interviews & lectures by famous critics, authors & cultural historians, including Aldous Huxley, James Baldwin, Malcolm X, Michel Foucault, Noam Chomsky, Umberto Eco & Claude Lévi-Strauss. |
|
|
Index to some 400+ active links to 5000+ full text, audio & video (streaming) versions of public speeches, sermons, legal proceedings, lectures, debates, interviews, other recorded media events, & a declaration or two. |
|
|
a collection of
interviews (edited/not-faithful-to-the-original transcripts + streaming
videos (for most interviews)) with distinguished people from all over the
world about their lives & their work (diplomats, statesmen, &
soldiers; economists & political analysts; scientists & historians;
writers & foreign correspondents; activists & artists). At the heart
of each interview is a focus on individuals & ideas that make a
difference. The series is produced at the |
|
|
CUCASE (City University Corpus of Academic Spoken English; forthcoming) |
A c. 2-million-word
(multimedia) corpus currently being compiled (Jan 2008-Sept 2009, initially)
by David Y.W. Lee. Will mirror the design of MICASE & BASE; will contain English spoken at a |
|
(Genre & Multimodality) |
The GeM project ran from 1999 until 2002 & was concerned with developing the first XML annotation scheme for multilayered description of illustrated documents with complex layout. The GeM framework allows layout, rhetorical structure, content & language of different text types to be represented & interrogated. A follow-up project is in the planning stage. Output:
|
|
"The purpose of Historical Voices is to create a significant, fully searchable online database of spoken word collections spanning the 20th century - the first large-scale repository of its kind. Historical Voices will both provide storage for these digital holdings & display public galleries that cover a variety of interests & topics." Includes synchronised text-&-audio RealMedia presentations (see, for e.g., the Flint Sit-Down Strike). Transcripts are not formatted like standardised corpora, but have the advantage of being linked to sound recordings. |
|
|
Gesture Database
(Max Planck Institute in |
consists of the video
recordings (no accompanying transcript/corpus texts, as far as I know) of
speech & gestures that spontaneously accompany speech, & the annotations
regarding gesture & speech in the recording. The recordings were made in
different cultures, including the |
|
MICASE ( |
See fuller description above. Selected audio recordings of the original speech events are available here (streaming ReadAudio), or, for bona fide researchers, in other formats by special arrangement. |
|
Read about this corpus of American movies created using subtitles in five languages (English, French, German, Italian & Spanish) |
|
|
a project of the Special
Collections Unit, J. Murrey Atkins Library, Univ of North Carolina at |
|
|
|
|
Dictionary Data/Lexicons |
* Freely-accessible, On-line Corpora of EnglishMany language teachers & learners just want to know one simple thing: where are the free, web-accessible corpora that we can search rightaway, without any fuss? There are not many! Here are the major ones. I have left out literary works, newspaper collections & blogs because these you can easily find yourselves & there are millions of them out there. |
|
1. British National Corpus (BNC) [100m wds; 1990s British English, spoken & written]: There are many different web sites giving free (but limited) access to the corpus--limited due to copyright: i.e. you cannot expand the concordance context to read more of the surrounding text, & you cannot read the entire source texts (only snippets).
2. Corpus of Contemporary American English (COCA): [360 m wds; c. 150,000 texts, 20 m wds each year from 1990-2007.] For each year (& therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, & academic journals. Searchable on-line only; the texts themselves are not available for download. 3. TIME Magazine Corpus: [100 m words American English, 1923-present; More than 275,000 articles from TIME Magazine. Wide range of topics: news, sports, business, culture, health, entertainment, etc.] Nice search interface (essentially the same as that of the BYU-BNC and COCA). 4. WordbanksOnline (from the Bank of English) [modern written & spoken texts]: search a 56-million-word subset of the Bank of English (sub-dividable into 3 broad categories); also usefully allows you to specify a following word by part of speech, & gives collocations (limited to 40 hits, & the total number of hits is not reported :-( ) 5. MICASE [1.7m wds of current, spoken academic American English, as produced by faculty, students & staff in formal & informal settings around the university]: fully searchable & browseable via a custom web interface (no limits), & now has selected playable sound files to accompany some transcripts. Homepage is here. 6. Word Neighbors [by 7. Business Letters
Corpus [ 8. LOB & Brown [1m wds each; 1960s written British English (LOB) & American English (Brown)]: The Brown Corpus is searchable & browseable via LDC On-line here (no limits). The Brown & LOB are both searchable via the Virtual Language Centre (VLC, or alternative edict site); limited to 2001 hits, without any warning of this maximum. Allows simple searches as well as searches for [word + contextual/associated word] (i.e. only instances where the search word occurs near another specified word). The VLC/edict sites also have other collections of text -- see here for a description & breakdown of these more specialized corpora. 9. Hong Kong Financial Services Corpus (HKFSC) [7.6 m wds; spoken & written texts collected with the help of professional associations & private organisations from across the financial services sector in Hong Kong: e.g., insurance/investment product descriptions; agreements; media releases; ordinances; procedures; prospectuses; rules; standards; speeches] 10. CorpusEye: Search various corpora (for many languages). The English corpora include texts culled from Wikipedia and the Enron e-mails. 11. OPUS : [Computer manuals & European parliament speeches] an open-source collection of freely searchable/downloadable parallel corpora (texts with sentence-aligned translations). Not terribly useful texts unless you're teaching Technical English or researching parliamentary speeches. 12. VOA's Special English Program Scripts (by Charles Kelly) [c.14K wds; sentence-view concordances of scripts from Voice of America's "Special English" broadcasts, which use a limited vocab of 1,500 wds (not necessarily the "easiest" English words, but most are simple)] The scripts represent a kind of "written-to-be-spoken" English; useful for less-proficient English learners. 13. CorTec: a bilingual (English & Portuguese) comparable corpus of technical language (linked to the COMET project) in 5 areas: Cooking, Contracts, Computing, Environment & Hypertension. The texts themselves cannot be downloaded, but can be searched via the web tools provided: concordancer, wordlist & N-gram extractor. 14. SACODEYL includes a small corpus of English language teenager talk. Contains structured video interviews with students 13-18 yrs old (seven European languages in total). Annotated and enriched for language learning purposes. Free multimedia access (videos). ============= There are many other corpora which are free, but not on-line, including most of the ICE corpora (just sign a licence & download the files). If you're interested in non-native English, the PICLE Corpus (argumentative essays & literature exam scripts by Polish learners of English) is searchable on-line. ============= * See also the section on Using the Web as a corpus (many of these web concordancing search engines allow you to restrict searches to particular countries, institutions, URLs/web sites, etc., thus reducing the amount of junk/unwanted hits), & scrutinize the above section on D-I-Y Corpora for newspapers, out-of-copyright literary texts, & Bibles in various language. Most of these are not 'corpora' in the strict sense of being structured & formatted according to contemporary corpus standards, but are starting points if you want to have your own free texts to run concordances on. ** I've left out something? If you know of other web-searchable corpora, do let me know. |
* Question:
Ok, so how do I actually get my hands on on these hundreds of corpora, &
how can I search them?
Short Answer: Depends on the
corpus. If the corpus you want is not publicly available & not searchable
on-line, then you'll have to get the necessary licence, pay any fees, & use
the appropriate concordancer/tool for the task (see "Software"
section), bearing in mind the mark-up scheme, annotation tags, etc. used in
that corpus. Some concordancers (e.g., WordSmith) can ignore mark-up.
·
Archive: a
repository of readable electronic texts not linked in any coordinated way, e.g.
the Oxford Text Archive
·
Electronic Text Library (or ETL, Fr. 'textothèque'): a collection of
electronic texts in standardized format with certain conventions relating to
content, etc., but without rigorous selectional constraints.
·
Corpus: a
subset of an ETL, built according to explicit
design criteria for a specific purpose, e.g. the Corpus Révolutionnaire (Bibliothèque Beaubourg, Paris), the Cobuild Corpus, the
Longman/Lancaster corpus, the Oxford Pilot corpus.
If you need
help with file formats for some of the downloads, [click
here]
Have you
found this web site/page useful? Do let me know
if you want to encourage me to keep updating the site, or if you have a new
corpus or resource (or something I've missed) for me to link to,
please drop me a line.
Jump to:
Major English Corpora | Other Recent English Corpora | Speech
Corpora | Historical
Corpora | 1st language
acquisition | Learner
Corpora | Specialised
Corpora | Text Archives & Corpus
Distribution Sites | Non-English & Multilingual
Corpora | The
Web as a Corpus | Parsed
Corpora | D-I-Y
Corpora | Audio/Visual 'Corpora' | Free
Web-accessible Corpora | [Bookmarks HOME]
[ If you've surfed
in from somewhere else & want to know what this site is about, click the
home icon to go to my entrance page ]
Last Updated:
17 June 2009 19:16:20