Bookmarks for Corpus-based Linguists
Major English Corpora
| Other
Recent English Corpora | Spoken Corpora | Diachronic
Comparisons | Historical Corpora | 1st language acquisition |
Learner/Lingua Franca Corpora
| Specialised
Corpora | Text
Archives & Corpus Distribution Sites |
Non-English
& Multilingual Corpora | The Web as a Corpus | Parsed Corpora | D-I-Y
Corpora | Audio/Visual 'Corpora' | *Free, web-accessible Corpora | [Bookmarks HOME]
* What are the differences among these terms?
See below.
* Ok, so how do I actually get my
hands on on these corpora, & how can I search them? See below.
* For freely
accessible, on-line corpora, see separate section below.
Major English Language General Corpora |
|
Kennedy (1998) suggests a three-way categorisation of corpora. Pre-electronic Corpora: (biblical & literary studies, early dictionaries, etc.)1st-generation Major Corpora: Brown, LOB,
LLC,
|
Other General Corpora for Written English(excluding those already in the above lists; please also note other categories (for speech corpora, for instance) below; the same corpus may appear under more than one category, for easy access) |
|
|
FLOB (Freiburg-LOB Corpus of British English) |
1990s analogue to the LOB corpus (1 m wds, written British English) |
|
FROWN (Freiburg-Brown Corpus of American English) |
1990s analogue to the Brown corpus (1 m wds, written American English) |
|
structurally analysed written British English (drawn from the British National Corpus); a treebank sampling modern written British English of three genres (edited published prose, the writing of young adults (e.g. A-level exam scripts, 1st-year undergraduate essays), spontaneous writing by 9- to 12-year-old children). |
|
|
SUSANNE (Surface & Underlying Structural Analyses of Naturalistic English) |
130,000-word cross-section of written American English (based on a subset of the million-word Brown Corpus; 64 texts x 2,000 wds each from four Brown genre categories) syntactically analysed (treebanked). |
|
Corpus of Contemporary American English (by Mark Davies, |
c. 360 million wds, including 20m for each year from 1990 to the present. Each year (& therefore overall, as well), the corpus is evenly divided between spoken, fiction, popular magazines, newspapers, & academic. In addition, the corpus will be continually updated--20m wds each year. (Because of copyright & licensing issues, the texts themselves are not available for download—they can only be searched online.) |
|
International Corpus of English |
see description under 2nd-generation Mega-corpora |
|
International Corpus of English: |
* For other national varieties of ICE, go
to the main ICE web page here
(site includes downloadable sound files from several ICE teams, including |
|
International Corpus of English: |
The |
|
International Corpus of English: |
1 m wds of spoken & written New Zealand English collected 1989 to 1994; consists of 600,000 wds of speech & 400,000 wds of written text. The Wellington Corpus of Spoken New Zealand English (WSC) & the spoken component of ICE-NZ share 9 categories. Because informal conversational data in particular was so difficult to collect, there is an overlap of 339,530 wds (173 files) between the two corpora to achieve economy in data collection. |
|
[This blurb is from their web site. Availability is unknown, as with all proprietary corpora... no comment on the use of 'corpuses'...] A dynamic corpus of 100 m wds from newspapers, journals, magazines, best-selling novels, technical & scientific writing, & coffee-table books..composition constantly being refined & new material added.... based on the general design principles of the Longman Lancaster English Language Corpus & the written component of the British National Corpus. Like other corpuses[sic] in the Longman Corpus Network, wds can be concordanced, wordlists created, & statistical features analysed, allowing lexicographers to compare & contrast usage in British & American English. |
|
|
(registration required to get the CDs, or get the older Reuters-21578 here.) |
Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 [810,000 news stories] Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 [over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, & Swedish). These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.] |
* Some of the
above-mentioned corpora are conveniently bundled together on the new ICAME Corpus Collection
on CD-ROM (click to find out more). It comes with the concordancers
WordSmith, TACT & WordCurncher.
Spoken Corpora of EnglishFor
phonemic/acoustic/articulatory databanks (mainly
isolated words, phonemes, or sentences), see separate list of links here ( |
|
|
ANDOSL (Australian National Database of Spoken Language) |
comprises spoken language as it occurs in a variety of
major speaker groups in |
|
BASE (British Academic Spoken English) |
|
|
(Cambridge & Nottingham Corpus of Discourse in English) |
Not generally available for research except at specific sites (annoying!). 5 m wds of spontaneous speech collected between 1995 & 2000. We are told that a feature of CANCODE that makes it different from other spoken corpora is that all the transcripts have been coded to reflect the relationship between the speakers–whether they are intimates (living together), casual acquaintances, colleagues at work, or unknown to each other. Speech events were recorded at hundreds of locations across the British Isles, covering a wide variety of situations: casual conversation, people working together, people shopping, people finding out information, discussions, etc. [see also the Centre for Research in Applied Linguistics, University of Nottingham ] |
|
(Spoken version of SUSANNE Corpus) |
SUSANNE-meets-spoken-English; Geoffrey Sampson's project |
|
(Center for Spoken Language Understanding) |
several free speech corpora (telephone recordings, conversations with children, pronunciations of isolated digits & alphabets, etc.) |
|
CUCASE ( |
A c. 2-m-word (multimedia) corpus currently being
compiled (Jan 2008-Sept 2009, initially) by David Y.W. Lee. Will mirror the
design of MICASE
& BASE;
will contain English spoken at a |
|
800,000 wds (87,188 parse trees) of fully-parsed &
annotated spoken British English from the 1950s to 1990s; composed of two
400,000-word samples of spoken English from the London-Lund Corpus (late
1960s-early 80s) & ICE-GB (early 1990s); fully parsed to be
consistent with ICE-GB & searchable using ICECUP, (Survey of
English Usage, University College London). |
|
|
Dialogue Diversity 'Corpus' (DDC) |
|
|
ELISA (English Language Interview Corpus as a Second-Language Application) |
60,000 wds, 28 interviews with
native speakers of English; multimodal (video files available). They talk
about their professional career (e.g. in tourism, politics, the media or
environmental education). Free for non-commercial use. Has an on-line
concordancer. |
|
EUSTACE
( |
Free for non-commercial use; esp. useful for phonetics researchers & speech technologists working on synthesis & recognition. Comprises 4608 spoken sentences spoken by six speakers of British English; sentences were designed to examine a number of durational effects in speech & are controlled for length & phonetic content. Subconstituents of key words in each sentence have been identified by labels in xlabel (ESPS) format & notes have been made about the prosodic realisation of the sentences. Example sentences available for playback. Speech waveform files are available in .wav (RIFF) format & .sd (ESPS) format. |
|
( |
A specialized corpus of British English dialects covering nine major dialect areas in Britain; 370 texts; c. 2.45 m wds; 300 hours of speech, excluding interviewer utterances (recorded between 1968 & 2000-- some recordings were taken from oral history interviews), 420 different informants (a majority are non-mobile old rural males who typically grew up before WW I.). Recordings will be made available |
|
(by the Human Communication Research
Centre at |
a set of 8 CD-ROMs containing linked audio & transcriptions of a total of about 18 hours (roughly 150,000 word tokens) of spontaneous (task-oriented) speech that was recorded from 128 two-person conversations according to a detailed experimental design. OR Download/ftp a gzipped tar file of the entire corpus (tar [compressesd] file is 10MB, whole corpus is 80MB; 2562 XML files & a dtd directory containing 15 dtd files.) |
|
(Intonational Variation in English) |
created to investigate cross-varietal & stylistic variation in English intonation. Focus is on modern or mainstream dialects: nine urban varieties of English spoken in the British Isles, viz. Belfast, Bradford (bilingual Punjabi/English speakers), Cambridge, Cardiff (bilingual Welsh-English speakers), Dublin, Liverpool, Leeds, London, & Newcastle; approximately 36 hours of speech data in five different speaking styles |
|
52,000 wds of mostly prepared (& mostly
monologic) southern British English speech (approximating
to RP), collected in the period 1984-1987; orthographic & prosodic
transcription & in two versions with grammatical tagging (like those for
the LOB Corpus). Detailed description: see: -- See the ICAME Corpus
Collection's SEC manual for a description of the SEC & the
AMALGAM web site for the SEC Tag-set
Ref: A Corpus of Formal British English Speech (1996), Knowles,
Gerald, Briony Williams & Lita Taylor (eds.), London: Longman. A
collection of research papers based on the SEC has also been published as Working
with Speech (1996), Knowles, Gerald, Anne Wichmann & Peter
Alderson (eds.), |
|
|
LeaP (Learning the Prosody of a foreign language) ( |
a large corpus of foreign language learners' speech (Target Languages are English & German, Native Languages span a wide range: German, Polish, Arabic, Chinese, Spanish, etc.). A multitude of data of various types is being collecetd: the corpus of spoken language will consist of at least 400 recordings of between 2 & 20 minutes length. It comprises there different speech styles: (i) read speech (a story of 268 wds); (ii) prepared speech (the re-telling of the story); (iii) free speech from an interview context. The central question of the project is to provide a detailed decription of non-native prosody. The second line of research aims to explore whether & how it is possible for learners of a foreign language to acquire the prosody of the target language without having a distinct "foreign accent". In a longitudinal study, various methods of teaching prosody will be tested. |
|
Limerick Corpus of
Irish-English (L-CIE) |
one-million word spoken corpus of Irish English discourse; conversations recorded in a wide variety of mostly informal settings throughout Ireland (excluding Northern Ireland); currently (accessed: Feb 2008) has 375 transcripts totalling over 1m wds; mainly casual conversation, but also over 200K wds of professional, transactional & pedagogic Irish English; not designed to be geographically representative (does not include data from every county); speakers range in age from 14 to 78; equal representation of both male & female speakers; designed to allow inter-corpus comparisons with CANCODE |
|
London-Lund Corpus (LLC) |
See description here |
|
5 m wds, demographically sampled speech from 12
regions (30 states) across the continental US; coordinated by the |
|
|
Machine-Readable Spoken English Corpus |
Some notes on MARSEC version 2 here (latest) or here (outdated). |
|
MICASE
( |
|
|
(Under construction) |
|
|
|
400,000 wds transcribed speech from 42 locations, across three age groups. Contact the Oxford Text Archive. |
|
PROSICE Corpus |
a collection of re-recorded ICE-GB texts with high technical specifications; syntactically analysed & temporally aligned. See here for more info. |
|
Reading/Leeds Emotional Speech Corpus |
prosodically & paralinguistically coded speech corpus for investigating suprasegmental & affective information in the speech signal. 4.5-hour database of machine-readable speech, of which 26 mins were transcribed using the extended ToBI system. Unfortunately, this corpus is NOT available for use by others, but you can find out more info from the people listed on the website, & also from here. |
|
( university site is here) recordings of people talking -- people from all over the United States, in all walks of life, talking about & doing all sorts of things; target of 200,000 wds. The three CD-ROM volumes in Part 1 contain 14 speech files of between fifteen & thirty minutes each. Alternative site for the data at TalkBank Part II contains 47,000 wds (6 hrs; 16 wave format speech files) |
|
|
Spoken Corpus of the Survey of English Dialects |
|
|
Switchboard Corpus (SWB) |
a corpus of over 240 hours of recorded spontaneous (but topic-prompted) telephone conversations (2,438 conversations averaging 6 minutes in length each) recorded in the early 1990s; c. 3 m wds (3,044,734) of text, spoken by 543 unique speakers (302 males & 241 females) from most major dialect groups of American English. Info on the speakers' age, sex, education & dialect region. On average, each speaker participates in about 9 calls (but it ranges from 1 to 32). |
|
TRAINS
Spoken Dialogue Corpus on CD-ROM ( |
six & a half hours' worth of human-human dialogues; includes 55,000 wds & about 5,500 speaker turns. Audio files for the dialogues are available on the CD-ROM; |
|
Tyneside Linguistic Survey (TLS) |
Not much info available, but some given on the NECTE page.
The TLS corpus was compiled in the late 1960s, & consists of 86
loosely-structured 30-min interviews. The informants were drawn from a
stratified random sample of Gateshead in |
|
1 m wds of spoken New Zealand English collected from 1988 to 1994 (99% (545 out of 551 extracts) was collected between 1990 to 1994). Of the eight remaining files, four were collected in 1988 (4 oral history interviews) & four in 1989 (4 social dialect interviews). 2,000 word extracts (where possible) & comprises different proportions of formal, semi-formal & informal speech. Both monologue & dialogue categories are included & there is broadcast as well as private material collected in a range of settings. Access to recordings from the WSC is restricted to use at Victoria University of Wellington. A small number of the recordings which are shared with the ICE-NZ corpus will be made available on CD through ICE. |
|
|
Not generally available (?). Project aims to analyse
features of interpersonal communication in a wide variety of |
|
|
During the 2000-2001 academic year, cadets, staff & faculty members at the US Military Academy volunteered to participate in a speech data collection project for American English (high-quality read speech---not spontaneous). The 185 sentences comprising the data collection script were written to elicit examples of all or most all of the possible syllables used in spoken American English. The G3 Corpus audio data comes from 53 female & 56 male volunteers, each of whom recorded approximately 104 utterances. The recordings are sampled at a 16 bit resolution, 22,050 samples per second. Total: c.15 hours. |
|
|
British National Corpus (BNC) |
Naturally, the spoken component of the British National Corpus is also a rich resource (although for phonetic/prosodic research you'll need to get the audio tapes from the British Library... these are now generally available, but the matching of tapes & actual BNC files is problematic). |
The LDC also contains various resources which
are not 'corpora' as such, but may be of interest. Example: the LDC
American English Spoken Lexicon, which is a collection of
pronunciations captured in individual audio files for more than 50,000 of the
most common words in English (words were extracted from newswire & telephone conversation; description & links to audio files here).
Diachronic Comparisons (recent changes in English) |
|
Since the first major English corpora were collected in the 1960s, it is now possible to compare these earlier corpora with more contemporary (1990s) corpora. For written British English, LOB can now be compared with FLOB, while for American English, it's Brown v. Frown. For spoken British English, the Diachronic Corpus of Present-Day Spoken English (DCPSE) allows comparisons of the London-Lund Corpus (LLC, 1960s) with the British component of the International Corpus of English (ICE-GB, 1990s). |
Historical Corpora or Collections (English) |
|
|
Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English |
a selection of texts from the Old English Section of the Helsinki Corpus of English Texts; contains 106,210 wds of Old Eng text; the samples from the longer texts are 5,000 to 10,000 wds in length; texts represent a range of dates of composition, authors, & genres. For a list of the texts included in the Brooklyn Corpus, click here. The texts are syntactically & morphologically annotated, & each word is glossed. Size of the corpus: c.12 megabytes. |
|
3,022 texts representing all extant Old English texts,
compiled at the |
|
|
1.2-m wds of Early Modern English speech-related
texts (177 text files). The CED contains texts representative of five text
types (plus a mixed bag of dialogues labelled 'Miscellaneous'), which divide
into two categories: these are 'authentic dialogue', which is written records
of real speech events (Trial Proceedings & Witness Depositions),
& 'constructed dialogue', in which
the dialogue is constructed by an author (Drama Comedy, Didactic Works, &
Prose Fiction). |
|
|
Early English Books On-line (EEBO) (subscription required) |
(images of original print documents, with some now searchable as texts) "From the first book published in English through the age of Spenser & Shakespeare, this incomparable collection now contains about 100,000 of over 125,000 titles listed in Pollard & Redgrave's Short-Title Catalogue (1475-1640) & Wing's Short-Title Catalogue (1641-1700) & their revised editions, as well as the Thomason Tracts (1640-1661) collection & the Early English Books Tract Supplement." |
|
approximately 800,000 wds of running text drawn from all the newsbooks present in the Thomason Tracts that were published from December 1653 to May 1654. |
|
|
1.5 million word syntactically-annotated corpus of Old English prose texts; sister corpus to the Penn-Helsinki Parsed Corpus of Middle English (uses the same form of annotation & is accessed by the same search engine, CorpusSearch). The corpus itself (the annotated text files) is distributed by the Oxford Text Archive. Free for non-commercial use. |
|
|
a selection of poetic texts from the Old English Section of the Helsinki Corpus of English Texts; 71,490 wds of Old English text; the samples from the longer texts are 4,000 to 17,000 wds in length. The texts represent a range of dates of composition & authors. For a list of the texts included in the York Poetry Corpus, click here. The texts are syntactically & morphologically annotated. |
|
|
c. 1.5 m wds; 242 files; covers the period from c. 750 to c. 1700 (Old English to Early Modern) |
|
|
Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET) |
(1) The Prose Corpus of ICAMET: compilation of 129 texts (March 1999) of Middle Eng prose (1100-1500), digitalized from extant editions & constantly enlarged by further files. Since it is a full-text database, it particularly aims at target groups of users who, unlike those of the Helsinki Corpus, are not so much interested in extracts of texts, but in their complete versions. Thus allows literary, historical & topical analyses of various kinds, esp. studies of cultural history. It also invites linguists to raise questions of style, rhetoric or narrative technique, for which one would want a lengthier piece of text or even the complete text. (2) The Letter Corpus of ICAMET contains 254 complete letters, arranged diachronically, from different sources (written between 1386 & 1688). Particularly encourages pragmatic & sociolinguistic studies, & analyses concerning cultural life & lifestyle. |
|
prose text samples of Middle Eng, annotated for syntactic structure. Designed for the use of students & scholars of the history of English, especially the historical syntax of the language |
|
|
Corpus of Middle English Prose & Verse (CME) (or visit the parent site, the Middle English Compendium) |
collection of Middle Eng texts assembled from works contributed by Univ of Michigan faculty & from texts provided by the Oxford Text Archive, as well as works created specifically for the Corpus (archive last updated in October 2000). All 61 texts in the archive are valid SGML documents, tagged in conformance with the TEI Guidelines, & converted to the TEI Lite DTD for wider use. Web-searchable. |
|
Corpus
of Early English Correspondence (CEEC) & the Parsed
Corpus of Early English Correspondence (PCEEC), |
2.7 m wds; 1410 to 1681 (CEES = 450,000 wds); a supplement, the "Corpus of Early Correspondence Supplement (CEECSu; 0.44 m wds) extends the time range: 1402-1663, while the "Corpus of Early English Correspondence Extension" (CEECE; 2.2 m wds) covers the period 1681-1800. The project home page & the manual at ICAME give more details. |
|
Corpus
of Early English Medical Writing & Corpus
of Middle English Medical Texts (MEMT) |
a corpus of medical treatises from 1375-1800. Shorter
texts are included in toto & longer treatises are represented by extracts
of approximately 10-12 K wds. The medieval section contains about 500,000
wds |
|
Century of Prose Corpus |
half a m wds of literary & non-literary English; 1680-1780; 120 authors. |
|
c.300,000 wds of local English letters on practical
subjects, dated 1761-89, as a sample of the English language of the
north-west of |
|
|
A 100K-word corpus of informal private letters by British writers, covering the period 1861 to 1919. (Range of dates by birth-date of writer is narrower: 1837-67.) Available from the Oxford Text Archive & through the owner (David Denison). |
|
|
c.10 m wds; a principled collection of texts drawn from the Project Gutenberg & Oxford Text Archive; Ten m wds of running text, divided over three 70-year sub-periods from 1710-1920. |
|
|
Corpus of Early American English |
English in |
|
|
830,000 wds; 1450-1700, from fifteen genres. |
|
ARCHER Corpus |
1.7 m wds of British & American English from written
& "speech-based" genres sampled from 7 historical periods
covering Early Modern English (range: 1650-1990); 1,037 texts; 10
registers (e.g., drama, letters, science prose) representing speech-based,
popular, & specialist/academic written registers. Contact Douglas Biber.
Complements the |
|
NEET (Network of Early Eighteenth-century English Texts) |
c. 3-million-word corpus of 18th Century English registers. No more information available, but contact Douglas Biber for more details. |
|
750,000 wds; manuscript newsletters from 1674-92. |
|
|
1m wds of English pamphlet literature covering 1640-1740. Text samples are taken from each decade within this century & several genres are represented. Contains the whole text of pamphlets, rather than fragments. |
|
|
(Under construnction: 15 months from October 2003) |
1-million-word corpus which matches as closely as possible the LOB & FLOB corpora of written British English, except that the year of data collection is 1931, or near to that date (+/- 3 years). The immediate purpose of building this corpus is to make it possible to compare these three temporally equidistant corpora (1931, 1961, 1991): "Pre-LOB", LOB, & FLOB. This will enable tracking of grammatical change through a period of 60 years of the 20th century. Under construction & as yet unnamed (?) |
|
100-m wds from TIME magazine, 1923-2006. Allows you to
see how wds & phrases have increased or decreased in usage & or
changed meaning over time. |
|
|
The Brown University Women Writers Project's main
undertaking is an SGML-encoded full-text database of pre-Victorian women's
writing in English (at present, it covers 1400 to 1850). This collection
currently includes nearly 200 texts representing a broad cross-section of the
literate culture of pre-Victorian |
|
|
Zürich Corpus of English Newspapers (ZEN) |
|
* See also the Early
Modern English Dictionaries Database (EMEDD description here)
Corpora for research on 1st language acquisition |
|
|
Child Language Data Exchange System (CHILDES) XML database here |
c.20 m wds (180m characters), 20 languages. The CHILDES system provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding,& systems for linking transcripts to digitized audio & video. Includes a language acquisition bibliography |
|
Polytechnic of |
100,000 wds spoken English by 120 children, aged 6-12;
parsed according to Hallidayan Systemic-Functional Grammar. See the manual here.
Distributed from two places: The Oxford
Text Archive orgainsed by Lou Burnard. & ICAME in |
|
a digitized collection of project work produced by children aged between 9 & 11; part of a larger research program (a longitudinal study of children's writing-for-learning, based on the writing of 8-12 year old children) |
|
Learner Corpora, Lingua Franca Corpora (for various languages / 2nd Lg Acquisition research)(Language produced by non-native speakers/writers) * See Yukio Tono's Learner Corpora Resources web page for a more comprehensive index to learner corpora web sites (e.g. the various ICLE projects for learner English, such as SWICLE (Sweden), BRICLE (Brazil) & PICLE (Poland)), plus a useful bibliography on learner corpora |
|
|
Over 2 m wds of writing by advanced/university learners of English (EFL, not ESL) from 19 different mother tongue backgrounds (e.g. Brazilian Portuguese, Chinese (which dialect?), Czech, Dutch, Finnish, French, German, Japanese, Polish, Spanish & Swedish). *NEW: The ICLE corpus is now available for purchase (CD-ROM, version 1.1.) from i6doc.com here |
|
|
Not
generally available. A large collection of examples of English writing
from learners of English all over the world; over 15 m wds & expanding
all the time;part of the Cambridge International Corpus (CIC); comes from
anonymised exam scripts written by students taking Cambridge ESOL English
exams around the world; each script is coded with information about the
student's first language, nationality, level of English, age, etc. Currently,
it can only be used by authors & writers working for |
|
|
Chinese
Learner English Corpus (CLEC) |
one m wds of
English compositions collected from 5 different levels of Chinese learners of
English, tagged according to an error tagging scheme of 61 types of error
(excludes stylistic errors & error sources, which are difficult to tag
objectively & consistently). CLEC consists of a book & a CD-ROM. The
main body of the book has an introduction (in Chinese) which gives an account
of the corpus design, the methodology used in the statistical analysis of the
corpus, and the major findings, + an Alphabetical List, a Lemmatized List, a
Word-Frequency Distribution, a Summary Table of Errors, & a List of
Spelling Mistakes. The CD-ROM consists of the error-tagged corpus with a
simple concordancer, & all the lists & tables of the book. Another
companion to CLEC known as Analysis of Chinese Learner Errors in English is
forthcoming. Available by mail: |
|
(English
|
a web-based
corpus of c. 2 m wds of unrestricted running text of Eng written by
learners in |
|
a computerized archive of the spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, & their communication with native speakers in the respective host countries (France, Germany, Great Britain, The Netherlands & Sweden). For each target language, two source languages were selected. |
|
|
ELFA |
recordings
& transcripts of English used as a lingua franca in academic
settings ( |
|
a corpus of French as a foreign language, with a target size of 450,000 wds.. |
|
|
|
the biggest
corpus of Chinese (Cantonese) learners of English (or, indeed, of any single
group of learners of English). 25 m wds, with grammatical &
discourse-feature tags. Texts consist of written undergrad assignments
& 'A-level' scripts. Contact: Gregory James, Language Centre, |
|
Hungarian university students' English |
|
|
(Interactive Spoken Language Eduation) |
[not really a "corpus" as such]; database of non-native English created to help train & test the ISLE automatic pronunciation tutor system; approx. 20 minutes of speech (per speaker) from 23 German & 23 Italian intermediate learners of E |