Bookmarks for Corpus-based Linguists 

 

Help with File Formats

If you've downloaded a file that you don't know what to do with here are some pointers:

Zipped/compressed files

(files ending in .ZIP or .gz or .tar)

Use 7zip (freeware) or Winzip (shareware) to uncompress them

Postscript files (.ps)

may be previewed or read on screen before printing (or instead of printing) using Ghostview (+ the Ghostscript interpreter) [Get latest version of Ghostview + Ghostscript]

Adobe Acrobat (.PDF) format

Get Adobe Acrobat Reader (freeware) to view the files.

Microsoft Word format (files ending in .DOC)  

or 
PowerPoint
files (.PPT)

if you don't have a particular version of Word or PowerPoint you can get either the converters/filters OR the free viewer programs from the Microsoft web site: [Get Word 97/2000 viewer] [Get PowerPoint 2000 Viewer]

 

A mini-Glossary of terms used by corpus-based linguists

Unless otherwise referenced these are my own quick-and-dirty definitions. Please do not reproduce any of this without permission.

Corpus

(plural = corpora, or, if you want to be different, corpuses)

"a collection of pieces of language [texts] that are selected and ordered according to explicit linguistic criteria in order to be used as a [representative] sample of the language" (taken from EAGLES 1996:4)

A corpus can be synchronic (closed), presenting a snapshot of the language of a particular period, or it can be a monitor corpus (e.g., the Bank of English), where new material is added on a continual basis.

Concordance

A formatted/ordered listing of all the instances of a search item (word/phrase) in a corpus (usually in a "KWIC" format with the search item in the centre of the page or screen). Hence we talk about concordance lines (individual search 'hits') produced by concordancers (the software).

KWIC/KIIC

Keyword/Key Item in Context: a display format showing the search item (word/phrase) plus the surrounding words to the left and right of it).

Useful for examining how a word/phrase is used in real samples of language (embedded in real-life co-texts, genres, and social relationships/contexts); helps analysts discover regularities or patterns governing the usage of the item/word/phrase.

Collocation

The phenomenon, tendency or specific instance of words/lexical items habitually co-occurring close to one another (i.e. the greater-than-chance co-selection of words revealing the language habits of native speakers). For example, if you look up the word jubilee in a large collection of (British) English texts you will tend to find the following words (the collocates) nearby: silver, diamond, golden, Queen's, and line (the 'Jubilee Line' is one of the London Underground subway routes).

In language teaching, learners are shown or taught collocations in order to help them speak and write  natural-sounding ('idiomatic') language. Some nouns, for example, have very strong verb collocates: e..g  conclusions are drawn/reached but not made [ = noun-verb collocation].

The term 'collocation' is very broad and allows varying degrees of collocability (or collocational strength), which is measured by several statistical formulae (e.g. log-likelihood, mutual information). 

At one extreme of the scale, collocations which are totally predictable are usually analysed as 'idioms', 'cliches', 'fixed expressions', 'lexical bundles', etc. At the other extreme, items which co-occur significantly in statistical terms may not be recognised as predictable collocations by native speakers, i.e. the collocational regularity or statistical cooccurrence is there, but it may not have any psycholinguistic reality for native speakers.

Some abstract patterns of meaning resulting from collocations (whether intuitive or not) form a system called 'semantic prosody' (a systematic connotational 'colouring' of a word or phrase that arises from its collocational patterning into one or more semantic sets).

ASCII text format

"American Standard Code for Information Interchange" = printable text format = Plain vanilla Anglocentric text format, based on Roman/English orthography, essentially consisting of everything you can see on an ordinary US computer keyboard: letters (A-Z, a-z), digits (0-9), punctuation marks, plus a few miscellaneous symbols ($ % @ # ~ * & _ + - (  ) <  > {  } | \ ^ etc.). No "exotic/foreign" (non-English) characters are included except for those with the accent marks (diacritics) used in French and Spanish (e.g.,  è é ê ñ ).

Mark-up (or markup)

versus

Annotation

(Some academics don't make a distinction between the terms, but I do because I think it's a useful one...)

 

Mark-up= tags (added character strings) used to code the structural or surface format/renditional attributes of a text (e.g., headings, sections, page breaks, sentences, bold/italics, speaker ID, speaker turns, pauses), OR non-interpreted aspects of the situated context of the discourse (e.g. bibliographical or demographic details about the author or speaker, location of speech event, genre, etc., and also gestures, laughter, voice quality, and events such as "writes on blackboard"). In HTML/SGML/XML (mark-up languages, or metalanguages), mark-up is always within angled brackets.

 

Annotation = a subset of mark-up; tags (added character strings) used to code 'value-added' or interpreted information, derived through analysis by humans or machines; usually added for research purposes. The most common annotations are part-of-speech (POS) tags, lemmas, semantic tags, discourse-level/pragmatic tags.

 

Marked-up/annotated texts are designed for computational tractability, and not meant to be read “raw”. They can, however, be rendered for human consumption with the right software/user interface.

lemma

(plural = lemmas or, less commonly, lemmata)

An abstract lexical category (usually represented by all-capitals, e.g. BLOW) consisting of a lexeme base plus its inflected forms (regular, irregular & suppletive inflections) which share the same part of speech.

For example, the verbal lemma BLOW contains the word forms blow, blows, blew, blown and blowing, while the lemma GO encompasses go, goes, went, gone, going.

Lemmas for nouns (or 'substantives') group together singular and plural forms (e.g. wolf/wolves); adjectival lemmas group together positive, comparative and superlative forms (e.g. happy, happier, happiest; good, better, best); pronominal lemmas consist of the different 'cases' of the same pronoun (e.g. I, me, my, mine).

 

Back to HOME (tiny.cc./corpora)[Bookmarks HOME]

 [ If you've surfed in from somewhere else & want to know what this site is about, click the above to go to my entrance page ]


Last Updated: 12 July 2009 04:26:06
© David Lee