Littera Deusto

Modern Languages, Basque Studies and Humanities

What is a corpus?

mayo 16th, 2009 · No hay Comentarios

In linguistics, a corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe. They have also multiple examples of each word in different contexts and each example is categorized by a certain code, specifying the date of release and the name of the magazine in which that sentence has been published.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

If you want to find out more about English corpora or find links to famous English Corpora visit our wiki page: http://wiki.littera.deusto.es/en/index.php/Lr0809/I

Etiquetas:

  • Etiquetas