Littera Deusto

Modern Languages, Basque Studies and Humanities

Information extraction (2nd Questionnaire)

mayo 8th, 2009 · No hay Comentarios

As Jim Cowie and Yorick Wilks said in one article, “Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts”. We have to add that Information Extraction is a technology based on analyzing Natural Language, and when the fact about a topic is taken from a document, it is automatically entered into a datasabe. Computational Linguistic techniques play an important role on IE, because IE, in a way, is interested in the structure of the text, unlike IR, which understands texts as “bags of words”.

When the user enters a word or sentence, he only gets the specific information he is interested in (after a process of text analysis). So, instead of documents, which is what Information retrieval offers, we get just the information we need. That information has been probably taken from a collection of documents, but it has been summarized.

IE is getting more and more important, for the amount information available on the internet grows everyday. People can get to that information more easily thanks to marking-up the data with XML tags, among other things. And not only “people” turns to IE, but also groups use it to summarize medical documents or build medical and biomedical ontologies.

These are the most common subtasks on IE:

  • Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
  • Coreference: identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference.
  • Terminology extraction: finding the relevant terms for a given corpus
  • Relationship Extraction: identification of relations between entities, such as:

It hasn’t reached the market yet, but it could become a great helper to industries of all kinds (this is an example from Yorick Wilks and Jim Cowie “finance companies want to know facts of the following sort and on a large scale: what company take-overs happened in a given time span; they want widely scattered text information reduced to a simple data base”).

Etiquetas:

  • Etiquetas