Littera Deusto

Modern Languages, Basque Studies and Humanities

Recognition and Synthesis of Voice (Questionnaire 2)

abril 12th, 2009 · No hay Comentarios

Recognition and Synthesis of Voice is also known as Speech synthesis, Telephony Speech Recognition, Spoken Language Understanding, Speech Rechnology, ASR, Speech Recognition, Desktop Speech Recognition, Desktop Speech Recognition, Voice Biometrics, Speech Processing, Voice Processing, Voice Synthesis, Automated Speech Recognition, and finally, Voice ID.

A definition of it would be the next one:  Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format. Rudimentary speech recognition software has a limited vocabulary of words and phrases and may only identify these if they are spoken very clearly. More sophisticated software has the ability to accept natural speech.Speech recognition applications include call routing, speech-to-text, voice dialing and voice search.The terms “speech recognition” and “voice recognition” are sometimes used interchangeably. However, the two terms mean different things. Speech recognition is used to identify words in spoken language. Voice recognition is a biometric technology used to identify a particular individual’s voice.A text-to-speech system (or “engine”) is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalizationpre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences.

The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound.

The speech capabilities that can be added to an application are text-to-speech synthesis (TTS) and speech recognition (SR).

This involves turning a string into spoken language that is played through the computer speakers. The complexities of turning words into phonemes, adding appropriate emphasis and translating the result into digital audio are beyond the scope of this paper and are catered for by a TTS engine installed on your machine.

The end result is that the computer talks to the user to save the user having to read some text on the screen.

This involves the computer taking the user’s speech and interpreting what has been said. This allows the user to control the computer (or certain aspects of it) by voice, rather than having to use the mouse and keyboard, or alternatively just dictating the contents of a document.

The complex nature of translating the raw audio into phonemes involves a lot of signal processing and is not focused on here. These details are taken care of by an SR engine that will be installed on your machine. SR engines are often calledrecognisers and these days typically implement continuous speech recognition (older recognisers implemented isolated or discrete speech recognition, where pauses were required between words).

Speech recognition usually means one of two things. The application can understand and follow simple commands that it has been educated about in advance. This is known as command and control (sometimes seen abbreviated as CnC, or simply SR).

Alternatively an application can support dictation (sometimes abbreviated to DSR). Dictation is more complex as the engine has to try and identify arbitrary spoken words, and will need to decide which spelling of similarly sounding words is required. It develops context information based on the preceding and following words to try and help decide. Because this context analysis is not required with Command and Control recognition, CnC is sometimes referred to as context-free recognition.

References:

Etiquetas:

  • Etiquetas