While prevalent research shows Lip-reading as independent process, however, in other cases it has been aligned with voice recognition system in order to improve it. In one of such process they use first ten letters of English each considered as single word.
In the due process of training images of a person saying the letters several times is recorded 1. Considering the co-relation factor the images are aligned to each other and it is then matched with a each of the possible letters and the energy is preserved and tested further. The letter which get to preserve most of the energy is used to pronounce letter in new sequence.
Another very interesting case of lip- reading s given by by Bregler et al 2 and it is known as ‘the bartender problem’. Here customer is given a choice to choose between four drinks however due to noise and sound in the curtail bartender can only read the lip. After this there was training of the system of for each of the drinks based on Hidden Markov Model(HMM). The chances of getting error was minimal with only one error per 22 test of utterances(45%)
Another aspect of testing was developed by Duchnowski et al. 3 Here the man-machine interaction platform was made quite simpler and of easier aspect with man sitting in front of a system making an utterance being tested by a multi-state time delay artificial neural network with three layers, and 15 units in the hidden layer. With this dual combination of audio-visual impact they were able to drastically reduce the error rate by 20-50% over the acoustic processing alone, for various signal/noise conditions.
Another technique was to record the facial movement of user making the utterances, this technique came to know as speech recognition method based on surface Electromyography (EMG) signals by Kumar et al. 4. However it requires a layer of electrode to be mounted on speakers face.
In 6 speech recognition method based on visual data is given which is less prone to intrusion and hence most effective.
A system called “image-input microphone” 6, analyses the dimension of mouth such as width and height in order to correctly identify the input and computes the corresponding vocal-tract transfer function.