Visual Speech Recognition

The pipeline essentially reads text from the lips of the person speaking. By getting facial features from the video file it can predict the text the person on the video is saying. The service takes the prompted text as input and compares the text spoken by a person with the prompted text and returns the spoken text with the similarity score.

Input Data

Video file (MP4, MOV, AVI etc…)

Output Data

code - 0 means the successful result. 1 - if texts do not match. 2 - if no face is present.
description - The description of the processed video.
result - The result of the pipeline is the text from the video file.
similarity - The Normalized Levenshtein distance between prompted and spoken texts.

JSON Response Example

"Visual Speech Recognition": {
  "code": 0,
  "description": "Successful check",
  "result": "HELLO WORLD",
  "similarity": 0.9
}