Skip to content

Active Speaker Detection

The pipeline detects whether the voice sound aligns with the video of the person speaking. By applying audio and visual features continuously, it can predict whether the sound matches the person in the video.

The system effectively measures the synchronisation between audio and visual elements, identifying discrepancies that may indicate deepfake content.

Input Data

  • Video file (MP4, MOV, AVI etc…)

Output Data

  • code:
    • 0 means the successful result.
    • 1 - if the person is not speaking.
    • 2 - if no face is present.
  • description - The description of the processed video.
  • result - Percentage of video frames where audio and video are properly synchronized (0-100%)
  • score - Confidence level indicating the probability of the video being synchronous (0-100%)

JSON Response Example

"Active Speaker Detection": {
"code": 0,
"description": "Successful check",
"result": 96.32,
"score": 100
}