Transcribe Multi-lingual Video Into Subtitles With Speaker Color

Ok I want to do a few things.

I have videos that are english, turkish or mixed.
I want a text transcript and subtitle files
I want to have subtitles colored per speaker
I want to burn-in said subtitles.

Part 1: ASR

There used to be several bad options. Sphinx, Wit.ai, so on.
Thankfully those days are all over - Whisper is the new king

Whisper variants

I want whisper large to handle multiple languages
- Whisper large keeps adding Altyazi MK in Turkish text but we'll filter it out.
  - They really trained this thing on pirated subs eh?
Whisper.cpp works on my PC and isn't dependency hell, but no per-speaker isolation
WhisperX is a good option, but MAN did I have issues with it (see below)

WhisperX

I'm using a sacrificial LXC because this is so jank
I hate python nightmares. Again, rust, go and even sane cpp projects are well sane.

Maybe non-obvious you will need:

python 3.11 (not 3.13)
- This is in AUR. Don't use it, use pyenv (hence sacrificial VM)
downgrade your glibc
- some ctranslate error will pop up
- yes this is dangerous to use a 6 month old gcc
- again sacrifical unprivileged lxc
- a huggingface account and accept terms on a few models
  - i iptables it later to ensure it's offline

I won't go over setup. It's hacks and jank.

Step 1 - Get the audio

Convert to whisper specs: mono, 16 bit, 16000hz sample rate wav file.

ffmpeg -i input-video.mkv -ac 1 -ar 16000 input-audio.wav

Step 2 - run whisper

This one is annoying because you might try a few things.

whisperx --model large-v3-turbo --model_dir ~/.cache/whisper --output_dir out/ --task transcribe --diarize --highlight_words True --verbose True --output_format all --print_progress True --compute_type int8 --threads 24 --hf_token your-token-here input-audio.wav

I found not too much diff between large-v2, v3 and v3-turbo. Turbo is a bit faster and newer so going with that.
You might not want highlight_words as it can be distracting.

This generates a .txt transcript, .srt and .vtt (the latter two are near equivalent).

NOTE: try with a single small wave file before you do *.wav!!! It will batch process and you want to do 1 good run to catch errors such as with huggingface. Or you will spend 10 hours in whisper then fail only to have to redo it.

Part 2: Subtitles

Step 3 - color them by speaker

Now, this results in text with SPEAKER_00 and SPEAKER_01. First I wanted to convert to substation alpha (ass/ssa) to do colors. Then I got bored and did .srt/.vtt directly. It's only a bit cursed.

Here is a bash script to take in the subtitle and add color tags.

NOTE: font tags, and not span as some sources say. Also I did the SRT and not VTT since there's some issues with tags in VTT? and limits to what is supported? Idk much man.

#!/bin/bash

# Define an array of colors for each speaker
COLORS=("yellow" "cyan" "green" "magenta" "orange" "green" "blue" "white")

# Input file (VTT or SRT)
INPUT_FILE="$1"

# Output file (with colors added)
OUTPUT_FILE="${INPUT_FILE%.*}_colored.${INPUT_FILE##*.}"

# Check if input file exists
if [[ ! -f "$INPUT_FILE" ]]; then
  echo "Error: Input file '$INPUT_FILE' not found."
  exit 1
fi

# Process the file
awk -v colors="${COLORS[*]}" '
BEGIN {
  # Split the colors into an array
  split(colors, colorArray, " ")
}

{
  # Check if the line contains a speaker label
  if (match($0, /\[SPEAKER_[0-9]+\]:/)) {
    # Extract the speaker number
    speaker_num = substr($0, RSTART + 9, RLENGTH - 10)
    # Get the corresponding color
    color = colorArray[speaker_num + 1]
    # Wrap the line in a span with the color
    $0 = "<font color=\""color"\">" $0 "</font>"
  }
  # Print the modified line
  print
}
' "$INPUT_FILE" > "$OUTPUT_FILE"

echo "Colors added successfully! Output saved to '$OUTPUT_FILE'."

Step 4 - burn into video

At this point you have SRT file. You can embed that into video as a subtitle track in say an MKV. That's how I would personally do it.

But sometimes you're sending to people and don't want them to deal with subtitle tracks.

Then just burn it in and do hardsubs.

 ffmpeg -i input.mkv -vf "subtitles=input_subtitles_colored.srt:force_style='OutlineColour=&H00000000,BorderStyle=3,Outline=1,Shadow=0,MarginV=20,FontSize=28'" -c:v libx264 -crf 23 -c:a aac -b:a 128k output.mp4

This will use our color tags, do black background and outline on the subs at font size 28. Choose based on your preference.

Note: I might hate h264 and the default aac encoder sounds like ass but sometimes you need a dumb mp4 file. Use AV1 and opus if you can.

Tony Tascioglu Wiki

Table of Contents