scripts:media:transcribe_video_text_subtitle_speaker_color

Transcribe Multi-lingual Video Into Subtitles With Speaker Color

Ok I want to do a few things.

  1. I have videos that are english, turkish or mixed.
  2. I want a text transcript and subtitle files
  3. I want to have subtitles colored per speaker
  4. I want to burn-in said subtitles.
  • There used to be several bad options. Sphinx, Wit.ai, so on.
  • Thankfully those days are all over - Whisper is the new king
  • I want whisper large to handle multiple languages
    • Whisper large keeps adding Altyazi MK in Turkish text but we'll filter it out.
      • They really trained this thing on pirated subs eh?
  • Whisper.cpp works on my PC and isn't dependency hell, but no per-speaker isolation
  • WhisperX is a good option, but MAN did I have issues with it (see below)

WhisperX

  • I'm using a sacrificial LXC because this is so jank
  • I hate python nightmares. Again, rust, go and even sane cpp projects are well sane.

Maybe non-obvious you will need:

  • python 3.11 (not 3.13)
    • This is in AUR. Don't use it, use pyenv (hence sacrificial VM)
  • downgrade your glibc
    • some ctranslate error will pop up
    • yes this is dangerous to use a 6 month old gcc
    • again sacrifical unprivileged lxc
    • a huggingface account and accept terms on a few models
      • i iptables it later to ensure it's offline

I won't go over setup. It's hacks and jank.

Convert to whisper specs: mono, 16 bit, 16000hz sample rate wav file.

ffmpeg -i input-video.mkv -ac 1 -ar 16000 input-audio.wav

This one is annoying because you might try a few things.

whisperx --model large-v3-turbo --model_dir ~/.cache/whisper --output_dir out/ --task transcribe --diarize --highlight_words True --verbose True --output_format all --print_progress True --compute_type int8 --threads 24 --hf_token your-token-here input-audio.wav

I found not too much diff between large-v2, v3 and v3-turbo. Turbo is a bit faster and newer so going with that.
You might not want highlight_words as it can be distracting.

This generates a .txt transcript, .srt and .vtt (the latter two are near equivalent).

NOTE: try with a single small wave file before you do *.wav!!! It will batch process and you want to do 1 good run to catch errors such as with huggingface. Or you will spend 10 hours in whisper then fail only to have to redo it.

Now, this results in text with SPEAKER_00 and SPEAKER_01. First I wanted to convert to substation alpha (ass/ssa) to do colors. Then I got bored and did .srt/.vtt directly. It's only a bit cursed.

Here is a bash script to take in the subtitle and add color tags.

NOTE: font tags, and not span as some sources say. Also I did the SRT and not VTT since there's some issues with tags in VTT? and limits to what is supported? Idk much man.

#!/bin/bash

# Define an array of colors for each speaker
COLORS=("yellow" "cyan" "green" "magenta" "orange" "green" "blue" "white")

# Input file (VTT or SRT)
INPUT_FILE="$1"

# Output file (with colors added)
OUTPUT_FILE="${INPUT_FILE%.*}_colored.${INPUT_FILE##*.}"

# Check if input file exists
if [[ ! -f "$INPUT_FILE" ]]; then
  echo "Error: Input file '$INPUT_FILE' not found."
  exit 1
fi

# Process the file
awk -v colors="${COLORS[*]}" '
BEGIN {
  # Split the colors into an array
  split(colors, colorArray, " ")
}

{
  # Check if the line contains a speaker label
  if (match($0, /\[SPEAKER_[0-9]+\]:/)) {
    # Extract the speaker number
    speaker_num = substr($0, RSTART + 9, RLENGTH - 10)
    # Get the corresponding color
    color = colorArray[speaker_num + 1]
    # Wrap the line in a span with the color
    $0 = "<font color=\""color"\">" $0 "</font>"
  }
  # Print the modified line
  print
}
' "$INPUT_FILE" > "$OUTPUT_FILE"

echo "Colors added successfully! Output saved to '$OUTPUT_FILE'."

At this point you have SRT file. You can embed that into video as a subtitle track in say an MKV. That's how I would personally do it.

But sometimes you're sending to people and don't want them to deal with subtitle tracks.

Then just burn it in and do hardsubs.

 ffmpeg -i input.mkv -vf "subtitles=input_subtitles_colored.srt:force_style='OutlineColour=&H00000000,BorderStyle=3,Outline=1,Shadow=0,MarginV=20,FontSize=28'" -c:v libx264 -crf 23 -c:a aac -b:a 128k output.mp4
 

This will use our color tags, do black background and outline on the subs at font size 28. Choose based on your preference.

Note: I might hate h264 and the default aac encoder sounds like ass but sometimes you need a dumb mp4 file. Use AV1 and opus if you can.

  • scripts/media/transcribe_video_text_subtitle_speaker_color.txt
  • Last modified: 2025-03-19 06:39
  • by Tony