Transcribe Multi-lingual Video Into Subtitles With Speaker Color
Ok I want to do a few things.
- I have videos that are english, turkish or mixed.
- I want a text transcript and subtitle files
- I want to have subtitles colored per speaker
- I want to burn-in said subtitles.
Part 1: ASR
- There used to be several bad options. Sphinx, Wit.ai, so on.
- Thankfully those days are all over - Whisper is the new king
Whisper variants
- I want whisper large to handle multiple languages
- Whisper large keeps adding Altyazi MK in Turkish text but we'll filter it out.
- They really trained this thing on pirated subs eh?
- Whisper.cpp works on my PC and isn't dependency hell, but no per-speaker isolation
- WhisperX is a good option, but MAN did I have issues with it (see below)
WhisperX
- I'm using a sacrificial LXC because this is so jank
- I hate python nightmares. Again, rust, go and even sane cpp projects are well sane.
Maybe non-obvious you will need:
- python 3.11 (not 3.13)
- This is in AUR. Don't use it, use pyenv (hence sacrificial VM)
- downgrade your glibc
- some ctranslate error will pop up
- yes this is dangerous to use a 6 month old gcc
- again sacrifical unprivileged lxc
- a huggingface account and accept terms on a few models
- i iptables it later to ensure it's offline
I won't go over setup. It's hacks and jank.
Step 1 - Get the audio
Convert to whisper specs: mono, 16 bit, 16000hz sample rate wav file.
ffmpeg -i input-video.mkv -ac 1 -ar 16000 input-audio.wav
Step 2 - run whisper
This one is annoying because you might try a few things.
whisperx --model large-v3-turbo --model_dir ~/.cache/whisper --output_dir out/ --task transcribe --diarize --highlight_words True --verbose True --output_format all --print_progress True --compute_type int8 --threads 24 --hf_token your-token-here input-audio.wav
I found not too much diff between large-v2, v3 and v3-turbo. Turbo is a bit faster and newer so going with that.
You might not want highlight_words as it can be distracting.
This generates a .txt transcript, .srt and .vtt (the latter two are near equivalent).
NOTE: try with a single small wave file before you do *.wav!!! It will batch process and you want to do 1 good run to catch errors such as with huggingface. Or you will spend 10 hours in whisper then fail only to have to redo it.
Part 2: Subtitles
Step 3 - color them by speaker
Now, this results in text with SPEAKER_00 and SPEAKER_01. First I wanted to convert to substation alpha (ass/ssa) to do colors. Then I got bored and did .srt/.vtt directly. It's only a bit cursed.
Here is a bash script to take in the subtitle and add color tags.
NOTE: font tags, and not span as some sources say. Also I did the SRT and not VTT since there's some issues with tags in VTT? and limits to what is supported? Idk much man.
#!/bin/bash # Define an array of colors for each speaker COLORS=("yellow" "cyan" "green" "magenta" "orange" "green" "blue" "white") # Input file (VTT or SRT) INPUT_FILE="$1" # Output file (with colors added) OUTPUT_FILE="${INPUT_FILE%.*}_colored.${INPUT_FILE##*.}" # Check if input file exists if [[ ! -f "$INPUT_FILE" ]]; then echo "Error: Input file '$INPUT_FILE' not found." exit 1 fi # Process the file awk -v colors="${COLORS[*]}" ' BEGIN { # Split the colors into an array split(colors, colorArray, " ") } { # Check if the line contains a speaker label if (match($0, /\[SPEAKER_[0-9]+\]:/)) { # Extract the speaker number speaker_num = substr($0, RSTART + 9, RLENGTH - 10) # Get the corresponding color color = colorArray[speaker_num + 1] # Wrap the line in a span with the color $0 = "<font color=\""color"\">" $0 "</font>" } # Print the modified line print } ' "$INPUT_FILE" > "$OUTPUT_FILE" echo "Colors added successfully! Output saved to '$OUTPUT_FILE'."
Step 4 - burn into video
At this point you have SRT file. You can embed that into video as a subtitle track in say an MKV. That's how I would personally do it.
But sometimes you're sending to people and don't want them to deal with subtitle tracks.
Then just burn it in and do hardsubs.
ffmpeg -i input.mkv -vf "subtitles=input_subtitles_colored.srt:force_style='OutlineColour=&H00000000,BorderStyle=3,Outline=1,Shadow=0,MarginV=20,FontSize=28'" -c:v libx264 -crf 23 -c:a aac -b:a 128k output.mp4
This will use our color tags, do black background and outline on the subs at font size 28. Choose based on your preference.
Note: I might hate h264 and the default aac encoder sounds like ass but sometimes you need a dumb mp4 file. Use AV1 and opus if you can.