
If you produce video content regularly and are tired of inaccurate auto-captions or paying per-minute for cloud transcription APIs, this pipeline is for you. I'll walk through how I wired up ffmpeg, OpenAI Whisper, and a folder-watching daemon on a Mac Mini cluster to go from raw video to subtitled output with zero manual steps.
The Problem: Cloud Captions Are Unreliable for Technical Korean
YouTube's auto-captions handle general conversational Korean passably. The moment you throw in technical terminology — framework names, CLI flags, model identifiers — accuracy tanks. I was manually correcting 10–20% of generated subtitles on every video, which added an hour or two of rework per episode.
The first thing I tried was just uploading to a paid transcription API. It worked, but at scale (multiple videos per week), costs compound fast, and you're dependent on rate limits and network availability. I run a Mac Mini cluster locally anyway for inference workloads, so burning cloud credits for something that could run on-device felt wrong.
Why ffmpeg + Whisper Is the Right Combination
ffmpeg handles the media side: it extracts audio from any container format, resamples to 16 kHz mono (exactly what Whisper expects), and later burns the finished subtitle track back into the video. Whisper handles the transcription side: it takes that WAV file and outputs a properly timestamped SRT file.
Neither tool alone gets you to a subtitled video. Together, they cover every step in the pipeline. ffmpeg is the prep cook, Whisper is the line cook, and the output folder is the pass.
The key Whisper flag for Korean technical content is --language ko. Without it, Whisper occasionally misidentifies the language on videos that mix Korean and English terms, which triggers a completely wrong transcription model path.
The Core Pipeline
The full pipeline is three chained commands:
# Step 1: Extract audio — 16 kHz mono WAV, no video stream
ffmpeg -i input.mp4 -vn -ar 16000 -ac 1 audio.wav
# Step 2: Transcribe with Whisper large-v3, output SRT
whisper audio.wav --model large-v3 --language ko --output_format srt
# Step 3: Burn subtitles back into the video
ffmpeg -i input.mp4 -vf subtitles=audio.srt output_sub.mp4
Or as a single one-liner with error propagation:
ffmpeg -i input.mp4 -vn -ar 16000 -ac 1 audio.wav && \
whisper audio.wav --model large-v3 --language ko --output_format srt && \
ffmpeg -i input.mp4 -vf subtitles=audio.srt output_sub.mp4
On M2 Pro, a 10-minute video runs end-to-end in about 52 seconds. A 30-minute video finishes in roughly 2 minutes 20 seconds. These are real numbers from my setup, not benchmarks from a controlled environment.
Choosing the Right Whisper Model
I tested all three practical model sizes on the same 10-minute Korean tech-content video. Here's what I found:
| Model | Processing Time (M2 Pro) | Korean Technical Accuracy |
|---|---|---|
tiny |
11 seconds | Low — misses jargon constantly |
base |
19 seconds | Medium — frequent proper noun errors |
large-v3 |
48 seconds | High — best available for Korean |
tiny is fast the way a rough draft is fast — you'll spend more time fixing errors than you saved on processing. base is the tempting middle ground, but "proper noun errors" in technical content means your viewers see garbled model names, library names, and command flags. That's worse than no subtitles.
large-v3 at ~12x real-time speed on M2 Pro hits the sweet spot. For content accuracy, there's no real alternative.
One extra flag worth adding for post-processing flexibility:
whisper audio.wav \
--model large-v3 \
--language ko \
--word_timestamps True \
--output_format srt
--word_timestamps True gives you per-word timing in the output, which makes it possible to later split long subtitle lines without re-running transcription.
The Real Automation: Folder-Watching Daemon
Running the pipeline manually per video is half-automation. The actual goal is zero manual steps: drop a file in a folder, come back to find a subtitled video.
I use fswatch — available via Homebrew — to monitor the input directory. Any time a new file appears, it fires the subtitle script automatically.
# Install fswatch
brew install fswatch
# Watch input folder and trigger processing script on new files
fswatch -o ~/Videos/input/ | xargs -n1 -I{} bash ~/scripts/auto_subtitle.sh
The script auto_subtitle.sh wraps the three-command pipeline and handles output path construction:
#!/bin/bash
# auto_subtitle.sh — called by fswatch on new file detection
INPUT_DIR=~/Videos/input
OUTPUT_DIR=~/Videos/output
SCRATCH=~/Videos/scratch
for f in "$INPUT_DIR"/*.mp4; do
[ -f "$f" ] || continue
BASE=$(basename "$f" .mp4)
# Extract audio
ffmpeg -i "$f" -vn -ar 16000 -ac 1 "$SCRATCH/${BASE}.wav"
# Transcribe
whisper "$SCRATCH/${BASE}.wav" \
--model large-v3 \
--language ko \
--output_format srt \
--output_dir "$SCRATCH"
# Burn subtitles
ffmpeg -i "$f" -vf "subtitles=$SCRATCH/${BASE}.srt" "$OUTPUT_DIR/${BASE}_sub.mp4"
# Move processed input out of watch folder
mv "$f" ~/Videos/processed/
done
To run this persistently without keeping a terminal open, use nohup:
nohup fswatch -o ~/Videos/input/ | xargs -n1 -I{} bash ~/scripts/auto_subtitle.sh &
Or register it as a launchd service so it survives reboots:
<!-- ~/Library/LaunchAgents/com.local.subtitle-daemon.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.local.subtitle-daemon</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>/Users/you/scripts/auto_subtitle.sh</string>
</array>
<key>WatchPaths</key>
<array>
<string>/Users/you/Videos/input</string>
</array>
<key>RunAtLoad</key>
<true/>
</dict>
</plist>
Load it with:
launchctl load ~/Library/LaunchAgents/com.local.subtitle-daemon.plist
Variations and Gotchas
Non-MP4 inputs: ffmpeg handles virtually any container. Swap input.mp4 with input.mkv, input.mov, or whatever you have. The audio extraction command is format-agnostic.
Soft subtitles vs. burned-in: The pipeline above hard-burns subtitles into the video stream.If you want soft subtitles (selectable in a player), use -c:s mov_text instead of -vf subtitles=...:
ffmpeg -i input.mp4 -i audio.srt -c copy -c:s mov_text output_soft.mp4
Linux: fswatch has a Linux equivalent: inotifywait from the inotify-tools package. Replace the fswatch line with:
inotifywait -m ~/Videos/input/ -e create | \
while read dir action file; do bash ~/scripts/auto_subtitle.sh; done
Docker: If you're running Whisper in a container, mount the scratch directory as a volume. The --model large-v3 download (~3 GB) happens on first run; cache it to a persistent volume to avoid re-downloading.
File detection race condition: fswatch fires the moment the OS reports the file — which may be before the write is complete for large videos. Add a brief sleep 2 at the top of auto_subtitle.sh, or check file size stability before processing.
GPU acceleration: On Apple Silicon, Whisper uses Metal via the mlx-whisper package (a drop-in replacement). It cuts large-v3 processing time further on M2 Max and M2 Ultra chips. On the M2 Pro in my cluster, the standard openai-whisper package already hits acceptable speeds, so I haven't switched.
The Takeaway
Three tools, one script, one daemon: that's the entire stack. ffmpeg + Whisper + fswatch running on a Mac Mini produces accurately subtitled videos from raw input with no cloud dependency, no per-minute cost, and no manual steps after setup.
What's next for me is feeding the generated SRT files into a local LLM (Ollama) to auto-generate chapter markers and description copy — effectively closing the loop on the entire video post-production workflow without leaving the local cluster.
🐦 Faster updates on X: @baegseungh7061
📚 More in this series: AI Insights
💌 Subscribe: Follow on X or grab the RSS
댓글
댓글 쓰기