
If you're paying cloud bills to transcode video, you might be solving the wrong problem. Running four Mac Minis wired together with Tailscale and driving ffmpeg segment distribution cut a 4K encode from 4 minutes 12 seconds down to 58 seconds — at zero ongoing cost. Here's the exact setup, the gotchas I hit, and how I added AI upscaling on top with Draw Things.
The Problem: Cloud Transcoding Adds Up Fast
AWS Elastic Transcoder and similar services are convenient, but the pricing model punishes you for volume. A 4K ProRes source file that runs 2 minutes costs real money to transcode at scale — and you're waiting on remote I/O on top of the compute time.
The first thing I tried was a single-node ffmpeg job on one Mac Mini M2 Pro. The machine is powerful, but a naive ffmpeg -i input.mov -c:v libx264 output.mp4 command only saturates a handful of cores. The 4K ProRes input took 4 minutes 12 seconds end-to-end. Not terrible, but not impressive for hardware that costs $1,300.
The deeper issue: video encoding is embarrassingly parallel at the segment level, and nothing in the default ffmpeg workflow exploits that across machines.
Section 1: Wiring the Cluster with Tailscale
Tailscale is a mesh VPN that runs on top of WireGuard. The key difference from a traditional VPN is that every node connects peer-to-peer — there's no central tunnel to saturate. Each machine gets a stable 100.x.x.x address in the 100.64.0.0/10 range, and that address stays constant even if the machine moves networks or lives behind double NAT.
For a home lab or office setup, that means you don't touch router firewall rules. Install, authenticate, done.
# Run on each Mac Mini
brew install tailscale
sudo tailscaled &
tailscale up --authkey tskey-xxxxxxxxxxxxxxxx
# Verify all nodes see each other
tailscale status
Expected output from tailscale status once all four nodes are up:
100.64.0.1 mini0 macOS -
100.64.0.2 mini1 macOS -
100.64.0.3 mini2 macOS -
100.64.0.4 mini3 macOS -
Once you see all four, test SSH across the mesh:
ssh mini1@100.64.0.2 "uname -a"
If that returns without a password prompt (configure ~/.ssh/authorized_keys on each node beforehand), the cluster is ready.
The gotcha I hit: Tailscale's default key rotation can interrupt long-running SSH sessions. Set --accept-routes and configure the auth key as reusable in the admin console if you're leaving these machines unattended.
Section 2: Distributed ffmpeg Encoding
The core idea is simple: split the source into 30-second segments, ship one segment per node, encode in parallel, then concatenate.
# Step 1 — Split the source into 30-second segments (stream copy, no re-encode)
ffmpeg -i input.mov \
-c copy \
-f segment \
-segment_time 30 \
-reset_timestamps 1 \
segment_%03d.mov
This produces segment_000.mov, segment_001.mov, etc. The -c copy flag is critical — you're not re-encoding here, just cutting at keyframes, so this step completes in seconds.
# Step 2 — Distribute and encode in parallel across nodes
for i in 00 01 02 03; do
scp segment_0${i}.mov mini${i}@100.64.0.${i}:~/encode/
ssh mini${i}@100.64.0.${i} \
"ffmpeg -i ~/encode/segment_0${i}.mov \
-c:v libx264 -crf 23 -preset fast \
~/encode/out_0${i}.mp4" &
done
wait
# Step 3 — Merge outputs into the final file
# Build the filelist first
for i in 00 01 02 03; do
echo "file 'out_${i}.mp4'" >> filelist.txt
done
ffmpeg -f concat -safe 0 -i filelist.txt -c copy final_output.mp4
The & at the end of each SSH call is what makes this parallel — bash forks each connection and wait blocks until all four finish.
Why the speedup beats 4×: On a single node, the CPU encoder and disk I/O compete. Distributing across four nodes spreads the I/O load too, which is where the extra efficiency comes from. You're not just parallelizing compute — you're also parallelizing reads from local NVMe on each node.
Measured results on a 2-minute 4K ProRes source:
| Setup | Encode Time |
|---|---|
| Single Mac Mini M2 Pro | 4m 12s |
| 4-node Tailscale cluster | 58s |
| AWS Elastic Transcoder | ~3m 40s (plus transfer time) |
Section 3: AI Frame Upscaling with Draw Things
This is where it gets interesting. Two of the M2 Pro nodes run as dedicated upscale workers using Draw Things' img2img pipeline with Real-ESRGAN. The workflow: extract keyframes from a lower-res source, upscale each frame using the Neural Engine, then recompose the video.
# Extract scene-change keyframes (threshold 0.3 catches most cuts without over-sampling)
ffmpeg -i input.mp4 \
-vf "select=gt(scene\,0.3)" \
-vsync vfr \
frames/frame_%04d.png
The scene filter value 0.3 is a sensitivity threshold — lower means more frames captured. For talking-head video, 0.4 is usually cleaner. For action footage, drop to 0.2.
# Run the upscaler (wrapper script around Draw Things CLI)
python upscale_runner.py \
--input frames/ \
--output upscaled/ \
--model realesrgan-x4
# Recompose the high-res video, pulling audio from the original
ffmpeg -framerate 30 \
-i upscaled/frame_%04d.png \
-i input.mp4 \
-map 0:v \
-map 1:a \
-c:v libx264 -crf 18 -preset slow \
final_hq.mp4
Measured throughput on M2 Pro Neural Engine: 0.8 seconds per frame at 720p → 1440p. For a 30fps source that's not extracting every frame, the total upscale pass on a 2-minute clip runs around 6–8 minutes on two nodes working in parallel.
Variations and Gotchas
Segment boundary artifacts: -c copy cuts at keyframe boundaries, not exact timestamps. For h.264 sources with long keyframe intervals (default is often 250 frames), your segments might not be exactly 30 seconds. Force keyframe injection upstream with -g 30 if precise segmentation matters.
SCP overhead on large files: For a 10GB ProRes source, scp adds measurable time. Consider mounting a shared NFS volume over the Tailscale network instead, so nodes read directly without copying. Or use rsync with -z compression for high-CRF intermediates.
Node failure handling: The current for loop has no error recovery. If one node dies mid-encode, wait still returns but you get a missing output file. Wrap the SSH call in a retry block or use a proper job scheduler (GNU parallel, or even a lightweight queue with Redis) if you need resilience.
macOS vs. Linux on ffmpeg flags: -preset fast is available on both, but hardware acceleration differs. On Apple Silicon, add -hwaccel videotoolbox to the encode command to use the hardware H.264 encoder — this can cut encode time by another 30–40% at the cost of slightly less CRF precision.
# Hardware-accelerated variant for Apple Silicon nodes
ffmpeg -i ~/encode/segment_0${i}.mov \
-c:v h264_videotoolbox \
-b:v 8M \
~/encode/out_0${i}.mp4
Note that h264_videotoolbox uses bitrate mode, not CRF — you'll need to tune -b:v per your quality target.
The Takeaway
If you have Apple Silicon hardware sitting around, the total cost of running this cluster is electricity. The Tailscale mesh eliminates the network configuration headache that used to make self-hosted multi-machine pipelines painful. Four nodes with segment-level distribution isn't just cheaper than Elastic Transcoder — it's faster.
The natural next step is adding a lightweight job queue so the orchestrator can distribute variable-length jobs dynamically rather than statically assigning one segment per node. That unlocks better load balancing when segments vary significantly in complexity.
🐦 Faster updates on X: @baegseungh7061
📚 More in this series: AI Insights
💌 Subscribe: Follow on X or grab the RSS
댓글
댓글 쓰기