Distributed Video Encoding on Mac Minis: ffmpeg + Tailscale Mesh

hero

If you're paying cloud bills to transcode video, you might be solving the wrong problem. Running four Mac Minis wired together with Tailscale and driving ffmpeg segment distribution cut a 4K encode from 4 minutes 12 seconds down to 58 seconds — at zero ongoing cost. Here's the exact setup, the gotchas I hit, and how I added AI upscaling on top with Draw Things.

overall pipeline flow


The Problem: Cloud Transcoding Adds Up Fast

AWS Elastic Transcoder and similar services are convenient, but the pricing model punishes you for volume. A 4K ProRes source file that runs 2 minutes costs real money to transcode at scale — and you're waiting on remote I/O on top of the compute time.

The first thing I tried was a single-node ffmpeg job on one Mac Mini M2 Pro. The machine is powerful, but a naive ffmpeg -i input.mov -c:v libx264 output.mp4 command only saturates a handful of cores. The 4K ProRes input took 4 minutes 12 seconds end-to-end. Not terrible, but not impressive for hardware that costs $1,300.

The deeper issue: video encoding is embarrassingly parallel at the segment level, and nothing in the default ffmpeg workflow exploits that across machines.

single-node bottleneck vs distributed path


Section 1: Wiring the Cluster with Tailscale

Tailscale is a mesh VPN that runs on top of WireGuard. The key difference from a traditional VPN is that every node connects peer-to-peer — there's no central tunnel to saturate. Each machine gets a stable 100.x.x.x address in the 100.64.0.0/10 range, and that address stays constant even if the machine moves networks or lives behind double NAT.

For a home lab or office setup, that means you don't touch router firewall rules. Install, authenticate, done.

# Run on each Mac Mini
brew install tailscale
sudo tailscaled &
tailscale up --authkey tskey-xxxxxxxxxxxxxxxx

# Verify all nodes see each other
tailscale status

Expected output from tailscale status once all four nodes are up:

100.64.0.1  mini0   macOS   -
100.64.0.2  mini1   macOS   -
100.64.0.3  mini2   macOS   -
100.64.0.4  mini3   macOS   -

Once you see all four, test SSH across the mesh:

ssh mini1@100.64.0.2 "uname -a"

If that returns without a password prompt (configure ~/.ssh/authorized_keys on each node beforehand), the cluster is ready.

The gotcha I hit: Tailscale's default key rotation can interrupt long-running SSH sessions. Set --accept-routes and configure the auth key as reusable in the admin console if you're leaving these machines unattended.


Section 2: Distributed ffmpeg Encoding

The core idea is simple: split the source into 30-second segments, ship one segment per node, encode in parallel, then concatenate.

# Step 1 — Split the source into 30-second segments (stream copy, no re-encode)
ffmpeg -i input.mov \
  -c copy \
  -f segment \
  -segment_time 30 \
  -reset_timestamps 1 \
  segment_%03d.mov

This produces segment_000.mov, segment_001.mov, etc. The -c copy flag is critical — you're not re-encoding here, just cutting at keyframes, so this step completes in seconds.

# Step 2 — Distribute and encode in parallel across nodes
for i in 00 01 02 03; do
  scp segment_0${i}.mov mini${i}@100.64.0.${i}:~/encode/
  ssh mini${i}@100.64.0.${i} \
    "ffmpeg -i ~/encode/segment_0${i}.mov \
            -c:v libx264 -crf 23 -preset fast \
            ~/encode/out_0${i}.mp4" &
done
wait

# Step 3 — Merge outputs into the final file
# Build the filelist first
for i in 00 01 02 03; do
  echo "file 'out_${i}.mp4'" >> filelist.txt
done

ffmpeg -f concat -safe 0 -i filelist.txt -c copy final_output.mp4

The & at the end of each SSH call is what makes this parallel — bash forks each connection and wait blocks until all four finish.

segment encode timeline

Why the speedup beats 4×: On a single node, the CPU encoder and disk I/O compete. Distributing across four nodes spreads the I/O load too, which is where the extra efficiency comes from. You're not just parallelizing compute — you're also parallelizing reads from local NVMe on each node.

Measured results on a 2-minute 4K ProRes source:

Setup Encode Time
Single Mac Mini M2 Pro 4m 12s
4-node Tailscale cluster 58s
AWS Elastic Transcoder ~3m 40s (plus transfer time)

Section 3: AI Frame Upscaling with Draw Things

This is where it gets interesting. Two of the M2 Pro nodes run as dedicated upscale workers using Draw Things' img2img pipeline with Real-ESRGAN. The workflow: extract keyframes from a lower-res source, upscale each frame using the Neural Engine, then recompose the video.

# Extract scene-change keyframes (threshold 0.3 catches most cuts without over-sampling)
ffmpeg -i input.mp4 \
  -vf "select=gt(scene\,0.3)" \
  -vsync vfr \
  frames/frame_%04d.png

The scene filter value 0.3 is a sensitivity threshold — lower means more frames captured. For talking-head video, 0.4 is usually cleaner. For action footage, drop to 0.2.

# Run the upscaler (wrapper script around Draw Things CLI)
python upscale_runner.py \
  --input frames/ \
  --output upscaled/ \
  --model realesrgan-x4

# Recompose the high-res video, pulling audio from the original
ffmpeg -framerate 30 \
  -i upscaled/frame_%04d.png \
  -i input.mp4 \
  -map 0:v \
  -map 1:a \
  -c:v libx264 -crf 18 -preset slow \
  final_hq.mp4

Measured throughput on M2 Pro Neural Engine: 0.8 seconds per frame at 720p → 1440p. For a 30fps source that's not extracting every frame, the total upscale pass on a 2-minute clip runs around 6–8 minutes on two nodes working in parallel.

upscale integration into pipeline


Variations and Gotchas

Segment boundary artifacts: -c copy cuts at keyframe boundaries, not exact timestamps. For h.264 sources with long keyframe intervals (default is often 250 frames), your segments might not be exactly 30 seconds. Force keyframe injection upstream with -g 30 if precise segmentation matters.

SCP overhead on large files: For a 10GB ProRes source, scp adds measurable time. Consider mounting a shared NFS volume over the Tailscale network instead, so nodes read directly without copying. Or use rsync with -z compression for high-CRF intermediates.

Node failure handling: The current for loop has no error recovery. If one node dies mid-encode, wait still returns but you get a missing output file. Wrap the SSH call in a retry block or use a proper job scheduler (GNU parallel, or even a lightweight queue with Redis) if you need resilience.

macOS vs. Linux on ffmpeg flags: -preset fast is available on both, but hardware acceleration differs. On Apple Silicon, add -hwaccel videotoolbox to the encode command to use the hardware H.264 encoder — this can cut encode time by another 30–40% at the cost of slightly less CRF precision.

# Hardware-accelerated variant for Apple Silicon nodes
ffmpeg -i ~/encode/segment_0${i}.mov \
  -c:v h264_videotoolbox \
  -b:v 8M \
  ~/encode/out_0${i}.mp4

Note that h264_videotoolbox uses bitrate mode, not CRF — you'll need to tune -b:v per your quality target.


The Takeaway

If you have Apple Silicon hardware sitting around, the total cost of running this cluster is electricity. The Tailscale mesh eliminates the network configuration headache that used to make self-hosted multi-machine pipelines painful. Four nodes with segment-level distribution isn't just cheaper than Elastic Transcoder — it's faster.

The natural next step is adding a lightweight job queue so the orchestrator can distribute variable-length jobs dynamically rather than statically assigning one segment per node. That unlocks better load balancing when segments vary significantly in complexity.


🐦 Faster updates on X: @baegseungh7061
📚 More in this series: AI Insights
💌 Subscribe: Follow on X or grab the RSS

댓글