A few years ago I first learned about Ultimate Vocal Remover, a desktop program created to work with several different ML models that split instrumentals and vocal tracks from a song. I learned about this in the context of learning about voice cloning back then. When I saw how well the vocal remover models worked, I immediately thought it would’ve been perfect for making karaoke videos.
The problem with many karaoke videos (those found in karaoke places) is multiple folds: (1) poor instrumental quality that sometimes sound nothing like the original song (2) pitch is sometimes off (3) the videos are sometimes completely unrelated to the original song or music video.
I finally got around to building something from this idea I had for a few years. This is now working in Python script format; perhaps one day I’ll build a desktop app or web app with this tech.
Core Idea
The primary goal is to take a standard video file (e.g., MP4) as input and produce a new video file with two key features:
- Karaoke Audio Mix: The audio is re-mixed into a stereo track where the left channel contains the instrumental-only version of the song, and the right channel contains the original, unprocessed audio. This allows for easy toggling between karaoke and original versions in a media player (common in karaoke places). I also added an option to use just the instrumental-only track, if that’s what we want to output.
- Synchronized Lyrics: The lyrics of the song are transcribed and embedded into the video as karaoke-style subtitles, highlighting each word as it is sung.
Technology Stack
The application is written in Python, orchestrating several key pieces of tech:
-
Audio/Video Processing:
ffmpeg
is used as the workhorse for all audio/video manipulation, including extraction, stream copying, and final re-assembly. -
Vocal Separation:
onnxruntime
runs theUVR_MDXNET_Main.onnx
model, a high-quality vocal separation model from the Ultimate Vocal Remover project. -
Speech-to-Text:
faster-whisper
, an optimized implementation of OpenAI’s Whisper model, is used for fast and accurate transcription of the isolated vocals to generate word-level timestamps.
Application Workflow
The conversion process is a four-step pipeline. The entire process is automated by a single Python script.
Step 1: Audio Extraction
The process begins by using ffmpeg
to extract the full original audio track from the input video file. It is saved as an uncompressed WAV file. This original audio track is also preserved for use in the final video’s right audio channel. (If using that option)
Step 2: Vocal Separation
The extracted WAV file is fed into a vocal separation module. This module uses the pre-trained UVR_MDXNET_Main.onnx
model. This is an MDX-Net (Music Demixing Network) model that excels at separating music into its component stems. We configure it to produce two outputs:
- Instrumental Track: A WAV file containing only the music.
- Vocal Track: A WAV file containing only the vocals.
Step 3: Lyrics Generation
The isolated vocal track is then passed to the transcription module. We use the faster-whisper
library with the small
model for a good balance of speed and accuracy. The library is configured to perform transcription and generate word-level timestamps. These timestamps are then formatted into an Advanced SubStation Alpha (.ass
) subtitle file. The .ass
format is specifically chosen because it supports karaoke-style effects, allowing us to specify the duration of each word for the „bouncing ball“ highlighting effect.
Step 4: Final Video Assembly
In the final step, ffmpeg
is called again to construct the output karaoke video. It combines four distinct sources:
- Original Video Stream: The video (without audio) is copied directly from the input file to prevent any loss of quality from re-encoding.
- Instrumental Audio (Left Channel): The instrumental track from Step 2 is mapped to the left channel of the new stereo audio track.
- Original Audio (Right Channel): The original, unprocessed audio from Step 1 is mapped to the right channel.
- Karaoke Subtitles: The
.ass
file from Step 3 is burned directly onto the video frames.
The result is a self-contained karaoke video file, ready for playback.
Here is an example of what the program did:
Original music video: https://www.youtube.com/watch?v=Oextk-If8HQ
Karaoke output: https://www.youtube.com/watch?v=0dBf14_TLZc