Cloning My Voice for an Audiobook

Why local
Step 1: the local TTS stack
Step 2: record and clean a reference
Step 3: render in your voice (and the setting that matters)
Step 4: find the bad chunks without listening
Step 5: make it sound like one recording
Step 6: assemble the M4B
What to know before you commit
The piece I would not skip
Further reading

I narrated my book in my own voice without recording a single chapter. The whole thing runs locally on a Mac, costs nothing, and sounds like me. This is the build, step by step, including the failure modes nobody warns you about and the settings that fix them. You can hear the result on the audiobooks at /books/.

Why local

The options for an audiobook in your own voice are a recording studio (time and money) or a cloud voice service (a subscription, and your manuscript sitting on someone else’s server). I wanted neither. What changed in 2026 is that voice cloning runs entirely on an M-series Mac: no API key, no per-character billing, no upload. The book never leaves the laptop.

You need: an Apple-Silicon Mac, about a minute of clean audio of your own voice, ffmpeg, and a couple of hours of setup. The render itself is long but unattended.

Step 1: the local TTS stack

Two pieces. mlx-audio runs speech models on Apple’s MLX (the Mac GPU). Qwen3-TTS is the model; its small “Base” variant does zero-shot voice cloning.

python3 -m venv ~/.venvs/mlx-audio
~/.venvs/mlx-audio/bin/pip install mlx-audio

The model downloads on first use (mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16).

Step 2: record and clean a reference

The clone is zero-shot, which means there is no training step and no model file to keep. The voice is just a short reference clip plus its exact transcript; feed those on every generation and the model reconstructs your voice each time.

Record about a minute of yourself reading one passage, at the pace you want the book to have. The Base clone does not take a “read slowly” instruction; it imitates the delivery of the reference. If you read the sample fast and clipped, the whole book is fast and clipped.

The recording quality is the ceiling, a noisy room clones a noisy voice, so clean it:

ffmpeg -i voice.m4a -ac 1 -ar 24000 \
  -af "highpass=f=75,afftdn=nf=-28,loudnorm=I=-18:TP=-2:LRA=11" reference.wav

Mono, 24 kHz, a high-pass to drop rumble, light denoise, loudness normalization. Keep it gentle: aggressive noise reduction leaves artifacts the model will faithfully clone. Save the transcript verbatim in reference.txt; a mismatch between audio and transcript degrades the clone.

Step 3: render in your voice (and the setting that matters)

Generation is one call per chunk of text:

import mlx.core as mx
from mlx_audio.tts.utils import load_model

mx.random.seed(20260628)                 # reproducibility (the API has no seed arg)
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")

result = list(model.generate(
    text=piece,                          # a SMALL piece, see below
    ref_audio="reference.wav",
    ref_text=open("reference.txt").read(),
    temperature=0.3,                     # NOT the default 0.9
    repetition_penalty=1.3,
    max_tokens=cap,                      # capped near the piece's expected length
))

Three settings carry the whole thing, and they are the difference between a clean book and three hours of intermittent garble:

Feed small pieces. Split your text so nothing over ~300 characters goes in one call. Long inputs give the model room to lose the thread.
temperature=0.3, not the default 0.9. High randomness is fine for a sentence and a slow disaster over a long passage.
repetition_penalty=1.3. An explicit penalty on repeating tokens kills loops directly.
Cap max_tokens near the expected length (roughly chars * 0.14 * 12.5 frames, plus a margin). Even a misbehaving piece then cannot run away.

Here is what happens if you skip these. I first rendered the whole book on the defaults, three and a half hours of audio, mostly clean, with stretches of looping garble scattered through it. The garbled chunks all clocked in at exactly 320 seconds. That number is the tell: my token limit was 4096, which at the model’s frame rate is 320 seconds. The model was running away, generating until it hit the ceiling, looping the whole time. Long inputs (one 4,000-character glossary paragraph) and temperature 0.9 were the cause. Small pieces, low temperature, and a repetition penalty fixed it: a 467-character chunk that had been 320 seconds of nonsense came back as 33 seconds of clean narration.

So render piece by piece, cache each one to disk, and skip what already exists. At roughly four times real-time on an M-series Mac, a 3.5-hour book takes about eleven hours, and you will want it resumable so an interruption at hour ten does not start over.

Step 4: find the bad chunks without listening

You do not want to scrub three hours hunting for glitches. Let the numbers find them. A clean chunk’s audio length is roughly proportional to its character count, so flag the outliers:

# for each rendered chunk wav: if duration >> expected from its text, it ran away
import subprocess
dur = float(subprocess.check_output(
    ["ffprobe","-v","error","-show_entries","format=duration","-of","csv=p=0", wav]))
if dur > len(text) * 0.14 * 1.8:   # far longer than the text warrants
    flag(wav)

Re-render the flagged chunks, and re-generate them with a different seed rather than cutting the audio to length: a hard cut lands mid-word, while a fresh seed usually comes back clean and in the same voice. Only ever trim at a silent point. (One caveat the scan cannot see: subtle mispronunciations slip through, since they do not change the duration. Listen to a few samples too.)

Step 5: make it sound like one recording

The runaway garble shows up inside a single chapter. Three quieter problems only surface once you stitch a whole book together, and every one of them is a level problem, not a model problem.

The volume drifts between chunks. Each small piece is a separate generation, and the model’s output loudness varies from one call to the next. Rendered back to back, the seams are audible: one sentence sits louder than the one before it. Level each piece to a common loudness before you stitch, using the RMS of its voiced samples so the pauses do not skew it:

import numpy as np
def level(a, target_rms=0.105, peak=0.9):
    v = a[np.abs(a) > 0.006]                    # voiced samples only
    if len(v) < 100: return a
    a = a * (target_rms / (np.sqrt(np.mean(v.astype(np.float64)**2)) + 1e-9))
    p = float(np.max(np.abs(a)))
    return (a * (peak / p) if p > peak else a).astype(np.float32)

The first words of some chapters blast. This one took me a while to trace. I ran a single loudness pass over the finished book, and every few chapters the opening words came in loud enough to make you flinch. The cause is loudnorm in one pass: it adjusts gain on the fly without seeing what is coming, so during the silence between chapters its automatic gain rides up, and the next chapter’s first words land at that inflated level. The documented symptom is exactly “raised volume in quiet parts.” The fix is to run it in two passes, measure the whole file, then apply one constant gain, with a brickwall limiter behind it so no stray transient crosses the ceiling:

# pass 1: measure
ffmpeg -i book.wav -af loudnorm=I=-19:TP=-2:LRA=11:print_format=json -f null -
# pass 2: apply the measured numbers with linear=true (one gain, no riding)
ffmpeg -i book.wav -af "loudnorm=I=-19:TP=-2:LRA=11:linear=true:measured_I=...:measured_TP=...:measured_LRA=...:measured_thresh=...:offset=...,alimiter=limit=0.9" out.wav

The joins click. Butt-splicing two clips can pop. A twelve-millisecond fade on each end of every piece, plus trimming the near-silence off the ends so the gaps stay uniform, takes it out.

None of this is exotic audio work. It is the gap between a stack of clips and something that sounds recorded in one sitting, and you cannot hear the need for it until the book is whole.

Step 6: assemble the M4B

ffmpeg stitches the per-chapter audio into a chaptered M4B with a cover. Write an FFMETADATA file with [CHAPTER] blocks (start/end in milliseconds and a title), then:

ffmpeg -f concat -safe 0 -i chapters.txt -i chapters.ffmeta -i cover.png \
  -map 0:a -map 2:v -disposition:v:0 attached_pic -c:v copy \
  -map_metadata 1 -map_chapters 1 \
  -af "loudnorm=I=-19:TP=-2:LRA=11:linear=true:measured_I=...,alimiter=limit=0.9" \
  -c:a aac -b:a 64k -movflags +faststart book.m4b

The -af here is the second loudnorm pass from step 5, folded into the same encode. Two practical notes: open with a spoken intro chapter (title, subtitle, author, and the epigraph) rather than cold on chapter one, and keep the file under your host’s size limit. GitHub blocks any file over 100 MiB (and warns at 50), so 64k mono AAC keeps even a 3.3-hour book near 95 MB, under the ceiling with room to spare.

What to know before you commit

Stated plainly, because the demos skip it:

Cloning trades reliability for “it is your voice.” A clone built from a minute of audio is stable for a sentence and drifts over hours. A purpose-built preset voice does not run away like this. If you do not need your voice, a preset is the safer choice.
Pronunciation control is thin, and trying to add it can backfire. Plain text in, audio out; there is no reliable phoneme or stress markup. The Sanskrit read well from plain transliteration, which surprised me, so I tried to improve it and made it worse. A full phonetic respelling map, hyphenated syllable by syllable, padded the words out and came back choppy; the model’s native reading was better. Leave it alone unless it genuinely mangles a specific word, and only then respell that one and check it by ear.
It is slow and unattended. Eleven hours for a book. Plan it as an overnight job.
Emphasis is the model’s call. Where a human narrator would stress a word, the clone decides. You are narrating with a capable reader, not a director.

The piece I would not skip

Store the voice as a small, fixed profile: the reference WAV, its transcript, and the render settings (temperature, repetition penalty, seed) in one place. That turns a one-off render into a reusable narrator, the same voice for the next book, no re-tuning. The voice stops being a render artifact and becomes something you own.

The technology is genuinely here: a believable clone of your own voice, free and offline, on hardware you already have. What is not here is the polish, and the gap between a usable audiobook and three hours of garble is a temperature value and a chunk size. Now you know which.