I narrated my book in my own voice without recording a single chapter. The whole thing runs locally on a Mac, costs nothing, and sounds like me. This is the build, step by step, including the failure mode nobody warns you about and the one-line setting that fixes it. You can hear the result on The One Who Notices.

Why local

The options for an audiobook in your own voice are a recording studio (time and money) or a cloud voice service (a subscription, and your manuscript sitting on someone else’s server). I wanted neither. What changed in 2026 is that voice cloning runs entirely on an M-series Mac: no API key, no per-character billing, no upload. The book never leaves the laptop.

You need: an Apple-Silicon Mac, about a minute of clean audio of your own voice, ffmpeg, and a couple of hours of setup. The render itself is long but unattended.

Step 1: the local TTS stack

Two pieces. mlx-audio runs speech models on Apple’s MLX (the Mac GPU). Qwen3-TTS is the model; its small “Base” variant does zero-shot voice cloning.

python3 -m venv ~/.venvs/mlx-audio
~/.venvs/mlx-audio/bin/pip install mlx-audio

The model downloads on first use (mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16).

Step 2: record and clean a reference

The clone is zero-shot, which means there is no training step and no model file to keep. The voice is just a short reference clip plus its exact transcript; feed those on every generation and the model reconstructs your voice each time.

Record about a minute of yourself reading one passage, at the pace you want the book to have. The Base clone does not take a “read slowly” instruction; it imitates the delivery of the reference. If you read the sample fast and clipped, the whole book is fast and clipped.

The recording quality is the ceiling, a noisy room clones a noisy voice, so clean it:

ffmpeg -i voice.m4a -ac 1 -ar 24000 \
  -af "highpass=f=75,afftdn=nf=-28,loudnorm=I=-18:TP=-2:LRA=11" reference.wav

Mono, 24 kHz, a high-pass to drop rumble, light denoise, loudness normalization. Keep it gentle: aggressive noise reduction leaves artifacts the model will faithfully clone. Save the transcript verbatim in reference.txt; a mismatch between audio and transcript degrades the clone.

Step 3: render in your voice (and the setting that matters)

Generation is one call per chunk of text:

import mlx.core as mx
from mlx_audio.tts.utils import load_model

mx.random.seed(20260628)                 # reproducibility (the API has no seed arg)
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")

result = list(model.generate(
    text=piece,                          # a SMALL piece, see below
    ref_audio="reference.wav",
    ref_text=open("reference.txt").read(),
    temperature=0.3,                     # NOT the default 0.9
    repetition_penalty=1.3,
    max_tokens=cap,                      # capped near the piece's expected length
))

Three settings carry the whole thing, and they are the difference between a clean book and three hours of intermittent garble:

  1. Feed small pieces. Split your text so nothing over ~300 characters goes in one call. Long inputs give the model room to lose the thread.
  2. temperature=0.3, not the default 0.9. High randomness is fine for a sentence and a slow disaster over a long passage.
  3. repetition_penalty=1.3. An explicit penalty on repeating tokens kills loops directly.
  4. Cap max_tokens near the expected length (roughly chars * 0.14 * 12.5 frames, plus a margin). Even a misbehaving piece then cannot run away.

Here is what happens if you skip these. I first rendered the whole book on the defaults, three and a half hours of audio, mostly clean, with stretches of looping garble scattered through it. The garbled chunks all clocked in at exactly 320 seconds. That number is the tell: my token limit was 4096, which at the model’s frame rate is 320 seconds. The model was running away, generating until it hit the ceiling, looping the whole time. Long inputs (one 4,000-character glossary paragraph) and temperature 0.9 were the cause. Small pieces, low temperature, and a repetition penalty fixed it: a 467-character chunk that had been 320 seconds of nonsense came back as 33 seconds of clean narration.

So render piece by piece, cache each one to disk, and skip what already exists. At roughly four times real-time on an M-series Mac, a 3.5-hour book takes about eleven hours, and you will want it resumable so an interruption at hour ten does not start over.

Step 4: find the bad chunks without listening

You do not want to scrub three hours hunting for glitches. Let the numbers find them. A clean chunk’s audio length is roughly proportional to its character count, so flag the outliers:

# for each rendered chunk wav: if duration >> expected from its text, it ran away
import subprocess
dur = float(subprocess.check_output(
    ["ffprobe","-v","error","-show_entries","format=duration","-of","csv=p=0", wav]))
if dur > len(text) * 0.14 * 1.8:   # far longer than the text warrants
    flag(wav)

Re-render the flagged chunks with the settings above. (One caveat the scan cannot see: subtle mispronunciations slip through, since they do not change the duration. Listen to a few samples too.)

Step 5: assemble the M4B

ffmpeg stitches the per-chapter audio into a chaptered M4B with a cover. Write an FFMETADATA file with [CHAPTER] blocks (start/end in milliseconds and a title), then:

ffmpeg -f concat -safe 0 -i chapters.txt -i chapters.ffmeta -i cover.png \
  -map 0:a -map 2:v -disposition:v:0 attached_pic -c:v copy \
  -map_metadata 1 -map_chapters 1 -c:a aac -b:a 48k -movflags +faststart book.m4b

Two practical notes: open with a spoken intro chapter (title, subtitle, author, and the epigraph) rather than cold on chapter one, and keep the file under your host’s size limit. GitHub Pages rejects files over 100 MiB, so 48k mono AAC keeps a 3-hour book around 65 MB.

What to know before you commit

Stated plainly, because the demos skip it:

  • Cloning trades reliability for “it is your voice.” A clone built from a minute of audio is stable for a sentence and drifts over hours. A purpose-built preset voice does not run away like this. If you do not need your voice, a preset is the safer choice.
  • Pronunciation control is thin. Plain text in, audio out; there is no reliable phoneme or stress markup. I got lucky that the Sanskrit read well from plain transliteration. For a word it mangles, the only lever is respelling it phonetically and checking by ear.
  • It is slow and unattended. Eleven hours for a book. Plan it as an overnight job.
  • Emphasis is the model’s call. Where a human narrator would stress a word, the clone decides. You are narrating with a capable reader, not a director.

The piece I would not skip

Store the voice as a small, fixed profile: the reference WAV, its transcript, and the render settings (temperature, repetition penalty, seed) in one place. That turns a one-off render into a reusable narrator, the same voice for the next book, no re-tuning. The voice stops being a render artifact and becomes something you own.

The technology is genuinely here: a believable clone of your own voice, free and offline, on hardware you already have. What is not here is the polish, and the gap between a usable audiobook and three hours of garble is a temperature value and a chunk size. Now you know which.

Further reading