How It Runs Local
whisper.cpp, silero vad, and the full local pipeline. Published 17 April 2026.
When you hold the hotkey and speak, six things happen in sequence: audio capture, voice activity detection, resampling, transcription, output sanitisation, and text injection. No network calls. No cloud. Everything runs in a single process on your machine. Here's how each stage works.
Murmur captures audio through cpal, a cross-platform Rust audio library. On Linux, cpal talks to PipeWire or PulseAudio, whichever your system provides. It opens the default input device and starts streaming audio frames the moment you press the hotkey.
The raw audio comes in at whatever sample rate your microphone provides, typically 44.1 kHz or 48 kHz, in whatever channel layout the device reports. Murmur collects these frames into a buffer in real time. The buffer grows as long as you hold the key or until auto-stop triggers.
Murmur uses Silero VAD (Voice Activity Detection), a small neural network that distinguishes speech from silence. It runs continuously on the incoming audio during recording.
This serves two purposes. In tap mode, it enables auto-stop: when Silero detects sustained silence after speech, Murmur stops recording automatically so you don't have to press the key again. In both modes, it trims trailing silence from the buffer before transcription. This matters because whisper.cpp has a known tendency to hallucinate words like "Thank you" or "Thanks for watching" when fed silence. Trimming the tail reduces these phantom outputs significantly.
Whisper expects 16 kHz mono PCM audio. Most microphones don't record at 16 kHz. Before transcription, Murmur resamples the captured audio: stereo is mixed down to mono, and the sample rate is converted to 16,000 Hz. This happens in memory, on the raw buffer. The resampled audio is passed directly to the whisper.cpp context.
Transcription runs through whisper.cpp, accessed via the whisper-rs 0.16 Rust bindings. whisper.cpp is a C/C++ port of OpenAI's Whisper model, optimised for CPU inference. It runs the full encoder-decoder transformer locally, with no server component.
Three models are available:
tiny.en (75 MB): ~3-4 seconds for 10 seconds of speech. Good enough for casual dictation. This is the default.
base.en (142 MB): ~8-10 seconds. Better accuracy, especially for less common words. A good middle ground.
small.en (466 MB): ~20-30 seconds on CPU. The most accurate local option. Best for longer dictation sessions where precision matters.
Speed estimates are for CPU inference on a modern x86 processor. When Vulkan GPU acceleration is available (via whisper-rs's Vulkan backend), all models run significantly faster. Murmur detects Vulkan support at startup and uses it automatically. If Vulkan isn't available, it falls back to CPU without any user intervention.
Models download from Hugging Face on first use and are SHA256-verified. After that, they're cached in ~/.local/share/com.murmurlinux.murmur/models/ and loaded from disk on each launch.
whisper.cpp sometimes returns artefacts: leading/trailing whitespace, repeated punctuation, or hallucinated tokens from silence at the edges of the audio. Murmur runs a sanitisation pass on the raw output before injecting it. This is straightforward string processing, not an LLM. It trims whitespace, collapses redundant punctuation, and removes known hallucination patterns. The goal is to give you clean text without altering what you actually said.
The final step: getting the transcribed text into whatever application you're using. Murmur supports two injection backends depending on your display server.
On X11, it uses xdotool to simulate keystrokes. xdotool types each character into the currently focused window, as if you had pressed the keys yourself. This works in any application that accepts keyboard input.
On Wayland, it uses wtype, which does the same thing through Wayland's input protocols.
If neither tool is installed, Murmur falls back to copying the text to your clipboard. You paste with Ctrl+V. Functional, just less seamless.
The installed binary is ~15 MB. RAM usage at idle is ~50 MB. During transcription, memory usage rises based on the model size (the model weights are loaded into memory), then drops back down.
This is possible because the heavy lifting is done by whisper.cpp, which is C code compiled and linked statically. The Tauri 2 shell uses the system's WebView rather than bundling a browser engine. SolidJS compiles to minimal vanilla JS. There's no Electron, no Chromium, no 200 MB runtime. The result is a native app that behaves like one.
To put it all together:
1. You press the hotkey. Murmur opens the default audio input via cpal.
2. Audio frames stream into a buffer. Silero VAD monitors for speech and silence.
3. You release the hotkey (or auto-stop triggers after sustained silence).
4. The buffer is resampled to 16 kHz mono PCM.
5. whisper.cpp transcribes the audio using the selected model.
6. The raw output is sanitised.
7. The cleaned text is injected at your cursor via xdotool, wtype, or clipboard.
Start to finish, the entire pipeline runs in a single process on your machine. No network calls, no cloud, no telemetry. The audio exists in memory for the duration of the transcription and is then discarded. You can verify all of this in the source code.