PocketWebToolsPocketWebTools100% local · no upload

Audio Transcription

Transcribe audio, video, or live recordings into timestamped text and subtitles. Runs entirely in your browser, no upload, no signup.

or
Model

One-time download. Cached in your browser so subsequent runs are instant.

Your audio never leaves your device. The Whisper model runs in your browser using WebGPU (or WebAssembly as a fallback). No upload, no signup, unlimited use, even on hour-long files.

Free audio transcription that runs entirely in your browser

PocketWebTools' audio transcriber turns spoken audio or video into clean, timestamped text using OpenAI's Whisper model. There is no upload, no signup, no per-minute fee, and no usage cap. Every second of audio is processed on your device, powered by WebGPU where available and WebAssembly everywhere else. If you have ever balked at SaaS transcription paywalls or refused to send a sensitive recording to a third-party API, this is the version that just works.

The same engine drives both privacy and price. Because the model lives in your browser, we have no inference bill to recover, no logs to keep, and nothing to upsell. Drop a podcast clip, a meeting recording, a lecture, an interview, a voice memo, an unreleased track, a press call. None of it leaves the device.

How to use it

  1. Pick a mode at the top: Upload audio for files and recordings, or Live dictation for real-time speech-to-text as you talk. File mode accepts MP3, WAV, M4A, FLAC, OPUS, OGG, MP4, MOV, and WebM via drop, file picker, or microphone capture. Live mode opens the mic and streams transcripts segment-by-segment.
  2. Optionally pick the source language and pick a model. Whisper Base is the default (smaller, faster); switch to Whisper Turbo for noisier audio or non-English content.
  3. Choose Transcribe for same-language output or Translate to English if the source is in a different language.
  4. Hit Transcribe. On the first run, the model downloads (200 MB to 560 MB depending on the choice) and caches in your browser. After that it starts instantly.
  5. Review the segmented transcript, click timestamps to jump to that part of the audio, and download as plain text, SRT, VTT, or JSON.

What it does

Whisper is a sequence-to-sequence speech model trained on 680,000 hours of multilingual and multitask audio. It does three jobs in one model: speech detection, transcription, and speech-to-English translation. We run it through transformers.js's automatic-speech- recognition pipeline, which handles long-form audio with a 30-second sliding window so you can drop a 90-minute interview and still get a coherent transcript.

  • Two model tiers: Whisper Base for fast first-impression usage, Whisper Turbo for studio-grade quality.
  • 99 languages out of the box. Auto-detect by default; lock a language for slightly better accuracy if you know the source.
  • Translate-to-English mode for foreign-language podcasts, lectures, interviews, voicemails. Whisper produces an English transcript directly from the audio.
  • Per-segment timestamps with click-to-seek playback so you can verify any uncertain passage in seconds.
  • Subtitle-ready output: drop the .srt or .vtt file into YouTube Studio, Vimeo, Premiere, Final Cut, or any video player and your captions just work.

Common use cases

  • Podcasters and journalists who need transcripts for show notes, articles, or accessibility without sending interviews to a cloud API.
  • YouTubers and video editors generating .srt or .vtt subtitle files for uploads or burn-in captions.
  • Students and researchers transcribing lectures or interview tapes for note-taking or citation.
  • Anyone with sensitive recordings (legal, medical, HR, or unreleased creative work) who cannot use a cloud transcription service for compliance or confidentiality reasons.
  • Multilingual users translating foreign-language podcasts or video to English without subscribing to a separate translator tool.
  • Voice-memo users turning quick voice notes into searchable text, captured directly from the file your phone or laptop produced.

Why local AI matters

Cloud transcription services upload your audio to a server, run the model there, and send the text back. That is fine for casual use, but it means the recording touches their infrastructure (and usually their logs), they need a paid quota to recover GPU costs, and your transcript stalls the moment your network does. Running the model in your browser flips all three:

  • Privacy. The audio never leaves your device. No servers, no logs, no retention.
  • Cost. No inference bill for us means no subscription for you. Free, unlimited.
  • Offline. Once the model is cached, the tool works on a plane, in a remote office, or any time your network is flaky.

Frequently asked questions

Does my audio get uploaded anywhere?
No. The audio file stays in your browser the entire time. We run OpenAI's Whisper neural network locally with WebGPU (or WebAssembly on devices without WebGPU), so there is no server in the loop, nothing to log, and nothing for us to retain.
Why does it ask to download a model the first time?
Whisper has to run on your device for the privacy guarantee to mean anything. The first time you hit Transcribe, your browser downloads the model and caches it. Subsequent runs use the cached copy and start instantly. You can review or delete cached models at any time using the 'Local models' chip in the page header.
How big is the download?
Whisper Base is around 200 MB on WebGPU (about 80 MB on the WebAssembly fallback) and is the default. Whisper Turbo is around 560 MB and produces noticeably better results on accented English, non-English audio, and noisy recordings. Pick the one that fits your bandwidth and quality needs; both are cached after the first run.
Which languages are supported?
Both models are multilingual. Whisper recognizes 99 languages including English, Arabic, Spanish, French, German, Hindi, Chinese, Japanese, Korean, Portuguese, Russian, Indonesian and many more. Leave the language picker on Auto-detect to let Whisper figure it out, or pick the source language to skip detection.
Can it translate to English?
Yes. Set the Task switch to 'Translate to English' and Whisper will produce an English transcript even when the audio is in another language. This works for all 99 supported languages and is genuinely useful for podcasts, lectures, and interviews you want to read rather than listen to.
What output formats do you support?
Plain text, SubRip subtitles (.srt), WebVTT subtitles (.vtt), and JSON with timestamps. SRT and VTT are the standard subtitle formats for YouTube, Vimeo, video editors, and TVs. JSON gives you the raw timestamp + text array if you want to post-process the transcript.
How long can the audio be?
The tool accepts files up to 500 MB. There is no hard duration cap. Whisper processes long audio in 30-second sliding windows internally, so a one-hour podcast typically transcribes in 2 to 6 minutes on WebGPU with Whisper Base.
Does this work on video files?
Yes. Drop an MP4, MOV, or WebM and we extract the audio track in the browser before transcribing. Useful for subtitling your own videos without ever uploading them to a third-party service.
Can I record straight from my microphone?
Yes. Click 'Record from microphone' under the drop zone and grant the permission prompt. We use the browser's MediaRecorder to capture audio locally, hand the resulting file to Whisper, and never touch the network. Tap Stop when you're done and the transcript runs against your recording.
Does it support real-time live transcription?
Yes. Switch to Live dictation at the top of the tool and click Start. Whisper Base runs in a background worker, a voice-activity detector segments your speech, and each segment's transcript appears the moment you pause. Works in 99 languages with the same translate-to-English option. WebGPU is required for live mode (Chrome, Edge, Brave, Arc on desktop; Safari 26+ on iOS).
How accurate is the transcript?
Whisper is one of the most accurate open-source speech models available. On clear English audio, error rates are usually in the 3-7% range. Accuracy drops on heavy accents, overlapping speakers, music-heavy audio, and noisy environments. Switch to Turbo if Base is missing words you can clearly hear.
Does this work offline?
Yes, after the first visit. Once the model is cached, your browser does not need a network connection to transcribe. The page itself also works offline if you have visited it before.
Why does it say WebGPU is faster?
WebGPU lets the model run on your graphics card instead of your CPU. For Whisper this is roughly 3 to 10 times faster. We auto-detect WebGPU support and fall back to WebAssembly when it is unavailable; the badge above the action button tells you which path you got.
Can I use the transcript commercially?
Yes. The transcript is yours. Whisper itself is released by OpenAI under the MIT license, and the ONNX model we load (onnx-community/whisper-base, onnx-community/whisper-large-v3-turbo) inherits that license. There are no per-minute fees and no rights claim by us.