🎙️ Free · Open-source · 99+ languages

Speech to text,
in any language.

Pick any AI model — Chinese, Japanese, Whisper — behind one clean interface. If a model goes down or runs out of quota, we auto-switch. No shared API key, no lock-in.

0+
Languages
0
AI models
$0
From / month
0%
Open-source
Voice2Text mobile app
The vision

What we're building

A public, free voice-to-text app for web + mobile, powered by multiple open-source AI models — not just one. Users switch models freely, and the app keeps working even when an individual model fails or hits its quota. Next: React web + Flutter mobile on a Python backend.

🎙️

Record or upload

Speak in-app or upload audio. We normalize to 16 kHz mono, chunk long files, and transcribe.

🔀

Pick any model

A clean picker of models with languages, speed and "offline?" tags. Choose what fits your language.

🛟

Never breaks

If your model is down or over quota, the orchestrator falls back to the next best one — transparently.

Definitions — get these right first

Two concepts, correctly separated

These are two different things people constantly confuse. Keep them separate; they compose beautifully.

Concept A

User picks the model

A Model Picker + Engine Registry. The user chooses a Chinese model, a Japanese model, or any model that supports the language they need. ✓ Easy once every engine sits behind one interface.

Concept B

AI Orchestration

"One model stops / its daily quota ends → auto-switch." An AI Gateway / Router with health checks, circuit breakers, and fallback chains. Exactly what LiteLLM & Portkey (open-source) do.

🧩 They compose: the user sets a preference (A); the orchestrator honors it but auto-falls-back (B) if that model is unavailable — e.g. "switched to Whisper because SenseVoice was over quota."
Whisper99+ lang SenseVoice🇨🇳 fast Qwen3-ASR52 langs Kotoba🇯🇵 JP Voskoffline GroqBYOK Moonshineedge ORCHESTRATOR single interface
Why this is hard — in plain language

Real-world problems for end users

What actually goes wrong for the person using the app — and what each problem really means.

🚫

"Quota exceeded"

Definition: a free model allows only N requests/day. Problem: at request N+1 everyone is blocked and the transcript just fails.

💥

"Model is down"

Definition: a provider has an outage. Problem: the whole app feels broken though only one model failed.

🌐

"My language isn't supported"

Definition: a model covers only some languages. Problem: a Cantonese or Japanese user gets garbage from an English-only model.

🐢

"It's so slow"

Definition: big models on weak hardware. Problem: users wait 30s for a 10s clip and give up.

🔓

"Where did my audio go?"

Definition: cloud models send audio off-device. Problem: sensitive voice notes leave the device — a privacy issue.

💸

"Why is it suddenly paid?"

Definition: free credits run out. Problem: a free app starts demanding a card, breaking trust.

The single most important risk

The shared-API-key trap

⚠️ One API key in .env for a public app is a trap.

A single shared key means one free quota burns for everyone in minutes, one abuser kills the app, the key will leak if it touches client code, and most free tiers' Terms of Service forbid proxying one account to many users (→ account ban).

The clean fix for a "fully open-source + public" app — combine these:

LayerWhatWhy it solves your problem
⭐ Self-host the open-source models
Whisper, SenseVoice, Vosk, Qwen3-ASR…
Run them on your server / GPUNo API key, no per-day quota at all — you trade per-request quota for compute. Fits "fully open-source, multiple models, public."
Per-user rate limit (Redis)Cap audio-minutes per user / IP / dayProtects your compute from abuse even with no keys.
OrchestratorHonor the user's model choice + auto-failoverDelivers Concept A + Concept B together.
Optional: BYOKUsers paste their own Groq/HF key (encrypted)Their quota, not yours — so there's no shared key, ever.
So: self-host = no shared key + unlimited (your compute) + per-user caps, with BYOK as an opt-in for cloud models. Never ship one shared key.
How many solutions really exist?

4 feasible solutions — pick what fits

Four viable architectures trading off cost, speed, privacy, and complexity. Each shows its limitations, best use case, and a monthly cost band. Vote for your favourite below ↓

Cheapest

① Free-Cloud Orchestrator

$3 /mo
$0–$5 · free tiers + BYOK

Orchestrate free cloud APIs (Groq, HF, Cloudflare). Heavy users bring their own key.

  • Almost free; zero infra
  • Very fast (Groq LPU)
  • Depends on 3rd-party quotas
  • Audio leaves device
Best for: launching the MVP fast on a tiny budget.
Budget · Private

② CPU Self-Host

$12 /mo
$5–$20 · one small VPS

faster-whisper INT8 / Vosk / SenseVoice on a CPU VPS. No keys, no quotas.

  • Fully open-source; no shared key
  • Audio stays private
  • Predictable flat cost
  • Slower (CPU)
Best for: privacy-first, moderate traffic.
Fastest · Scale

④ GPU Powerhouse

$120 /mo
$50–$300 · or serverless

Dedicated/serverless GPU running Whisper large-v3 / Qwen3-ASR at real-time.

  • Fastest + most accurate
  • Real-time streaming, high concurrency
  • Highest cost; ops
  • Idle GPU cost
Best for: scale & real-time apps.
Comparative analysis

Compare: pricing, performance & ranking

Scores are indicative (1–10 where noted) to aid reasoning — not hard benchmarks. The user-likes chart updates live as you vote.

Monthly cost

USD per month (typical)

Performance & accuracy

Relative score, 1–10

Multi-factor ranking

Speed · accuracy · privacy · scale · ease · cost

User likes

Live — vote below

Overall MVP-fit score

Out of 100

Designed, not just described

App preview

Mockups of the planned mobile & desktop experience — 36 screens each, the full journey end-to-end. Slide to explore. (Concept designs; the real apps come next on a Python backend.)

At a glance

Features & how we achieve them

What the app does, what it means, how it's built, and when it lands.

FeatureWhat it meansHow we achieve itStage
Multi-languageTranscribe 99+ languages, CN/JP first-classWhisper large-v3 default + SenseVoice (CN) / Kotoba (JP) routingMVP
Switch modelsUser chooses any model, any timeModel registry behind one engine interface (Concept A)MVP
Auto-failoverKeeps working when a model dies / hits quotaOrchestrator: health checks + fallback chains (LiteLLM/Portkey)v1
No shared keyPublic-safe, no quota burn or ToS banSelf-host models + per-user limits + optional BYOKMVP
Offline / on-deviceWorks with no network; audio never leaves deviceVosk / Moonshine on mobile (Flutter/RN)Later
PrivacySensitive audio stays on your infraSelf-hosted inference; consent + retention controlsv1
ExportSave transcripts in standard formats.txt / .srt / .vtt with word-level timestampsv1
Web + MobileOne backend, two clientsReact web + Flutter mobile + Python backendLater
Help us decide

Which solution should we build first?

Tap to vote for the solution that's cheaper, faster, or best fits your needs. The user-likes chart above updates live.

① Free-Cloud
0
② CPU Self-Host
0
③ Hybrid ⭐
0
④ GPU
0

What users are saying

Leave a rating & comment

Demo mode stores locally. Once the waitingList-service (Supabase) endpoint is configured, it posts to the backend.

Voice2Text app — transcript screen
🎙️ Voice2Text · be first in line
Be first

Join the waitlist

Tell us which solution you'd pay for and what result you need. We'll email you when the Voice2Text app is live.

No spam. One confirmation email. Leave any time.