Step 6
OCR / STT / TTS
30 min
OCR / STT / TTS
Add "image → text", "speech → text", and "text → speech" — three practical bits.
1. OCR — Tesseract wasm
pnpm add tesseract.js
import Tesseract from "tesseract.js";
const { data: { text } } = await Tesseract.recognize(imageFile, "kor+eng", {
logger: (m) => console.log(m.status, m.progress),
});
kor+engtogether- ~10MB wasm (lazy load)
- Downloads traineddata on first use
2. Preprocessing
function preprocessImage(canvas: HTMLCanvasElement) {
const ctx = canvas.getContext("2d")!;
const img = ctx.getImageData(0, 0, canvas.width, canvas.height);
const d = img.data;
for (let i = 0; i < d.length; i += 4) {
const gray = 0.299 * d[i] + 0.587 * d[i+1] + 0.114 * d[i+2];
const bw = gray > 128 ? 255 : 0;
d[i] = d[i+1] = d[i+2] = bw;
}
ctx.putImageData(img, 0, 0);
}
Grayscale + threshold: +10–20pp accuracy.
3. STT — Web Speech API
const rec = new (window.webkitSpeechRecognition || window.SpeechRecognition)();
rec.lang = "ko-KR";
rec.continuous = false;
rec.interimResults = true;
rec.onresult = (e) => {
const text = Array.from(e.results).map(r => r[0].transcript).join("");
};
rec.start();
Free, online, Chromium-based engines. For offline, swap to Vosk or Whisper.cpp.
4. TTS — Web Speech API
function speak(text: string, lang = "ko-KR") {
const u = new SpeechSynthesisUtterance(text);
u.lang = lang; u.rate = 1.0; u.pitch = 1.0;
window.speechSynthesis.speak(u);
}
function cancel() { window.speechSynthesis.cancel(); }
Toggle speaking ? cancel() : speak() for intuitive UX.
5. Permissions (mobile)
Android requires runtime RECORD_AUDIO for the microphone. OCR needs no permission (image selection only).
6. Language packs
Tesseract traineddata is ~10MB per language. Bundle vs download at runtime:
- Bundled — offline ready, bigger app
- Runtime download — smaller app, first use delay
Bundle for mobile to respect data plans.
7. Gotchas
- Long first OCR delay → show a progress UI
- No Korean voice → user must install OS voice pack
- STT battery drain → cancel on
onend - Noisy OCR output → regex
/\s+/g, " "cleanup
Closing
OCR · STT · TTS often combine in real products (language apps, accessibility tools). Web APIs go a long way under Tauri.
Next
- 07-admob-shipping