OCR · STT · TTS
OCR · STT · TTS — Voice and Text
The three areas of human writing (text in images), human speech (sound), and the reverse of synthesized voice run on different stacks but are often bundled together inside an app.
1. Three areas at a glance
| Acronym | Expansion | Input → Output |
|---|---|---|
| OCR | Optical Character Recognition | image → text |
| STT | Speech-to-Text | speech → text |
| TTS | Text-to-Speech | text → speech |
None of these tasks reach 100% accuracy. Postprocessing (spell correction, domain dictionaries, confidence display) is part of natural UX.
2. Tesseract
Tesseract was developed at HP Labs around 1985, suspended in the late 1990s, and open-sourced in 2005. Google then sponsored it and pushed development through v3 from 2006, and now it is a community project under the Apache-2.0 license.
| Version | When | Event |
|---|---|---|
| 3.0 | 2010 | Multilingual support settled. |
| 4.0 | 2018 | LSTM-based neural engine added. |
| 5.0 | 2021 | Stable. Improved accuracy and speed. |
Tesseract.js (naptha, 2017) is a library that ports Tesseract to WebAssembly.
import Tesseract from "tesseract.js"
const { data: { text } } = await Tesseract.recognize(imageUrl, "kor+eng")
Language data (kor.traineddata, eng.traineddata) ships as separate files. Three variants exist: tessdata, tessdata_best, tessdata_fast. best is for accuracy, fast for speed, and tessdata is the compromise.
3. Cloud OCR and built-in OS
| Service | Notes |
|---|---|
| Google Cloud Vision OCR | High Korean accuracy. DOCUMENT_TEXT_DETECTION mode preserves layout. |
| Azure AI Vision (Read API) | Multilingual. Handwriting supported. |
| AWS Textract | First-class table and form structure extraction. |
| Naver Clova OCR | Often discussed for Korean specialization. |
Cloud models are generally more accurate than Tesseract, but with cost and privacy trade-offs. In local-first apps, Tesseract.js or built-in OS OCR is often discussed.
Built-in OS:
- Windows — Windows.Media.Ocr (OCR Runtime, 2015+).
- macOS — VisionKit
VNRecognizeTextRequest(10.15+, 2019). - iOS — same Vision framework.
- Android — ML Kit Text Recognition (Google).
To call these from Tauri, OS-specific native code (a Rust crate or direct FFI) is needed.
4. Web Speech API
A browser standard API defined by W3C. Composed of two interfaces.
SpeechRecognition(STT) — microphone input to text.SpeechSynthesis(TTS) — text to voice.
| Feature | Chrome/Edge | Safari | Firefox |
|---|---|---|---|
| SpeechSynthesis | O | O | O |
| SpeechRecognition | O (server-dependent) | O (16+) | partial |
Important: Chrome's SpeechRecognition is widely cited as actually sending audio to Google servers for recognition (not fully offline). Places with high privacy demand need a separate solution.
5. TTS example
const u = new SpeechSynthesisUtterance("안녕하세요")
u.lang = "ko-KR"
u.rate = 1.0 // 0.1 ~ 10
u.pitch = 1.0 // 0 ~ 2
speechSynthesis.speak(u)
// available voice list (loaded asynchronously)
speechSynthesis.onvoiceschanged = () => {
const ko = speechSynthesis.getVoices().filter(v => v.lang.startsWith("ko"))
console.log(ko)
}
The voice list is provided by the OS and the browser. On Windows, Microsoft Heami / SAPI voices; on macOS, voices like Yuna sit in the Korean place.
6. STT example
const SR = (window as any).SpeechRecognition || (window as any).webkitSpeechRecognition
const sr = new SR()
sr.lang = "ko-KR"
sr.continuous = false
sr.interimResults = true
sr.onresult = (e: any) => {
const t = e.results[e.results.length - 1][0].transcript
console.log(t)
}
sr.start()
7. Cloud STT/TTS
| Service | Notes |
|---|---|
| OpenAI Whisper | 2022 OSS, MIT. Offline/on-prem possible. Korean supported. |
| Google Cloud Speech-to-Text | Multilingual. |
| Azure Speech | Real-time conversion. |
| AWS Transcribe | Strong in US English. |
| Naver Clova Speech | Korean specialized. |
Whisper is an open-source model OpenAI released in September 2022. C++/WASM ports like whisper.cpp (Georgi Gerganov, 2022) made local execution common. Official announcement supports 99 languages including Korean.
Cloud TTS:
- Google Cloud TTS (WaveNet, Studio Voice).
- Azure Neural TTS (200+ voices).
- ElevenLabs, OpenAI TTS, Coqui TTS.
- Korean neural voices: Naver Clova Voice, Kakao i, Typecast.
8. Quirks of Korean OCR
Vertical writing — old documents and some designs. Tesseract handles via page segmentation (PSM) options (such as --psm 5 vertical).
Hanja mixed in — academic documents, classical texts. Specify kor + chi_tra/chi_sim together.
Font variants — calligraphic and design fonts cause sharp accuracy drops.
Image preprocessing — binarization, contrast, deskew, and noise removal greatly affect accuracy. OpenCV or sharp is often used.
9. Quirks of Korean STT/TTS
STT:
- Proper nouns and neologisms — names, brands, and neologisms are frequently wrong. Domain dictionaries or candidate corrections.
- No formal/casual distinction — depending on how the model learned colloquial speech, casual register can come out awkward.
- Dialects — accuracy drops outside standard Korean.
- Number representation — "이천이십육년" / "2026년" / "2026" are ambiguous to each other.
TTS:
- Prosody — natural synthesis hinges on phrase-level prosody in Korean. Short SSML support helps.
- Loanword pronunciation — policy on whether to read English words with Korean pronunciation or English pronunciation.
- Numbers → Korean units — whether to read "1000" as "천" or "일영영영." In natural context the latter is awkward.
10. Common shape
import Tesseract from "tesseract.js"
const worker = await Tesseract.createWorker("kor+eng", 1, {
logger: (m) => console.log(m.status, m.progress),
})
const { data: { text, confidence } } = await worker.recognize(file)
await worker.terminate()
createWorker runs in a background Web Worker to avoid blocking the UI. Large model data (tens of MB) downloads on first run (cached in IndexedDB).
Forcing a TTS voice selection:
async function speak(text: string) {
await new Promise<void>((r) => {
if (speechSynthesis.getVoices().length) return r()
speechSynthesis.onvoiceschanged = () => r()
})
const voice = speechSynthesis.getVoices().find(v => v.lang === "ko-KR")
const u = new SpeechSynthesisUtterance(text)
if (voice) u.voice = voice
speechSynthesis.speak(u)
}
Single-source for STT routing and stop:
let recognizing = false
function start() {
if (recognizing) return
recognizing = true
sr.start()
}
function stop() {
if (!recognizing) return
recognizing = false
sr.stop()
}
sr.onend = () => { recognizing = false }
When onend and a user stop() race, state diverges. Manage in one place consistently.
11. Common pitfalls
Microphone permission — the permission dialog appears only over HTTPS within a user gesture. localhost is the exception.
SpeechRecognition cutting off — some browsers automatically end after a period. The pattern of continuous = true and restarting in onend is common.
Mobile browser compatibility — iOS Safari's SpeechRecognition is from 16+. Earlier versions need branching.
OCR PSM (Page Segmentation Mode) — auto split (--psm 3) does not fit every image. Single line is --psm 7, single word is --psm 8.
Cost of loading multiple language models — kor+eng is often required, but doubles memory and download. Get a user choice or restrict to one language matching the domain.
Whisper hallucination — there are reports of arbitrary text being produced in silent segments. VAD (Voice Activity Detection) preprocessing or filtering by timestamp/logprob.
Closing thoughts
For OCR, STT, and TTS, expecting 100% accuracy is unrealistic. Postprocessing, confidence display, and a user-correction flow are half of the UX. Korean often has more places of lower accuracy than standard English, so a domain dictionary helps.
Next
- native-integrations
- loading-ux
We refer to Tesseract GitHub, Tesseract.js, tessdata_best, Apple Vision, W3C Web Speech API, MDN Web Speech, OpenAI Whisper, whisper.cpp, and Naver CLOVA.