OCR · STT · TTS — Voice and Text

The three areas of human writing (text in images), human speech (sound), and the reverse of synthesized voice run on different stacks but are often bundled together inside an app.

1. Three areas at a glance

Acronym	Expansion	Input → Output
OCR	Optical Character Recognition	image → text
STT	Speech-to-Text	speech → text
TTS	Text-to-Speech	text → speech

None of these tasks reach 100% accuracy. Postprocessing (spell correction, domain dictionaries, confidence display) is part of natural UX.

2. Tesseract

Tesseract was developed at HP Labs around 1985, suspended in the late 1990s, and open-sourced in 2005. Google then sponsored it and pushed development through v3 from 2006, and now it is a community project under the Apache-2.0 license.

Version	When	Event
3.0	2010	Multilingual support settled.
4.0	2018	LSTM-based neural engine added.
5.0	2021	Stable. Improved accuracy and speed.

Tesseract.js (naptha, 2017) is a library that ports Tesseract to WebAssembly.

import Tesseract from "tesseract.js"
const { data: { text } } = await Tesseract.recognize(imageUrl, "kor+eng")

Language data (kor.traineddata, eng.traineddata) ships as separate files. Three variants exist: tessdata, tessdata_best, tessdata_fast. best is for accuracy, fast for speed, and tessdata is the compromise.

3. Cloud OCR and built-in OS

Service	Notes
Google Cloud Vision OCR	High Korean accuracy. DOCUMENT_TEXT_DETECTION mode preserves layout.
Azure AI Vision (Read API)	Multilingual. Handwriting supported.
AWS Textract	First-class table and form structure extraction.
Naver Clova OCR	Often discussed for Korean specialization.

Cloud models are generally more accurate than Tesseract, but with cost and privacy trade-offs. In local-first apps, Tesseract.js or built-in OS OCR is often discussed.

Built-in OS:

Windows — Windows.Media.Ocr (OCR Runtime, 2015+).
macOS — VisionKit VNRecognizeTextRequest (10.15+, 2019).
iOS — same Vision framework.
Android — ML Kit Text Recognition (Google).

To call these from Tauri, OS-specific native code (a Rust crate or direct FFI) is needed.

4. Web Speech API

A browser standard API defined by W3C. Composed of two interfaces.

SpeechRecognition (STT) — microphone input to text.
SpeechSynthesis (TTS) — text to voice.

Feature	Chrome/Edge	Safari	Firefox
SpeechSynthesis	O	O	O
SpeechRecognition	O (server-dependent)	O (16+)	partial

Important: Chrome's SpeechRecognition is widely cited as actually sending audio to Google servers for recognition (not fully offline). Places with high privacy demand need a separate solution.

5. TTS example

const u = new SpeechSynthesisUtterance("안녕하세요")
u.lang = "ko-KR"
u.rate = 1.0          // 0.1 ~ 10
u.pitch = 1.0         // 0 ~ 2
speechSynthesis.speak(u)

// available voice list (loaded asynchronously)
speechSynthesis.onvoiceschanged = () => {
  const ko = speechSynthesis.getVoices().filter(v => v.lang.startsWith("ko"))
  console.log(ko)
}

The voice list is provided by the OS and the browser. On Windows, Microsoft Heami / SAPI voices; on macOS, voices like Yuna sit in the Korean place.

6. STT example

const SR = (window as any).SpeechRecognition || (window as any).webkitSpeechRecognition
const sr = new SR()
sr.lang = "ko-KR"
sr.continuous = false
sr.interimResults = true
sr.onresult = (e: any) => {
  const t = e.results[e.results.length - 1][0].transcript
  console.log(t)
}
sr.start()

7. Cloud STT/TTS

Service	Notes
OpenAI Whisper	2022 OSS, MIT. Offline/on-prem possible. Korean supported.
Google Cloud Speech-to-Text	Multilingual.
Azure Speech	Real-time conversion.
AWS Transcribe	Strong in US English.
Naver Clova Speech	Korean specialized.

Whisper is an open-source model OpenAI released in September 2022. C++/WASM ports like whisper.cpp (Georgi Gerganov, 2022) made local execution common. Official announcement supports 99 languages including Korean.

Cloud TTS:

Google Cloud TTS (WaveNet, Studio Voice).
Azure Neural TTS (200+ voices).
ElevenLabs, OpenAI TTS, Coqui TTS.
Korean neural voices: Naver Clova Voice, Kakao i, Typecast.

8. Quirks of Korean OCR

Vertical writing — old documents and some designs. Tesseract handles via page segmentation (PSM) options (such as --psm 5 vertical).

Hanja mixed in — academic documents, classical texts. Specify kor + chi_tra/chi_sim together.

Font variants — calligraphic and design fonts cause sharp accuracy drops.

Image preprocessing — binarization, contrast, deskew, and noise removal greatly affect accuracy. OpenCV or sharp is often used.

9. Quirks of Korean STT/TTS

STT:

Proper nouns and neologisms — names, brands, and neologisms are frequently wrong. Domain dictionaries or candidate corrections.
No formal/casual distinction — depending on how the model learned colloquial speech, casual register can come out awkward.
Dialects — accuracy drops outside standard Korean.
Number representation — "이천이십육년" / "2026년" / "2026" are ambiguous to each other.

TTS:

Prosody — natural synthesis hinges on phrase-level prosody in Korean. Short SSML support helps.
Loanword pronunciation — policy on whether to read English words with Korean pronunciation or English pronunciation.
Numbers → Korean units — whether to read "1000" as "천" or "일영영영." In natural context the latter is awkward.

10. Common shape

import Tesseract from "tesseract.js"

const worker = await Tesseract.createWorker("kor+eng", 1, {
  logger: (m) => console.log(m.status, m.progress),
})

const { data: { text, confidence } } = await worker.recognize(file)
await worker.terminate()

createWorker runs in a background Web Worker to avoid blocking the UI. Large model data (tens of MB) downloads on first run (cached in IndexedDB).

Forcing a TTS voice selection:

async function speak(text: string) {
  await new Promise<void>((r) => {
    if (speechSynthesis.getVoices().length) return r()
    speechSynthesis.onvoiceschanged = () => r()
  })
  const voice = speechSynthesis.getVoices().find(v => v.lang === "ko-KR")
  const u = new SpeechSynthesisUtterance(text)
  if (voice) u.voice = voice
  speechSynthesis.speak(u)
}

Single-source for STT routing and stop:

let recognizing = false
function start() {
  if (recognizing) return
  recognizing = true
  sr.start()
}
function stop() {
  if (!recognizing) return
  recognizing = false
  sr.stop()
}
sr.onend = () => { recognizing = false }

When onend and a user stop() race, state diverges. Manage in one place consistently.

11. Common pitfalls

Microphone permission — the permission dialog appears only over HTTPS within a user gesture. localhost is the exception.

SpeechRecognition cutting off — some browsers automatically end after a period. The pattern of continuous = true and restarting in onend is common.

Mobile browser compatibility — iOS Safari's SpeechRecognition is from 16+. Earlier versions need branching.

OCR PSM (Page Segmentation Mode) — auto split (--psm 3) does not fit every image. Single line is --psm 7, single word is --psm 8.

Cost of loading multiple language models — kor+eng is often required, but doubles memory and download. Get a user choice or restrict to one language matching the domain.

Whisper hallucination — there are reports of arbitrary text being produced in silent segments. VAD (Voice Activity Detection) preprocessing or filtering by timestamp/logprob.

Closing thoughts

For OCR, STT, and TTS, expecting 100% accuracy is unrealistic. Postprocessing, confidence display, and a user-correction flow are half of the UX. Korean often has more places of lower accuracy than standard English, so a domain dictionary helps.

native-integrations
loading-ux

We refer to Tesseract GitHub, Tesseract.js, tessdata_best, Apple Vision, W3C Web Speech API, MDN Web Speech, OpenAI Whisper, whisper.cpp, and Naver CLOVA.

OCR · STT · TTS