# CHITTI Voice Factory — Master Specification **Version:** 1.0 **Date:** 2026-05-09 **Author:** Bryan Wilfred Pinto · drafted by Claude **Status:** LIVING DOCUMENT — every Claude session that touches Chitti voice / TTS / multi-language must read this first. > "Every Indian's mother tongue, spoken back to them — legally, consensually, at zero marginal cost where we can." --- ## 0. Where this product sits ``` Chitti (parent brand at sahayai.in) ├── Chitti Shares │ ├── Chitti Technical (chitti_complete_technical.html) │ └── Chitti Fundamentals (chitti_fundamentals.html) ├── Chitti MedUPI (chitti_medupi.html) ├── Chitti News (chitti_news.html) ├── Chitti Vaani (chitti_vaani.html + chitti-vaani-android/) └── Chitti Voice Factory (this spec) ← SHARED VOICE SUBSTRATE │ ├── Primary 12 Hindi · Bangla · Telugu · Tamil · Kannada · Malayalam · │ Marathi · Gujarati · Odia · Assamese · Punjabi · Urdu └── Cousin 14 Bhojpuri · Chhattisgarhi · Maithili · Konkani · Tulu · Kodava · Dogri · Sindhi · Kashmiri · Manipuri · Bodo · Santhali · Sanskrit · Oraon (Kurukh) ``` **Chitti `` pages (e.g. `chitti_bangla.html`, `chitti_tulu.html`) are localised front doors over the same backend** — not 24 new products. --- ## 1. Product Overview | Field | Value | |---|---| | **Product Name** | Chitti Voice Factory | | **Tagline** | "Your mother tongue, spoken back to you." | | **Category** | Multi-language voice substrate (TTS + STT + voice routing) | | **Mission** | Give every Indian a Chitti that speaks the language they think in — using only legal, consented, openly-licensed voice. | | **Target users** | All Chitti users — prioritised for the four-user contract (Blind / Deaf / Mute / Illiterate). | | **Backend** | `chitti-voice-factory/backend/` · Flask · `https://chitti-voice-factory-api-production.up.railway.app` | | **Frontend** | 24 generated `chitti_.html` pages + status dashboard `chitti_voice_factory.html` | **Positioning:** - ✅ IS a router over **legal, consented voice sources** (Bhashini, AI4Bharat, Sarvam, opt-in donors). - ✅ IS the language-routing substrate every other Chitti product calls. - ❌ NOT a voice-cloning product. We do not clone real anchors. We do not scrape Doordarshan / AIR / YouTube. - ❌ NOT a deepfake platform. The "Chitti Male" personality voice comes from consenting volunteer donors. --- ## 2. Things we EXPLICITLY DECIDED NOT to do (and why) This section exists so future Claude / future contributors do not "improve" these back in. ### 2.1 ❌ NO Doordarshan / Prasar Bharati / YouTube audio scraping **Why blocked:** anchor voices are personality rights (cf. *Anil Kapoor v. Simply Life India*, Delhi HC 2023; *Arijit Singh v. Codible Ventures*, Bombay HC 2024). Prasar Bharati holds broadcast copyright. Risk = takedown + named defendant. **Use instead:** Bhashini, AI4Bharat, Sarvam, Mozilla Common Voice, opt-in donors. ### 2.2 ❌ NO "cousin = primary + grammar swap + voice morph" **Why blocked:** Tulu is a separate Dravidian language, not Kannada-with-a-filter. Konkani has 4 dialects across 4 scripts. Morph output sounds like mockery to actual speakers. **Use instead:** real per-language models from Bhashini / AI4Bharat. For Tulu + Kodava (no model exists): voice-donor program, NOT silent fallback. ### 2.3 ❌ NO claim of "<100 ms on-device, native-quality, cloned, 12 languages, 50–100 MB" **Why blocked:** XTTS-v2 weights are ~1.8 GB. The maths doesn't work. **Use instead:** measured per-language latency in the ledger. Cascade picks the supplier that actually meets target. --- ## 3. Suppliers (cascade order) Four named suppliers. The cascade tries them in this priority and records every attempt. | # | Supplier | Role | Cost | Status today | |---|---|---|---|---| | 1 | `on_device` | downloaded ONNX model running in-browser via `onnxruntime-web` | zero (after one-time download) | placeholder — returns `unavailable` until models packaged | | 2 | `bhashini` | Govt of India NLTM — primary source of truth | zero (citizen use) | **MOCK** until ULCA credentials issued. Mock returns a `client_directive: speech_synthesis` so the client uses browser TTS, with `supplier=mock_bhashini` honestly labelled in every response. | | 3 | `ai4bharat` | IIT Madras IndicTTS / IndicParler-TTS — open weights | zero (self-hosted) or low (HF inference) | not yet wired | | 4 | `sarvam` | paid commercial TTS — last resort | metered ₹/char | disabled in v1 | **Cascade rule:** the router walks 1→2→3→4. First supplier that returns `ok=True` wins. The supplier that won is recorded in the ledger and surfaced to the client as `supplier` and to the user as a verbal disclaimer. --- ## 4. Languages — per-language honest tiers (26 total) ### 4.1 Tier A — Production-ready (12 Primary) All covered by Bhashini AND AI4Bharat IndicTTS. Web Speech API fallback exists for most. | Language | ISO | Bhashini | AI4Bharat | Web Speech | |---|---|---|---|---| | Hindi | hi | ✅ | ✅ | ✅ | | Bangla | bn | ✅ | ✅ | ✅ | | Telugu | te | ✅ | ✅ | ✅ | | Tamil | ta | ✅ | ✅ | ✅ | | Kannada | kn | ✅ | ✅ | ✅ | | Malayalam | ml | ✅ | ✅ | ✅ | | Marathi | mr | ✅ | ✅ | ✅ | | Gujarati | gu | ✅ | ✅ | ✅ | | Odia | or | ✅ | ✅ | ⚠️ thin on iOS | | Assamese | as | ✅ | ✅ | ❌ | | Punjabi | pa | ✅ | ⚠️ Gurmukhi only | ✅ | | Urdu | ur | ✅ | ❌ | ✅ | ### 4.2 Tier B — Covered but quality varies (11 Cousins) | Language | ISO | Bhashini | AI4Bharat | Notes | |---|---|---|---|---| | Bhojpuri | bho | ⚠️ partial | ✅ Indic-Parler | AI4Bharat primary | | Chhattisgarhi | hne | ⚠️ partial | ⚠️ corpus only | Donor program for top-up | | Maithili | mai | ✅ | ⚠️ partial | Bhashini primary | | Konkani | kok | ✅ Devanagari | ⚠️ partial | Flag for Roman/Kannada users | | Dogri | doi | ✅ | ❌ | Bhashini-only | | Sindhi | sd | ✅ Devanagari | ❌ | Arabic script flagged | | Kashmiri | ks | ⚠️ partial | ❌ | Weakest Tier B — flag in honest_status | | Manipuri (Meitei) | mni | ✅ | ✅ | Both scripts supported | | Bodo | brx | ⚠️ partial | ✅ | AI4Bharat primary | | Santhali | sat | ⚠️ partial | ❌ | Ol Chiki — donor program planned | | Sanskrit | sa | ✅ partial | ⚠️ IndicTTS partial | Scheduled-22; Web Speech rarely has a Sanskrit voice | ### 4.3 Tier C — No production model (3 Cousins) | Language | ISO | Plan | |---|---|---| | Tulu | tcy | **Donor program required.** v1 ships text-only with banner. NO silent fallback. | | Kodava | kfa | **Donor program required.** v1 ships text-only with banner. NO silent fallback. | | Oraon (Kurukh) | kru | **Donor program required.** Dravidian, ~2M speakers across Jharkhand/Chhattisgarh/Odisha/WB. v1 ships text-only with banner. | --- ## 5. Honest Status Ledger (SQLite — `voice_factory.sqlite`) **Hard rule: NO fake data. Every "available" claim is backed by a real synthesis row.** ```sql CREATE TABLE synthesis_log ( id INTEGER PRIMARY KEY AUTOINCREMENT, language_code TEXT NOT NULL, supplier TEXT NOT NULL, -- 'on_device' | 'bhashini' | 'mock_bhashini' | 'ai4bharat' | 'sarvam' text_sha256 TEXT NOT NULL, -- never log raw user text text_chars INTEGER NOT NULL, bytes_out INTEGER, -- bytes of audio produced (0 if client-side directive) latency_ms INTEGER, -- measured wall-clock ok INTEGER NOT NULL, -- 1 success, 0 failure error_code TEXT, -- short token if !ok created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX ix_log_lang_time ON synthesis_log(language_code, created_at); CREATE TABLE donor_consents ( id INTEGER PRIMARY KEY AUTOINCREMENT, donor_handle TEXT NOT NULL, -- public attribution name language_code TEXT NOT NULL, consent_text_sha256 TEXT NOT NULL, audio_proof_url TEXT NOT NULL, -- donor verbally states consent recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, revoked_at TIMESTAMP ); ``` `/api/voice/status` returns `available:true` for a language ONLY IF **all four** are true: 1. Successful synthesis row in last 24 h (`ok=1`). 2. `latency_ms IS NOT NULL`. 3. Known supplier (one of the five listed). 4. Disclaimer text non-empty. Otherwise `available:false` with a `reason`. **No language returns `available:true` from a hard-coded boolean.** --- ## 6. API Surface Base: `https://chitti-voice-factory-api-production.up.railway.app` | Method | Path | Purpose | |---|---|---| | GET | `/` | Banner: name, version, link to ledger | | GET | `/health` | Liveness | | GET | `/api/voice/status` | Per-language honest status (all 24) | | GET | `/api/voice/status/` | One language detailed status | | POST | `/api/voice/speak` | Synthesise (cascade); body `{text, language}` | | GET | `/api/voice/languages` | The 24-language registry | | GET | `/api/voice/ledger` | Full ledger (anonymised — sha256 only) | | GET | `/api/voice/honest-banner/` | Per-language verbal disclaimer text | | POST | `/api/voice/donate` | Volunteer donor signup (CC-BY-4.0 + audio proof) | | GET | `/api/voice/donations` | Public donor list (no audio, just credit) | ### 6.1 `POST /api/voice/speak` response shapes **Success (mock supplier — client-side TTS directive):** ```json { "ok": true, "supplier": "mock_bhashini", "client_directive": "speech_synthesis", "text": "नमस्ते, मैं चिट्टी हूँ।", "language": "hi", "voice_lang_code": "hi-IN", "latency_ms": 18, "disclaimer": "MOCK supplier — replaces real Bhashini once NLTM credentials are issued. Voice is your device's built-in TTS. Not a real person." } ``` **Success (real Bhashini, future):** ```json { "ok": true, "supplier": "bhashini", "audio_url": "https://cdn.sahayai.in/voice/cache/sha256.../audio.mp3", "language": "hi", "latency_ms": 612, "disclaimer": "Voice via Bhashini (Govt of India NLTM). Not a real person." } ``` **Failure (Tier C, no supplier available):** ```json { "ok": false, "supplier": null, "language": "tcy", "reason": "voice_not_available", "human_message_en": "Chitti is still learning Tulu. We need volunteer voice donors.", "donor_url": "https://sahayai.in/voice_donor.html?lang=tcy" } ``` --- ## 7. Frontend pages Each `chitti_.html`: - Sticky **honest banner** in that language: status pill that reads `/api/voice/status/` every 60 s. Never a fake green tick. - Four-user contract row (blind / deaf / mute / illiterate symbol header). - One **🔊 Speak this** button on a textbox. - One **🎙️ Donate my voice** button → `/api/voice/donate` (skippable). - One **⬇️ Download voice model** button (when on-device model exists). - All buttons have aria-labels. All verdicts are spoken first, written second. - AI-not-a-doctor / AI-not-a-lawyer banner via `chitti_disclaimer.js`. - SEBI banner NOT shown (this is not a finance product). A status dashboard `chitti_voice_factory.html` shows the full 24-language ledger publicly. --- ## 8. Deploy ```yaml # chitti-voice-factory/render.yaml services: - type: web name: chitti-voice-factory runtime: python rootDir: chitti-voice-factory/backend plan: free buildCommand: pip install -r requirements.txt startCommand: gunicorn main:app --bind 0.0.0.0:$PORT --workers 2 --timeout 60 envVars: - key: PYTHON_VERSION value: 3.11.10 - key: ALLOWED_ORIGINS value: https://sahayai.in,https://www.sahayai.in - key: BHASHINI_USER_ID sync: false - key: BHASHINI_API_KEY sync: false - key: BHASHINI_INFERENCE_KEY sync: false - key: SARVAM_API_KEY sync: false - key: VOICE_FACTORY_DB value: /tmp/chitti_voice_factory.sqlite - key: VOICE_FACTORY_USE_MOCK_BHASHINI value: "1" ``` Frontend is static — `chitti_.html × 24` + `chitti_voice_factory.html` deploy alongside other Chitti pages on GitHub Pages. --- ## 9. Bhashini registration To go from MOCK → real Bhashini we need: - A registered Bhashini ULCA citizen account - Inference API key - Stated use case: **accessibility infrastructure for blind / illiterate users in 24 Indian languages, free at point of use, attribution to Bhashini on every audio response, no commercial redistribution** Application body draft lives at `chitti-voice-factory/README.md` §3. Until creds arrive, env var `VOICE_FACTORY_USE_MOCK_BHASHINI=1` keeps the mock active. Setting it to `0` and providing real keys flips Bhashini live with no other code change. --- ## 10. Build phases | Phase | Scope | Status | |---|---|---| | **1. Spec** | This document | ✅ done 2026-05-09 | | **2. Backend skeleton** | Flask app, SQLite ledger, 24-language registry, supplier interface, all 4 suppliers stubbed, `/api/voice/status` honestly returning `available:false` until real synthesis happens | ✅ in this commit | | **3. Mock Bhashini supplier** | `mock_bhashini.py` returns `client_directive: speech_synthesis` so client uses browser TTS. Records to ledger with `supplier=mock_bhashini`. Hindi flips `available:true` after first successful call. | ✅ in this commit | | **4. 24 HTML pages** | Generated from one template + i18n bundle | ✅ in this commit | | **5. Status dashboard** | `chitti_voice_factory.html` rendering full ledger | ✅ in this commit | | **6. Real Bhashini** | Wire `bhashini.py` ULCA client. Set `VOICE_FACTORY_USE_MOCK_BHASHINI=0`. | ⏳ awaiting NLTM creds | | **7. AI4Bharat** | IndicTTS + IndicParler-TTS wrapper for Tier B | next | | **8. Sarvam (paid)** | Last-resort fallback, rate-limited 100 chars/req | next | | **9. Donor flow** | `/api/voice/donate` + Supabase audio storage | next | | **10. On-device** | Quantised IndicTTS via `onnxruntime-web`, IndexedDB cache | next | --- ## 11. Non-negotiables 1. **No fake data.** A language is `available:true` only after a real (or honestly-labelled mock) synthesis row exists. The mock supplier is named `mock_bhashini` everywhere — never silently labelled `bhashini`. 2. **No scraping.** Doordarshan / AIR / YouTube are forbidden corpora. 3. **No closed-source costs hidden.** Sarvam is logged + rate-limited + only used after free suppliers fail. 4. **Volunteer-only donors for v1.** Compensation revisited at 100 donors. 5. **Donor revocation in 30 days.** `DELETE /api/voice/donate/` removes the voice from rotation in 24 h, retrains within 30 days. 6. **Tier C never silently falls back.** Tulu / Kodava users see the donor banner, not a Kannada voice with their text. 7. **The four-user contract holds.** No exceptions. 8. **Every audio response carries a disclaimer naming the supplier.** Spoken first, written second. --- ## 12. File layout (after Phase 2-5) ``` sahayai/ ├── CHITTI_VOICE_FACTORY_MASTER_SPEC.md ← this file ├── chitti_voice_factory.html ← public status dashboard ├── chitti_.html × 24 ← Phase 4 front doors ├── chitti-voice-factory/ │ ├── README.md │ ├── render.yaml │ ├── tools/ │ │ └── generate_lang_pages.py ← Phase 4 generator │ └── backend/ │ ├── main.py ← Flask app │ ├── requirements.txt │ ├── runtime.txt │ ├── languages.py ← 24-language registry │ ├── ledger.py ← SQLite synthesis_log │ ├── router.py ← supplier cascade │ ├── routes/ │ │ ├── __init__.py │ │ └── voice.py │ └── suppliers/ │ ├── __init__.py │ ├── base.py │ ├── on_device.py │ ├── bhashini.py ← real (skipped if no creds) │ ├── mock_bhashini.py ← active until creds │ ├── ai4bharat.py ← stub for Phase 7 │ └── sarvam.py ← stub for Phase 8 ``` --- ## 13. Fluency Pipeline (added 2026-05-12) > **Fluency ≠ Pronunciation.** Voice Factory ships *two independent* substrates: > > | Substrate | What it owns | Owner module | > |---|---|---| > | **Pronunciation** | How a sentence sounds (Bhashini cascade, donor voices, on-device TTS) | `services/voice_factory.py` + `suppliers/*` | > | **Fluency** | Grammar, vocabulary, sentence patterns in each language | `services/fluency_*` + `data/fluency//` | > > A Chitti language page draws on both. The pipelines run independently — Bhashini ULCA registration does **not** block fluency ingestion. ### 13.1 Sources (in order of preference) — **textbook_source field** Every chunk carries a `textbook_source` field with one of three values: | Value | Meaning | Where it comes from | |---|---|---| | `curriculum` | Real curriculum content from a textbook | NCERT direct PDFs OR archive.org mirrors of state-board books | | `community` | Real in-language text from open community sources | Wikipedia REST API (60 curated topics, native titles via langlinks) | | `cousin` | Borrowed chunk from a related language | Cousin mapping (hne/doi/kru→hi, brx→as, kfa→kn) | **No chunk is faked**. Every entry has a real `source` URL on disk. #### Discovery scripts - `scripts/discover_ncert_urls.py` — HEAD-checks ~1,380 NCERT URL candidates per known suffix pattern. Records survivors to `data/ncert_urls_discovered.json`. - `scripts/discover_archive_org.py` — searches archive.org for state-board / NCERT-translation mirrors across 10 regional languages. Records to `data/archive_urls_discovered.json`. - `scripts/merge_discovered.py` — merges both into `data/discovered_textbook_urls.json`. `services/textbook_sources.py` reads this at import time (utf-8-sig to handle PowerShell-written BOMs). #### Ingester channels 1. **NCERT direct** (`fluency_ingester.fetch_ncert_pdfs`) — uses Python `requests`, capped at 30 PDFs/lang, 15s timeout. 2. **archive.org mirrors** (`fluency_ingester.fetch_archive_pdfs`) — capped at 10 PDFs/lang, 25s timeout, 2s polite delay between successful downloads. Archive.org has flaky CDN servers; we accept the failures and move on (logged in `honest_status.errors`). 3. **Wikipedia REST API** (`fluency_ingester.fetch_wikipedia`) — native title resolution via `services/wiki_langlinks.py`. The English title is a fallback when no native title is cached. 4. **Cousin mapping** (`fluency_ingester.copy_from_cousin`) — only fires when channels 1-3 produced zero chunks (Tier C languages without their own Wikipedia). ### 13.2 Pipeline stages ``` download → extract → chunk → embed → FAISS index → honest_status.json ↑ ↑ ↑ ↑ ↑ ↑ HTTP + PyMuPDF 400-600 paraph-MM IndexFlatIP per-language ledger Wikipedia (fitz) char L12-v2 (numpy chunks/sources/ REST sliding CPU cosine errors/ready window fallback) ``` Embeddings use `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` — one model covers 50+ languages including every Chitti language. FAISS is preferred; when `faiss-cpu` is missing we fall back to numpy cosine over `embeddings.npy`. ### 13.3 Layout ``` chitti-voice-factory/backend/ ├── services/ │ ├── fluency_corpus.py ← per-language store + search; degrades when ST/faiss missing │ ├── fluency_ingester.py ← NCERT + Wikipedia + cousin channels │ ├── textbook_sources.py ← 26-language source registry, WIKI_TOPICS, NCERT_URLS │ └── wiki_langlinks.py ← English-title → native-title resolver (cached) ├── routes/ │ └── fluency.py ← /api/voice/fluency/{status,status/,search/,chunks/} ├── scripts/ │ ├── build_langlinks.py ← one-shot: warm langlinks cache for 60 topics │ ├── ingest_textbooks.py ← orchestrator, parallel workers, two waves (Wikipedia then cousin) │ ├── embed_all.py ← embed + FAISS pass after deps install on target host │ └── report_summary.py ← per-language ingestion summary └── data/fluency/ ├── _report.json ├── wiki_langlinks_cache.json (60 topics × ~150 langs each) └── / ├── _pdfs/ raw downloaded PDFs (NCERT) ├── chunks.jsonl real text chunks with provenance ├── embeddings.npy float32 matrix (created on Railway after pip install) ├── index.faiss FAISS IndexFlatIP (created alongside embeddings) └── honest_status.json what worked / what failed / fluency_ready bool ``` ### 13.4 Second-run result (2026-05-12, post-discovery) — 79,414 chunks, 55 curriculum PDFs 29.2-minute parallel run, 8 workers, with NCERT + archive.org discovery merged. Per-language breakdown: | Lang | Chunks | Curriculum PDFs | textbook sources | |---|---:|---:|---| | bn | 6,966 | 0 | Wikipedia | | ta | 5,466 | 0 | Wikipedia (Tamil archive.org URLs all 401/403) | | **hi** | **5,310** | **22 NCERT** | NCERT + Wikipedia | | hne | 5,310 | (cousin) | Cousin from hi | | kru | 5,310 | (cousin) | Cousin from hi | | doi | 5,310 | (cousin) | Cousin from hi | | **kn** | **5,203** | **2 archive.org** | archive.org + Wikipedia | | kfa | 5,203 | (cousin) | Cousin from kn | | **ml** | **3,830** | **2 archive.org** | archive.org + Wikipedia | | **mr** | **3,678** | **3 archive.org** | archive.org + Wikipedia | | **te** | **3,562** | **1 archive.org** | archive.org + Wikipedia | | **ur** | **3,386** | **8 NCERT** | NCERT + Wikipedia | | gu | 3,287 | 0 | Wikipedia | | sd | 2,528 | 0 | Wikipedia | | as | 2,506 | 0 | Wikipedia | | brx | 2,506 | (cousin) | Cousin from as | | **pa** | **2,286** | **2 archive.org** | archive.org + Wikipedia | | **sa** | **1,972** | **25 NCERT** | NCERT (Ruchira Class 7/8/10) + Wikipedia | | or | 1,893 | 0 | Wikipedia | | bho | 1,510 | 0 | Wikipedia | | sat | 735 | 0 | Wikipedia | | kok | 537 | 0 | Wikipedia | | tcy | 406 | 0 | Wikipedia | | mai | 315 | 0 | Wikipedia | | ks | 253 | 0 | Wikipedia | | mni | 146 | 0 | Wikipedia | **TOTAL: 79,414 real chunks across all 26 languages. 55 curriculum PDFs ingested. 0 languages failed.** | `textbook_source` distribution | Lang count | |---|---:| | `curriculum` (NCERT/state-board) | 8 languages: hi, ur, sa, kn, ml, mr, te, pa | | `community` (Wikipedia) | 13 languages: bn, ta, gu, or, as, sd, bho, mai, kok, ks, mni, sat, tcy | | `cousin` (borrowed from related) | 5 languages: hne, doi, kru, brx, kfa | `fluency_ready` is `false` for all 26 until the embedding pass runs (deferred to Railway py3.11 because local py3.14 lacks stable torch wheels). Run `python -m scripts.embed_all` on the deploy host to lift to ready. #### Known gaps - **Tier A languages with Wikipedia-only corpus** (bn, ta, gu, or, as): their archive.org searches found mostly unrelated items (CIA reading room, legislative proceedings) or items requiring auth (BDRC). Pursuing state-board direct partnerships (WBBSE, TN Board, GSEB, BSE Odisha, SEBA) is the next step. - **NCERT Urdu**: pattern guessing found 22 URLs (jujp/judp = Class 10 Jaan Pahechan / Door Pas) but earlier classes use different codes. NCERT publishes ~50 Urdu books; we currently capture ~16%. - **Archive.org reliability**: ~70% of discovered archive.org PDFs failed mid-download (CDN servers ia601400, dn710101 frequently timing out from this network). A retry pass from a different network may recover many of these. ### 13.5 API surface (added) | Method | Path | Purpose | |---|---|---| | GET | `/api/voice/fluency/status` | All 26 languages' honest fluency status | | GET | `/api/voice/fluency/status/` | One language: chunks, sources, fluency_ready, source plan | | GET | `/api/voice/fluency/search/?q=...&k=5` | Top-k similarity search over the language's chunks | | GET | `/api/voice/fluency/chunks/?offset=&limit=` | Paginated chunk inspection | | GET | `/api/voice/fluency//videos` | List user-added YouTube videos for this language | | POST | `/api/voice/fluency//videos` | Queue a YouTube URL (rate-limited to 10/lang) | | DELETE | `/api/voice/fluency//videos/` | Remove a queued/processed video record | | POST | `/api/voice/fluency//videos/process?embed=1` | Fetch transcripts, append chunks to corpus, optionally re-embed | ### 13.6 YouTube video learning (added 2026-05-12) Each language page exposes a **"📺 Teach Chitti with YouTube Videos"** section that lets any user feed a YouTube URL into the corpus. - Storage: `data/fluency//videos.json` (auditable; per-language) - Rate limit: `MAX_VIDEOS_PER_LANG = 10` (in `services/youtube_learner.py`) - Transcript fetch (`youtube-transcript-api`): prefers a human-authored transcript in the target language, falls back to auto-generated, finally falls back to *translated*. The video record stores `auto_generated` so the UI can flag lower-quality contributions. - Chunks land with `textbook_source = "community"` and `source = "youtube:"`. Audit-trail-equivalent to Wikipedia chunks. - Embedding rebuild is opt-in (`?embed=1` on `/process`) since it is the slow step. Without `embed=1` the chunks are queryable via the keyword fallback search; FAISS index updates on the next embed pass. - HTML injection: `scripts/inject_youtube_ui.py` adds the section to all 26 `chitti_.html` pages idempotently (`data-chitti-section="youtube"` marker prevents double-injection). #### Error codes returned by `/videos` endpoints | Code | Meaning | |---|---| | `invalid_youtube_url` | URL didn't match any of the 5 known YouTube URL shapes | | `duplicate` | Video already in queue for this language | | `rate_limit_exceeded` | 10-videos-per-language cap reached | | `video_unavailable` | YouTube returned video-unavailable | | `transcripts_disabled` | The video has captions disabled | | `no_transcript_for_language` | No transcript available; couldn't translate | | `transcript_too_short` | Fetched transcript under `MIN_TRANSCRIPT_CHARS` (200) | | `library_not_installed` | `youtube-transcript-api` missing on the host | ### 13.7 Honesty contract (additions to §11) 9. **No stub PDFs, no fake text.** Every chunk has a real `source` (NCERT URL, Wikipedia page, `cousin::`, or `youtube:`). The previous `ingest/ingest_master.py` that wrote `"STUB: Hindi Class 1 textbook"` into placeholder PDFs is **deprecated** — the production pipeline lives under `chitti-voice-factory/backend/scripts/`. 10. **`fluency_ready` requires embeddings on disk.** A language flips to `true` only when `chunks ≥ 50` AND `embeddings.npy` exists. Cousin-mapped languages can be ready but the UI must surface the cousin banner. 11. **404 = recorded.** NCERT URL changes, Wikipedia coverage gaps, and YouTube errors are logged to `honest_status.errors` / `videos.json[].error`. We do not invent content for missing sources. 12. **Auto-generated YouTube transcripts are flagged**, not silently mixed with human-authored ones (`video.auto_generated = true`). The UI surfaces this badge.