Show HN: I trained a 9M speech model to fix my Mandarin tones
The author developed a specialized deep learning-based Computer-Assisted Pronunciation Training (CAPT) system to improve their Mandarin pronunciation. Frustrated by the limitations of traditional pitch visualization and commercial APIs, the developer built a custom model using a Conformer encoder trained with CTC (Connectionist Temporal Classification) loss. They utilized approximately 300 hours of transcribed speech from datasets like AISHELL-1 and Primewords. By treating pinyin and tones as distinct tokens, the system avoids the auto-correction pitfalls of standard ASR models, providing frame-by-frame feedback. The final 9M-parameter model was quantized to 11MB, enabling it to run entirely on-device via onnxruntime-web without compromising accuracy. This project highlights the effectiveness of small, specialized models for language education.