This page is a collection of notes and links related to AI speech technologies, including Text-to-Speech (TTS), Speech-to-Text (STT), voice synthesis, voice cloning, and other related frippery in the modern AI space.
Resources
Field | Category | Date | Link | Notes |
---|---|---|---|---|
Generative Audio | models | 2023 | bark | a text-prompted genereative audio model |
Speech Recognition | Libraries | 2025 | WhisperKit | a Swift package that integrates Whisper with Apple’s CoreML |
Models | 2024 | WhisperLive | a real-time text-to-speech system based on Whisper |
|
moonshine | a family of models optimized for fast and accurate automatic speech recognition on resource-constrained devices. |
|||
2023 | distil-whisper | a distilled version of whisper that is 6 times faster |
||
2022 | whisper.cpp | a C++ implementation of whisper that can run in consumer hardware |
||
whisper | a general purpose speech recognition model |
|||
Tools | 2024 | audapolis | an editor for spoken-word audio with automatic transcription |
|
2023 | insanely-fast-whisper | An opinionated CLI for audio transcription |
||
Speech Synthesis | Models | 2025 | csm | a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. |
Orpheus-TTS | an open-source text-to-speech system built on Llama-3b |
|||
2024 | ChatTTS | a text-to-speech model designed specifically for dialogue scenarios, with decent prosody |
||
Real-Time-Voice-Cloning | a PyTorch implementation of a voice cloning model |
|||
WhisperSpeech | a text-to-speech system built by inverting Whisper |
|||
2023 | StyleTTS2 | A text to speech model that supports style diffusion |
||
Tools | 2025 | voice-pro | a tool for doing speech processing and voice cloning |
|
edge-tts | a text-to-speech module that leverages the Microsoft Edge TTS API |
|||
podcastfy | a tool for generating podcasts from text |
|||
2024 | OpenVoice | a tool that enables accurate voice cloning with multi-lingual support and flexible style control. |