AI Speech Technologies

This page is a collection of notes and links related to speech technologies, including Text-to-Speech (TTS), Speech-to-Text (STT), voice synthesis, voice cloning, and other related frippery in the modern space.

Resources

Field Category Date Link Notes
Generative Audio models 2023 bark

a text-prompted genereative audio model

Speech Recognition Libraries 2025 WhisperKit

a Swift package that integrates Whisper with Apple’s CoreML

Models 2024 WhisperLive

a real-time text-to-speech system based on Whisper

moonshine

a family of models optimized for fast and accurate automatic speech recognition on resource-constrained devices.

2023 distil-whisper

a distilled version of whisper that is 6 times faster

2022 whisper.cpp

a C++ implementation of whisper that can run in consumer hardware

whisper

a general purpose speech recognition model

Tools 2024 audapolis

an editor for spoken-word audio with automatic transcription

2023 insanely-fast-whisper

An opinionated CLI for audio transcription

Speech Synthesis Models 2025 csm

a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.

Orpheus-TTS

an open-source text-to-speech system built on Llama-3b

2024 ChatTTS

a text-to-speech model designed specifically for dialogue scenarios, with decent prosody

Real-Time-Voice-Cloning

a PyTorch implementation of a voice cloning model

WhisperSpeech

a text-to-speech system built by inverting Whisper

2023 StyleTTS2

A text to speech model that supports style diffusion

Tools 2025 voice-pro

a tool for doing speech processing and voice cloning

edge-tts

a text-to-speech module that leverages the Microsoft Edge TTS API

podcastfy

a tool for generating podcasts from text

2024 OpenVoice

a tool that enables accurate voice cloning with multi-lingual support and flexible style control.

This page is referenced in: