#voice-cloning | Anand S - Things I Learned

Mon, May 18, 2026. Surprising but GPT Realtime Whisper ( new model) isn't as good as the older open-source Whisper models. Also, Gemini 3 Flash Preview is as good at transcription as Gemini 3.1 Pro Preview for up to medium-length text. LLM Audio Transcription benchmark #speech-to-text #tts #voice-cloning

Wed, Mar 25, 2026. Confessional AI training for honest model behaviour. Research probing for "absence", i.e. what the model is NOT saying #ai-coding #future #voice-cloning

Sun, Jan 25, 2026. Qwen3 TTS is impressive. It voice-clones, streams, and the tone/style can be controlled via prompts. The model is small. I ran it locally without flash-attn (which I couldn't get to work) and took ~14 seconds to generate an audio file for 10 words on my GPU machine. Environment setup: #speech-to-text #tts #voice-cloning
```
uv venv --python 3.12
UV_TORCH_BACKEND=auto uv pip install -U qwen-tts
```

Thu, Jan 1, 2026. Grok Voice Agent API tops the speech-to-speech quality benchmark and is pretty cheap at 5c/min ($3/hr). #speech-to-text #voice-cloning

Sun, Dec 21, 2025. A clever trick to prevent voice models from speaking too quickly. Use a "stay silent" function call. Ref #speech-to-text #tts #voice-cloning

Fri, Dec 19, 2025. uvx --python 3.10 --with torchcodec demucs --two-stems=vocals -n htdemucs "song.mp3" separates vocals from music. #voice-cloning

Fri, Dec 19, 2025. I updated the TTS (text-to-speech) costs across Gemini and OpenAI at https://github.com/sanand0/openai-tts-cost. My current favorite (value for money) is Gemini 2.5 Flash Preview TTS. Good emotions, low price, and a single request can deliver a multi-voice podcast. Speed: ~25 seconds per minute of audio generated. #speech-to-text #tts #voice-cloning

Sun, Dec 14, 2025. Notes from One Year With ChatGPT Pro as a First Hire #chatgpt #prompt-engineering #speech-to-text #voice-cloning
- Each day I start a new Pro chat that will run for that entire day. I treat it as a colleague. I speak or type in whatever I am thinking about, including business problems, creative questions, experiments that worked or failed and feelings about particular decisions. I wear noise canceling earbuds and often run piano technique while the model is thinking. I listen to its response using the native “Read Aloud” feature, again while practicing, and stop to make notes in a physical notebook to collect inspiration. At the end of the day I ask that Pro model to summarize everything from that chat along with the notes I give it from my notebook, and that summary becomes our first prompt of the next day.
- Standard Voice Mode (SVM) can do things that Advanced Voice Mode (AVM) cannot and vice versa.SVM feels like it wants to talk forever, while AVM feels like it wants to get off the phone.
- Projects became the container for my daily Pro chats. I pull chats, notes and other files into project folders so I can reference them as static context.
- My scheduled tasks collection today consists of weekly lessons in math, ML and DL, design, market analysis and regular assessments of the UI and UX and copy on my company’s website.
- I let memory accumulate, then once a week I pruned it manually, removing entries that were no longer useful so that new memories could form.
- Connecting the ChatGPT macOS app to my terminal, using the Working with Apps feature, lets the Pro models essentially collaborate with Codex. Practicing collaborative context between these high end models fractals outward into a myriad of productive paths. I highly recommend exploring with 5.1 Pro connected to 5.1-Codex-Max (Very High) in a terminal. Tell Codex-5.1 that you have a buddy working with you today that can offer suggestions and review the work it does as we go. Then tell 5.1 Pro that you have a buddy that is working with you today and can apply any of the code changes we decide on. This is another form of “context priming” where I “set the scene” before jumping in.

Mon, Nov 24, 2025. Here are some AI experiments I'm planning to try with our marketing team: #ai-coding #automation #code-agents #prompt-engineering #voice-cloning
- Video Generation: Create marketing videos from text scripts in minutes
- Poster Generation: AI designs high-conversion posters from brief text inputs - notably Nano Banana Pro
- Synthetic Persona A/B Testing: LLM agents simulate 100K+ user behaviors to test designs before real users
- LLM-Powered A/B Automation: AgentA/B system runs experiments with AI-simulated traffic
- Vibe Coding Landing Pages: Marketers build production-ready pages in hours vs weeks
- On-demand Landing Pages: Generate pages for automated campaigns/products without human intervention
- Brand Voice Cloning at Scale: Train on company content to ensure consistency across 1000s of pieces
- Persona-Driven Content Synthesis: Use 1B+ personas to generate diverse content perspectives
- Competitive Intelligence Briefing: Real-time monitoring across millions of data points + data storytelling
- Marketing Analytics with LLMs: AI agents analyze complex datasets for insights
- Brand Compliance Checks: Ensure all content meets brand guidelines automatically
- Autonomous Blog Squads: AI agents identify trending topics / internal content, create data stories ready for review

Sat, Nov 22, 2025. Models read pretty fast, consuming input tokens at ~4K-20K words per second. It's the "speaking" (output token rate) that is the bottleneck. So shortening input doesn't matter as much as shortening output for latence. ChatGPT #chatgpt #speech-to-text #tts #voice-cloning

Mon, Nov 10, 2025. ⭐ Over 3 months, I've recorded ~180 calls. Processing each costs ~1.25 cents (GPT-5) and 1 year's conversations cost ~$9. That's incredible value for money if I hired GPT-5 / Codex as a data-driven personal coach to guide me on: #voice-cloning
- What are my blindspots? That is, feedback people share with me that I ignore?
- What are the clusters of persona that I interact with and which of these have a positive and negative influence on me?
- Where am I am being unreliable? Where am I being an asshole?
- Where are my expectations high? Where are they low? Where would the opposite have helped?
- Where do I quit early? Where do I persist? Where would the opposite have helped?
- What good habits should I continue? What bad habits should I stop?
- What are the strongest opportunities to thank or praise that I missed? Is there a pattern? What triggers could I use to build this habit?
- Where have I tried to change people? Where have people tried to change me?
- Where have I spotted wrong questions? That is, rather than answering the question, I spotted the more apt question and answered that instead?
- ... and a hundred other questions that I wouldn't even know to ask.

Sun, Nov 2, 2025. OpenAI TTS costs are confusing. But in short #speech-to-text #tts #voice-cloning
- TTS-1 costs $15 / MChars (max 4,096 chars per request), which ends up at ~86c / hour
- GPT-4o Mini TTS costs ~$16 / MChars (max 2K tokens which is ~7,000 chars per request), which ends up at ~88c / hour. Very similar cost, effectively
- TTS-1 HD is twice TTS-1.

Thu, Oct 16, 2025. Earlier we needed humans to label data for RLHF. Now we don't since AI can simulate it. This is a pattern. Once AI learns from a human, that human skill can be automated. How GPT-5 Thinks — OpenAI VP of Research Jerry Tworek #automation #future #voice-cloning

Thu, Oct 2, 2025. My laptop's mic is much better than my phone's mic, surprisingly. When recording conversations, it's better to leave my laptop open and record than use the phone's recording app. #voice-cloning

Fri, Aug 15, 2025. Learnings from a discussion on vibe-coding between Kunal Jain, Ravi Nadimpalli and me. #ai-coding #code-agents #llm-ops #prompt-engineering #voice-cloning
- On the Vibe Coding Process & Strategy
  - The 80/20 Rule is Real: The first 80% of a project is incredibly fast, but the final 20% (debugging, custom features, production-readiness) is extremely difficult and time-consuming.
  - Validation is the New Bottleneck: Since coding is now much faster, the critical, time-consuming task has shifted to reviewing, testing, and validating the LLM's output.
  - "Spec-Locking" is Crucial: Providing the LLM with detailed, well-defined, and "thinly sliced" specifications is essential for getting good results. Vague requests lead to poor outcomes.
  - It's Not Production-Ready (Yet): The consensus is that vibe coding is excellent for prototypes, demos, and go-to-market (GTM) activities but is not yet reliable for building production-grade applications from scratch.
  - Code is Brittle & Unstable: An application that works perfectly one day can inexplicably break the next, as the underlying agent might make undocumented changes.
- Impact on Roles & The Future of Work
  - The Rise of QC/Validation: The Quality Control (QC) function will become larger and more critical to manage the new challenge of validating AI-generated work.
  - Product Managers Shift Focus: PMs can move away from tedious documentation (like flowcharts) and focus more on high-level business strategy, using vibe coding to create quick prototypes.
  - Democratization of Building: It empowers non-coders to build functional apps and helps professionals upskill faster by "conversing" with an LLM on complex topics.
  - New Forms of Cheating: The technology is creating novel ways for people to cheat in interviews, such as using tools that provide real-time subtitles of answers.
  - The "Jagged Edge" of AI: The technology excels at certain tasks (like GTM content) but fails at others, creating new upstream bottlenecks where teams must rapidly generate more of the "AI-friendly" work.
- Practical Hacks & Takeaways
  - Meta-Prompting: Use an LLM to refine and improve your prompt before giving it to the final tool. This helps fill in gaps and add necessary detail.
  - Human-First Drafting: For creative or nuanced work (like writing), it's often better to write the first draft yourself and use the LLM to polish it, rather than starting with a generic AI draft.
  - Use Structured Prompts: For predictable and clean output, providing instructions in a structured format (JSON is OK but not needed) is highly effective.
  - LLM as a Judge: Use LLMs to evaluate and grade content, code, and other outputs, dramatically speeding up the review process.
  - Automate Learning & Documentation: Use tools to transcribe conversations automatically and create personalized revision quizzes from notes and documents.
  - Voice is a Powerful Modality: Using voice-to-code allows for capturing more complex ideas faster and can be done while multitasking (e.g., walking), capitalizing on "dead time."

Wed, Aug 13, 2025. Tavus is another AI avatar platform. #voice-cloning
- Synthesia. Market leader; $2.1B valuation; enterprise trusted. Good: Realism, enterprise features, templating. But: Price, usage caps, slower avatar setup
- HeyGen. Rapidly growing; $500M valuation. Good: Avatar realism, speed, affordability. But: Basic collaboration, support, scene complexity
- Colossyan. Favored L&D focus. Good: Interactive & educational tools, good value. But: Less polished avatars, slower renders
- D-ID. Frequently cited alternative. Good: Speed, flexibility, custom avatars. But: Watermarks, fewer templates
- Elai.io. Repeats in alternatives lists. Good: Storyboarding, educational formats. But: Limited templates, render time
- Hour One. Also common in alternative lists. Good: Photoreal avatars, expression control. But: Missing advanced features like screen capture
- Others. Niche or emerging tools. Good: Varies by platform. But: Less adoption, fewer reviews

Mon, Aug 4, 2025. What happens when LLMs play Chinese Whispers / the Telephone Game? Here are learnings. ChatGPT #llm-ops #voice-cloning #learning #lesson
- Drift increases faster than linear with hops.
- Bigger models do better, but constrained prompts (“Copy the text exactly; change nothing.”) have a bigger impact.
- Low temperature improves copying fidelity.
- But even after "forgetting", LLMs reproduce rare content if they're trained on it.

Sun, Aug 3, 2025. Claude Code tips from Things that didn't work by Armin Rocher #automation #prompt-engineering #speech-to-text #voice-cloning #ai-coding
- Speech-to-text. Cannot stress this enough but talking to the machine means you’re more likely to share more about what you want it to do.
- I maintain some basic prompts and context for copy-pasting at the end or the beginning of what I entered.
- I ended up preloading executables on the PATH that override the default ones, steering Claude toward the right tools, e.g. running python asks it to use uv.
- I use the task tool frequently for basic parallelization and context isolation.
- Simply taking time to talk to the machine and give clear instructions outperforms elaborate pre-written prompts.
- Forcing myself to evaluate the automation has another benefit: I’m less likely to just blindly assume it helps me.

Wed, Jun 4, 2025. Vision language models heavily rely on past training and miss changes they don't expect. Ref #future #voice-cloning

Tue, Jun 3, 2025. At the moment, the best speech to text for Android appears to be ChatGPT's transcription. The default Android text to speech (which I thought was good) no longer feels adequate. Gemini mis-hears and doesn't wait till I'm done. Whisper ASR has poor noise cancellation and a 30 second limit. #chatgpt #speech-to-text #tts #voice-cloning

Sat, May 24, 2025. TTS typically costs $1/hour now. Gemini 2.5 Flash Preview TTS, Gemini 2.5 Pro Preview TTS, GPT 4o TTS, and GPT 4o Mini TTS are the current best-in-class text-to-speech models from the mainstream LLM providers. Assuming ~175 words per minute and 1 token ≈ ¾ words, 1 hour of speech ~ 10,300 words/hr ~ 13,800 input tokens ~ 75,000 audio tokens, it costs: #speech-to-text #tts #voice-cloning
- Gemini 2.5 Flash Preview TTS ($0.50/1 M input, $10.00/1 M output): ~$0.8 per hour
- GPT-4o-mini-TTS ($0.60/1 M input, $12.00/1 M output): ~$0.9/hour
- Gemini 2.5 Pro Preview TTS ($1.00/1 M input, $20.00/1 M output): ~$1.5 per hour
- GPT-4o-TTS (known as gpt-4o-audio-preview, $2.50/1 M input, $80/1 M output): ~$6.0/hour
- This is comparable to the earlier OpenAI Standard TTS ($0.75), OpenAI HD TTS ($1.5), Google Neural2 ($0.8). ElevenLabs Pro costs ~$6/hr.

Fri, Apr 25, 2025. LemonSlice showcases real-time audio-video models (avatars) that are close enough to real. #future #models #voice-cloning

Mon, Apr 7, 2025. Notes from ThursdAI - Apr 03 #speech-to-text #voice-cloning
- Nomic Embed Multimodal models are the current SOTA on multi-modal embeddings. Notably, they embed PDFs natively.
- Hailuo Speech-02 is the best speech model right now beating ElevenLabs. It has excellent voice cloning. Pricing: $30/1M chars. 10% of ElevenLabs, 2X of OpenAI TTS
- PaperBench is an open testing framework from OpenAI that requires models to replicate the research work in papers. It has ~8,000 tasks evaluated by LLMs and with LLMs judging the judges as well. The code is well worth studying.
- Runway Gen 4 was released with very high character consistency and longer durations
- Dreamina creates lip-synced videos from audio + a single image. Hedra is better for animated characters, though.
- Meta shared but has not released Mocha, an open character generation model that generates new characters speaking based on an audio you provide. It is not based on existing images but the quality is very good
- All Hands has a free online version where you can fix GitHub issues.

Sun, Mar 30, 2025. LLM Native Multimodal image generation experiments: #chatgpt #gpu #image-generation #voice-cloning
- Stickers
  - Sending your wife AI-generated family photos, stickers, etc. is now a thing. Both an AI use case and a ... um... "family media" (?) use case. For example, ask ChatGPT to "Create a transparent comic-style sticker of a lady chef featuring this person happily cooking salad" with a photo. Then send it as a custom sticker. Image
  - Vadivelu stickers work well but the Tamil script generation is poor. Image
- Asking ChatGPT to generate 25-year younger pictures of people produces pretty poor results if you really knew what they looked like then. If you didn't, it's fairly convincing. Yet another example of "hallucinations" - except, it does have its uses.

Thu, Mar 27, 2025. Notes from Writing with AI #speech-to-text #voice-cloning
- Personal writing with connection won't go away. AI can't give you heartbreak. But the rest of non fiction writing will vanish.
- What AI is extraordinary at is personalizing to each audience member's interest
- Outlier opinions will thrive among humans - since AI is trained on consensus.
- Managers tend to be good at working with LLMs because it's mostly about delegation.
- LLMs are perfect for things that don't have a wrong answer! -- Benedict Evans.
- 💡 Explore arguing with AI. It's a safe way to get into a confrontational emotional state (which has its own benefits.)
- 💡 Keep an LLM on in voice mode while reading and ask it any questions you have.
- What models are good for what?
  - GPT 4.5 is great for creation - has a great sense of humor but a corporate style. Still, way better than GPT 4o.
  - ChatGPT is good for voice transcription and note taking. (Increasingly we take notes for AI rather than ourselves.)
  - Claude 3.7 has the best style of writing. It's also great for drawing charts.
  - O1 Pro and Deep Research is great for consumption - research.
  - Grok is the least corporate, able to argue with you, and the latest knowledge cutoff.
  - ElevenLabs for editing podcasts in your voice, making corrections.

Tue, Mar 25, 2025. The new GPT-4o mini Transcribe model is a bit better than Whisper and costs half: ~18 cents per hour. It includes background noise cancellation and semantic chunking, which is useful. #speech-to-text #tts #voice-cloning

Tue, Mar 25, 2025. The new GPT-4o mini TTS is about 3-4 times cheaper than TTS-1 since it's ~$12/MTok instead of $15/Mchar. It supports emotions with streaming. #speech-to-text #tts #voice-cloning

Sun, Mar 23, 2025. Phi-4 multimodal procehttps://huggingface.co/microsoft/Phi-4-multimodal-instructsses speech better than Whisper V3 on HuggingFace OpenASR, and images better than Gemini Flash Lite #future #huggingface #speech-to-text #tts #voice-cloning

Mon, Mar 10, 2025. Notes from Thursday AI, 6 Mar 2025 #ai-coding #voice-cloning
- Google's AI overviews now use Gemini 2.0. They've introduced an AI mode that functions like a mini deep research tool, incorporating planning and search. (A Perplexity-killer). It's a fine-tuned model that is extra cautious with topics like healthcare and always verifies information.
- QWQ from Quen competes with DeepSeq R1, but with only 32b parameters compared to R1's several hundred billion.
- AI models are becoming less restrictive. Gemini and GPT-4.5 have relaxed some constraints, shifting more responsibility onto users, similar to Grok.
- What's GPT-4.5 good for? It seems to excel in creativity, humor, education, emotional intelligence, and teaching. It follows instructions better and understands intent better. However, it's not a major leap in coding or math.
- OpenAI's Deep Research mode always uses O3, regardless of the model selected in the UI.
- Tencent has released a new video model available at https://aivideo.hunyuan.tencent.com/ and it appears to be quite good.
- Many clients now support Model Context Protocol (MCP), including Cursor, Claude Code, and Claude Desktop. The clients list is long. Some MCP uses include:
  - Interact with GitHub using the GitHub API.
  - Using Knowledge Graph memory to premember previous conversations
  - Using the Cloudflare MCP server to perform Cloudflare actions.
  - File retrieval and custom prompts -- which MCP supports in addition to tools.
  - Calling other MCPs or LLMs (conditionally) from an MCP, enabling the creation of full-fledged workflows.
- Composio offers a Hosted MCP service. CloudFlare lets you build remote MCP servers.
- Notagen is an open-source note generation engine that produces high-quality classical sheet music.
- Sesame has an open-source voice model worth exploring.
- DiffRhythm is a music generation model that appears to be quite good.

Mon, Feb 24, 2025. Real-time speech-to-text options for transcription: #speech-to-text #voice-cloning
- Deepgram has a MediaRecorder API, which is perfect.
- Whisper Streaming Web is a web app that can transcribe audio real-time from the browser. A good approach, but I wouldn't use it for meeting transcription on my mid-end laptop. Streaming takes up the bulk of my GPU, leaving little for transcription.
- whisper-live runs as a Python console app and does something similar.
- Whisper WebGPU runs on the browser (only 200MB). Cool! But slow and still takes up GPU.

Mon, Feb 24, 2025. Mini-omni is an open-source Qwen-based LLM that can hear and talk while thinking in real-time. An interesting experiment, but not for prototyping. #llm-ops #speech-to-text #voice-cloning

Fri, Feb 21, 2025. Soon, you'll be able to send an LLM to a virtual meeting on your behalf. It will talk like you. Ethan Mollick #future #llm-ops #speech-to-text #voice-cloning

Fri, Jan 17, 2025. Audio diaries are a thing. Monash University asks students to voice their learnings, share it with each other and have them give feedback. I wonder if ChatGPT diaries could become a thing, too, and LLM journalling starts helping with therapy. #future #speech-to-text #voice-cloning

Sun, Jan 12, 2025. TTS Arena is a benchmark of text-to-speech models. Kokoro-TTS is the current leader. It's just 82M, runs on Google Colab, and sounds slightly better than OpenAI TTS. #speech-to-text #tts #voice-cloning

Wed, Jan 8, 2025. whisper-flow does real-time speech transcription! #future #speech-to-text #tts #voice-cloning

Wed, Jan 8, 2025. Switchboard-1 is a labelled audio corpus with ~260 hours of speech. It has ~2,400 calls among 500+ speakers in the US. #speech-to-text #voice-cloning #5478

Fri, Jan 3, 2025. Assembly AI offers speech to text with diarization at 12c/hour. Good diarization, average transcription quality. #speech-to-text #tts #voice-cloning In comparison, WhisperX (with GPU) was much slower, had slightly poorer diarization, and slightly better transcription.
```
uvx --python 3.9 --index https://download.pytorch.org/whl/cu121 whisperx --diarize --lang en --hf_token $HUGGINGFACE_TOKEN
```

Fri, Nov 29, 2024. GPT-4o Audio supports tone control via XML tags like <cough>..., <laugh>..., etc. But at ~$15/hr of output, it's too expensive. Ref #speech-to-text #tts #voice-cloning

Tue, Nov 26, 2024. Ultravox lets you build voice agents at 5c/min = $3/hr (OpenAI is 6c input, 24c output). Or clone their repo. #speech-to-text #voice-cloning
- Idle call time is counted towards cost. So cost may be higher than OpenAI.
- Voice cloning quality is average. Very distinctive voices are just partly identifiable.
- Supports tool calls (from their server).
- Their API is simple but the docs have minor errors (e.g. a trailing comma in the JSON, which leads to an error) reducing confidence.

Fri, Nov 8, 2024. Here is a prompt for audio transcription using Gemini. Ref #speech-to-text #voice-cloning
- Transcription: Accurately transcribe the audio clip in the original language. Include all spoken words, fillers, slang, colloquialisms, and any code-switching instances. Pay attention to dialects and regional variations common among immigrant communities. Do your best to capture the speech accurately, and flag any unintelligible portions with [inaudible].
- Translation: Translate the transcription into English. Preserve the original meaning, context, idiomatic expressions, and cultural references. Ensure that nuances and subtleties are accurately conveyed.
- Capture Vocal Nuances: Note vocal cues such as tone, pitch, pacing, emphasis, and emotional expressions that may influence the message. These cues are critical for understanding intent and potential impact.

Fri, Nov 8, 2024. ChatGPT for Windows desktop supports real-time voice and a global shortcut (Alt Space). #chatgpt #speech-to-text #voice-cloning

Mon, Nov 4, 2024. Hertz-Dev is an open source realtime voice chat model. But it doesn't fit in Google Colab T4's RAM #github #tts #voice-cloning

Tue, Oct 29, 2024. F5-TTS clones voices with just 15-second samples. #future #speech-to-text #tts #voice-cloning

Sun, Oct 27, 2024. Elevenlabs lets you create voices with a prompt. No need to even clone one! #speech-to-text #tts #voice-cloning

Wed, Oct 9, 2024. Reverb ASR does diarration as well as transcription. It seems the state of art right now. #speech-to-text #tts #voice-cloning

Tue, Oct 8, 2024. Revisiting text to speech models. Nothing much has changed since July 2024. #speech-to-text #tts #voice-cloning
- OpenAI TTS: $15/1M chars Ref
- Deepgram Aura: $15/1M chars Ref
- Azure AI Speech: $15/1M chars Ref
- Google TTS Neural2: $16/1M chars Ref
- AWS Polly Neural TTS: $16/1M chars Ref
- Cartesia Pro: $50/1M chars Ref
- Elevenlabs Scale: $300/1M chars Ref

Thu, Oct 3, 2024. Speak is a language learning app based on OpenAI's Realtime API. #future #speech-to-text #voice-cloning

Thu, Oct 3, 2024. ChatGPT's advanced mode includes: "...you can use various regional accents and dialects." Ref Source #chatgpt #speech-to-text #tts #voice-cloning
- But the API can "laugh, whisper, and adhere to tone direction." Ref

Sat, Sep 21, 2024. Sarvam.ai offers Indic text to speech #speech-to-text #tts #voice-cloning

Tue, Sep 17, 2024. Segmind's Hallo lets you animate a face to an audio clip #speech-to-text #voice-cloning

Fri, Sep 13, 2024. Hume provides a voice-to-voice model (EVI 2) that handles emotions at 7 cents/minute. #speech-to-text #voice-cloning

Thu, Jul 25, 2024. Speech editing in audio files is a thing. Speech Editing Toolkit and Descript #speech-to-text #voice-cloning

Mon, Jul 8, 2024. A quick check on the pricing of text to speech models #speech-to-text #tts #voice-cloning
- OpenAI TTS: $15/1M chars Ref
- Deepgram Aura: $15/1M chars Ref
- Elevenlabs Scale: $165/1M chars Ref
- Google TTS Neural2: $16/1M chars Ref
- Azure AI Speech: $15/1M chars Ref
- AWS Polly Neural TTS: $16/1M chars Ref

Wed, May 29, 2024. Some audio embedding models: #embeddings #speech-to-text #voice-cloning unoti/voice-embeddings, retkowsky/audio_embeddings, pyannote/embedding (for speaker similarity), and more.

Sun, Apr 14, 2024. fal.ai "animates" pictures, creating videos. It made one from my talk. I morphed into various somewhat similar people rapidly in a 2-second span. Very promising, and far from good. #ai-art #future #image-generation #voice-cloning #write

Sun, Apr 14, 2024. Lemur from Assembly.ai does real time call transcription and summary #future #speech-to-text #voice-cloning

Sat, Mar 30, 2024. Hume.ai offers voice emotion API and emotion-based conversational responses. An empathic AI. #future #huggingface #speech-to-text #voice-cloning

Mon, Feb 26, 2024. MetaVoice 1B offers voice cloning on American & British accents with 30s training #speech-to-text #voice-cloning

Mon, Feb 26, 2024. AI scams are growing. Deepfakes scammed $34m. But voice fake for kidnapping is scarier. #voice-cloning

Mon, Feb 26, 2024. Buildspace's demo is a great demo of how voice and actions can be used effectively. #speech-to-text #voice-cloning

Fri, Feb 16, 2024. SORA is OpenAI's video generation model, and is stunning! #voice-cloning

Mon, Jan 15, 2024. Eleven-labs speech synthesis with voice cloning is at the uncanny valley. With two 5-minute samples, my voice sounds a fair bit like my voice but is very clearly not my voice. I find stability ~ 30%, similarity ~ 80% and style ~50% gives a reasonable outcome. But the default voices (e.g. Joseph, George, Charlie) are excellent. #speech-to-text #voice-cloning

Sun, Dec 3, 2023. Tools to explore #speech-to-text #voice-cloning
- ElevenLabs speaks in your voice
- Cutout Pro removes backgrounds and parts of images
- Vocal Remover removes vocals from songs
- CapCut video editor

Sun, Dec 3, 2023. Meta released SeamlessExpressive which preserves emotions in speech-to-speech translations #embeddings #future #speech-to-text #tts #voice-cloning

Sat, Nov 4, 2023. 04 Nov 2023: Seamless4MT does language and speech inter-conversion. CC-NC-BY license #document-conversion #speech-to-text #tts #voice-cloning