#speech-to-text | Anand S - Things I Learned

Mon, Jan 26, 2026. Vercel's agent-browser seems a good CLI choice for browser automation, alongside playwright-cli. It may be work switching from direct Playwright coding (on CDP). ChatGPT #chatgpt #code-agents #github #prompt-engineering #speech-to-text

Sun, Jan 25, 2026. Qwen3 TTS is impressive. It voice-clones, streams, and the tone/style can be controlled via prompts. The model is small. I ran it locally without flash-attn (which I couldn't get to work) and took ~14 seconds to generate an audio file for 10 words on my GPU machine. Environment setup: #speech-to-text #tts #voice-cloning
```
uv venv --python 3.12
UV_TORCH_BACKEND=auto uv pip install -U qwen-tts
```

Thu, Jan 1, 2026. Grok Voice Agent API tops the speech-to-speech quality benchmark and is pretty cheap at 5c/min ($3/hr). #speech-to-text #voice-cloning

Sun, Dec 21, 2025. A clever trick to prevent voice models from speaking too quickly. Use a "stay silent" function call. Ref #speech-to-text #tts #voice-cloning

Fri, Dec 19, 2025. I updated the TTS (text-to-speech) costs across Gemini and OpenAI at https://github.com/sanand0/openai-tts-cost. My current favorite (value for money) is Gemini 2.5 Flash Preview TTS. Good emotions, low price, and a single request can deliver a multi-voice podcast. Speed: ~25 seconds per minute of audio generated. #speech-to-text #tts #voice-cloning

Mon, Dec 15, 2025. I'm surprised that Edge's Read Aloud sounds more natural than EleventReader. Read Aloud is one of the main reasons I'm using Edge, but I hadn't realized it was that good. #speech-to-text

Sun, Dec 14, 2025. Notes from One Year With ChatGPT Pro as a First Hire #chatgpt #prompt-engineering #speech-to-text #voice-cloning
- Each day I start a new Pro chat that will run for that entire day. I treat it as a colleague. I speak or type in whatever I am thinking about, including business problems, creative questions, experiments that worked or failed and feelings about particular decisions. I wear noise canceling earbuds and often run piano technique while the model is thinking. I listen to its response using the native “Read Aloud” feature, again while practicing, and stop to make notes in a physical notebook to collect inspiration. At the end of the day I ask that Pro model to summarize everything from that chat along with the notes I give it from my notebook, and that summary becomes our first prompt of the next day.
- Standard Voice Mode (SVM) can do things that Advanced Voice Mode (AVM) cannot and vice versa.SVM feels like it wants to talk forever, while AVM feels like it wants to get off the phone.
- Projects became the container for my daily Pro chats. I pull chats, notes and other files into project folders so I can reference them as static context.
- My scheduled tasks collection today consists of weekly lessons in math, ML and DL, design, market analysis and regular assessments of the UI and UX and copy on my company’s website.
- I let memory accumulate, then once a week I pruned it manually, removing entries that were no longer useful so that new memories could form.
- Connecting the ChatGPT macOS app to my terminal, using the Working with Apps feature, lets the Pro models essentially collaborate with Codex. Practicing collaborative context between these high end models fractals outward into a myriad of productive paths. I highly recommend exploring with 5.1 Pro connected to 5.1-Codex-Max (Very High) in a terminal. Tell Codex-5.1 that you have a buddy working with you today that can offer suggestions and review the work it does as we go. Then tell 5.1 Pro that you have a buddy that is working with you today and can apply any of the code changes we decide on. This is another form of “context priming” where I “set the scene” before jumping in.

Mon, Dec 1, 2025. YTScribe is yet another YouTube transcription service. #future #speech-to-text #tts

Mon, Nov 24, 2025. 1 second = 10 tokens for OpenAI Realtime APIs. 1 second = 25 tokens for Gemini Live API #pricing #speech-to-text #tts
- 39 cents / hour on GPT Realtime Mini = 36 cents audio input + 3 cents text output
- 139 cents / hour on GPT Realtime = 115 cents audio input + 15 cents text output
- 30 cents / hour on Gemini 2.5 Flash Native Audio (Live API) = 27 cents audio input + 3 cents text output

Sat, Nov 22, 2025. Models read pretty fast, consuming input tokens at ~4K-20K words per second. It's the "speaking" (output token rate) that is the bottleneck. So shortening input doesn't matter as much as shortening output for latence. ChatGPT #chatgpt #speech-to-text #tts #voice-cloning

Sun, Nov 2, 2025. OpenAI TTS costs are confusing. But in short #speech-to-text #tts #voice-cloning
- TTS-1 costs $15 / MChars (max 4,096 chars per request), which ends up at ~86c / hour
- GPT-4o Mini TTS costs ~$16 / MChars (max 2K tokens which is ~7,000 chars per request), which ends up at ~88c / hour. Very similar cost, effectively
- TTS-1 HD is twice TTS-1.

Sun, Sep 28, 2025. typst is a good LaTeX alternative. Markdown-like syntax with fast rendering. Mostly useful for researchers using LaTeX. But publishers / journals don't accept typst often. #markdown #speech-to-text

Mon, Sep 22, 2025. ChatGPT's output is too dense for me. I added this to my custom instructions: "Write in simple language. Explain non-obvious terms intuitively." #chatgpt #gpu #speech-to-text

Mon, Sep 8, 2025. Output tokens dominate latency. Decoding is sequential (one token depends on all prior tokens), so long completions are the main throttle. Shrinking returned text (e.g., send spans/tags instead of echoing paragraphs) yields a far bigger win on latency than shrinking inputs. #speech-to-text

Fri, Aug 22, 2025. DSPy auto-optimizes prompts based on input-output pairs or evals. Typical improvements are ~10-20%. My opinion: avoid. It's a good idea, but has too much abstraction that hides the implementation. Worth learning from but not implementing unless you (a) have evals + metrics and (b) you KNOW you need to change models and (c) it's a long-term project where the learning curve is worth it. Claude and ChatGPT #automation #code-agents #gpu #markdown #optimization #prompt-engineering #speech-to-text

Fri, Aug 15, 2025. For live transcription, Gemini 2.5 Flash Live costs 0.6c/min of audio ($3/MTok x 32 tokens/second) while GPT 4o Mini Realtime costs ~2c/min and GPT 4o Realtime costs ~8c/min. ChatGPT #chatgpt #speech-to-text #tts

Sun, Aug 3, 2025. Claude Code tips from Things that didn't work by Armin Rocher #automation #prompt-engineering #speech-to-text #voice-cloning #ai-coding
- Speech-to-text. Cannot stress this enough but talking to the machine means you’re more likely to share more about what you want it to do.
- I maintain some basic prompts and context for copy-pasting at the end or the beginning of what I entered.
- I ended up preloading executables on the PATH that override the default ones, steering Claude toward the right tools, e.g. running python asks it to use uv.
- I use the task tool frequently for basic parallelization and context isolation.
- Simply taking time to talk to the machine and give clear instructions outperforms elaborate pre-written prompts.
- Forcing myself to evaluate the automation has another benefit: I’m less likely to just blindly assume it helps me.

Mon, Jul 28, 2025. Textual 4.0 supports Markdown streaming. Ref #markdown #speech-to-text

Tue, Jun 3, 2025. At the moment, the best speech to text for Android appears to be ChatGPT's transcription. The default Android text to speech (which I thought was good) no longer feels adequate. Gemini mis-hears and doesn't wait till I'm done. Whisper ASR has poor noise cancellation and a 30 second limit. #chatgpt #speech-to-text #tts #voice-cloning

Sat, May 24, 2025. TTS typically costs $1/hour now. Gemini 2.5 Flash Preview TTS, Gemini 2.5 Pro Preview TTS, GPT 4o TTS, and GPT 4o Mini TTS are the current best-in-class text-to-speech models from the mainstream LLM providers. Assuming ~175 words per minute and 1 token ≈ ¾ words, 1 hour of speech ~ 10,300 words/hr ~ 13,800 input tokens ~ 75,000 audio tokens, it costs: #speech-to-text #tts #voice-cloning
- Gemini 2.5 Flash Preview TTS ($0.50/1 M input, $10.00/1 M output): ~$0.8 per hour
- GPT-4o-mini-TTS ($0.60/1 M input, $12.00/1 M output): ~$0.9/hour
- Gemini 2.5 Pro Preview TTS ($1.00/1 M input, $20.00/1 M output): ~$1.5 per hour
- GPT-4o-TTS (known as gpt-4o-audio-preview, $2.50/1 M input, $80/1 M output): ~$6.0/hour
- This is comparable to the earlier OpenAI Standard TTS ($0.75), OpenAI HD TTS ($1.5), Google Neural2 ($0.8). ElevenLabs Pro costs ~$6/hr.

Mon, May 12, 2025. NVIDIA parakeet is a lightweight speech to text model that leads benchmarks. Installing such packages continues to be a nightmare due to PyTorch (despite uv). #gpu #speech-to-text

Thu, Apr 10, 2025. Claude 3.7 Sonnet with extended thinking has a token limit of over 64,000 tokens. Given a strong instruction following capability, that makes it one of the most powerful models for transforming text. For example, transcription restyling, translations, XML to json conversions, PDF to XML, etc. #document-conversion #markdown #speech-to-text

Tue, Apr 8, 2025. Claude 3.7 Sonnet with extended thinking has a token limit of over 64,000 tokens. Given a strong instruction following capability, that makes it one of the most powerful models for transforming text. For example, transcription restyling, translations, XML to json conversions, PDF to XML, etc. #document-conversion #markdown #speech-to-text

Mon, Apr 7, 2025. Notes from ThursdAI - Apr 03 #speech-to-text #voice-cloning
- Nomic Embed Multimodal models are the current SOTA on multi-modal embeddings. Notably, they embed PDFs natively.
- Hailuo Speech-02 is the best speech model right now beating ElevenLabs. It has excellent voice cloning. Pricing: $30/1M chars. 10% of ElevenLabs, 2X of OpenAI TTS
- PaperBench is an open testing framework from OpenAI that requires models to replicate the research work in papers. It has ~8,000 tasks evaluated by LLMs and with LLMs judging the judges as well. The code is well worth studying.
- Runway Gen 4 was released with very high character consistency and longer durations
- Dreamina creates lip-synced videos from audio + a single image. Hedra is better for animated characters, though.
- Meta shared but has not released Mocha, an open character generation model that generates new characters speaking based on an audio you provide. It is not based on existing images but the quality is very good
- All Hands has a free online version where you can fix GitHub issues.

Thu, Apr 3, 2025. CSS Speech is a W3C spec that lets you control how screen readers should read pages. No browser support now, though. #html #speech-to-text #web-dev

Wed, Apr 2, 2025. No open source LLM-based tool handles live transcription and allows you to query notes so far during the transcription. The closest seems to be Meetily #llm-ops #speech-to-text

Fri, Mar 28, 2025. Gemini 2.5 Pro transcription has accurate timestamps and bounding boxes. Simon Willison #speech-to-text

Thu, Mar 27, 2025. Notes from Writing with AI #speech-to-text #voice-cloning
- Personal writing with connection won't go away. AI can't give you heartbreak. But the rest of non fiction writing will vanish.
- What AI is extraordinary at is personalizing to each audience member's interest
- Outlier opinions will thrive among humans - since AI is trained on consensus.
- Managers tend to be good at working with LLMs because it's mostly about delegation.
- LLMs are perfect for things that don't have a wrong answer! -- Benedict Evans.
- 💡 Explore arguing with AI. It's a safe way to get into a confrontational emotional state (which has its own benefits.)
- 💡 Keep an LLM on in voice mode while reading and ask it any questions you have.
- What models are good for what?
  - GPT 4.5 is great for creation - has a great sense of humor but a corporate style. Still, way better than GPT 4o.
  - ChatGPT is good for voice transcription and note taking. (Increasingly we take notes for AI rather than ourselves.)
  - Claude 3.7 has the best style of writing. It's also great for drawing charts.
  - O1 Pro and Deep Research is great for consumption - research.
  - Grok is the least corporate, able to argue with you, and the latest knowledge cutoff.
  - ElevenLabs for editing podcasts in your voice, making corrections.

Tue, Mar 25, 2025. The new GPT-4o mini Transcribe model is a bit better than Whisper and costs half: ~18 cents per hour. It includes background noise cancellation and semantic chunking, which is useful. #speech-to-text #tts #voice-cloning

Tue, Mar 25, 2025. The new GPT-4o mini TTS is about 3-4 times cheaper than TTS-1 since it's ~$12/MTok instead of $15/Mchar. It supports emotions with streaming. #speech-to-text #tts #voice-cloning

Sun, Mar 23, 2025. Phi-4 multimodal procehttps://huggingface.co/microsoft/Phi-4-multimodal-instructsses speech better than Whisper V3 on HuggingFace OpenASR, and images better than Gemini Flash Lite #future #huggingface #speech-to-text #tts #voice-cloning

Sun, Mar 2, 2025. YayText converts text to Unicode that has strikethrough, bold, italics, alternate fonts, and other interesting features. So does #speech-to-text #tts Unitextify, ConvertCase, and LingoJam.

Mon, Feb 24, 2025. Real-time speech-to-text options for transcription: #speech-to-text #voice-cloning
- Deepgram has a MediaRecorder API, which is perfect.
- Whisper Streaming Web is a web app that can transcribe audio real-time from the browser. A good approach, but I wouldn't use it for meeting transcription on my mid-end laptop. Streaming takes up the bulk of my GPU, leaving little for transcription.
- whisper-live runs as a Python console app and does something similar.
- Whisper WebGPU runs on the browser (only 200MB). Cool! But slow and still takes up GPU.

Mon, Feb 24, 2025. Mini-omni is an open-source Qwen-based LLM that can hear and talk while thinking in real-time. An interesting experiment, but not for prototyping. #llm-ops #speech-to-text #voice-cloning

Fri, Feb 21, 2025. Soon, you'll be able to send an LLM to a virtual meeting on your behalf. It will talk like you. Ethan Mollick #future #llm-ops #speech-to-text #voice-cloning

Fri, Jan 17, 2025. Audio diaries are a thing. Monash University asks students to voice their learnings, share it with each other and have them give feedback. I wonder if ChatGPT diaries could become a thing, too, and LLM journalling starts helping with therapy. #future #speech-to-text #voice-cloning

Tue, Jan 14, 2025. I switched back from Brave to Edge, mainly because Edge's native text-to-speech and speech recognition is far better. I can use it better on my mobile. #speech-to-text

Sun, Jan 12, 2025. TTS Arena is a benchmark of text-to-speech models. Kokoro-TTS is the current leader. It's just 82M, runs on Google Colab, and sounds slightly better than OpenAI TTS. #speech-to-text #tts #voice-cloning

Wed, Jan 8, 2025. whisper-flow does real-time speech transcription! #future #speech-to-text #tts #voice-cloning

Wed, Jan 8, 2025. Switchboard-1 is a labelled audio corpus with ~260 hours of speech. It has ~2,400 calls among 500+ speakers in the US. #speech-to-text #voice-cloning #5478

Fri, Jan 3, 2025. Assembly AI offers speech to text with diarization at 12c/hour. Good diarization, average transcription quality. #speech-to-text #tts #voice-cloning In comparison, WhisperX (with GPU) was much slower, had slightly poorer diarization, and slightly better transcription.
```
uvx --python 3.9 --index https://download.pytorch.org/whl/cu121 whisperx --diarize --lang en --hf_token $HUGGINGFACE_TOKEN
```

Wed, Dec 18, 2024. prompt. Ask video generators like SORA to generate text in videos. It is of average quality. #future #speech-to-text #hard

Wed, Dec 18, 2024. GPT 4o Mini Realtime was released. A realtime conversation will cost ~50c/hr. About 36c for input, 72c for output. (I extrapolated from the 6c/min audio input cost for GPT 4o Realtime when it was $100/MTok. GPT 4o Mini Realtime is $10/MTok input and $20/MTok output.) #chatgpt #pricing #speech-to-text #tts

Wed, Dec 4, 2024. Fish eye text summary is a great way to read text while summarizing context. Amelia Wattenberger #speech-to-text

Fri, Nov 29, 2024. GPT-4o Audio supports tone control via XML tags like <cough>..., <laugh>..., etc. But at ~$15/hr of output, it's too expensive. Ref #speech-to-text #tts #voice-cloning

Tue, Nov 26, 2024. Ultravox lets you build voice agents at 5c/min = $3/hr (OpenAI is 6c input, 24c output). Or clone their repo. #speech-to-text #voice-cloning
- Idle call time is counted towards cost. So cost may be higher than OpenAI.
- Voice cloning quality is average. Very distinctive voices are just partly identifiable.
- Supports tool calls (from their server).
- Their API is simple but the docs have minor errors (e.g. a trailing comma in the JSON, which leads to an error) reducing confidence.

Wed, Nov 20, 2024. Alt Text will very likely be a browser feature. It's important for the Alt text to flow as part of the content when listening to the page. Perhaps even become a part of the browser APIs like speechRecognition. #future #html #speech-to-text

Mon, Nov 11, 2024. Gemini transcription does not give accurate timestamps. Whisper does. But the quality of transcription is similar. #speech-to-text #tts

Fri, Nov 8, 2024. Here is a prompt for audio transcription using Gemini. Ref #speech-to-text #voice-cloning
- Transcription: Accurately transcribe the audio clip in the original language. Include all spoken words, fillers, slang, colloquialisms, and any code-switching instances. Pay attention to dialects and regional variations common among immigrant communities. Do your best to capture the speech accurately, and flag any unintelligible portions with [inaudible].
- Translation: Translate the transcription into English. Preserve the original meaning, context, idiomatic expressions, and cultural references. Ensure that nuances and subtleties are accurately conveyed.
- Capture Vocal Nuances: Note vocal cues such as tone, pitch, pacing, emphasis, and emotional expressions that may influence the message. These cues are critical for understanding intent and potential impact.

Fri, Nov 8, 2024. ChatGPT for Windows desktop supports real-time voice and a global shortcut (Alt Space). #chatgpt #speech-to-text #voice-cloning

Mon, Nov 4, 2024. Recraft.ai is currently SOTA in text to image. It's fairly impressive and could be a good alternative to Figma. #ai-art #future #image-generation #speech-to-text

Mon, Nov 4, 2024. Artificial Analysis has a bunch of new leaderboards and arenas. #speech-to-text #tts
- Open AI TTS leads the TTS Leaderboard. ElevenLabs is a bit behind.
- Recraft V3 > Flux 1.1 leads Text to Image Leaderboard

Tue, Oct 29, 2024. F5-TTS clones voices with just 15-second samples. #future #speech-to-text #tts #voice-cloning

Sun, Oct 27, 2024. Elevenlabs lets you create voices with a prompt. No need to even clone one! #speech-to-text #tts #voice-cloning

Wed, Oct 9, 2024. Reverb ASR does diarration as well as transcription. It seems the state of art right now. #speech-to-text #tts #voice-cloning

Tue, Oct 8, 2024. Revisiting text to speech models. Nothing much has changed since July 2024. #speech-to-text #tts #voice-cloning
- OpenAI TTS: $15/1M chars Ref
- Deepgram Aura: $15/1M chars Ref
- Azure AI Speech: $15/1M chars Ref
- Google TTS Neural2: $16/1M chars Ref
- AWS Polly Neural TTS: $16/1M chars Ref
- Cartesia Pro: $50/1M chars Ref
- Elevenlabs Scale: $300/1M chars Ref

Thu, Oct 3, 2024. Speak is a language learning app based on OpenAI's Realtime API. #future #speech-to-text #voice-cloning

Thu, Oct 3, 2024. OpenAI's Realtime API can be used in a text-to-text chat mode without needing to send the entire context. If the pricing works out right, this can be far cheaper than sending the entire conversation context. Ref #speech-to-text #tts

Thu, Oct 3, 2024. ChatGPT's advanced mode includes: "...you can use various regional accents and dialects." Ref Source #chatgpt #speech-to-text #tts #voice-cloning
- But the API can "laugh, whisper, and adhere to tone direction." Ref

Sun, Sep 22, 2024. Sentient lets you control the browser via Python in natural language #prompt-engineering #python #speech-to-text

Sat, Sep 21, 2024. Sarvam.ai offers Indic text to speech #speech-to-text #tts #voice-cloning

Tue, Sep 17, 2024. Segmind's Hallo lets you animate a face to an audio clip #speech-to-text #voice-cloning

Fri, Sep 13, 2024. Hume provides a voice-to-voice model (EVI 2) that handles emotions at 7 cents/minute. #speech-to-text #voice-cloning

Thu, Sep 12, 2024. Pixtral seems quite good at OCR #document-conversion #future #speech-to-text

Tue, Aug 20, 2024. Lumentis creates docs from transcripts and text #document-conversion #future #github #speech-to-text

Thu, Jul 25, 2024. Speech editing in audio files is a thing. Speech Editing Toolkit and Descript #speech-to-text #voice-cloning

Mon, Jul 8, 2024. A quick check on the pricing of text to speech models #speech-to-text #tts #voice-cloning
- OpenAI TTS: $15/1M chars Ref
- Deepgram Aura: $15/1M chars Ref
- Elevenlabs Scale: $165/1M chars Ref
- Google TTS Neural2: $16/1M chars Ref
- Azure AI Speech: $15/1M chars Ref
- AWS Polly Neural TTS: $16/1M chars Ref

Wed, May 29, 2024. Some audio embedding models: #embeddings #speech-to-text #voice-cloning unoti/voice-embeddings, retkowsky/audio_embeddings, pyannote/embedding (for speaker similarity), and more.

Sun, Apr 14, 2024. Lemur from Assembly.ai does real time call transcription and summary #future #speech-to-text #voice-cloning

Sat, Mar 30, 2024. Hume.ai offers voice emotion API and emotion-based conversational responses. An empathic AI. #future #huggingface #speech-to-text #voice-cloning

Fri, Mar 29, 2024. pyannote-audio does speaker diarization #speech-to-text

Mon, Feb 26, 2024. MetaVoice 1B offers voice cloning on American & British accents with 30s training #speech-to-text #voice-cloning

Mon, Feb 26, 2024. Buildspace's demo is a great demo of how voice and actions can be used effectively. #speech-to-text #voice-cloning

Fri, Feb 23, 2024. Teknoturf is using Gen AI to #ai-coding #automation #prompt-engineering #speech-to-text #tts
- Improve prompts when teaching prompt engineering.
- Pronounce languages better, identifying which words Tamilians and Malayalis will mis-pronounce.

Mon, Feb 19, 2024. All image-to-text models on HuggingFace #future #huggingface #speech-to-text

Sun, Feb 18, 2024. HuggingFace Chat Assistants has open source system prompts!! #future #huggingface #speech-to-text

Mon, Jan 15, 2024. Eleven-labs speech synthesis with voice cloning is at the uncanny valley. With two 5-minute samples, my voice sounds a fair bit like my voice but is very clearly not my voice. I find stability ~ 30%, similarity ~ 80% and style ~50% gives a reasonable outcome. But the default voices (e.g. Joseph, George, Charlie) are excellent. #speech-to-text #voice-cloning

Sun, Dec 17, 2023. ⭐ whisper-standalone-win provides a Windows binary for Faster-Whisper. It just needs CUDA and cuDNN installed. Then whisper-faster.exe video.mkv --language=English --model=medium generates the transcript. #speech-to-text

Fri, Dec 15, 2023. Mixtral-8x7b-Instruct "... really does seem to be equivalent in quality to ChatGPT 3.5." Ref #chatgpt #speech-to-text #tts

Sun, Dec 3, 2023. Tools to explore #speech-to-text #voice-cloning
- ElevenLabs speaks in your voice
- Cutout Pro removes backgrounds and parts of images
- Vocal Remover removes vocals from songs
- CapCut video editor

Sun, Dec 3, 2023. Meta released SeamlessExpressive which preserves emotions in speech-to-speech translations #embeddings #future #speech-to-text #tts #voice-cloning

Mon, Nov 6, 2023. 06 Nov 2023: ChatGPT is slightly better that Github Copilot or CodeWhisperer #chatgpt #code-agents #github #speech-to-text

Sat, Nov 4, 2023. 04 Nov 2023: Seamless4MT does language and speech inter-conversion. CC-NC-BY license #document-conversion #speech-to-text #tts #voice-cloning

Thu, Oct 26, 2023. 26 Oct 2023: Auxi pro is a VS Code + Github Copilot Chat for PowerPoint #ai-coding #github #speech-to-text

Thu, Oct 26, 2023. 26 Oct 2023: ChatGPT AutoExpert is a great prompt mechanism for higher accuracy, context and control. Example #automation #chatgpt #code-agents #prompt-engineering #speech-to-text #tts