#gpu | Anand S - Things I Learned

Mon, May 12, 2025. Cloudflare Vectorize, at 768 dimensions / embedding, is free for ~6.5K chunks storage at ~1,000 queries / day. For a light load like 1M 768d chunks queried 1K times a day, the cost is: ChatGPT #cloud #gpu

Service	$
Cloudflare Vectorize	0.38
TurboPuffer (min $64/mo)	1.12
Pinecone (Serverless)	1.27
Supabase (pgvector Micro)	10.00
Redis Cloud Flex (~3 GB)	15.00
Elastic Serverless	65.84
Weaviate Cloud (Serverless)	73.00
Qdrant Cloud (4 CPU / 8 GB)	107.16
Azure AI Search (S1 1 SU)	245.28
AWS OpenSearch Serverless	350.00
Google Vertex AI Vector Search	547.50

Mon, May 12, 2025. NVIDIA parakeet is a lightweight speech to text model that leads benchmarks. Installing such packages continues to be a nightmare due to PyTorch (despite uv). #gpu #speech-to-text

Mon, May 12, 2025. Model reliability is a huge enabler for performance. As models become more reliable, they can work autonomously for longer and that is another kind of scaling. Vending Bench #automation #future #gpu #models

Sat, May 3, 2025. Nvidia's OpenMathReasoning 1.5b model beats MUCH larger models at math. Their training dataset is a massive 3.2M rows of math problems with DETAILED thinking traces. #gpu

Sat, Apr 26, 2025. OpenAI's reasoning models are much ahead of other models when multiplying two numbers in their heads. Ref #gpu

Fri, Apr 25, 2025. Notes from Latent Space ICLR 2025, Singapore #gpu
- Daniel: Menlo's ReZero. A model that keeps searching till it finds the answer.
  - There are multiple search techniques: Multi-step retreival, Iterative retrieval, Query rewriting. Also, reasoning.
  - The LLM token generation sequence is normally: <think>, <search>, <answer>.
  - Insight: "If we explicitly reward LLMs for retrying after a failed search, they out-perform one-attempt systems." So <think>, <search>, <think>, <search>, <think>, <search>, <answer>.
  - ⭐ Prompt reasoning models, e.g. "Keep searching till you find the best answer."
- Roger, Nous Research
  - Supervised learning is limited because accuracy is piece-wise linear, i.e. it's broken up. Continuous optimization is meaningless.
  - Reinforcement learning works better because rewards can be discrete. (But it converts things back into differentiable loss functions behind the scenes.)
    - Rewards can be good/bad. Single or multi-step. Whatever.
  - We're in the "Era of experience", i.e. models gain experience from the environment themselves.
  - ⭐ So, we need environments models can learn in. This is the next thing after training data. That needs a standard for environments.
  - We'd nede a model, a trainer, and the environment.
  - The environments whatever capabilities. Run code. Browser. A game. ... With an exposed interface
- Eugene Cheah (Featherless.ai)
  - Transformer architectures need n-square GPUs as # of tokens grow. Featherless is exploring an RWKV architecture that scales linearly. THere are other such architectures. Performer, Linformer, Reformer, Hyena.
  - Mistral-Nemo-12b-ic is one of the most popular fine-tuned model. It's small enough to run on a server.
- Justus Mattern (Prime Intellect)
  - Intellect-2 is a continously learning (RL) model that uses decentralized training on peer-to-peer GPUs.
  - Solving problems on bandwidth, verifiable contributions, etc.

Sun, Apr 20, 2025. The GPT 4.1 models have a 75% discounted prompt caching (instead of the usual 50%), making them particularly suited for repetitive tasks. OpenAI #chatgpt #gpu #prompt-engineering

Fri, Apr 18, 2025. O3 and O4 have built-in tool use covering all of OpenAI's tools, including containers. This allows them to manipulate images and natively understand them improving vision capabilities dramatically. #ai-coding-tools #automation #gpu

Fri, Apr 18, 2025. GPT 4.1 can handle videos #chatgpt #gpu

Sun, Mar 30, 2025. LLM Native Multimodal image generation experiments: #chatgpt #gpu #image-generation #voice-cloning
- Stickers
  - Sending your wife AI-generated family photos, stickers, etc. is now a thing. Both an AI use case and a ... um... "family media" (?) use case. For example, ask ChatGPT to "Create a transparent comic-style sticker of a lady chef featuring this person happily cooking salad" with a photo. Then send it as a custom sticker. Image
  - Vadivelu stickers work well but the Tamil script generation is poor. Image
- Asking ChatGPT to generate 25-year younger pictures of people produces pretty poor results if you really knew what they looked like then. If you didn't, it's fairly convincing. Yet another example of "hallucinations" - except, it does have its uses.

Fri, Mar 28, 2025. Notes from ThursdAI - Mar 27 #ai-coding-tools #chatgpt #gpu
- Gemini 2.5 Pro has good instruction following despite long context. It automatically thinks for longer where required. Good at understanding large codebases. Very fast. You can upload a 2 hour audio to transcribe with timestamps.
- ai.dev is the shortcut to Google AI studio.
- ChatGPT native image generation is the best image generation model now. - Great character consistency AND prompt adherence thanks to autoregression and not using stable diffusion. - It tends to refuse image generation less than Dall-E. (While Ghibli-style is possible, Calvin and Hobbes strips are blocked.) "We added a refusal which triggers when a user attempts to generate an image in the style of a living artist." Addendum to GPT-4o System Card - A neat personalization implication is that you could put your kids into their favourite cartoon as a cartoon character that looks like them.
- It's weird that the latest GPT 4o is ahead of GPT 4.5 on LM Arena.
- The new DeepSeek V3 is about as good as GPT 4.5 and VERY cheap (27c), so is the obvious choice to run on OpenRouter.
- MCP news:
  - Qwen.ai supports MCP in the UI! (But it's marked as "coming soon" in my case.)
  - Unlike tools, MCP uses servers that can remember the state or context. Tools are stateless.
  - MCP app store like Smithery, MCP.run, Glama, are mushrooming.
  - Awesome MCP Servers is another good starting point.
  - Azure lets you expose agents as MCP servers.
- ChatGPT now uses semantic VAD. I interrupts less and typically when you have meaningfully complete something. It responds a little slower as a result.

Sat, Mar 15, 2025. Here's a sample CI/CD pipeline with automated code review. #ai-coding-tools #github #gpu
- Here is the script that generated it.
- Note the use of NVIDIA's GPU Docker containers via nvcr.io

Fri, Mar 14, 2025. GPT 4o Mini is probably a 8b parameter model. Ref #chatgpt #gpu

Fri, Mar 7, 2025. Mistral released an impressive OCR model. #ai-coding-tools #document-conversion #github #gpu #markdown
- Marker from DataLab seems comparable but is CC-BY-NC-SA.
- MinerU convert medical textbooks to Markdown well.
- Gemini Flash may be more cost effective and better

Thu, Jan 30, 2025. Control of chips and GPU compute is what will likely be the gameplay to control AI dominance globally. Dario Amodei #gpu

Thu, Jan 23, 2025. According to Portkey's LLM usage analysis #cloud #gpu
- Anyscale and Fireworks AI have the lowest error rates (5xx, 429) and rate limits across providers
- Groq and Anthropic are among the highest, OpenAI is among the lowest, Google is in-between
- OpenAI has lower error rates and lower latency than Azure
- They have a ~35% cache hit rate

Sun, Dec 8, 2024. Running embeddings without a GPU is extremely slow. It takes ~2.4 seconds per string. #embeddings #gpu

Fri, Dec 6, 2024. LLMs can detect clear outliers easily. PROMPT: Which is the outlier in this dataset: (1,7), (2,7), (3,6), (4,6), (5,5), (6,1), (7,5), (8,3), (9,1), (10,1) (ANS: (6,1)) #gpu #llm-ops
- 🟢 GPT-4o on ChatGPT gets this. GPT-4o Mini on the API gets it too.
- 🟢 Gemini Pro, Flash, Flash 8b gets this right straight away, without even thinking.
- 🟢 Claude 3.5 Sonnet, Claude 3 Haiku, Claude 3.5 Haiku get it on LLM Foundry. 🔴 Claude.ai, where it visualizes it and gets it wrong.
- 🟢 Nova Micro, Lite, and Pro get it right.
- 🟢 Llama 3.1 70b gets it right. 🔴 Llama 3.2 8b gets it wrong. Llama 3.2 70b, Llama 3.1 8b enter repetition.

Wed, Nov 27, 2024. I tried LIDA from Microsoft, after almost a year of its release. A few notes: #ai-coding-tools #gpu
- Just running uvx lida ui --port 8080 --docs works.
- But I needed to use export TCL_LIBRARY=C:/Users/Anand/AppData/Roaming/uv/python/cpython-3.13.0-windows-x86_64-none/tcl/tcl8.6 to point it to my TCL installation for charts to work. I also chose to export OPENAI_BASE_URL=https://llmfoundry.straive.com/openai/v1
- I also chose to replace gpt-3.5-turbo-0301 (the default model) with gpt-4o-mini in lida/web/ui/component*
- It's quite impressive.

Fri, Nov 8, 2024. GraphRAG is better if data is naturally graph-structured. Else, it's slow and fills up the context window with even vaguely related stuff. Vigneshbabu, AMAT. #gpu #markdown

Fri, Sep 27, 2024. Google Vertex AI has an OpenAI compatible API but it works only for some models. Anthropic and Gemini are not compatible. #gpu

Sat, Sep 21, 2024. E2E is a cheap GPU hosting provider for India. About Rs 100/hr for a V100 16GB #gpu #hosting

Sat, Sep 21, 2024. Jetson NVIDIa is like Raspberry Pi with a GPU! But it's expensive. #gpu

Sun, Sep 15, 2024. Groq, SembaNova and Cerebras are fast inference models. All appear to be free #gpu

Mon, Aug 12, 2024. Some interesting multi-modal generation models / tools to explore: #future #gpu #image-generation #models
- Flux for open-weights image generation
- Runway Gen 3 for video generation
- Suno for music generation

Sat, Aug 10, 2024. Embedding models can be fine-tuned. Example: #TODO #embeddings #future #gpu #markdown #models #optimization #prompt-engineering

Wed, Aug 7, 2024. Notes on LLM Fine-Tuning #gpu #llm-ops #optimization
- Rouge 2 and Bleu and such metrics are NOT good. Create you own benchmarks
- Non-PEFT fine tuning needs 6X GPU RAM. Optimizer states, Gradient, Activations are the overhead. PEFT is about tuning a subset of parameters.
- LORA adds additional weights without updating the model. It's a low rank matrix multiplication. You can change these adapters in runtime. Saves space. Fast to train
- Quantization: Stick to bitsandbytes or AWQ (may be a bit better)
- QLORA = Quantization + LORA
- Predibase has open-sourced Lora Adapters in "Lora Land". Existing adapters are pretty good.
  - ghcr.io/predibase/lorax:main Docker image works on Docker compose to run locally. devices: on Docker Compose lets you specify NVIDIA GPU devices
  - Locust is a HTTP load testing lib in Python
- Techniques for inference optimization
  - Dynamic adapters: Loads right LORAX adapters WHEN a request comes in
  - Multi-adapter batching: Process all inputs in parallel on the same GPU, but different users are post-processed using different adapters

Wed, Jul 24, 2024. GPT 4o Mini is almost as good as GPT 4o in the LMSYS leaderboard. Llama 3.1 400B model and Mistral 2 Large are yet to be evaluated. #chatgpt #gpu #llm-ops

Wed, Jul 24, 2024. GPT 4o Mini can be fine-tuned! #chatgpt #future #gpu

Sat, Jun 22, 2024. Luma Labs Dream Machine generated videos. It's free and is of reasonable quality #cloud #future #gpu

Sat, Jun 8, 2024. Looks like GPT-4o is using CNNs to create vector embeddings of images, with images gridded into a 1x1, 2x2, etc. PLUS OCR. Ref #chatgpt #embeddings #gpu #image-generation

Sat, Apr 27, 2024. Cheap cloud GPU services thread on Twitter lists: #ai-coding-tools #cloud #gpu
- Runpod (17)
- Vast.ai (17)
- Modal Labs (8)
- fly.io (4)
- LightningAI (4)
- Colab (4)
- AkashNet (4)
- Lambda Labs (4)
- ShadeFormAI (3)
- Mac Mini (3)
- Tensor Dock (2)
- Hetzner (2)
- BrevDev (2)

Thu, Apr 18, 2024. Effort engine introduces "effort" as a parametrizable way to speed up LLMs with a quality trade-off. Works on Mistral for now. #gpu #llm-ops #optimization #prompt-engineering

Sat, Mar 30, 2024. Binary embeddings are good enough. Cohere releases binary embeddings. #ai-coding-tools #embeddings #gpu

Fri, Feb 23, 2024. My view: LLMs are general purpose and more capable than SLMs. They'll win, like CPUs won over special-purpose chips. GPUs will optimize for LLMs and as usage grows, cost will fall. #gpu #llm-ops

Wed, Feb 14, 2024. Fly.io offers GPU hosting and auto stop when they have nothing to do. #cloud #gpu #hosting

Fri, Feb 9, 2024. Fine-tuning session by Dan. Notebook #gpu #llm-ops #markdown
- Example of fine-tuning Mistral. Consumed ~~28 computes (~~$2.8)
- Axlotl is what the top fine-tuned LLMs are trained on
- Deepspeed provides distributed training
- Flash attention lets data stay on GPU
- Sample packing packs samples of different lengths into equal length tensors

Sun, Jan 28, 2024. Intel developer cloud has a liberal GPU in the free tier. #cloud #gpu

Sun, Jan 28, 2024. OpenAI releases text-embedding-3-large which can be truncated. The embedding values have descending importance, so picking the first n is a good approximation. Also, gpt-3.5-turbo-0125 is 50% cheaper. #embeddings #gpu

Sun, Jan 7, 2024. mixtral-offloading cleverly loads only the model layer required at any point, letting you run Mixtral 8x7b on Colab Free and on 16GB GPUs. This notebook runs on Colab Free too. #cloud #gpu #markdown

Mon, Dec 25, 2023. Mini-GPTs is an interesting approach to shrink LLMs and make them domain specific. It takes existing LLMs and removes neurons not used in a specific domain (e.g. law, medicine, etc.) #gpu #llm-ops

Sat, Dec 23, 2023. DPO is a simpler alternative to RLHF for fine-tuning. Several HuggingFace models use DPO for training #gpu #markdown

Thu, Dec 21, 2023. Generic computate-intensive algorithms eventually beat domain-specific tuning, because of Moore's law. Ref #gpu #optimization

Wed, Dec 20, 2023. ⭐ This leaderboard included paid models like GPT4 and Claude and compared them with open models on HUMAN + system benchmarks #chatgpt #gpu #models

Sun, Dec 17, 2023. ⭐ Token Tally has an LLM Cost Tool that estimates GPU memory required and token cost across cloud providers. #gpu

Sat, Dec 16, 2023. - Grab. Improving last mile delivery in maps. When did people pick up the phone, when should driver be allocated to minimize waiting time, layer on top of OSM. #future #gpu
- Singapore developers the Sea Lion 7b model
- Try VLLM with AWQ format. Can do batch inferencing. Needs a good GPU
- Amex prediction whether they can pay back in 1 year or 18 months. That choice is a business decision. In real time. Precompute individual score and use it as input to another model. Model must be explainable by regulation. Creates decision tree models therefore. Compliance team must agree if I can use a feature. Can't use gender. Age (in US, Canada);- high age is more risk. Can't use edu level in the US.
- Capture information from camera and use LLMs. Like traffic cameras mapping. Explore GIS from video cameras
- Grab tracks road closures and road accidents and whether a cycle can go on a road vs a bike vs a car
- All drivers have a front facing camera
- Drivers report road accidents by pressing a button
- Amex prices individual loans when selling to a collection agency
- #TODO buy a bike head camera!
- Playwright is a browser-based test framework. Supports recording.

Sun, Dec 10, 2023. ChatGPT is good at generating questions or training datasets. It genuinely creates them rather than replicating from memory. Ref #chatgpt #gpu

Sat, Nov 25, 2023. Quote from Jerry Liu: "GPT 4 is really good at complex reasoning". It's worth exploring what that means. #chatgpt #future #gpu

Sat, Nov 25, 2023. Quote from Jerry Liu: "RAG is a hack". It's engineered, not machine learnt, so it's suboptimal. We need an ML way of creating the context. Maybe fine tuning can be a way of CREATING the right context. But RAG can handle deterministic stuff like access control. #ai-coding-tools #code-agents #gpu

Sat, Nov 25, 2023. Open AI fine tuning API is not good at memorizing info the way it is exposed. But the Gorilla paper shows that fine tuning can actually memorize well. #gpu

Sun, Nov 19, 2023. Gorilla LLM specializes in APPI calls: Torch Hub, TensorFlow Hub, HuggingFace #gpu #llm-ops

Mon, Nov 6, 2023. vLLM runs HuggingFace transformers models faster. So does DeepSpeed #future #gpu #llm-ops

Mon, Oct 23, 2023. 23 Oct 2023: Bits & Bytes, QLORA, llama.cpp, BigDL, are ways of quantizing models that you can run directly in Google Colab. Try them. Also try OpenVino. #ai-coding-tools #github #gpu #markdown

Mon, Oct 23, 2023. 23 Oct 2023: Optimum is a set of performance optimization libraries for transformers #gpu #optimization