#document-conversion | Anand S - Things I Learned

Sat, Feb 28, 2026. unidown is a Rust CLI tool that converts Markdown to Unicode characters - useful for LinkedIn. #document-conversion #github #markdown

Fri, Feb 20, 2026. Cloudflare introduced Markdown for Agents. This converts websites from HTML to Markdown via Accept: text/markdown for any Cloudflare endpoint which has enabled this feature. This requires a Pro account. #code-agents #document-conversion #html #markdown

Fri, Feb 20, 2026. "Animated web formats are simply video codecs ... stripped of their most powerful feature." A .webm file is likely to compress much better than an animated .webp, etc. Gemini #document-conversion #web-dev

Sun, Feb 1, 2026. Microsoft's docfind generates a WASM search index for documents, building a dependency free browser based compact and fast search. #document-conversion #html #search #web-dev

Sun, Jan 25, 2026. Exposing your workflow as a software interface productizes services businesses. For example, my auditors and immigration lawyers have portals where I can fill out forms, upload documents, see my status, etc. This standardizes their delivery, and creates a "product" moat. #automation #best-practices #document-conversion

Sun, Jan 11, 2026. AVIF compresses better than WebP and may be the "next big thing". I will be switching for all future images. Squoosh remains my choice of compressor and Ezgif's AVIF maker and GIF to AVIF are handy. #document-conversion #image-generation

Sun, Dec 7, 2025. OmniDocBench 1.5 is a benchmark for parsing realistic PDFs. Gemini 3 Pro does well on the list among the commercial LLMs. PaddleOCR-VL (0.9B) tops the benchmarks, overall. #document-conversion #llm-ops

Mon, Nov 17, 2025. GhostScript seems the best way to compress PDFs via the CLI. Example: gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf #document-conversion

Mon, Nov 17, 2025. Pandoc supports Lua filters which are a powerful way to customize the document conversion process. Here is a Lua filter that converts horizontal rules in a markdown document to page breaks and preserve in a Word document (OpenXML format) #document-conversion #markdown
```
function HorizontalRule()
  return pandoc.RawBlock('openxml', '<w:p><w:r><w:br w:type="page"/></w:r></w:p>')
end
```

Mon, Oct 6, 2025. Create a long form document agent editor that can make targeted edits in long documents/reports. Effectively a Codex for Word. #code-agents #document-conversion #future #markdown #write

Sat, Sep 27, 2025. The most effective way to convert a blob (e.g. file input) to a data URL on the browser seems to be via the FileReader API. #document-conversion

const blobToDataURL = (blob) =>
  new Promise((res, rej) => {
    const r = new FileReader();
    r.onload = () => res(r.result);
    r.onerror = () => rej(r.error);
    r.readAsDataURL(f);
  });

Fri, Sep 26, 2025. Adding // @ts-check to a JavaScript file and documenting types via JSDoc might be the simplest way to migrate phase-wise from JS to Typescript. #document-conversion

Mon, Sep 22, 2025. uvx markitdown https://example.com/ fetches example.com as Markdown. I learnt this when I told Codex it could use uvx markitdown to convert PDFs and it figured this part out by itself. #document-conversion #html #markdown

Wed, Sep 10, 2025. Claude.ai can natively work with Excel, PPTX, DOCX, and PDF files now. #document-conversion #markdown

Fri, Aug 29, 2025. Cloudflare has an image transformation API that also acts as a CDN. Apart from basic transformations, it can auto detect and crop faces, remove backgrounds, and more. #cloud #document-conversion #image-generation

Thu, Aug 28, 2025. Our team passed an image to an LLM for OCR (especially to identify formatting, e.g. bold, italics, etc.), then passed the output and the image to another LLM for improvement. Interestingly, the best LLM (Gemini 2.5 Pro, for this sample of 8 images) out-performed the two-stage workflow. Perhaps incorrect results confuse more than the correct results help? This needs more research. #document-conversion #image-generation #llm-ops

Sat, Aug 23, 2025. Codex and Codex CLI now support image attachments. #document-conversion

Wed, Aug 20, 2025. 20 Aug 2025 #ai-coding #code-agents #document-conversion #future #github
- Policy-as-code app. Checklist from doc. Apply checklist to data / doc inputs.
- Code similarity checker library based on TDS Project evals.

Wed, Aug 20, 2025. ⭐ Policy-as-code is an emerging theme. Allow users to create their own guardrails policy. Or, take existing policy documents and convert them into an LLM-based evaluator. Krishnakumar Menon #ai-coding #code-agents #document-conversion #future #llm-ops

Sat, Aug 16, 2025. ⭐ LLMs can hyper-personalize demos. E.g. an LLM document generator demo accepts a role, document type, and prompt. The demo-er says "Bank, LinkedIn marketing" and the LLM auto-populates the fields aptly, re-purposing the demo. #document-conversion #future #llm-ops #markdown

Tue, Aug 12, 2025. Increasing the size of an image improves OCR accuracy for LLM models (or at least Claude 4 Sonnet). Anecdotally, resizing 2x did not work on a number of examples but 2.5x - 3x did. This increases the cost to 6.25x or 9x, however. #document-conversion #image-generation

Tue, Aug 5, 2025. defuddle can be used in the browser to get the main content from web pages. A replacement for Mozilla Readability. # #document-conversion #html #markdown #web-dev

Fri, Aug 1, 2025. Teaching vibe coding is satisfying, too. I guided a developer to write a Python workflow by providing 2 prompts. Both of these were one-shotted by Claude 4 Sonnet. The entire process took 20 min with me guiding them over the phone. #ai-coding #code-agents #document-conversion #prompt-engineering #write
- "Write a Python script to extract a page from a PDF file and save it." Followed by "Write minimal code. Drop error handling."
- "Write a Python script to pass a PDF file to an LLM for OCR and print the result. Use this code sample... [PASTED CODE]." Followed by "Write minimal code. Drop error handling."

Mon, Jun 30, 2025. ⭐ When bringing in humans-in-the-loop, applications must make it easier to review and to edit the work. #automation #best-practices #document-conversion #future #optimization #prompt-engineering #write

Mon, Jun 23, 2025. XConvert is a convenient online app to compress .webm videos. Not great design but fairly good compression. #document-conversion #web-dev

Sun, Jun 15, 2025. Documentation can become technical debt. If LLMs can read code and understand it well enough, maybe docs become a build artifact rather than a version controlled source of truth. Refactoring Podcast: The Future of Dev Tools 🔧 — with Dennis Pilarinos 35:56 #document-conversion #github #markdown #ai-coding

Fri, Jun 13, 2025. PDF plumber seems a good way to extract PDF structure and internals. #document-conversion

Tue, May 27, 2025. When processing presentations for RAG via OCR: #document-conversion
- How to parse PDF docs for RAG is a useful OpenAI cookbook with a GPT 4o prompt

Sun, May 11, 2025. Pandoc has several options useful when converting Markdown to HTML (cat file.md | pandoc -f markdown -t html). My favorites: #document-conversion #html #markdown
- --no-highlight skips code-highlighting. --highlight=pygments adds Pygments styling
- --wrap=none doesn't wrap the content in a single block
- --number-sections adds section numbering (<h2>1. Introduction</h2>)
- --shift-heading-level-by=NUM – shift all headings by NUM levels (e.g., start at <h2> instead of <h1>)
- pandoc -f markdown-auto_identifiers drops the auto-identifiers extension that generates id=... for each heading
- pandoc -f gfm uses GitHub flavored Markdown. Run pandoc --list-extensions=gfm to identify the extensions it uses.
- Pandoc's Markdown extension examples are quite extensive.
- Auto-enabled GFM extensions:
  - alerts: GitHub-style callouts (info, tip, warning) via > [!TYPE] blocks.
  - autolink_bare_uris: Turns bare URLs into links, without needing <...>.
  - emoji: Parses :smile:-style codes into Unicode emoji characters.
  - footnotes: Enables footnote syntax with [^id] and definitions at the bottom.
  - gfm_auto_identifiers: Uses GitHub’s heading-ID algorithm: spaces → dashes, lowercase, removes punctuation.
  - pipe_tables: Enables table.
  - raw_html: Raw HTML is unchanged.
  - strikeout: Enables strikethrough with ~~text~~.
  - task_lists: Parses - [ ] and - [x] items as checkboxes.
  - yaml_metadata_block: YAML front matter for document metadata, e.g. <title>
- GFM extensions worth enabling:
  - ascii_identifiers: Strips accents/non-Latin letters in automatically generated IDs.
  - bracketed_spans: [Warning]{.alert} becomes <span class="alert">
  - definition_lists: Term\n: Definition text becomes a definition list
  - fenced_divs: ::: {.note} block creates a <div class="note">...</div>
  - implicit_figures: Standalone images become <figure> with <figcaption>.
  - implicit_header_references: [Section] is treated as [Section][#section]
  - raw_attribute: <b>bold</b>{=html} is inserted as HTML
  - smart: Converts straight quotes to curly, -- to en-dash, --- to em-dash, ... to ellipsis.
  - subscript & superscript: E.g. H~2~O and E = mc^2^

Sat, May 10, 2025. snapdom is a fast, light, element capture alternative to html2canvas but doesn't work well with non-CORS images or iframes. #document-conversion #html

Tue, Apr 22, 2025. You can run xclip -sel clip -o | pandoc -f markdown -t html --no-highlight | xclip -sel clip -t text/html -i to convert Markdown in the clipboard to rich text. But xclip doesn't support multiple selections, so the text is lost. ChatGPT #document-conversion #markdown

Thu, Apr 10, 2025. Claude 3.7 Sonnet with extended thinking has a token limit of over 64,000 tokens. Given a strong instruction following capability, that makes it one of the most powerful models for transforming text. For example, transcription restyling, translations, XML to json conversions, PDF to XML, etc. #document-conversion #markdown #speech-to-text

Tue, Apr 8, 2025. One way to copy as Markdown: copy page contents, paste in text-html.com, copy HTML, paste in Turndown, copy Markdown. #document-conversion #html #markdown #write

Tue, Apr 8, 2025. Claude 3.7 Sonnet with extended thinking has a token limit of over 64,000 tokens. Given a strong instruction following capability, that makes it one of the most powerful models for transforming text. For example, transcription restyling, translations, XML to json conversions, PDF to XML, etc. #document-conversion #markdown #speech-to-text

Thu, Apr 3, 2025. Clipboard2Markdown is a utility that lets you paste rich text and convert it to Markdown. #document-conversion #html #markdown

Wed, Mar 19, 2025. OpenAI now supports PDFs natively in the API. (Gemini has done so for a while) #document-conversion

Sun, Mar 16, 2025. Monolith downloads web pages as a single HTML file by embedding content. #document-conversion #html #web-dev

Sat, Mar 15, 2025. There's a PDF UA2 standard for accessibility but there aren't enough tools to generate it. #document-conversion

Sat, Mar 15, 2025. LibreOffice is now on WASM. ZetaJS provides office in the browser. Has a CDN (that was down from our IP). 35M packaged binary. 100M of in-memory file-system loaded. #document-conversion #html #markdown #web-dev
- Useful for: Document conversion, Thumbnail generation, Text extraction, Merging / splitting documents

Sat, Mar 8, 2025. "Export to prompt" can be a useful feature in apps (or even as a bookmarklet). It would let you export content in an LLM-friendly Markdown format. You can paste it into an LLM and ask questions. Here are things I would find useful: #document-conversion #github #markdown #prompt-engineering
- Copy an entire issue (with history) from GitHub, Gitlab, or JIRA
- Copy an entire PR (with code changes) from GitHub, Gitlab, or Bitbucket
- Copy CI/CD logs from GitHub Actions, Gitlab CI, Azure DevOps, etc.
- Copy entire conversation thread in Gmail or Discourse, Service now etc.
- Copy product reviews from Amazon, Shopify, etc.
- Copy page(s) from wikis and content sites like Wikipedia, StackOverflow, etc.
- Copy survey responses from Google Forms, Typeform, etc.
- Copy all interactions with a contact (including interactions, proposal history) from HubSpot or Salesforce
- Copy transcripts from Zoom, Teams, Google Meet, etc.
- Copy as Markdown from Word, GDocs, PDF or HTML
- Copy the summary of an analysis as well as all key metrics from any dashboard
- Copy SAP invoices
- Copy JDs, CVs, and reviews from Workday, BambooHR, DarwinBox, etc.
- Copy design specs, component libraries, and style guides from Figma, Miro, etc.
- Generated with the help of ChatGPT -- link not working

Fri, Mar 7, 2025. Mistral released an impressive OCR model. #document-conversion #github #gpu #markdown
- Marker from DataLab seems comparable but is CC-BY-NC-SA.
- MinerU convert medical textbooks to Markdown well.
- Gemini Flash may be more cost effective and better

Sat, Feb 1, 2025. You can add any content at the end of a PDF file. It's ignored. It's an interesting way to send additional information (or just blow up the file size if you don't like them.) #document-conversion

Thu, Jan 2, 2025. uvx doc2docx converts Word .doc files to the new .docx format. I had several old .doc files that I converted. #document-conversion

Thu, Jan 2, 2025. Tools that convert files to prompt / Markdown suitable for LLMs: #document-conversion #github #markdown
- uvx files-to-prompt
- npx git-ingest
- ingest - written in Go, only Mac/Linux binaries

Mon, Dec 23, 2024. Document to Markdown Converters: #document-conversion #markdown
- PyMuPDF4LLM uses MuPDF. Requires PyTorch.
  - PYTHONUTF8=1 uv run --with pymupdf4llm python -c 'import pymupdf4llm; h = open("pymupdf4llm.md", "w"); h.write(pymupdf4llm.to_markdown("$FILE.pdf"))'
- markitdown from Microsoft. PDF via PDFMiner, DOCX via Mammoth, XLSX via Pandas, PPTX via Python-PPTD, ZIP, etc.
  - PYTHONUTF8=1 uvx markitdown $FILE.pdf > markitdown.md
- Docling by IBM. Unable to install via pip on Windows AND on Linux.
- MegaParse uses libreoffice, pandoc, tesseract-ocr, etc. Requires OpenAI API key.

Sun, Dec 22, 2024. aspose-words is a Python library that converts documents with many formats (Word, RTF, PDF, HTML, Markdown, EPUB, etc.) #document-conversion

Tue, Nov 26, 2024. CloudFlare workers can bundle any kind of files, including text, data, and WASM. Docs #cloud #document-conversion #markdown

Mon, Nov 25, 2024. Crawl4AI and Firecrawl are tools / libraries to convert websites into LLM Friendly Markdown and extract structured data using LLMs. #ai-coding #code-agents #document-conversion #html #llm-ops

Sat, Nov 23, 2024. A list of Markdown to Website converters on this thread: #document-conversion #github #html #markdown
- Jekyll - Ruby - 2008
- MkDocs - Python - 2014
- GitBook - JavaScript (Node.js) - 2014
- MkDocs Material - Python (MkDocs-based) - 2016
- Docsify - JavaScript - 2016
- MdBook - Rust - 2017
- Antora - JavaScript (Node.js) - 2017
- Docusaurus - JavaScript (React) - 2017
- JupyterBook - Python - 2019
- Keenwrite - Java - ~2019
- Honkit - JavaScript (GitBook fork) - 2019
- Nextra - JavaScript (Next.js) - 2020
- Astro - JavaScript/TypeScript - 2021
- Hugo Book - Go (Hugo-based) - ~2020
- Clowncar - JavaScript/Node.js - ~2021
- Quarto - R and Python - 2022
- Starlight - JavaScript/TypeScript - 2023

Wed, Nov 6, 2024. Docling by IBM converts PDF, DOCX, etc. to Markdown. Like PyMuPDF4LLM but better. #document-conversion #markdown

Sat, Sep 28, 2024. PyMuPDF4LLM can convert PDFs to Markdown. It handles tables, too. #document-conversion #llm-ops #markdown
- 04 Oct 2024. PDF-Extract-Kit does PDF layout, formula, table, and OCR extraction using various models.
- 04 Oct 2024. llmsherpa extracts PDF layout, tables, not OCR

Thu, Sep 12, 2024. Pixtral seems quite good at OCR #document-conversion #future #speech-to-text

Tue, Aug 20, 2024. Lumentis creates docs from transcripts and text #document-conversion #future #github #speech-to-text

Sun, Aug 11, 2024. DocxTemplater is SlideSense but open-core and handles DOCX as well! #document-conversion #html #markdown

Sun, Apr 14, 2024. llmsherpa extracts PDFs using LLMs. It has errors but it preserves hierarchy, extracts tables well, and retains image coordinates. Via +91 90031 35354 ~Vetrivel PS #document-conversion #llm-ops

Tue, Feb 20, 2024. Adobe express has a forever free video to GIF converter #document-conversion

Wed, Jan 3, 2024. Adobe Firefly offers a "generative fill" that lets you remove or paint new objects into an image. I'm awaiting text to vector images. #document-conversion #future #image-generation

Wed, Dec 27, 2023. Lica has a fascinating demo of how a document can be converted into a video story. #document-conversion

Sun, Dec 3, 2023. Microsoft released table-transformer to extract tables from PDFs. Sample usage #document-conversion

Sun, Dec 3, 2023. Convert PDF to markdown with marker - an improvement over nougat. #document-conversion #markdown

Thu, Nov 30, 2023. CoVA scrapes web pages via OCR #ai-coding #code-agents #document-conversion

Sat, Nov 4, 2023. 04 Nov 2023: Seamless4MT does language and speech inter-conversion. CC-NC-BY license #document-conversion #speech-to-text #tts #voice-cloning