Extract PDF text in your browser with LiteParse for the web
TLDR
- Simon Willison ported LlamaIndex’s LiteParse PDF-to-text CLI into a pure browser app in 59 minutes using Claude Code and Opus 4.7.
Key Takeaways
- LiteParse uses “spatial text parsing” heuristics to handle multi-column PDF layouts without any AI models, falling back to Tesseract.js OCR for image-based PDFs.
- The browser port runs entirely client-side on PDF.js and Tesseract.js; no data leaves the browser and no network requests are made during parsing.
-
Willison used Claude Code with a plan-first workflow: wrote
notes.mdfrom initial research, generatedplan.mdbefore coding, then ranbuild itand queued follow-up prompts. - Cross-browser bugs (Safari ReadableStream failure) were caught via Playwright TDD and fixed without Willison reviewing any of the HTML or TypeScript directly.
- The app is deployed via GitHub Pages using a Vite build step configured by Claude Code; CI runs tests on every push before deploying.
Why It Matters
- A working, privacy-safe, zero-cost browser PDF parser now exists that any developer can fork or embed without a server or API key.
- The build log is a concrete benchmark: one developer, one 59-minute Claude Code session, red/green TDD, small commits, cross-browser QA, and a deployed GitHub Pages app.
- Willison distinguishes “vibe coding” by whether the developer reviews the output, not whether AI wrote it; he argues low blast-radius static tools tolerate the tradeoff better than server-side code.
Simon Willison, Simon Willison’s Weblog · 2026-04-23 · Read the original