Tuesday, December 23, 2025
Show HN: Full-text search engine for Epstein docs (OCR and OpenSearch) https://ift.tt/NEIwSLP
Show HN: Full-text search engine for Epstein docs (OCR and OpenSearch) Hi HN, Like many people, I was frustrated that the released Epstein/Maxwell court documents were mostly scanned images (PDFs) with no text layer. This made them impossible to Ctrl+F or analyze programmatically. I built a pipeline to fix this using Python, Tesseract, and OpenSearch. The Site: https://ift.tt/Uk38wWL The Stack: Ingestion: Python workers using ocrmypdf (Tesseract) to perform parallel OCR on raw files. Search: OpenSearch for indexing the extracted text. Frontend: Next.js (SSR) for the UI. Infrastructure: Self-hosted Docker swarm. Features: Sub-second full-text search across all files. Highlights search terms directly on the PDF page. Deep linking to specific pages/documents. This is a transparency tool, not a political one. I wanted to make the raw primary sources accessible to researchers and journalists. Feedback on the search relevance or indexing pipeline is welcome! December 24, 2025 at 01:27AM
Subscribe to:
Post Comments (Atom)
Show HN: Gaussian Splat of a Strawberry https://ift.tt/mZ7rW0b
Show HN: Gaussian Splat of a Strawberry The Setup: https://ift.tt/vNjBfQh https://ift.tt/5K6Cc4r https://ift.tt/FUesJPo https://ift.t...
-
Show HN: Music player for big local collections with mpd support mpz is a C++/Qt music player focused on UX, with derectory tree and playlis...
-
Show HN: Stickerbox, a kid-safe, AI-powered voice to sticker printer Bob and Arun here, creators of Stickerbox. If AI were built for kids, w...
-
Show HN: HCB Mobile – financial app built by 17 y/o, processing $6M/month Hey everyone! I just built a mobile app using Expo (React Native) ...
No comments:
Post a Comment