Tuesday, December 23, 2025
Show HN: Full-text search engine for Epstein docs (OCR and OpenSearch) https://ift.tt/NEIwSLP
Show HN: Full-text search engine for Epstein docs (OCR and OpenSearch) Hi HN, Like many people, I was frustrated that the released Epstein/Maxwell court documents were mostly scanned images (PDFs) with no text layer. This made them impossible to Ctrl+F or analyze programmatically. I built a pipeline to fix this using Python, Tesseract, and OpenSearch. The Site: https://ift.tt/Uk38wWL The Stack: Ingestion: Python workers using ocrmypdf (Tesseract) to perform parallel OCR on raw files. Search: OpenSearch for indexing the extracted text. Frontend: Next.js (SSR) for the UI. Infrastructure: Self-hosted Docker swarm. Features: Sub-second full-text search across all files. Highlights search terms directly on the PDF page. Deep linking to specific pages/documents. This is a transparency tool, not a political one. I wanted to make the raw primary sources accessible to researchers and journalists. Feedback on the search relevance or indexing pipeline is welcome! December 24, 2025 at 01:27AM
Subscribe to:
Post Comments (Atom)
Show HN: Aroma: Every TCP Proxy Is Detectable with RTT Fingerprinting https://ift.tt/IHaXDBv
Show HN: Aroma: Every TCP Proxy Is Detectable with RTT Fingerprinting TL;DR explanation (go to https://ift.tt/iIlpQJc... if you want the for...
-
Show HN: An AI logo generator that can also generate SVG logos Hey everyone, I've spent the past 2 weeks building an AI logo generator, ...
-
Show HN: I Made an AI Social Media Manager to Automate Content Creation Hey HN, I am a Solopreneur, and I love building apps to automate bor...
-
RoboPianist, a piano playing robot simulation in the browser https://ift.tt/zywcBo6 March 30, 2023 at 10:52PM
No comments:
Post a Comment