Open
Description
Being able to search inside the PDF files uploaded to Atomic Server would be a really nice addition.
Goals:
- Make it easier to find PDF documents by searching for terms that occur inside them
- Lightweight
- Fast
- Runs in background, may fail. Should not slow down upload process.
- OCR, if missing in the original PDF, would be a decent addition. But only if other goals are met.
- Bonus points if it also turns other doc types (e.g. docx) to plaintext
- Output should be plaintext or (preferably) markdown
Non-goals:
- Extract data from tables in PDFs
There are some tools that could help with this:
- pdf-extract rust crate
- ooxml-rs openXML (.docx .xlsx .pptx / word powerpoint excel) rust parser
- pdf-to-markdown (JS, so should run client-side!)