Extract text from imported (PDF / Word / Office) files

Being able to search _inside_ the PDF files uploaded to Atomic Server would be a really nice addition. 

Goals:

- Make it easier to find PDF documents by searching for terms that occur inside them
- Lightweight
- Fast
- Runs in background, may fail. Should not slow down upload process.
- OCR, if missing in the original PDF, would be a decent addition. But only if other goals are met.
- Bonus points if it also turns other doc types (e.g. docx) to plaintext
- Output should be plaintext or (preferably) markdown

Non-goals:

- Extract data from tables in PDFs

There are some tools that could help with this:

- [pdf-extract](https://github.com/jrmuizel/pdf-extract) rust crate
- [ooxml-rs](https://github.com/zitsen/ooxml-rs) openXML (.docx .xlsx .pptx  / word powerpoint excel) rust parser
- [pdf-to-markdown](https://github.com/jzillmann/pdf-to-markdown) (JS, so should run client-side!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract text from imported (PDF / Word / Office) files #477

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extract text from imported (PDF / Word / Office) files #477

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions