Skip to content

Extract text from imported (PDF / Word / Office) files #477

Open
@joepio

Description

@joepio

Being able to search inside the PDF files uploaded to Atomic Server would be a really nice addition.

Goals:

  • Make it easier to find PDF documents by searching for terms that occur inside them
  • Lightweight
  • Fast
  • Runs in background, may fail. Should not slow down upload process.
  • OCR, if missing in the original PDF, would be a decent addition. But only if other goals are met.
  • Bonus points if it also turns other doc types (e.g. docx) to plaintext
  • Output should be plaintext or (preferably) markdown

Non-goals:

  • Extract data from tables in PDFs

There are some tools that could help with this:

Metadata

Metadata

Assignees

No one assigned

    Labels

    pluginShould probably be an Atomic Plugin

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions