| Hi,
I'm looking for a simple way to convert PDFs into markdown with integrated images and tables. Tried Llamaindex, but no integrated images.
Tried Langchain, but some PDFs will have the footer being parsed before the top.
Tried to use Adobe PDF API, but have to pay $25K upfront! |
For text, unstructured seems to work quite well and does a good job of quickly processing easy documents while falling back to OCR when required. It is also quite flexible with regards to chunking and categorization, which is important when you start thinking about your embedding step. OTOH it can definitely be computationally expensive to process long documents which require OCR.
For images, we've used PyMuPDF. The main weakness we've found is that it doesn't seem to have a good story for dealing with vector images - it seems to output its own proprietary vector type. If anyone knows how to get it to output SVG that'd obviously be amazing.
For tables, we've used Camelot. Tables are pretty hard though; most libraries are totally fine for simple tables, but there are a ton of wild tables in PDFs out there which are barely human-readable to begin with.
For tables and images specifically, I'd think about what exactly you want to do with the output. Are you trying to summarize these things (using something like GPT-4 Vision?) Are you trying to present them alongside your usual RAG output? This may inform your methodology.