Published some notes on Docling, a rather nice MIT licensed Python PDF document / table extraction library from IBM https://simonwillison.net/2024/Nov/3/docling/
-
Published some notes on Docling, a rather nice MIT licensed Python PDF document / table extraction library from IBM https://simonwillison.net/2024/Nov/3/docling/
-
@simon How does the Markdown output from Docling compare with the HTML that you've gotten out of Gemini for PDF documents? Does Docling do a good job of recognizing headings, lists, etc.?
-
@simon Any comments on it's output's quality?
-
@xsc I tried it on two PDDs and it looked OK, which isn't nearly enough testing for me to say anything useful!
-
@matt I tried it on two documents so far and it looked reasonable, but I've not done a remotely robust comparison of it yet
Copyright © 2024 NodeBB | Contributors