OK, I _swear_ I didn’t do this on purpose but PDF sucks so bad as a publishing format that the easiest way to build search traffic to a website turns out to be republishing information that’s otherwise locked up in a PDF
-
OK, I _swear_ I didn’t do this on purpose but PDF sucks so bad as a publishing format that the easiest way to build search traffic to a website turns out to be republishing information that’s otherwise locked up in a PDF
-
Pondering my dashb-orbreplied to Simon Willison last edited by
@simon the mrbeast thing huh?
-
Simon Willisonreplied to Simon Willison last edited by
Heck, I did it with a paper that had been published by Google’s own research team and my version was in the top three results on Google within 2 hours of putting it online https://simonwillison.net/2024/Aug/24/pipe-syntax-in-sql/
Don’t let friends publish useful information in PDFs!
-
Simon Willisonreplied to Pondering my dashb-orb last edited by
@arichtman yeah!
-
Mark T. Tomczakreplied to Simon Willison last edited by
@simon I actually wonder if it's a technical challenge issue.
There are so many ways to create a PDF that's readable to humans and illegible to computers that it's much, much easier to make something search-engine-friendly in HTML format. Even in the case of Google where, I imagine, they can OCR that shit, that pipeline's gotta be more of a bottleneck than interpreting token-stripped HTML because it just costs more resources to transform images of text.
And that's before we factor in the human element: Google still gets signal on popularity from clicks, and if I see a PDF in the wild, my default response is "No thank you; I do not want this information in likely-unsearchable page-by-page form that'll be harder to consume than a plain web page."
-
@[email protected] I wonder if you have seen some LLM/Stable Diffusion service that converts a PDF to ePub? Give that most traditional services are really bad with some complex PDFs I suspect an LLM could provide a great alternative.
-
@alvaro I've had success using Gemini 1.5 Pro to convert PDFs to both HTML and to Markdown, so I'm very confident it could output ePub (effectively HTML + assets in a zip file) given the right prompts and the right harness around it
-
Simon Willisonreplied to Mark T. Tomczak last edited by
@mark I'm sure that's what's going on here - HTML is a far better format for machine-readability than PDF
-
@[email protected] just FYI my attmempts to convert a PDF book to ePub sadly didn't work. I was able to generate some HTML, but definitely not the whole book :sadblob:
-
@alvaro I expect it would take quite a lot of prompt engineering, maybe even across multiple prompts, one per page of the document