@ireneista

d@nny "disc@" mc²

@ireneista @adrienne there are OCR methods which avoid trying to parse the graphics calls at all which are used in e.g. pdfsandwich via tesseract but that tends to be compute intensive and less applicable to a phone

d@nny "disc@" mc²

@ireneista @adrienne please let me know if the text layer is somehow deeply linked to the pdf display capabilities (i know this is why arxiv has an experimental html5 option) but i was only proposing this because i was under the impression that could be avoided

just adrienne

@hipsterelectron @ireneista It depends a lot on what software generated the PDFs (and with what options). IF there's an accessibility layer, reading that gets you 90% of the way there

d@nny "disc@" mc²

@adrienne @ireneista i was looking into whether wasm/webgpu were far enough along to do something like a fawkes facial rec poisoning on a phone browser (obv with greatly reduced settings) and even if that ends up being impossible, OCR seems much less strenuous https://circumstances.run/@hipsterelectron/113239764489703522

Irenes (many)

@hipsterelectron @adrienne right but like OCR just gives you (character, coordinates) tuples. if you then also reconstruct bounding boxes, that gets you back to what you'd have if you'd started by looking at the draw calls, but it doesn't tell you the logical structure of the characters.

Irenes (many)

@hipsterelectron @adrienne most tools that do this have some heuristics where they try to figure out when two things are next to each other and have the same baseline, and then also use font-specific knowledge to guess at where the spaces are (it's hard because they aren't all the same size.....)

just adrienne

@ireneista @hipsterelectron in fairness, the thing that prompted this was legal briefs, which are a reasonably clean subset of PDFs. they're all gonna be in one of like 4 fonts because courts are picky as shit, and the only bit where there's ever going to be multicolumn text is at the very beginning of the doc.

just adrienne

@ireneista @hipsterelectron like it's actually potentially a very usefully-limited case bc the format is so standardized! there are variations, sure, but there are also a lot of assumptions it's totally safe to make about the entire class.

d@nny "disc@" mc²

@adrienne @ireneista oh yes i'm not letting this discourage me from just pulling the text layer first i'm just fascinated that i hadn't really considered the multiple levels of difficulty beyond the character recognition part of "OCR"

Irenes (many)

@hipsterelectron @adrienne yeah as far as we can tell, neither did Adobe >< (or, more likely, they decided that was fine)

d@nny "disc@" mc²

@ireneista @adrienne difficulty authoring and consuming by external tools by making the format difficult to parse redounds to monopolistic objectives (i also believe this may be a goal of gpg's reportedly horrific file format)

Irenes (many)

@hipsterelectron @adrienne sigh gpg was doing backwards compatibility with pgp, so, maybe

Erin 💽✨

@ireneista @hipsterelectron @adrienne PGPs file format was Fine until it needed 13 different extensions to keep up with developments in cryptography