The Appleton Times

Truth. Honesty. Innovation.

Technology

Why is AI so bad at reading PDFs?

By Rachel Martinez

about 14 hours ago

Share:
Why is AI so bad at reading PDFs?

Artificial intelligence models struggle to parse PDFs despite rapid advancements, as illustrated by efforts to analyze Jeffrey Epstein's document trove. Specialized tools from companies like Reducto and research from institutions like the Allen Institute are making progress, but challenges remain in achieving reliable extraction for real-world use.

In the wake of the Jeffrey Epstein scandal, a group of tech enthusiasts turned to artificial intelligence to sift through millions of government-released documents, only to discover a stubborn roadblock: the PDF file format. Last November, the House Oversight Committee released 20,000 pages from Epstein's estate, followed by the Department of Justice's batches totaling more than three million files, all in PDF form. Luke Igel, cofounder of the AI video editing startup Kino, and his friends, including Riley Walz, spent hours navigating garbled email threads and poor-quality scans in a clunky PDF viewer.

"There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. There was no real index. You just had to get lucky and hope that the document ID that you were looking at contains what you’re looking for," Igel said. Frustrated by the unsearchable files—despite the Justice Department's attempt at optical character recognition (OCR)—Igel envisioned building a Gmail-like tool to make the correspondence more intuitive. But extracting data from PDFs proved far more challenging than expected, highlighting a surprising weakness in even the world's most advanced AI models.

PDFs, invented by Adobe in the early 1990s, were designed to preserve a document's exact visual appearance for printing and viewing, not for machine readability. Unlike HTML, which structures text logically, PDFs use character codes and coordinates to render pages as images. This makes them ideal for consistency—opening the same on any device—but a nightmare for AI. Edwin Chen, CEO of data company Surge, calls PDF parsing one of AI's "unsexy failures" that limits real-world applications. Last year, Chen tested state-of-the-art models and found they often summarized PDFs instead of extracting info, mixed up footnotes with main text, or hallucinated content entirely.

Researcher Pierre-Carl Langlais has even joked in a timeline of AI progress that "PDF parsing is solved!" would come just before artificial general intelligence (AGI). Igel's initial attempt using Google's Gemini on the Epstein files worked only for the cleanest scans and would be too costly for millions of documents. He then turned to Adit Abraham, a former MIT classmate who runs Reducto, a PDF-parsing AI company located in the office above his.

Reducto succeeded where general models failed, pulling data from cryptic email threads, redacted call logs, and low-quality handwritten flight manifests in the Epstein files. With the extracted information, Igel and Walz built an ecosystem of apps: Jmail, a searchable prototype of Epstein's inbox; Jflights, an interactive globe showing flight paths linked to underlying PDFs; Jamazon for searching purchases; and Jikipedia for businesses and people mentioned, all citing source PDFs. "That’s where the magic of extracting information of PDFs became real for me," Igel said. "It’s going to completely change the way a lot of jobs happen."

The core issue with PDFs lies in their structure, which ignores editorial norms that humans intuitively grasp. Langlais explained, "The key issue is that they cannot recognize editorial structure. It’s all fine while it’s relatively simple text, but then you’ve got all these tables, you’ve got forms. A PDF is part of some kind of textual culture with norms that it needs to understand." OCR can convert images of text back to editable form, but it struggles with multi-column layouts, turning academic papers into jumbled messes. Tables, images, footnotes, and headers further complicate matters. When fed to AI like ChatGPT, the process involves cycling through tools, sometimes using vision models for OCR, often resulting in delays, high compute costs, and errors.

Training data scarcity exacerbates the problem; models rarely encounter PDFs. But that's changing as AI developers hunger for quality data. Government reports, textbooks, and academic papers—much of it in PDFs—offer trillions of tokens for training, according to researchers at the Allen Institute for AI. In a 2023 paper, they announced olmOCR, a specialized model trained on about 100,000 PDFs including public domain books, papers, brochures, and Library of Congress documents with human transcriptions. It was fine-tuned for challenges like tables, learning to identify headers from font size. Luca Soldaini, a researcher at the institute who worked on olmOCR, said it became their most popular release last year, outpacing generalist models. "People are actually using it," Soldaini noted, even if it lacks the glamour of broader AI advancements.

The PDF's origins underscore its enduring role. Duff Johnson, CEO of the PDF Association, recalls that the first PDF was reportedly an IRS 1040 form. In 1994, the IRS distributed CDs of PDFs to ensure consistent forms without mass printing and mailing. From there, PDF became essential for email-era document sharing: publishers sending manuscripts, patent applicants submitting diagrams, anyone needing a fixed, uneditable file. "There’s no other technology solving the problem the PDF solves," Johnson said. Websites vary by browser and CSS; Word docs alter by machine; links decay. PDFs remain identical regardless of viewer, time, or device. Johnson recently opened a 1995 PDF about PDFs itself, and it worked flawlessly. "I would expect no less," he added.

At Hugging Face, researchers like Hynek Kydlíček discovered PDFs' untapped potential while processing the Common Crawl web archive for a 5 billion-document multilingual dataset. They found 1.3 billion PDFs lurking within—high-quality data overlooked amid mostly HTML content. "That’s how we figured out that PDFs could be actually a super big and super high-quality source we can still train on," Kydlíček said. But extraction was tough, so they classified PDFs as easy (text-based) or hard (image-heavy), routing the latter to a modified olmOCR called RolmOCR, developed with Reducto. After filtering out oddities like horse racing results, they "liberated three trillion of the finest tokens" for training.

Yet, parsing for training differs from precision needs in fields like law and engineering. Early tests showed models inventing text on blank pages or misdescribing images. Kydlíček said, "It’s solved in like 98 percent of cases, and like in many areas you always have this problem of getting these last 2 percent." He noted heavy investment in OCR as an economic use case for vision language models. "I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct."

Reducto leads in specialized parsing, drawing from founders' self-driving car experience. Abraham, who cofounded it for managing AI conversation histories, pivoted after repeated PDF requests. "One of our core intuitions was all these documents were made for humans like you and I to interpret, and there’s a lot of visual information here that we take for granted," he said, like paragraph gaps signaling new ideas or indentations for subpoints. Their approach segments pages into elements—headers, tables, footnotes—then routes to tailored models. Tables go to parsers; charts to axis and legend extractors. A vision model corrects outputs, enabling accurate spreadsheet conversions from charts, a boon for financial clients that eludes larger models.

Abraham posted their method in early 2024, sparking interest from developers stalled by PDFs. "This wasn’t supposed to be a pivot," he said. Now, Reducto handles a growing array of small models for multi-pass parsing. But challenges persist, akin to self-driving's "long tail" of rare scenarios. "I’ve seen the most insane documents you could imagine," Abraham said—nested PDFs, underlined and crossed-out legal text, scribbled medical faxes with connecting lines. "I don’t think PDFs are a fully solved problem. I wish that were the case. We’re close, but there’s still plenty to do."

PDFs show no signs of fading. Johnson points to Google Trends, where searches for PDF rise steadily yearly, reflecting its role in high-quality content. Past challengers to PDF have vanished into obscurity. "What’s going to happen is that all the world’s systems will instead understand and use PDF better and better," he predicted. AI firms initially ignored PDFs' difficulty until realizing the format holds premium data. As investments pour in—from olmOCR to Reducto's tools—the unsexy PDF parsing could unlock vast archives, from Epstein's files to global records, transforming how professionals access information.

For now, the Epstein apps demonstrate the potential: Jflights' globe reveals crisscrossing paths, clickable to passenger manifests; Jmail uncovers unsettling emails. But broader adoption hinges on closing that 2 percent gap. With governments, lawyers, and engineers relying on PDFs for immutable records, solving this puzzle isn't just technical—it's essential for AI's practical evolution.

Share: