You copy a table from a PDF into a spreadsheet and it turns into garbage. Columns collapse. Numbers jump rows. The formatting you saw on screen was a lie.

This happens because PDFs were built for print, not for data. A table in a PDF is not actually a table. It is text and lines placed at specific coordinates to look like one. There is no hidden grid underneath.

Why PDF tables are hard to extract

A PDF stores content as drawing instructions. "Put this character at (x, y). Draw a line from here to there." The file has no concept of a row or a column. It just knows where things sit on the page.

So any extraction tool has to reconstruct the table from scratch. It reads the text positions, guesses where columns start and end, and tries to group things into rows. That is a hard problem, and it gets worse with real-world documents.

A few things that regularly cause failures:

Different PDFs use wildly different spacing between columns. A gap that looks obvious to you can be ambiguous to software.
Merged cells break the grid assumption. A cell that spans two columns throws off alignment for everything below it.
Long tables that span multiple pages may repeat headers, or just stop and continue with bare data on the next page. Both are annoying in different ways.
Scanned PDFs are the worst case. The text has to go through OCR first, which adds another layer of potential errors before table detection even starts.

The manual approach

The simplest method is select, copy, paste into a spreadsheet. For a two-column table with ten rows, this works fine. For anything more complex, you will spend longer fixing the output than it would take to just retype it.

Retyping is actually underrated for small tables. It is fast and accurate if you are careful. It just does not scale past a dozen rows or so.

Using an extraction tool

Dedicated tools analyze the visual layout of a PDF page, detect table regions, and reconstruct the data into rows and columns you can actually use. This is the approach that scales.

unPDF, for example, lets you upload a document, describe what you need in plain language ("the revenue table on page 3"), and download the result as CSV or Excel. The natural language part helps when a document has several tables and you only care about one.

No tool is perfect, though. Complex layouts with nested headers or irregular cell sizes still trip up automated extraction. Always check the output.

Tips for better results

If you can get the original file that generated the PDF (the Word doc, the Excel sheet, the LaTeX source), skip extraction entirely and use that. It will always be cleaner.

When you do need to extract from the PDF itself:

Review the output, especially numbers and dates. A misplaced decimal point can be expensive.
Be specific about which table you want. "The table on page 5" is better than "all the tables." You get fewer errors and less noise.
Digitally created PDFs extract much better than scans or photos. If you have a choice, always work with the digital version.

After extraction

Getting the raw data out is usually step one. You might still need to clean it, merge it with another dataset, or load it into a database. CSV and Excel are good intermediate formats because almost everything can import them.

The real win is going from "I can see this data but cannot touch it" to "I have it in a format I can query." How long that takes depends on the PDF. Some take seconds. Some take an afternoon of cleanup. Having a decent tool just shifts the ratio.

Why PDF tables are hard to extract

A PDF stores content as drawing instructions. "Put this character at (x, y). Draw a line from here to there." The file has no concept of a row or a column. It just knows where things sit on the page.

A few things that regularly cause failures:

Different PDFs use wildly different spacing between columns. A gap that looks obvious to you can be ambiguous to software.

Merged cells break the grid assumption. A cell that spans two columns throws off alignment for everything below it.

Long tables that span multiple pages may repeat headers, or just stop and continue with bare data on the next page. Both are annoying in different ways.

Scanned PDFs are the worst case. The text has to go through OCR first, which adds another layer of potential errors before table detection even starts.

The manual approach

Retyping is actually underrated for small tables. It is fast and accurate if you are careful. It just does not scale past a dozen rows or so.

Using an extraction tool

Dedicated tools analyze the visual layout of a PDF page, detect table regions, and reconstruct the data into rows and columns you can actually use. This is the approach that scales.

No tool is perfect, though. Complex layouts with nested headers or irregular cell sizes still trip up automated extraction. Always check the output.

Tips for better results

If you can get the original file that generated the PDF (the Word doc, the Excel sheet, the LaTeX source), skip extraction entirely and use that. It will always be cleaner.

When you do need to extract from the PDF itself:

Review the output, especially numbers and dates. A misplaced decimal point can be expensive.

Be specific about which table you want. "The table on page 5" is better than "all the tables." You get fewer errors and less noise.

Digitally created PDFs extract much better than scans or photos. If you have a choice, always work with the digital version.

After extraction

How to Extract Tables from PDF Documents

Why PDF tables are hard to extract

The manual approach

Using an extraction tool

Tips for better results

After extraction

How to Extract Tables from PDF Documents

Why PDF tables are hard to extract

The manual approach

Using an extraction tool

Tips for better results

After extraction