Copy-pasting text out of a PDF is a coin flip — sometimes it works, sometimes you get gibberish or the page won’t let you select anything. Feeding a confidential report into an online extractor to get clean text is also not great when the document isn’t meant to leave your hands.
WSL pulls the text out locally, for free, with pdftotext from the Poppler project. One PDF or a whole folder, layout preserved if you want it, nothing uploaded.
No WSL yet? See the WSL install guide.
Install the tool
pdftotext is in poppler-utils:
sudo apt update && sudo apt install -y poppler-utils
Confirm:
pdftotext -v
Extract text from a PDF
Give it the PDF and an output filename:
pdftotext input.pdf output.txt
Leave off the output name and it writes input.txt next to the PDF. To see the result straight away in the terminal, send it to standard output with -:
pdftotext input.pdf -
That prints the whole document’s text to the screen — handy for a quick look or piping into a search.
Fix jumbled, multi-column text
Reports, papers, and newsletters with columns often come out interleaved because the default reading order guesses wrong. The -layout option keeps the visual arrangement:
pdftotext -layout input.pdf output.txt
This usually straightens out columns and tables. If the plain extraction looks scrambled, reach for -layout first.
Extract only certain pages
Use -f (first) and -l (last) to limit the range:
pdftotext -f 2 -l 5 input.pdf output.txt
That pulls text from pages 2 through 5 only. Setting both to the same number grabs a single page.
pdftotext options
| pdftotext in.pdf out.txt | Extract all text to a file |
|---|---|
| pdftotext in.pdf - | Print text to the terminal |
| pdftotext -layout in.pdf out.txt | Preserve columns and layout |
| pdftotext -f 2 -l 5 in.pdf out.txt | Only pages 2 to 5 |
| pdftotext -enc UTF-8 in.pdf out.txt | Force UTF-8 output encoding |
Batch a whole folder
Turn every PDF in a folder into a matching text file:
for f in *.pdf; do pdftotext -layout "$f" "${f%.pdf}.txt"; done
report.pdf becomes report.txt, and so on, with the PDFs left in place.
The one case this can’t handle: scanned PDFs
Wrapping up
Extracting text from a PDF on Windows is one command: pdftotext input.pdf output.txt, with -layout when columns get scrambled and -f/-l to limit the pages. A short loop clears a whole folder. The only thing it can’t do is read scanned pages — those need OCR first.
It’s free and runs in WSL, so even sensitive documents stay on your machine. While you’re working with PDFs, the same Poppler package powers converting PDF pages to images.