Working with PDFs

The following tools focus on working with PDFs in your research corpus. For managing a complete archive including images and metadata, please read the recommended platforms for managing your personal archive.

Creating PDFs

You may be able to find PDFs of research materials you need from Library databases and other resources such as HathiTrust. If you need to digitize text to create your own PDFs, we have some advice for making high-quality scans that will work well with optical character recognition (OCR). Some general tips:

  • Use black or white paper in between pages to prevent bleed-through of text from other pages into your scans.
  • Scan on a dark surface—however, that may create auto-cropping issues with art books where objects are published on a black background. Colorful paper to help create artificial contrast can help with this.
  • Keep an eye on your lighting source to avoid glare, especially with glossy pages and/or a tight binding.

PDF Scanning at Home

Phone or Tablet

Turn your phone or tablet into a scanner with an app! Box Capture syncs with your free Box account to easily create multi-page PDFs. Other apps such as Genius Scan work well, too. A phone or tablet tripod can help make the scanning process more efficient and accurate.

Document Cameras

A document camera such as from IPEVO can be a sound investment if you need to make a lot of PDFs at home or on the go. Some archives allow the use of document cameras in reading rooms, which may be another point in their favor.

OCR (Optical Character Recognition)

OCR, or optical character recognition, is the mechanical conversion of an image, scanned document, photograph, or PDF of text into machine-encoded text that allows users to:

  • Edit, search, or display text (CTRL+F)
  • Use text-to-speech tools
  • Engage in data mining, text mining, and text analysis

Software for Printed Text

ABBYY FineReader Pro

The best version of ABBYY FineReader Pro runs on PCs ($200), not Macs ($120). ABBYY supports 192 languages, although its strengths are languages that use the Latin and Cyrillic alphabets and modern Chinese, Japanese, and Korean. It does not handle languages with script-like fonts such as Arabic or Tamil well. For archaic versions of contemporary languages, its dictionaries may need additional training to increase the accuracy of the OCR. But the strength of ABBYY is that it is so trainable and customizable.

ABBYY can import PDFs, photos, scans, and more, and export OCR content to PDF, Microsoft Word, Excel, PowerPoint, plain text, CSV, and so on, which can be used in digital humanities projects.

Tesseract

Tesseract is a free and open-source OCR engine that is currently supported by Google. It requires slightly more tech savvy than ABBYY to get up and running. Good for batch processing large sets of files, but requires command line programming. Robust set of export options for use in digital humanities projects.

Adobe Acrobat DC

Adobe Acrobat Document Cloud works well for straightforward scans and being able to use CTRL+F, but is not likely suitable for digital humanities analysis.

Batch editing in Acrobat

Software for Handwritten Text

Google Lens

Google Lens is an AI-powered recognition tool that can reverse-image search files, and additionally can additionally transcribe and translate text from an image. Google Lens tends to be more accurate in recognizing handwriting than Transkribus.

Transkribus

From the European cooperative READ COOPTranskribus is an AI-powered transcription software can be trained to work on handwritten archival documents. Sign up to get 500 free pages, after that there is a charge.

Tropy

If you’re interested in doing your own manual transcription, Tropy’s Notes feature is a good place to transcribe or translate documents. The transcribed and/or translated notes are searchable within Tropy, but the documents themselves are not “OCRed” and are not in a format ready to use for digital humanities projects if you’re planning something more robust.

Campus Experts and Resources for OCR

VRC Staff are happy to discuss strategies for making high-quality PDFs and discussing OCR workflows. However, we recommend talking to our expert campus colleagues about any digital humanities or text analysis projects you might be planning, as they can help with specific tools, platforms, code, and more.

Campus Experts

  • Carmen Caswell, Digital Humanities Research Liaison, Division of the Humanities
  • Jeffrey Tharsen, Computational Scientist for the Digital Humanities, Research Computing Center

Resources

Platforms for Combined Note-taking and PDF Management

Citation Management Tools

The Library provides great support for managing bibliographic citations, including this guide to Citation Management as well as workshops on specific tools.

Zotero is the citation management tool of choice for many in the Humanities and Social Sciences. It’s free and open-source, and works well with other free tools on campus such as Box.

Storage and Backup

We recommend saving a back-up of your PDF library to a cloud-based and/or external storage platforms. Please read our guide to Storage and Backup for more information.