RenText

RenText is my ongoing effort to use computational tools to study Renaissance English texts, published roughly from 1470-1700.

Data Overview

The data used in this project is from the Text Creation Partnership.

The primary data currently consists of approximately 60,000 XML files, which can be accessed on the TCP’s Dropbox.

As described on this page, all of the 60K+ texts were hand-coded over the course of about 20 years. The exact nature of the EEBO source texts—digitized (often quite old and occasionally poor-quality) microfilm of original hard copies—made and continues to make optical character recognition infeasible.

Data Updates

The latest updates from the TCP, available under “Phase II” on this page, indicate several thousand additional TCP texts are forthcoming. The page indicates this release was intended in 2020. However, as of October 2022, these updates have not been released.

Tools

I’m using Python for all my exploratory data analysis and cleaning. My primary libraries are pandas, re, xml.etree.ElementTree, and os.

For visualization, I am using graph modeling with neo4j.

Completed Work

Thus far, I have

  • Written a script that converts all .xml files to clean, human-readable .txt files
  • Uploaded all .xml files to Amazon S3, enabling cloud-based computing
  • Cleaned errors and anomalies in the titles’ publication dates, creating a cleaned dates file that can be used for as a look-up table
  • Written a script that returns basic metadata (author, title) for all titles for a given year
  • Written a script that returns a random book’s information, including author, title, publication date, and sample paragraphs/lines of poetry
  • Written a script to return instances of a word or phrase (i.e., an n-gram) for a given year
  • Written a script to build a SQLite database with all primary metadata for all books in the TCP archive

Plans

As of October 2022, my next steps include the following:

  • Optimize processing time in existing scripts by incorporating list comprehension and multiprocessing
  • Publish full .txt files on S3
  • Building an API to expose values in SQLite database
  • Building a website to publish returned API responses and provide download paths for full .txt files
  • Building a Python package to simplify processing and analysis of TCP archive
  • Building a graph database for visualizing the archive’s metadata

Codebase and Supporting Resources

The code for this project can be found on this GitHub repository.