RenText

RenText is my ongoing effort to use computational tools to study Renaissance English texts, published roughly from 1470-1700.

Data Overview

The data used in this project is from the Text Creation Partnership.

The primary data currently consists of approximately 60,000 XML files, which can be accessed on the TCP’s Dropbox.

As described on this page, all of the 60K+ texts were hand-coded over the course of about 20 years. The exact nature of the EEBO source texts—digitized (often quite old and occasionally poor-quality) microfilm of original hard copies—made and continues to make optical character recognition infeasible.

Data Updates

The latest updates from the TCP, available under “Phase II” on this page, indicate several thousand additional TCP texts are forthcoming. The page indicates this release was intended in 2020. However, as of October 2022, these updates have not been released.

Tools

I’m using Python for all my exploratory data analysis and cleaning. My primary libraries are pandas, re, xml.etree.ElementTree, and os.

For visualization, I am using graph modeling with neo4j.

Completed Work

Thus far, I have

Written a script that converts all .xml files to clean, human-readable .txt files
Uploaded all .xml files to Amazon S3, enabling cloud-based computing
Cleaned errors and anomalies in the titles’ publication dates, creating a cleaned dates file that can be used for as a look-up table
Written a script that returns basic metadata (author, title) for all titles for a given year
Written a script that returns a random book’s information, including author, title, publication date, and sample paragraphs/lines of poetry
Written a script to return instances of a word or phrase (i.e., an n-gram) for a given year
Written a script to build a SQLite database with all primary metadata for all books in the TCP archive

Plans

As of October 2022, my next steps include the following:

Optimize processing time in existing scripts by incorporating list comprehension and multiprocessing
Publish full .txt files on S3
Building an API to expose values in SQLite database
Building a website to publish returned API responses and provide download paths for full .txt files
Building a Python package to simplify processing and analysis of TCP archive
Building a graph database for visualizing the archive’s metadata

Codebase and Supporting Resources

The code for this project can be found on this GitHub repository.

RenText

Data Overview #

Data Updates #

Tools #

Completed Work #

Plans #

Codebase and Supporting Resources #