RenText
RenText is my ongoing effort to use computational tools to study Renaissance English texts, published roughly from 1470-1700.
Data Overview
The data used in this project is from the Text Creation Partnership.
The primary data currently consists of approximately 60,000 XML files, which can be accessed on the TCP’s Dropbox.
As described on this page, all of the 60K+ texts were hand-coded over the course of about 20 years. The exact nature of the EEBO source texts—digitized (often quite old and occasionally poor-quality) microfilm of original hard copies—made and continues to make optical character recognition infeasible.
Data Updates
The latest updates from the TCP, available under “Phase II” on this page, indicate several thousand additional TCP texts are forthcoming. The page indicates this release was intended in 2020. However, as of October 2022, these updates have not been released.
Tools
I’m using Python for all my exploratory data analysis and cleaning. My primary libraries are pandas
, re
, xml.etree.ElementTree
, and os
.
For visualization, I am using graph modeling with neo4j.
Completed Work
Thus far, I have
- Written a script that converts all
.xml
files to clean, human-readable.txt
files - Uploaded all
.xml
files to Amazon S3, enabling cloud-based computing - Cleaned errors and anomalies in the titles’ publication dates, creating a cleaned dates file that can be used for as a look-up table
- Written a script that returns basic metadata (author, title) for all titles for a given year
- Written a script that returns a random book’s information, including author, title, publication date, and sample paragraphs/lines of poetry
- Written a script to return instances of a word or phrase (i.e., an n-gram) for a given year
- Written a script to build a SQLite database with all primary metadata for all books in the TCP archive
Plans
As of October 2022, my next steps include the following:
- Optimize processing time in existing scripts by incorporating list comprehension and multiprocessing
- Publish full
.txt
files on S3 - Building an API to expose values in SQLite database
- Building a website to publish returned API responses and provide download paths for full
.txt
files - Building a Python package to simplify processing and analysis of TCP archive
- Building a graph database for visualizing the archive’s metadata
Codebase and Supporting Resources
The code for this project can be found on this GitHub repository.