This was a week in which I tried to pull place names from a TEI-encoded IMH article first by using XSLTs and then using my newly acquired Python skills. For me the lack of looping ability made the XSLT a no-go. Unfortunately, even with adapting Python code I found on the internet, I wasn’t able to automate this process smoothly. I’m still thinking about ways to do it, but have moved on to working on creating a map with the data we do have and have hand-generated a KML file for one article to show the difference in accuracy given the coding issues we have now.
To that end, I read Ehrard Rahm and Hong Hai Do’s Data Cleaning: Problems and Current Approaches. The first sentence that struck home was “For instance, duplicated or missing information will produce incorrect or misleading statistics (“garbage in, garbage out”).” Unfortunately, IMH data is rife with duplication at the moment. For example, “Chicago, Illinois” is currently encoded as two place names, “Chicago” and “Illinois,” thereby leading to inaccuracies in mapping as Illinois will yield a point in the middle of the state and not even result in a double count of “Chicago.”
The article goes on to talk about the wrapper for data extraction, which is exactly where the IMH problem lies. The <placeName> wrapper has been applied to “Chicago” and “Illinois” separately, rather than enclosing them together. In addition, two TGN numbers have been assigned. Another transformation is going to have to be applied–one that will never be 100% accurate–that throws out any <placeName> wrappers following a comma (thus, any lists of cities, etc would also fall prey). The authors of the article are quick to point out that the cleaning should occur in a separate “staging area” from the data warehouse. Only clean data should enter the warehouse (2).
They next speak to the need to engage data transformations when schema-level changes are being made–which is what will need to happen at the IMH. All 100+ years of the TEI-encoded issues have the same problem. Luckily, this is a single-source problem, so ensuring that cleaning happens consistently across the data is not as complex as it could be.
The most helpful part of the article comes when the authors lay out approaches to data cleaning: 1) Data analysis, 2) Definition of transformation workflows and mapping rules, 3) Verification (This so important! Did you fix the problem?), 4) Transformation, and 5) Backflow of cleaned data (Another key idea–if a clean set of place names is generated for geocoding, it is also important to populate the original articles with this cleaned data so that any future extractors do no run into the same headaches.).