Strategies for Visualizing Text

My second entry focused on my obsession with Paradise Lost and and using Franco Morretti’s ideas to create a literary map. While intellectually intriguing, this style of mapping does not lend itself to large datasets. Instead, I envision users of an IMH mapping program selecting the issues/articles they wish to see mapped and then have an interface do the dirty work for them. Some pre-loaded selections and filters will be available to get people started.

The Visualizing Emancipation uses “Emancipation Event Types” and give a drop down menu of options. By reading through the IMH itself, dedicated issues would be a great place to begin: Native Americans, desegregation, and railroads, would give the user ideas of filter topics and what they would look like when applied. Time itself, is another key filter option, and ideally a SIMILE widget would exist at the bottom of mapping pages allowing for the unfolding of ideas against time to be mapped.

Finally, I read about Neatline, a new project out of the Scholars’ Lab at Virginia. This would require a re-thinking of presentation as Neatline is Omeka-based, but its plug-ins allow for sophisticated visualizations of trends by someone curating an exhibit about the IMH over time. A great example is a Lovecraft exhibit designed by a UVa undergraduate which connects passages in his writing to his hometown. By creating some Neatline tie-ins, the text of IMH, which is the real star of any mapping project concerning it, would come to the fore. The challenge of mapping ideas could be grappled with in a more satisfactory way.

Reading about Google Fusion Tables and More Advanced Mapping

It is very easy to make simple maps with Google Fusion Tables, but I wanted to dip into some of the more sophisticated offerings. To that end, I began reading about what others have done. Mary Jo Webster at The Data Mine Blog has a very simple introduction to her first attempt to create a “custom intensity map.” With options like custom intensity (think of heat maps), Google has gotten away from my least favorite part of version 2 wherein every Google map looked essentially the same–there was little the maker could do to customize the map. Her blog led me to a great tutorial on the javascript behind custom intensity maps by Michelle Minkhoff about How to Combine Multiple Fusion Tables into One Map.

She showed me how to create individual tables and then get their ID numbers for use on one map. These can be passed through a javascript call, or if you are not comfortable with javascript, use the Google Fusion TablesLayer Wizard. And voila!

one article vs the entire issue

Google Fusion Tables!

I have discovered the magic that is Google Fusion Tables, and I can’t believe how much I can do with it. I’m still working out how to use it effectively, where geocoding comes in, and how to create intensity maps, but I’m really excited by the examples provided. (ETA: I need to use these codings for intensity maps: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements.)

For the IMH, it will be important to geocode ahead of time as: “There are daily limits on free geocodes, so Fusion Tables can only geocode up to that amount. If you have a very large data set, you’ll need to manually geocode repeatedly over a series of days until all your data is entirely geocoded. The ungeocoded rows will be highlighted in yellow.”

I’m still getting some strange yellow rows–for example, “Bloomington, Indiana” came up yellow when other cities did not. I’m going to look at batch-processing the geocoding later tonight and then upload my data in lat, long format so that I can get better results–there’s also a service, Google Refine, that I’m looking into. However, the ease with which maps can be created is astounding.

Trying Lots of Things that Don’t Work Week

This was a week in which I tried to pull place names from a TEI-encoded IMH article first by using XSLTs and then using my newly acquired Python skills. For me the lack of looping ability made the XSLT a no-go. Unfortunately, even with adapting Python code I found on the internet, I wasn’t able to automate this process smoothly. I’m still thinking about ways to do it, but have moved on to working on creating a map with the data we do have and have hand-generated a KML file for one article to show the difference in accuracy given the coding issues we have now.

To that end, I read Ehrard Rahm and Hong Hai Do’s Data Cleaning: Problems and Current Approaches. The first sentence that struck home was  “For instance, duplicated or missing information will produce incorrect or misleading statistics (“garbage in, garbage out”).” Unfortunately, IMH data is rife with duplication at the moment. For example, “Chicago, Illinois” is currently encoded as two place names, “Chicago” and “Illinois,” thereby leading to inaccuracies in mapping as Illinois will yield a point in the middle of the state and not even result in a double count of “Chicago.”

The article goes on to talk about the wrapper for data extraction, which is exactly where the IMH problem lies. The <placeName> wrapper has been applied to “Chicago” and “Illinois” separately, rather than enclosing them together. In addition, two TGN numbers have been assigned. Another transformation is going to have to be applied–one that will never be 100% accurate–that throws out any <placeName> wrappers following a comma (thus, any lists of cities, etc would also fall prey). The authors of the article are quick to point out that the cleaning should occur in a separate “staging area” from the data warehouse. Only clean data should enter the warehouse (2).

They next speak to the need to engage data transformations when schema-level changes are being made–which is what will need to happen at the IMH. All 100+ years of the TEI-encoded issues have the same problem. Luckily, this is a single-source problem, so ensuring that cleaning happens consistently across the data is not as complex as it could be.

The most helpful part of the article comes when the authors lay out approaches to data cleaning: 1) Data analysis, 2) Definition of transformation workflows and mapping rules, 3) Verification (This so important! Did you fix the problem?), 4) Transformation, and 5) Backflow of cleaned data (Another key idea–if a clean set of place names is generated for geocoding, it is also important to populate the original articles with this cleaned data so that any future extractors do no run into the same headaches.).

 

Thoughts on “Mapping Medieval Chester”

Even though no medieval maps of Chester survive, Mapping Medieval Chester uses scholarship to create an accurate and interactive map of the city. This project imported paper maps as different layer into GIS software and then attempted to reconcile them (not easily done with older maps). They then worked on features of medieval importance: “Each of these features was digitized as a separate layer in the GIS, partly because one aim was to be able to re-present them in the final web-resource in a way that allows users to selectively turn each of them either on or off, as needed, and partly because making each of the layers – each of the topographic features – independent allows them to be depicted visually differently in the GIS in terms of line-weighting, colour, shade and so forth, which helps to communicate more effectively the different cartographic information contained in the GIS (Fig. 4).”

As occurred when I worked on my Paradise Lost map (even the literary version), spatial mapping reveals different things than prose mapping: This exercise in mapping late-medieval Chester has also helped the project team to reflect on how medieval townscape is experienced and understood through modern maps and map-making, and how this differs through engaging with ‘textual’ mappings recorded by those contemporaries who experienced and knew the city first-hand….”

Thoughts on Visualizing Emancipation

http://dsl.richmond.edu/emancipation/

“Visualizing Emancipation organizes documentary evidence about when, where, and how slavery fell apart during the American Civil War.” The red dots below each represent an emancipation event and can be further narrowed by types such as “African Americans Helping the Union” and “African Americans Captured by Union Troops.” The map also allows for Union Troops to be mapped, as well as places of legalized slavery, and a heat map of Emancipation Events.

Emancipation Events

While the project began with xml-encoded texts, they then hired an outside source to develop a mapping application: “Azavea, a Philadelphia-based company specializing in the creation of geographic web software worked closely with the project directors to develop the mapping application. The application uses non-proprietary applications and technologies, including GeoServer, OpenLayers, and javascript to display information in a data-rich, interactive environment. The map employs ESRI’s light gray canvas basemap.”

As I have been unsatisfied with Google Maps’ appearance, I looked into ESRI (although there is much to emulate about this project’s approach, with the ability to toggle different map keys and types of events) and they offer a free personal ArcGIS account that I plan to explore. Also, it may be feasible to purchase a basemap for IMH so that the canvas (and hopefully its markers are less generic than Google Maps currently allows).

Notes on “Automating the Production of Map Interfaces for Digital Collections Using Google APIs”

http://www.dlib.org/dlib/september11/neatrour/09neatrour.html

At the University of Utah J. Willard Marriott Library, librarians wanted to enhance the metadata of many of their collections with geodata. They knew that Google had APIs that could ingest their metadata and come back with latitudes and longitudes that could then be reinserted into the XML files:

“In an effort to create more robust geographic data for the collection, we developed a three step process:

1) Use the Google Geocoding API to return latitude and longitude data based on existing place names in the metadata.

2) Create a table and scripting program to add the new latitude and longitude values to the core metadata XML file within CONTENTdm.

3) Upload links to the digital collection items with the newly compatible latitude and longitude data to GoogleMaps.”

Using PHP they extracted a list of unique place names: “For digital libraries using software that supports import and export of collection data in XML files, the locations can be extracted easily with PHP’s preg_match function, which is a regular expression matcher used to look for the applicable xml tag, in our case ‘covspa.’ (They use Dublin Core.)” Unfortunately, the IMH place name data is not properly encoded, and preg_match will yield dirty data until the encoding problem is fixed.

Step 2: “The Google Geocoding API Lookup script searches for all occurrences of <covspa>, reads the metadata, and breaks it up into distinct locations based on semicolons as a separator. Each location is put into an associative array that is later output into a comma-separated values (CSV) file. This spreadsheet is then manually reviewed for errors in the metadata.” Taking an article I hand-extracted place names from, using the Google Geocoding API became possible.

Step 3: “The second part of our script iterates through the location list, sending locations to the Google Geocoding API one at a time. This is done with the cURL library in PHP, which provides a mechanism for the API to transmit data using a variety of protocols, including automated HTTP requests. Google sends coordinates back if it finds a match. The coordinates are saved and then used to create a table populated with both the place names for the collection and their applicable geographic coordinates5.”(This is the part I’ll need help for!)

Step 4, Multiple Place Names (how they dealt with this could be very helpful to Brianna): “The metadata librarian ranked the place names and coordinates, so we were able to assign the most specific latitude and longitude coordinates to items with multiple place names in their metadata. This ranking system is necessary to get the subsequent script to update the item with the most local and accurate coordinate data. Since we have multiple place names in records separated by semicolons, the scripting program populates the latitude and longitude fields with the most specific information first. This process would not be necessary for other library collections where items have only one place name assigned. See Appendix Item 1 for the coordinate ranking system.”

Step 5: “The second script is an XML Modification script which takes the table of coordinate pairs and collection place names returned by the Google Geocoding API lookup script and inserts them into the core descriptive metadata file for the collection.”

Step 6: “Once the new latitude and longitude coordinates are in the metadata for the collection, the next step is to use the updated metadata to generate a KML file8 that can be used in GoogleMaps applications … Google MyMaps has size limits that restrict KML file rendering9.”

Step 7: “To generate thumbnails we add columns of script that, with exported item identification metadata, execute a command to generate hyperlinked thumbnails. In the formula we include additional descriptive metadata (place name, recording description) and add the persistent URL for the object in CONTENTdm. A final step involves adding a blank second row with a command to allow the Description column to exceed 256 characters.”

There are some more details to explore here, but this sounds like a great starting point for Brianna’s image collections!

Notes on “Integrating Interactive Maps into a Digital Library”

When surveying other systems, McInstosh startys with GYPSY, a system with “automatic geographical indexing of text documents that could then be searched with a spatial interface.” GYPSY creates a polygon mesh on the x, y, and z axes that allows for multiple mentions to push up through the z axis. In a given example about Nevada, if Las Vegas is mentioned often, it becomes a spike in the polygon, or “This created a skyline where it was clear which geographical areas the document(s) focused upon by the height the mesh rose out of the surface at different points on the mesh.”

Pre-NewsExplorer is more interested in ranking places according to importance as it tags them: “The first technique used the concept of the “importance” of a place. Each place in their database is given a value from 1 to 6 that denotes how “important” that particular place is. For example, 1 means that this place is the capital city of a country and 6 means that the place is a small town or village.” A second version of this system, NewsExplorer, improves upon visualizations, including using Google Earth’s KML.

The Informedia Digital Video Library has done some amazing work:

To utilise this in- formation they began the development of a system that could automatically extract this information from the narrative of the video (obtained through the use of the Carnegie Mellon University Sphinx speech-recognition engine [HAHR93]). The system also extracted any words that had been shown on the screen through the use of OCR and checked these for place names. This information could then be used to provide spatial video searching and display map footage relative to a video in sync with the content… (14)

So far it has become clear that each project used different methods to disambiguate place names, something we will have to think about with the IMH, if the TGN is not sufficiently detailed enough for nuanced locales in Indiana. McIntosh observed the following strategies in his survey:

There were several methods of disambiguation used by the different systems, these included: methods based on linguistic rules (e.g., understanding “Cambridge, England” to mean Cambridge in England); methods based on other heuristics such as minimum geographic distance, population comparison, the importance of a place (i.e., a capital city is more important than other cities), examining the local context (i.e., other surrounding place names) and score- based methods. (24)

The considerations that GYPSY finally takes into account by limiting their gazetteer so that there are fewer chances of overlapping place names is clearly not an option for us as I am expecting many of the articles to be heavily focused on a small geographical area. However, by using the weighting system employed by NewsExplorer and more heavily weighting Indiana instances of a name, perhaps we can overcome false positives for places like Princeton–a town name in several states. I will have to do more reading of articles to be sure that this kind of skewing is appropriate or if the geographical bent is not as Indiana-specific as I am expecting.

McIntosh also details why the project chose Google Maps (a good list to contrast against MAF although I still dislike the way in which frequency of hits is dealt with by Google):

GWT has a Java API for Google Maps which allows for easy integration with the rest of the web application.

• Included in the Google Maps API is the functionality to call Google’s geocoder2. The geocoder is a powerful tool that takes an address (e.g. “Hamilton, Waikato, New Zealand”) and attempts to return the latitude and longitude of that place.

• It has a well designed user interface with a large number of useful features.

• Google Maps is free to use for non-commercial purposes.

• Google Maps is an interface that many users will already be familiar with due to its wide spread use around the world. (41)

Much of the rest of this thesis deals with acquiring place names and rendering maps on the fly–a stage that I do not think we are quite ready to contemplate.

Thoughts on “Geocoding in the LCSH Biodiversity Library”

Tags

,

In Geocoding in the LCSH Biodiversity Library, the authors detail how they have taken MARC record locations and ingested them through the Google Maps API to generate a geocoded version of the LCSH.

Most interesting to me was the section on limitations.

“Other limitations in the display relate to how Google Maps geocodes values and how it then displays those Placemarks. By definition a Placemark is a single point on a map, which works well with traditional uses of Google Maps, such as displaying points for street addresses. However, geocoding returns a somewhat different result for less granular place names like ‘Missouri.’ The point associated with ‘Missouri’ is the centroid, or center point, of the polygon defined by the boundaries of the state of Missouri.”

Given that early encoding of the IMH may use just these sorts of generic names in some places, a preponderance of markers at the center of a state should be evaluated critically. Even later encoding may reveal more state or country level encoding than expected–we should be aware of this possibility.

Another consideration is the lack of density representation available in the API:

“A further complication with the single Placemark paradigm relates to the inability to visually represent areas of density within Google Maps. Rather than viewing single Placemarks at each country’s centroid, a more compelling display would be to view a map with shading to represent countries associated with more digitized content, similar in display to a population density map. It is possible to cluster Placemarks such that multiple points are represented by a single Placemark when zoomed out on the map, allowing developers to streamline maps with many points in close proximity to one another. However, clustering still doesn’t allow for visual ranking or weighting of results.”

This is where the Maps of American Fiction project’s use of MONK yields much more nuanced maps. We are also investigating OpenLayers, but I should look at other projects to see if there are ways to build maps in Google that reveal the kind of detail we’d like.

Post-lapserian Place and Space in Paradise Lost

Tags

, ,

I have constructed a geographical map that conveys the sheer vastness of the world that Adam and Eve must now inhabit and annotated it to show the diversity of places included so that the reader may understand why many of these names are included. However, Moretti is looking for more than just geography when constructing literary maps. He believes in extracting this information from the narrative to construct patterns: “Each pattern is a clue—a fingerprint of history, almost” (57). He has a rubric for making literary maps:

You choose a unit—walks, lawsuits, luxury goods, whatever—find its occurrences and place them in a space . . . . or in other words: you reduce the text to a few elements, and abstract them from the narrative flow, and construct a new artificial object like I have been discussing. And with a little luck, these maps will be more than the sum of their parts: they will possess ‘emerging’ qualities, which were not visible at the lower level. (53)

Thus, I have followed his example and rendered a map of the world with Eden at its center, surrounded by concentric rings of pagan places:

When looking at the concentric rings around Eden, it is clear that the four nearest cities are all strongholds of Islam and have important roles in trade—however, that trade is in luxury goods, not humans. On the next circle out, false idolatry in the form of the Taj Mahal and the Shalimar Gardens crop up, as do more sinister topics such as colonization, piracy, Catholicism, and slavery. Things have gone merely worldly concerns to some of the worst aspects of humanity. Finally, on the most remote circle, with the exception of El Dorado (which one could argue represents the pinnacle of obsession with wealth) strife in the form of war dominates. Whether reminding the reader of the Moguls and Genghis Khan, the Ming Dynasty, the instability of Moscow, the bloodshed that occurred when the Spanish encountered Montezuma, or the use of the humans taken for slavery in the prior circle, the widest reaches of this map are so foreign to Eden and Christian values that there is little framework to judge them against. Notably, western Europe and America are absent from this map. Milton seems to have felt they were not needed to illustrate the lesson Michael is intent on giving Adam—namely that the world is a large and unwholesome place, and that the farther one gets from Eden, literally the farther one gets from God’s presence on Earth, the worse things get. This is the kind of “literary sociology” that Moretti hopes will happen when elements are extracted from a text (57).

Reference:

Moretti, Franco. Graphs, Maps, and Trees: Abstract Models for a Literary History. London: Verso, 2007.

Follow

Get every new post delivered to your Inbox.