A Google Books Cautionary Tale

This one made the rounds of Twitter earlier today thanks to Jo Guldi. This month Wired Magazine tells a cautionary tale for those following the progress of Google Books. Entitled “Google’s Abandoned Library of 700 Million Titles,” the article reminds readers of Google’s 2001 acquisition of a Usenet archive of more than 700 million articles from more than 35,000 newsgroups. Incorporated today into Google Groups, the Wired article contends the archival Usenet material is poorly indexed and hardly searchable, rendering much of it practically inaccessible. The article concludes, “In the end, then, the rusting shell of Google Groups is a reminder that Google is an advertising company — not a modern-day Library of Alexandria.” Something to remember when considering the Google Books settlement and its implications.

Searching History

This week Yahoo! Buzz provides a telling glimpse into the popular historical mind with a list of last Sunday’s top twenty “history” searches. Perhaps predictably the list leans towards the geeky (“History of the Computer” and “History of the Internet”), which probably reflects the technological and scientific biases of internet users, and towards the recreational (“History of Badminton” and “History of Swimming”), which probably reflects the fact the sample was taken on the weekend. But there are some oddballs in there as well, for instance “History of Mortgage Rates” at #3 and “Titanic History” at #10. Interestingly, moreover, “American History” ranks below both “Philippine History” and “Pakistan History”.

Here’s the whole list:

  1. History of the Computer
  2. History of Mortgage Rates
  3. Philippine History
  4. History of Badminton
  5. Pakistan History
  6. History of Psychology
  7. History of the Internet
  8. History of Physical Education
  9. American History
  10. Titanic History
  11. History of Volleyball
  12. History of Statistics
  13. History of Biology
  14. History of Table Tennis
  15. History of Measurement
  16. History of Gymnastics
  17. History of Swimming
  18. History of Mathematics
  19. History of China
  20. History of Chemistry

Google Timelines

On Monday Dave Lester pointed to the release of Google’s new timeline view of search results. Found History has often commented on the importance of timelines in public understanding of history and amateur historical practice, so this seems like it could be a big development in that space.

Google points out that the timeline view works best for people, places, and other similar searches and suggests Thomas Jefferson as a good example. As Dave points out, the software groups frequencies of events (search results) by decade, and at first glance the results seem pretty encouraging. Jefferson’s timeline, for example, has peaks in the 1770s and 1800s—just where you’d expect them:

Thomas Jefferson timeline

Being somewhat suspicious of the representativeness of Google’s hand-picked example, however, I tried a more obscure search for George Sarton, founder of history of science in America:

George Sarton timeline

To my pleasant surprise, Sarton’s timeline turned out to be nearly as good as Jefferson’s, peaking in the 1910s when Sarton was getting married, finishing his dissertation, founding his journal Isis, and relocating to America.

Nevertheless there are a few pretty significant problems with Google’s new timelines. To start the timelines only display the frequency with which certain dates appear in connection with a given search term rather than tying these dates to actual events in the life of the search subject. We don’t see what Sarton did in the 1910s. We only see that he did something (or more precisely that lots of people on the web have pointed out that he did something). Viewing this kind of timeline—i.e. one without named events—seems to me a little like watching an old time silent movie with your eyes closed. You know something is happening because the music gets louder or quicker, but you don’t know what.

In this regard a Google timeline is at best an activity map of a given historical actor’s life, demonstrating for instance that Jefferson did more and more important things in the 1770s than in the 1740s and 50s. If that’s the case, Google’s new visualization falls victim to Dan Cohen’s “big whoop” (my term, not Dan’s) criticism. Pointing to a visualization of the full text of the New Testament which showed (surprise!) that Jesus sits at the center of the narrative, Dan lamented that too many digital humanities visualizations “merely use computational methods to reveal the obvious in fancy ways.” Google’s Jefferson timeline is another case in point. Is anyone surprised that Jefferson did more in his 20s, 30s, and 40s than as an eight year old?

Despite these problems, however, I do see something new here. Google’s is one of the first projects I can think of that attempts to move beyond using computational means to answer factual historical questions (our H-Bot software has been doing this for a couple years now) and actually tries to provide something approximating historical interpretation, that tries to put factual information into a narrative framework—even if that narrative tells us little more than Sarton was busier in 1910 than in 1940 or that Jefferson did something in 1776 and nothing in 1752. If we all agree that the real work of history isn’t about names and dates, then the real work of digital history has to be more than that as well. Digital historians need to think about computational methods for producing historical narratives not just historical facts, for producing historical knowledge rather than merely uncovering historical information. I guess what I’m talking about is artificial historical intelligence, and although Google’s timelines aren’t that, they’re certainly a gesture in that direction. They’re certainly something to watch.

Place Names / Time

Yesterday software engineer Matthew Gray from Inside Google Book Search posted a mashup/geo-visualization demonstrating how place name frequency changed over the course of 19th century publishing history. Gray’s four maps—one each from the 1800s, 1830s, 1860s, and 1890s—clearly point to a growing publishing industry and broader shifts in center of gravity from Europe to North America and from East Coast to West Coast.

While Gray’s results are convincing and the whole project a good example of how digital tools are creating new avenues for amateur historical inquiry, we should also admit that it reinforces Dan Cohen’s recent point that “too many visualizations … merely use computational methods to reveal the obvious in fancy ways.” The question Dan wants us to ask is whether these visualizations teach us anything new. It’s a good question. Are we surprised that Denver is mentioned more frequently in print in 1890 than in 1830? Probably not. But another question we should ask is whether these visualizations can teach our students and publics anything new. I wonder if the obvious truths told by these maps, charts, and diagrams aren’t so obvious to people who don’t identify themselves as historians. I’m struck by the fact that both this example and the one Dan points to were both produced by and for non-professionals. I suspect the answer to Dan’s concern is that the best place for these things is not in research, but in teaching and public understanding.

More from UWO

Bill Turkel has a fantastic post about the ways people search for history online. Using search data released by AOL and some statistical methods, Bill has been able to tell us a lot about how ordinary Internet users think about history and what topics interest them most. Clearly this is very important stuff for Found History, and I hope he takes it further. I’d be particularly interested in how the history searches of AOL users compare to those of Google and Yahoo! users, but I suppose (thankfully) that Google and Yahoo! have more respect for their users’ privacy and that this won’t happen anytime soon.

One thing Bill notices is how many searches for “history” relate not to the study of the past, but to the web browser’s cache and how to delete it. Though Bill’s methods are statistical and mine are anecdotal, this is something I have noticed as well. I do a lot of searching around the web for the pieces of found history I post in this blog, and I often find myself sifting through lots of web pages and blog posts about clearing Internet Explorer’s history files on my way to finding a truly historical nugget.

This suggests a converse research question to the one Bill has asked of his data set. It would be interesting to compare the kinds of history people are searching for with the kinds of history they’re posting about. I suppose you could do this by pulling three months’ worth of feeds for blog posts containing the word “history” (easily done through Bloglines or blogsearch.google.com) and running some similar text mining operations on them. Analyzing how “history” is used in titles could be particularly enlightening in that titles and search terms share a similar descriptive intent. And you could easily ask the same kinds of information distance questions of both.

Obviously this has me thinking. Many thanks to Bill.