Libraries – Found History

Teaching and Learning with Primary Sources in the age of Generative AI

The following is a (more or less verbatim) transcript of a keynote address I gave earlier today to the Dartmouth College Teaching with Primary Sources Symposium. My thanks to Morgan Swan and Laura Barrett of the Dartmouth College Library for hosting me and giving me the opportunity to gather some initial thoughts about this thoroughly disorienting new development in the history of information.

Thank you, Morgan, and thank you all for being here this morning. I was going to talk about our Sourcery project today, which is an application to streamline remote access to archival materials for both researchers and archivists, but at the last minute I’ve decided to bow to the inevitable and talk about ChatGPT instead.

Dartmouth College Green on a beautiful early-spring day

I can almost feel the inner groan emanating from those of you who are exhausted and perhaps dismayed by the 24/7 coverage of “Generative AI.” I’m talking about things like ChatGPT, DALL-E, MidJourney, Jasper, Stable Diffusion, and Google’s just released, Bard. Indeed, the coverage has been wall to wall, and the hype has at times been breathless, and it’s reasonable to be skeptical of “the next big thing” from Silicon Valley. After all we’ve just seen the Silicon Valley hype machine very nearly bring down the banking system. In just past year, we’ve seen the spectacular fall of the last “next big thing,” so-called “crypto,” which promised to revolutionize everything from finance to art. And we’ve just lived through a decade in which the social media giants have created a veritable dystopia of teen suicide, election interference, and resurgent white nationalism.

So, when the tech industry tells you that this whatever is “going to change everything,” it makes sense to be wary. I’m wary myself. But with a healthy dose of skepticism, and more than a little cynicism, I’m here to tell you today as a 25-year veteran of the digital humanities and a historian of science and technology, as someone who teaches the history of digital culture, that Generative AI is the biggest change in the information landscape since at least 1994 and the launch of the Netscape web browser which brought the Internet to billions. It’s surely bigger than the rise of search with Google in the early 2000s or the rise of social media in the early 2010s. And it’s moving at a speed that makes it extremely difficult to say where it’s headed. But let’s just say that if we all had an inkling that the robots were coming 100 or 50 or 25 years into the future, it’s now clear to me that they’ll be here in a matter of just a few years—if not a few months.

It’s hard to overstate just how fast this is happening. Let me give you an example. Here is the text of a talk entitled (coincidentally!) “Teaching with primary sources in the next digital age.” This text was generated by ChatGPT—or GPT-3.5—the version which was made available to the public last fall, and which really kicked off this wall-to-wall media frenzy over Generative AI.

You can see that it does a plausible job of producing a three-to-five paragraph essay on the topic of my talk today that would not be an embarrassment if it was written by your ninth-grade son or daughter. It covers a range of relevant topics, provides a cogent, if simplistic, explanation of those topics, and it does so in correct and readable English prose.

Now here’s the same talk generated by GPT-4 which came out just last week. It’s significantly more convincing than the text produced by version 3.5. It demonstrates a much greater fluency with the language of libraries and archives. It correctly identifies many if not most of the most salient issues facing teaching in archives today and provides much greater detail and nuance. It’s even a little trendy, using some of the edu-speak and library lingo that you’d hear at a conference of educators or librarians in 2023.

Now here’s the outline for a slide deck of this talk that I asked GPT-4 to compose, complete with suggestions for relevant images. Below that is the text of speaker notes for just one of the bullets in this talk that I asked the bot to write.

Now—if I had generated speaker notes for each of the bullets in this outline and asked GPT’s stablemate and image generator, DALL-E, to create accompanying images—all of which would have taken the systems about 5 minutes—and then delivered this talk more or less verbatim to this highly educated, highly accomplished, Ivy League audience, I’m guessing the reaction would have been: “OK, seems a little basic for this kind of thing” and “wow, that was talk was a big piece of milktoast.” It would have been completely uninspiring, and there would have been plenty to criticize—but neither would I have seemed completely out of place at this podium. After all, how many crappy, uninspiring, worn out PowerPoints have you sat through in your career? But the important point to stress here is that in less than six months, the technology has gone from writing at a ninth-grade level to writing at a college level and maybe even more.

Much of the discourse among journalists and in the academic blogs and social media has revolved around picking out the mistakes these technologies make. For example, my good friend at Middlebury, Jason Mittell, along with many others, has pointed out that ChatGPT tends to invent citations: references to articles attributed to authors with titles that look plausible in real journals that do not, in fact, exist. Australian literary scholar, Andrew Dean, has pointed out how ChatGPT spectacularly misunderstands some metaphors in poetry. And it’s true. Generative AIs make lots of extremely weird mistakes, and they wrap those mistakes in extremely convincing-sounding prose, which often makes them hard to catch. And as Matt Kirschenbaum has pointed out: they’re going to flood the Internet with this stuff. Undoubtedly there are issues here.

But don’t mistake the fact that ChatGPT is lousy at some things for the reality that it’ll be good enough for lots, and lots, and lots of things. And based on the current trajectory of improvement, do we really think these problems won’t be fixed?

Let me give another couple of examples. Look at this chart, which shows GPT-3.5’s performance on a range of real-world tests. Now look at this chart, which shows GPT-4’s improvement. If these robots have gone from writing decent five-paragraph high school essays to passing the Bar Exam (in the 90^th percentile!!) in six months, do we really think they won’t figure out citations in the next year, or two, or five? Keep in mind that GPT-4 is a general purpose model that’s engineered to do everything pretty well. It wasn’t even engineered to take the Bar Exam. Google CEO, Sundar Pichai tells us that AI computing power is doubling every six months. If today it can kill the Bar Exam, do we really think it won’t be able to produce a plausible article for a mid-tier peer reviewed scholarly journal in a minor sub-discipline of the humanities in a year or two? Are we confident that there will be any way for us to tell that machine-written article from one written by a human?

(And just so our friends in the STEM fields don’t start feeling too smug, GPT can write code too. Not perfectly of course, but it wasn’t trained for that either. It just figured it out. Do we really think it’s that long until an AI can build yet another delivery app for yet another fast-food chain? Indeed, Ubisoft and Roblox are starting to use AI to design games. Our students’ parents are going to have to start getting their heads around the fact that “learning to code” isn’t going to be the bulletproof job-market armor they thought it was. I’m particularly worried for my digital media students who have invested blood, sweat, and tears learning the procedural ins and outs of the Adobe suite.)

There are some big philosophical issues at play here. One is around meaning. The way GPT-4 and other generative AIs produce text is by predicting the next word in a sentence statistically based on a model of drawn from an unimaginably large (and frankly unknowable) corpus of text the size of the whole Internet—a “large language model” or LLM—not by understanding the topic they’re given. In this way the prose they produce is totally devoid of meaning. Drawing on philosopher, Harry Frankfurter’s definition of “bullshit” as “speech intended to persuade without regard for truth”, Princeton computer scientists Arvind Narayanan and Sayash Kapoor suggest that these LLMs are merely “bullshit generators.” But if something meaningless is indistinguishable from something meaningful—if it holds meaning for us, but not the machine—is it really meaningless? If we can’t tell the simulation from the real, does it matter? These are crucial philosophical, even moral, questions. But I’m not a philosopher or an ethicist, and I’m not going to pretend to be able to think through them with any authority.

What I know is: here we are.

As a purely practical matter, then, we need to start preparing our students to live in a world of sometimes bogus, often very useful, generative AI. The first-year students arriving in the fall may very well graduate into a world that has no way of knowing machine-generated from human-generated work. Whatever we think about them, however we feel about them (and I feel a mixture of disorientation, disgust, and exhaustion), these technologies are going to drastically change what those Silicon Valley types might call “the value proposition” of human creativity and knowledge creation. Framing it in these terms is ugly, but that’s the reality our students will face. And there’s an urgency to it that we must face.

So, let’s get down to brass tacks. What does all this mean for what we’re here to talk about today, that is, “Teaching with Primary Sources”?

One way to start to answer this question is to take the value proposition framing seriously and ask ourselves, “What kinds of human textual production will continue to be of value in this new future and what kinds will not?” One thing I think we can say pretty much for sure is that writing based on research that can be done entirely online is in trouble. More precisely, writing about things about which there’s already a lot online is in trouble. Let’s call this “synthetic writing” for short. Writing that synthesizes existing writing is almost certainly going to be done better by robots. This means that what has passed as “journalism” for the past 20 years since Google revolutionized the ad business—those BuzzFeed style “listicles” (“The 20 best places in Dallas for tacos!”) that flood the internet and are designed for nothing more than to sell search ads against—that’s dead.

But it’s not only that. Other kinds of synthetic writing—for example, student essays that compare and contrast two texts or (more relevant to us today) place a primary source in the context drawn from secondary source reading—those are dead too. Omeka exhibits that synthesize narrative threads among a group of primary sources chosen from our digitized collections? Not yet, but soon.

And it’s not just that these kinds of assignments will be obsolete because AI will make it too easy for students to cheat. It’s what’s the point of teaching students to do something that they’ll never be asked to do again outside of school? This has always been a problem with college essays that were only ever destined for a file cabinet in the professor’s desk. But at least we could tell ourselves that we were doing something that simulated the kind of knowledge work they would so as lawyers and teachers and businesspeople out in the real world. But now?

(Incidentally, I also fear that synthetic scholarly writing is in trouble, for instance, a Marxist analysis of Don Quixote. When there’s a lot of text about Marx and a lot of text about Don Quixote out there on the Internet, chances are the AI will do a better—certainly a much faster—job of weaving the two together. Revisionist and theoretical takes on known narratives are in trouble.)

We have to start looking for the things we have to offer that are (at least for now) AI-proof, so to speak. We have to start thinking about the skills that students will need to navigate an AI world. Those are the things that will be of real value to them. So, I’m going to use the rest of my time to start exploring with you (because I certainly don’t have any hard and fast answers) some of the shifts we might want to start to make to accommodate ourselves and our students to this new world.

I’m going to quickly run through eight things.

The most obvious thing we can do it to refocus on the physical. GPT and its competitors are trained on digitized sources. At least for now they can only be as smart as what’s already on the Internet. They can’t know anything about anything that’s not online. That’s going to mean that physical archives (and material culture in general) will take on a much greater prominence as the things that AI doesn’t know about and can’t say anything about. In an age of AI, there will be much greater demand for the undigitized stuff. Being able to work with undigitized materials is going to be a big “value add” for humans in the age of these LLMs. And our students do not know how to access it. Most of us were trained on card catalogs, in sorting through library stacks, of traveling to different archives and sifting through boxes of sources. Having been born into the age of Google, our students are much less good at this, and they’re going to need to get better. Moreover, they’re going to need better ways of getting at these physical sources that don’t always involve tons of travel, with all its risks to climate and contagion. Archivists, meanwhile, will need new tools to deal with the increased demand. We launched our Sourcery app, which is designed to provide better connections between researchers and archivists and to provide improved access to remote undigitized sources before these LLMs hit the papers. But tools like Sourcery are going to be increasingly important in an age when the kind of access that real humans need isn’t the digital kind, but the physical kind.
Moreover, we should start rethinking our digitization programs. The copyright issues around LLMs are (let’s say) complex, but currently Open AI, Google, Microsoft, Meta, and the others are rolling right ahead, sucking up anything they can get their hands on, and processing those materials through their AIs. This includes all of the open access materials we have so earnestly spent 30 years producing for the greater good. Maybe we want to start asking ourselves whether we really want to continue providing completely open, barrier-free access to these materials. We’ve assumed that more open meant more humane. But when it’s a robot taking advantage of that openness? We need a gut check.
AIs will in general just be better at the Internet than us. They’ll find, sort, sift, and synthesize things faster. They’ll conduct multi-step online operations—like booking a trip or editing a podcast—faster than us. This hits a generation that’s extremely invested in being good at the Internet, and, unfortunately, increasingly bad at working in the real world. Our current undergraduates have been deeply marked by the experience of the pandemic. I’m sure many of you have seen a drastic increase in class absences and a drastic decrease in class participation since the pandemic. We know from data that more and more of our students struggle with depression and anxiety. Students have difficulty forming friendships in the real world. There are a growing number of students who choose to take all online classes even though they’re living in the dorms. This attachment to the virtual may not serve them well in a world where the virtual is dominated by robots who are better than us at doing things in the digital world. We need to get our students re-accustomed to human-to-human connections.
At the same time, we need to encourage students to know themselves better. We need to help them cultivate authentic, personal interests. This is a generation that has been trained to write to the test. But AIs will be able to write to the test much better than we can. AIs will be able to ascertain much better than we can what they (whomever they is: the school board, the college board, the boss, the search algorithm) want. But what the AI can’t really do is tell us what we want, what we like, what we’re interested in and how to get it. We need to cultivate our students’ sense of themselves and help them work with the new AIs to get it. Otherwise, the AI will just tell them what they’re interested in, in ways that are much more sophisticated and convincing than the Instagram and TikTok algorithms that are currently shoving content at them. For those of us teaching with primary sources this means exposing them to the different, the out of the ordinary, the inscrutable. It means helping them become good “pickers” – helping them select the primary sources that truly hold meaning for them. As educators of all sorts, it means building up their personalities, celebrating their uniqueness, and supporting their difference.
I think we also need to return to teaching names and dates history. That’s an unfashionable statement. The conventional wisdom of at least the last 30 years is that that names, dates, and places aren’t that important to memorize because the real stuff of history are the themes and theories—and anyway, the Google can always give us the names and dates. Moreover, names and dates history is boring and with the humanities in perpetual crisis and on the chopping block in the neoliberal university, we want to do everything we can to make our disciplines more attractive. But memorized names, and dates, and places are the things that allow historians to make the creative leaps that constitute new ideas. The biggest gap I see between students of all stripes, including graduate students, and the privileged few like me who make it into university teaching positions (besides white male privilege) is a fluency with names, dates, and places. The historians that impress most are the ones who can take two apparently disconnected happenings and draw a meaningful connection between them. Most often the thing that suggests that connection to them is a connected name, date, place, source, event, or institution that they have readily at hand. Those connections are where new historical ideas are born. Not where they end, for sure, but where they are born. AI is going to be very good at synthesizing existing ideas. But it may be less good at making new ones. We need students who can birth new ideas.
Related to this is the way we teach students to read. In the last 20 years, largely in response to the demands of testing, but also in response to the prioritization of “critical thinking” as a career skill, we’ve taught students not to read for immersion, for distraction, for imagination, but for analysis. Kids read tactically. They don’t just read. In many cases, this means they don’t read at all unless they have to. Yet, this is exactly how the AI reads. Tactically. Purely for analysis. Purely to answer the question. And they’ll ultimately be able to do this way better than us. But humans can read in another way. To be inspired. To be moved. We need to get back to this. The imaginative mode of reading will set us apart.
More practically, we need to start working with these models to get better at asking them the right questions. If you’ve spent any time with them, you’ll know that what you put in is very important in determining what you get out. Here’s an example. In this chat, I asked GPT-3.5, “How can I teach with primary sources.” OK. Not bad. But then in another chat I asked, “Give me a step-by-step plan for using primary sources in the classroom to teach students to make use of historical evidence in their writing” and I followed it up with a few more questions: “Can you elaborate?” and “Are there other steps I should take?” and then “Can you suggest an assignment that will assess these skills?” You’ll see that it gets better and better as it goes along. I’m no expert at this. But I’m planning on becoming one because I want to be able to show our students how to use it well. Because, don’t fool yourselves, they’re going to use it.
Finally, then, perhaps the most immediate thing we can do is to inculcate good practice around students’ use of AI generated content. We need to establish citation practices, and indeed the MLA has just suggested some guidance for citing generative AI content. Stanford, and other universities, are beginning to issue policies and teaching guidance. So far, these policies are pretty weak. Stanford’s policy basically boils down to, “Students: Don’t cheat. Faculty: Figure it out for yourselves.” It’s a busy time of year and all, but we need urgently to work with administration to make these things better.

I’m nearly out of time, and I really, really want to leave time for conversation, so I’ll leave it there. These are just a couple of thoughts that I’ve pulled together in my few weeks of following these developments. As I’ve said, I’m no expert in computer science, or philosophy, or business, but I think I can fairly call myself an expert in digital humanities and the history of science and technology, and I’m convinced this new world is right around the corner. I don’t have to like it. You don’t have to like it. If we want to stop it, or slow it down, we should advocate for that. But we need to understand it. We need to prepare our students for it.

At the same time, if you look at my list of things we should be doing to prepare for the AI revolution, they are, in fact, things we should have been (and in many cases have been) doing all along. Paying more attention to the undigitized materials in our collections? I’m guessing that’s something you already want to do. Helping students have meaningful, in-person, human connections? Ditto. Paying more attention to what we put online to be indexed, manipulated, sold against search advertising? Ditto. Encouraging students to have greater fluency with names, dates, and places? Helping them format more sophisticated search queries? Promoting better citation practice for born-digital materials and greater academic integrity? Ditto. Ditto. Ditto.

AI is going to change the way we do things. Make no mistake. But like all other technological revolutions, the changes it demands will just require us to be better teachers, better archivists, better humans.

Thank you.

Collaboration and Emergent Knowledge at Greenhouse Studios

Crossposted from Greenhouse Studios

Since the 1970s, scholars in fields as varied as sedimentology, ornithology, sociology, and philosophy have come to understand the importance of self-organizing systems, of how higher-order complexity can “emerge” from independent lower-order elements. Emergence describes how millions of tiny mud cracks at the bottom of a dry lake bed form large scale geometries when viewed at a distance, or how water molecules, each responding simply to a change in temperature, come to form the complex crystalline patterns of a snowflake. Emergence describes how hundreds of birds, each following its own, relatively simple rules of behavior, self-organize into a flock that displays its own complex behaviors, behaviors that none of the individual birds themselves would display. In the words of writer Steven Johnson, emergence describes how those birds, without a master plan or executive leadership, go from being a “they” to being an “it.” In other words, emergence describes a becoming.

We, too, form emergent systems. Emergence describes how a crowd of pedestrians self-organizes to form complex traffic flows on a busy sidewalk. Viewed close-up, each pedestrian is just trying to get to his or her destination without getting trampled, reacting to what’s in front of him or her according to a set of relatively simple behavioral rules—one foot in front of the other. Viewed from above, however, we see a structured flow, a river of humanity. Acting without direction, the crowd spontaneously orders itself into a complex system for maximizing pedestrian traffic. The mass of individual actors has, without someone in charge, gone from being an uncoordinated “they” to an organized “it.”

Emergent approaches to scholarly communication have long been an interest of mine, although I’ve only recently come to think of them this way. My first experiment in the emergent possibilities of radical collaboration took the form of THATCamp—The Humanities and Technology Camp—an “unconference” that colleagues at the Roy Rosenzweig Center for History and New Media and I launched in 2008. Instead of a pre-arranged, centrally-planned conference program, THATCampers set their own agendas on the first morning of the event, organizing around the topics that happen to be of most interest to most campers on that day. Another example is Hacking the Academy, a collaboration with Dan Cohen, which posed an open call for submissions to the community of digital humanists on a seven-day deadline. From the patterns that emerged from the more than 300 submissions we received—everything from tweets to blog post to fully formed essays—we assembled and published an edited volume with University of Michigan Press. A final experiment with this emergent approach was a project called One Week | One Tool. This Institute for Advanced Topics in Digital Humanities brought together a diverse collections of scholars, students, programmers, designers, librarians, and administrators to conceive, build, and launch an entirely new software tool for humanities scholarship. Participants arrived without an idea of what they would build, only the knowledge that the assembled team would possess the necessary range of talent for the undertaking. They began by brainstorming ideas for a digital project and proceeded to establish project roles, iteratively design a feature set, implement their design, and finally launch their product on day seven.

The Greenhouse Studios design process similarly provides a space for emergent knowledge making. Greenhouse Studios is interested in what new knowledge might emerge when we allow academic communities to self-organize. We are asking what kinds of higher-order complexities arise when teams of humanists, artists, librarians, faculty, students, and staff are given permission to set and follow their own simple rules of collaboration. This mode of work stands in strong rebuke to what I would call the “additive” model of collaboration that draws resources and people together to serve faculty member-driven projects. Instead, Greenhouse Studios provides its teams with the conditions for collaboration—diversity and depth of thought and experience, time apart, creative tools and spaces—and lets them set their own projects and own roles. At Greenhouse Studios, we’re running an experiment in radical collaboration, exploring what happens when you remove the labor hierarchies and predetermined workplans that normally structure collaborative scholarly projects, and instead embrace the emergent qualities of collaboration itself.

Innovation, Use, and Sustainability

Revised notes for remarks I delivered on the topic of “Tools: Encouraging Innovation” at the Institute of Museum and Library Services (IMLS) National Digital Platform summit last month at the New York Public Library.

What do we mean when we talk about innovation? To me innovation implies not just the “new” but the “useful.” And not just the “useful” but the “implemented” and the “used.” Used, that is, by others.

If a tool stays in house, in just the one place where it was developed, it may be new and it may be interesting—let’s say “inventive”—but it is not “innovative.” Other terms we use in this context—”ground breaking” and “cutting edge,” for example—share this meaning. Ground is broken for others to build upon. The cutting edge preceeds the rest of the blade.

The IMLS program that has been charged and most generously endowed with encouraging innovation in the digital realm is the National Leadership Grants: Advancing Digital Resources program. The idea that innovation is tied to use is implicit in the title of the program: the word “leadership” implies a “following.” It implies that the digital resources that the program advances will be examples to the field to be followed widely, that the people who receive the grants will become leaders and gain followers, that the projects supported by the program will be implemented and used.

This is going to be difficult to say in present company, because I am a huge admirer of the NLG program and its staff of program officers. I am also an extremely grateful recipeint of its funds. Nevertheless, in my estimation as an observer of the program, a panelist, and an adwardee, the program has too often fallen short in this regard: it has supported a multitude of new and incredibly inventive work, but that work has too rarely been taken up by colleagues outside of the originating institution. The projects the NLG program has spawned have been creative, exciting, and new, but they have too rarely been truly innovative. This is to say that the ratio of “leaders” to “followers” is out of whack. A model that’s not taken up by others is no model at all.

I would suggest two related remedies for the Leadership Grants’ lack of followers:

More emphasis on platforms. Surely the NLG program has produced some widely used digital library and museum platforms, including the ones I have worked on. But I think it bears emphasizing that the limited funds available for grants would generate better returns if they went to enabling technologies rather than end prodcuts, to platforms rather than projects. Funding platforms doesn’t just mean funding software—there are also be social and institutional platforms like standards and convening bodies—but IMLS should be funding tools that allow lots of people to do good work, not the good work itself of just a few.
More emphasis on outreach. Big business doesn’t launch new products without a sale force. If we want people to use our products, we shouldn’t launch them without people on staff who are dedicated to encouraging their use. This should be refelected in our budgets, a much bigger chunk of which should go to outreach. That also means more flexibility in the guidelines and among panelists and program officers to support travel, advertizing, and other marketing costs.

Sustainability is a red herring

These are anecdotal impressions, but it is my belief that the NLG program could be usefully reformed by a more laser-like focus on these and other uptake and go-to-market strategies in the guidelines and evaluation criteria for proposals. In recent years, a higher and higher premium has been placed on sustainability in the guidelines. I believe the effort we require applicants to spend crafting sustainability plans and grantees to spend implementing them would be better spent on outreach—on sales. The greatest guarantor of sustainiability is use. When things are used they are sustained. When things become so widely implemented that the field can’t do without them, they are sustained. Like the banks, tools and platforms that become too big to fail are sustained. Sustainability is very simply a fuction of use, and we should recognize this in allocating scare energies and resources.

Looks Like the Internet: Digital Humanities and Cultural Heritage Projects Succeed When They Look Like the Network

A rough transcript of my talk at the 2013 ACRL/NY Symposium last week. The symposium’s theme was “The Library as Knowledge Laboratory.” Many thanks to Anice Mills and the entire program committee for inviting me to such an engaging event.

When Bill Gates and Paul Allen set out in 1975 to put “a computer on every desk and in every home, all running Microsoft software” it was absurdly audacious. Not only were the two practically teenagers. Practically no one owned a computer. When Tim Berners-Lee called the protocols he proposed primarily for internal sharing of research documents among his laboratory colleagues at CERN “the World Wide Web,” it was equally audacious. Berners-Lee was just one of hundreds of physicists working in relative anonymity in the laboratory. His supervisor approved his proposal, allowing him six months to work on the idea with the brief handwritten comment, “vague, but exciting.”

In hindsight, we now know that both projects proved their audacious claims. More or less every desk and every home now has a computer, more or less all of them running some kind of Microsoft software. The World Wide Web is indeed a world-wide web. But what is it that these visionaries saw that their contemporaries didn’t? Both Gates and Allen and Berners-Lee saw the potential of distributed systems.

In stark contrast to the model of mainframe computing dominant at the time, Gates and Allen (and a few peers such as Steve Jobs and Steve Wozniak and other members of the Homebrew Computing Club) saw that computing would achieve its greatest reach if computing power were placed in the hands of users. They saw that the personal computer, by moving computing power from the center (the mainframe) to the nodes (the end user terminal) of the system, would kick-start a virtuous cycle of experimentation and innovation that would ultimately lead to everyone owning a computer.

Tim Berners-Lee saw (as indeed did his predecessors who built the Internet atop which the Web sits) that placing content creation, linking, indexing, and other application-specific functions at the fringes of the network and allowing the network simply to handle data transfers, would enable greater ease of information sharing, a flourishing of connections between and among users and their documents, and thus a free-flowing of creativity. This distributed system of Internet+Web was in stark contrast to the centralized, managed computer networks that dominated the 1980s and early 1990s, networks like Compuserve and Prodigy, which managed all content and functional applications from their central servers.

This design principle, called the “end-to-end principle,” states that most features of a network should be left to users to invent and implement, that the network should be as simple as possible, and that complexity should be developed at its end points not at its core. That the network should be dumb and the terminals should be smart. This is precisely how the Internet works. The Internet itself doesn’t care whether the data being transmitted is a sophisticated Flash interactive or a plain text document. The complexity of Flash is handled at the end points and the Internet just transmits the data.

In my experience digital cultural heritage and digital humanities projects function best when they adhere to this design principle, technically, structurally, and administratively. Digital cultural heritage and digital humanities projects work best when content is created and functional applications are designed, that is, when the real work is performed at the nodes and when the management functions of the system are limited to establishing communication protocols and keeping open the pathways along which work can take place, along which ideas, content, collections, and code can flow. That is, digital cultural heritage and digital humanities projects work best when they are structured like the Internet itself, the very network upon which they operate and thrive. The success of THATCamp in recent years demonstrates the truth of this proposition.

Begun in 2008 by my colleagues and I at the Roy Rosenzweig Center for History and New Media as an unfunded gathering of digitally-minded humanities scholars, students, librarians, museum professionals, and others, THATCamp has in five years grown to more than 100 events in 20 countries around the globe.

How did we do this? Well, we didn’t really do it at all. Shortly after the second THATCamp event in 2009, one of the attendees, Ben Brumfield, asked if he could reproduce the gathering and use the name with colleagues attending the Society of American Archivists meeting in Austin. Shortly after that, other attendees organized THATCamp Pacific Northwest and THATCamp Southern California. By early-2010 THATCamp seemed to be “going viral” and we worked with the Mellon Foundation to secure funding to help coordinate what was now something of a movement.

But that money wasn’t directed at funding individual THATCamps or organizing them from CHNM. Mellon funding for THATCamp paid for information, documentation, and a “coordinator,” Amanda French, who would be available to answer questions and make connections between THATCamp organizers. To this day, each THATCamp remains independently organized, planned, funded, and carried out. The functional application of THATCamp takes place completely at the nodes. All that’s provided centrally at CHNM are the protocols—the branding, the groundrules, the architecture, the governance, and some advice—by which these local applications can perform smoothly and connect to one another to form a broader THATCamp community.

As I see it, looking and acting like the Internet—adopting and adapting its network architecture to structure our own work—gives us the best chance of succeeding as digital humanists and librarians. What does this mean for the future? Well, I’m at once hopeful and fearful for the future.

On the side of fear, I see much of the thrust of new technology today to be pointing in the opposite direction, towards a re-aggregation of innovation from the nodes to the center, centers dominated by proprietary interests. This is best represented by the App Store, which answers first and foremost to the priorities of Apple, but also by “apps” themselves, which centralize users’ interactions within wall-gardens not dissimilar to those built by Compuserve and Prodigy in the pre-aeb era. The Facebook App is designed to keep you in Facebook. Cloud computing is a more complicated case, but it too removes much of the computing power that in the PC era used to be located at the nodes to a central “cloud.”

On the other hand, on the side of hope, are developments coming out of this very community, developments like the the Digital Public Library of America, which is structured very much according to the end-to-end principle. DPLA executive director, Dan Cohen, has described DPLA’s content aggregation model as ponds feeding lakes feeding an ocean.

As cultural heritage professionals, it is our duty to empower end users—or as I like to call them, “people.” Doing this means keeping our efforts, regardless of which direction the latest trends in mobile and cloud computing seem to point, looking like the Internet.

[Image credits: Flickr user didbygraham and Wikipedia.]

No Holds Barred

About six months ago, I was asked by the executive director of a prestigious but somewhat hidebound—I guess “venerable” would be the word—cultural heritage institution to join the next meeting of the board and provide an assessment of the organization’s digital programs. I was told not to pull any punches. This is what I said.

You don’t have a mobile strategy. This is by far your most pressing need. According to the Pew Internet and American Life Project, already more than 45% of Americans own a smartphone. That number rises to 66% among 18-29 year olds and 68% among families with incomes of more than $75,000. These are people on the go. You are in the travel and tourism business. If you are only reaching these people when they’re at their desks at work—as opposed to in their cars, on their lunch breaks, while they’re chasing the kids around on Saturday morning—you aren’t reaching them in a way that will translate into visits. This isn’t something for the future. Unfortunately, it’s something for two years ago.
You don’t have an integrated social strategy. I could critique your website, and of course it needs work. But a redesign is a relatively straightforward thing these days. The more important thing to realize is that you shouldn’t expect more than a fraction of your digital audience these to interact directly with your website. Rather, most potential audience members will want to interact with you and your content on their chosen turf, and these days that means Facebook, Twitter, Pinterest, Tumblr, and Wikipedia, depending on the demographic. You have to be ready to go all in with social media and dedicate at least as much thought and resources to your social media presence as to your web presence.
Your current set of researcher tools and resources aren’t well-matched to what we know about researcher needs and expectations. Ithaka Research, a respected think tank that studies higher education and the humanities, recently released a report entitled “Supporting the Changing Research Practices of Historians” (I’d encourage everyone here to give it a good read; it has a ton of recommendations for organizations like this one grappling with the changing information landscape as it relates to history). One of its key findings is that Google is now firmly established as researchers’ first (and sometimes last) stop for research. Lament all you want, but it means that if you want to serve researchers better, your best bet isn’t to make your own online catalog better but instead to make sure your stuff shows up in Google. As the Library of Congress’s Trevor Owens puts it: “the next time someone tells you that they want to make a ‘gateway’ a ‘portal’ or a ‘registry’ of some set of historical materials you can probably stop reading. It already exists and it’s Google.” This speaks to a more general point, which is related closely to my previous point. Researchers come to your collection with a set of digital research practices and tools that they want to use, first and foremost among these being Google. Increasingly, researchers are looking to interact with your collections outside of your website. They are looking to pull collection items into personal reference management tools like Zotero. More sophisticated digital researchers are looking for ways to dump large data sets into an Excel spreadsheet for manipulation, analysis, and presentation. The most sophisticated digital historians are looking for direct connections to your database through open APIs. The lesson here is that whatever resources you have to dedicate to online research collections should go towards minimizing the time people spend on your website. We tend to evaluate the success of our web pages with metrics like numbers of page views, time spent per page, and bounce rate. But when it comes to search the metrics are reversed: We don’t want people looking at lots of pages or spending a lot of time on our websites. We want our research infrastructure to be essentially invisible, or at least to be visible for only a very short period of time. What we really want with search is to allow researchers to get in and get out as quickly as possible with just what they were looking for.
You aren’t making good use of the organization’s most valuable—and I mean that in terms of its share of the annual budget—resource: its staff expertise. Few things are certain when it comes to Internet strategy. The Internet is an incredibly complex ecosystem, and it changes extremely quickly. What works for one project or organization may not work for another organization six months from now. However, one ironclad rule of the Internet is content drives traffic. Fresh, substantive content improves page rank, raises social media visibility, and brings people to the website. Your website should be changing and growing every day. The way to do that is to allow and encourage (even insist) that every staff member, down to the interns and docents, contribute something to the website. Everybody here should be blogging. Everyone should be experimenting. The web is the perfect platform for letting staff experiment: the web allows us to FAIL QUICKLY.
You aren’t going to make any money. Digital is not a revenue center, it’s an operating cost like the reading room, or the permanent galleries, or the education department. You shouldn’t expect increased revenues from a website redesign any more than you should from a new coat of paint for the front door. But just like the reading room, the education programs, and the fresh coat of paint, digital media is vital to the organization’s mission in the 21st century. There are grants for special programs and possibly for initial capital expenditures (start-up costs), but on the whole, cultural organizations should consider digital as a cost of doing business. This means reconfiguring existing resources to meet the digital challenge. One important thing to remember about digital work is that its costs are almost entirely human (these days the necessary technology, software, equipment, bandwidth is cheap and getting cheaper). That means organizations should be able to afford a healthy digital strategy if they begin thinking about digital work as an integral part of the duties of existing staff in the ways I described earlier. You probably need a head of digital programs and possibly a technical assistant, but beyond that, you can achieve great success through rethinking/retraining existing human resources.

I’m happy to say that, aside from a few chilly looks (mainly from the staff members, rather than the board members, in the room), my no-holds-barred advice was graciously received. Time will tell if it was well received.

Nobody cares about the library: How digital technology makes the library invisible (and visible) to scholars

There is a scene from the first season of the television spy drama, Chuck, that takes place in a library. In the scene, our hero and unlikely spy, Chuck, has returned to his alma mater, Stanford, to find a book his former roommate, Bryce, has hidden in the stacks as a clue. All Chuck has to go on is a call number scribbled on a scrap of paper.

When he arrives in the stacks, he finds the book is missing and assumes the bad guys have beat him to it. Suddenly, however, Chuck remembers back to his undergraduate days of playing tag in the stacks with Bryce with plastic dart guns. Bryce had lost his weapon and Chuck had cornered him. Just then, Bryce reached beneath a shelf where he had hidden an extra gun, and finished Chuck off. Remembering this scene, Chuck reaches beneath the shelf where the book should have been shelved and finds that this time around Bryce has stashed a computer disk.

I like this clip because it illustrates how I think most people—scholars, students, geeks like Chuck—use the library. I don’t mean as the setting for covert intelligence operations or even undergraduate dart gun games. Rather, I think it shows that patrons take what the library offers and then use those offerings in ways librarians never intended. Chuck and his team (and the bad guys) enter the library thinking they are looking for a book with a given call number only to realize that Bryce has repurposed the Library of Congress Classification system to hide his disk. It reinforces the point when, at the end of the scene, the writers play a joke at the expense of a hapless librarian, who, while the action is unfolding, is trying to nail Chuck for some unpaid late fees. When the librarian catches up with Chuck, and Chuck’s partner Sarah shouts “Run!” she is not, as the librarian thinks, worried about late fees but about the bad guys with guns standing behind him. Chuck and his friends don’t care about the library. They use the library’s resources and tools in their own ways, to their own ends, and the concerns of the librarians are a distant second to the concerns that really motivate them.

In some ways, this disconnect between librarians (and their needs, ways of working, and ways of thinking) and patrons (and their needs and ways of working) is only exacerbated by digital technology. In the age of Google Books, JSTOR, Wikipedia, and ever expanding digital archives, librarians may rightly worry about becoming invisible to scholars, students, and other patrons—that “nobody cares about the library.” Indeed, many faculty and students may wonder just what goes on in that big building across the quad. Digital technology has reconfigured the relationship between librarians and researchers. In many cases, this relationship has grown more distant, causing considerable consternation about the future of libraries. Yet, while it is certainly true that digital technology has made libraries and librarians invisible to scholars in some ways, it is also true, that in some areas, digital technology has made librarians increasingly visible, increasingly important.

To try to understand the new invisibility/visibility of the library in the digital age let’s consider a few examples on both sides.

The invisible library

Does it matter that Chuck couldn’t care less about call numbers and late fees or about controlled vocabularies, metadata schemas, circulation policies, or theories collections stewardship? I’m here to argue that it doesn’t. Don’t get me wrong. I’m not arguing that these things don’t matter or that the library should be anything but central to the university experience. But to play that central role doesn’t mean the library has to be uppermost in everyone’s mind. In the digital age, in most cases, the library is doing its job best when it is invisible to its patrons.

What do I mean by that? Let me offer three instances where the library should strive for invisibility, three examples of “good” invisibility:

Search: We tend to evaluate the success of our web pages with metrics like numbers of page views, time spent per page, and bounce rate. But with search the metrics are reversed: We don’t want people looking at lots of pages or spending a lot of time on our websites. We want the library web infrastructure to be essentially invisible, or at least to be visible for only a very short period of time. What we really want with search is to allow patrons to get in and get out as quickly as possible with just what they were looking for.

APIs and 3rd party mashups: In fact, we may not want people visiting library websites at all. What would be even better would be to provide direct computational access to collections databases so people could take the data directly and use it in their own applications elsewhere. Providing rich APIs (Application Programming Interfaces) would make the library even more invisible. People wouldn’t even come to our websites to access content, but they would get from us what they need where they need it.

Social media: Another way in which we may want to discourage people from coming to library websites is by actively placing content on other websites. To the extent that a small or medium-sized library wants to reach general audiences, it has a better chance of doing so in places where that audience already is. Flickr Commons is one good example of this third brand of invisibility. Commentors on Flickr Commons may never travel back to the originating library’s website, but they may have had a richer interaction with that library’s content because of it.

The visible library

The experience of the digital humanities shows that the digital can also bring scholars into ever closer and more substantive collaboration with librarians. It is no accident that many if not most successful digital humanities centers are based in univeristy libraries. Much of digital humanities is database driven, but an empty database is a useless database. Librarians have the stuff to fill digital humanists’ databases and the expertise to do so intelligently.

Those library-based digital humanities centers tend to skew towards larger universities. How can librarians at medium-sized or even small universities library help the digital humanities? Our friend Wally Grotophorst, Associate University Librarian for Digital Programs and Systems at Mason, provides some answers in his brief but idea-rich post, What Happens To The Mid-Major Library?. I’ll point to just three of Wally’s suggestions:

Focus on special collections, that is anything people can’t get from shared sources like Google Books, JSTOR, LexisNexis, HathiTrust. Not only do special collections differentiate you from other institutions online, they provide unique opportunities for researchers on campus.

Start supporting data-driven research in addition to the bibliographic-driven kind that has been the traditional bread and butter of libraries. Here I’d suggest tools and training for database creation, social network analysis, and simple text mining.

Start supporting new modes of scholarly communication—financially, technically, and institutionally. Financial support for open access publishing of the sort prescribed by the Compact for Open-Access Publishing Equity is one ready model. Hosting, supporting, and publicizing scholarly and student blogs as an alternative or supplement to existing learning management systems (e.g. Blackboard) is another. University Library/University Press collaboration, like the University of Michigan’s MPublishing reorganization, is a third.

Conclusion

In an information landscape increasingly dominated by networked resources, both sides of the librarian-scholar/student relationship must come to terms with a new reality that is in some ways more distant and in others closer than ever before. Librarians must learn to accept invisibility where digital realities demand it. Scholars must come to understand the centrality of library expertise and accept librarians as equal partners as more and more scholarship becomes born digital and the digital humanities goes from being a fringe sub-discipline to a mainstream pursuit. Librarians in turn must expand those services like special collections, support for data-driven research, and access to new modes of publication that play to their strengths and will best serve scholars. We all have to find new ways, better ways to work together.

So, where does that leave Chuck? Despite not caring about our work, Chuck actually remembers the library fondly as a place of play. Now maybe we don’t want people playing dart guns in the stacks. But applied correctly, digital technology allows our users and our staff to play, to be creative, and in their own way to make the most of the library’s rich resources.

Maybe the Chucks of the world do care about the library after all.

[This post is based on a talk I delivered at American University Library’s Digital Futures Forum. Thanks to @bill_mayer for his kind invitation. In memory of my dear friend Bob Griffith, who did too much to come and hear this lousy talk.]

Connecticut Forum on Digital Initiatives

Today, I’ll be speaking at the Connecticut Forum on Digital Initiatives at the Connecticut State Library under the catch-all title, “The Roy Rosenzweig Center for History and New Media: New initiatives, oldies but goodies, and partnership opportunities with ‘CHNM North’.” The long and short of it is that the institutional realities of being a grant-funded organization and the imperatives of the Web have meant that CHNM has from the beginning been a dynamic and entrepreneurial organization that’s always, always looking for new opportunities, new partners, new collaborations.

Among the projects I’ll point to are:

Partners wanted.

Omeka and Its Peers

As an open source, not-for-profit, warm-and-fuzzy, community service oriented project, we don’t normally like to talk about market rivals or competitive products when we talk about Omeka. Nevertheless, we are often asked to compare Omeka with other products. “Who’s Omeka’s competition?” is a fairly frequent question. Like many FAQs, there is an easy answer and a more complicated one.

The easy answer is there is no competition. 😉 Omeka’s mix of ease of use, focus on presentation and narrative exhibition, adherence to standards, accommodation for library, museum, and academic users, open source license, open code flexibility, and low ($0) price tag really make it one of a kind. If you are a librarian, archivist, museum professional, or scholar who wants a free, open, relatively simple platform for building a compelling online exhibition, there really isn’t any alternative.

[Figure 1. Digital Amherst, an award-winning Omeka powered project of the Jones Library in Amherst, MA.]

The more complicated answer is that there are lots of products on the market that do one or some of the things Omeka does. The emergence of the web has brought scholars and librarians, archivists, and museum professionals into increasingly closer contact and conversation as humanists are required to think differently and more deeply about the nature of information and librarians are required to play an ever more public role online. Yet these groups’ respective tool sets have remained largely separate. Library and archives professionals operate in a world of institutional repositories (Fedora, DSpace), integrated library systems (Evergreen, Ex Libris), and digital collections systems (CONTENTdm, Greenstone). Museum professionals operate in a world of collections management systems (TMS, KE Emu, PastPerfect) and online exhibition packages (Pachyderm, eMuseum). The humanist or interpretive professional’s online tool set is usually based around an off-the-rack web content management system such as WordPress (for blogs), MediaWiki (for wikis), or Drupal (for community sites). Alas, even today too much of this front facing work is still being done in Microsoft Publisher.

The collections professional’s tools are excellent for preserving digital collections, maintaining standardized metadata, and providing discovery services. They are less effective when it comes to exhibiting collections or providing the rich visual and interpretive context today’s web users expect. They are also often difficult to deploy and expensive to maintain. The blogs, wikis, and off-the-rack content management systems of the humanist (and, indeed, of the public programs staff within collecting institutions, especially museums) are the opposite: bad at handling collections and standardized metadata, good at building engaging experiences, and relatively simple and inexpensive to deploy and maintain.

Omeka aims to fill this gap by providing a collections-focused web publishing platform that offers both rigorous adherence to standards and interoperability with the collections professional’s toolkit and the design flexibility, interpretive opportunities, and ease of use of popular web authoring tools.

[Figure 2. Omeka Technology Ecosystem]

By combining these functions, Omeka helps advance collaboration of many sorts: between collections professionals and interpretive professionals, between collecting institutions and scholars, between a “back of the house” and “front of the house” staff, and so on.

[Figure 3. Omeka User Ecosystem]

In doing so, Omeka also helps advance the convergence and communication between librarians, archivists, museum professionals, and scholars that the digital age has sparked, allowing LAM professionals to participate more fully in the scholarship of the humanities and humanists to bring sophisticated information management techniques to their scholarship.

Which brings us back to the short answer. There really is no competition.

Rethinking Access

[This week and next I’ll be facilitating the discussion of “Learning & Information” at the IMLS UpNext: Future of Museums and Libraries wiki. The following is adapted from the first open thread. Please leave any comments at UpNext to join in the wider discussion!]

In addition to the questions posted on the main page for this theme—I will be starting threads for each of those over the course of the next two weeks—something that has been on my mind lately is the question, “What is access?”

Over the past ten or fifteen years, libraries and museums have made great strides in putting collections online. That is an achievement in itself. But beyond a good search and usable interfaces, what responsibilities do museums and libraries have to their online visitors to contextualize those materials, to interpret them, to scaffold them appropriately for scholarly, classroom, and general use?

My personal feeling is that our definition of what constitutes “access” has been too narrow, that real access has to mean more than the broad availability of digitized collections. Rather, in my vision, true access to library and museum resources must include access to the expertise and expert knowledge that undergirds and defines our collections. This is not to say that museum and library websites don’t provide that broader kind of access; they often do. It’s just to say that the two functions are usually performed separately: first comes database access to collections material, then comes (sometimes yes, sometimes no, often depending on available funding) contextual and interpretive access.

What I’d like to see in the future—funders take note!—is a more inclusive definition of access that incorporates both things (what I’m calling database access and contextual access) from the beginning. So, in my brave new world, as a matter of course, every “access” project funded by agencies like IMLS would include support both for mounting collections online and for interpretive exhibits and other contextual and teaching resources. In this future, funding access equals funding interpretation and education.

Is this already happening? If so, how are museums and libraries treating access more broadly? If not, what problems do you see with my vision?

[Please leave comments at UpNext.]

Benchmarking Open Source: Measuring Success by "Low End" Adoption

In an article about Kuali adoption, the Chronicle of Higher Education quotes Campus Computing Project director, Kenneth C. Green as saying,

With due respect to the elites that are at the core of Sakai and also Kuali, the real issue is not the deployment of Kuali or Sakai at MIT, at Michigan, at Indiana, or at Stanford. It’s really what happens at other institutions, the non-elites.

Indeed, all government- and charity (read, “foundation”)-funded open source projects should measure their success by adoption at the “low end.” That goes for library and museum technology as well; we could easily replace MIT, Michigan, Indiana, and Stanford in Mr. Green’s quote with Beinecke, Huntington, MoMA, and Getty, Though we still have a long way to go—the launch of Omeka.net will help a lot—Omeka aims at just that target.