in praise of humanities data
DESCRIPTION
A presentation on "datum love," or, how to engage with large humanities datasets non-statistically.TRANSCRIPT
This is my song in praise of humani1es data, of primary sources and their digital surrogates.
“Humani1es” and “data” are two terms that sit uneasily beside one another, because in the humani1es we deliberately and on purpose do not prac1ce the scien1fic method (unless we do). What is “the humani1es method”? There isn’t one. We reserve the right to change our method with our mood, as befits a human studying humans in a human way.
1
If there is a humani1es method, it could conceivably consist of this: a person, alone, reading. Not conduc1ng experiments or studies: just reading. And then, wri1ng. Philosophy, history, and the study of any language’s literature all see this as their archetypal method, I think, though not archaeology (which, yes, was classified as belonging to the humani1es by no less than the Na1onal Endowment for the Humani1es in 1965, the year of its founding). That is why it is a truism and cliché to say that the library is the humani1es laboratory.
And what we read and how and how much and how quickly is changing, changing uSerly. We live in an age where answers are as easy to come by as parking spaces. It wasn’t always so.
2
On Saturday, November 17, 1860, the masthead of the semi-‐scholarly London periodical Notes and Queries described itself, as it had for the last thirteen years, as “A Medium of Inter-‐Communica1on for Literary Men, Ar1sts, An1quaries, Geneaologists, Etc.”
3
Below the masthead were adver1sements of recently published or soon-‐to-‐be-‐published books, such as Carthage and Its Remains.
4
An ad for the London Library boasted of 80,000 volumes, a reading room “furnished with the principal Periodicals, English, French, German,” and a catalogue that could be purchased for only nine shillings and sixpence. “This EXTENSIVE LENDING LIBRARY, the only one of its kind in London,” was open from 10 to 6.
5
As usual, the journal was an ac1ve bulle1n board of ques1ons, answers, and miscellaneous contribu1ons to knowledge, such as the correc1on to Forster’s Lives of Eminent Statesmen concerning the mistaken iden1ty of one Lord Wentworth.
6
As for queries, “A Constant Reader and Subscriber” asked for “an authen1c account of Sawney Bene, the Scotch cannibal,”
7
while X. Y. wondered if anyone could tell him who wrote the 1830 tragedy called “Wismar,” and Saxon asked, “Can you, or one of your correspondents, inform me by whom the term ‘God's Acre,’ as applied to a churchyard, was first used in English literature? It appears in the wri1ngs of Longfellow, who seems to have adopted it from the German; but I have some doubts whether it had not been previously used by one of our early writers — George Herbert for instance.”
8
Most of these queries are easily answered today, by means other than this clever one of asking people. There’s a Wikipedia page for Sawney Bean, of course, which avers that he probably never existed (although the legend “is part of the Edinburgh tourism industry”).
9
The Oxford English Dic1onary suggests that George Herbert never used the term “God’s Acre,” though it had appeared a few 1mes at least in the 17th century,
10
and a Google Book Search generally confirms this, although the term also turns up in the 1828 Harvard Register -‐-‐ a possible source for New Englander Longfellow -‐-‐ as well as in a few other interes1ng sources.
11
The ques1on of who wrote the 1830 drama “Wismar: A Tragedy” is a bit harder, however, and it may forever remain unanswered.
12
Notes and Queries is s1ll published today, by Oxford Journals, but it has changed, as you can see by the table of contents.
13
The notes in themselves have changed somewhat, as well: they read more s1ffly to me, they sound more professional, more academic, less personal, even considering the more formal Victorian dic1on in which an 1860 author explained that a mistake he had made took place in “a 1me of great domes1c anxiety.”
I some1mes wonder whether Victorian humanists are staring longingly down at us from heaven, longing, just longing, to get their hands on our research tools. Of course, then, as now, not all humani1es ques1ons – perhaps not even most – could be answered with data, informa1on, facts, research, and then, as now, there are scrupulous researchers and not-‐so-‐scrupulous researchers, which makes a big difference no maSer what tools you have at your disposal.
22
The Victorian translator, cri1c, poet, librarian, and honorary M.A. Edmund Gosse, for instance, was a notoriously bad researcher. In the fall of 1886, in what Gosse’s biographer Ann Thwaite calls “the central episode of Edmund Gosse’s literary career,” the cri1c John Churton Collins aSacked Gosse’s literary history From Shakespeare to Pope for its unscholarliness (277). “We have even refrained from discussing maSers of opinion,” wrote Collins in a widely-‐read Quarterly Review piece. “We have confined ourselves en1rely to maSers of fact–to gross and palpable blunders, to unfounded and reckless asser1ons, to such absurdi1es in cri1cism and such vices of style as will in the eyes of discerning readers carry with them their own condemna1on” (qtd. in Thwaite 282). Gosse was just about to take up a faculty posi1on of Clark Lecturer at Cambridge when the denuncia1on appeared.
In the comments of Gosse’s biographer on the episode, we get a portrait of another kind of Victorian researcher. Thwaite writes, “There is no ques1on that Collins was a fana1c and a pedant. Later in life he would search the registers of forty-‐two Norwich churches, trying to pin down the elusive birth-‐date of Robert Greene for an edi1on he was edi1ng. But, as far as Gosse’s book was concerned, Collins happened to be right…From Shakespeare to Pope is full of extraordinary mistakes” (278). Gosse’s career did survive the blow–he took up his posi1on as scheduled–but his reputa1on as a scholar was never the same. During the scandal, Henry James remarked in a leSer that Gosse “has [emphasis original] a genius for inaccuracy which makes it difficult to dress his wounds” (qtd. in Thwaite 339).
23
Gosse’s research inep1tude or carelessness, however, probably contributed to the existence some great poetry, for instance Dylan Thomas’s “Do not go gentle into that good night.” How, you ask?
24
“Do not go gentle into that good night” is a villanelle, a 19-‐line 6-‐stanza alterna1ng-‐refrain poe1c form with only two rhymes that, with a lot of help from Gosse, had for over a century the reputa1on of being an ancient French poe1c form. In 1877, Gosse published an ar1cle in the Cornhill Magazine 1tled “A Plea for Certain Exo1c Forms of Verse,” in which he explained the rules of six ancient (or “ancient”) French forms and gave examples, wri1ng them himself when necessary.
In the ar1cle, Gosse reprinted a 16th-‐century poem 1tled “J’ay perdu ma tourterelle” by the French poet and professor of La1n Eloquence Jean Passerat, liSle realizing that his “example” was in fact the only early poem in that form. Gosse did write (in a slightly puzzled tone), “I do not find that much has been recorded of [the villanelle's] history, but it dates back at least as far as the fiueenth century” (64). AdmiSedly, at this point Gosse had done liSle worse than rely on a mistaken source, Théodore de Banville’s PeBt traité de poésie of 1872, but he was later to repeat his error, with less excuse.
25
In 1879, two years auer “A Plea for Certain Exo1c Forms of Verse,” a Parisian bibliophile and poet named Joseph Boulmier published a book of villanelles in French all modeled auer Passerat’s poem. It is likely that Boulmier owned a copy of the 1606 work in which Passerat’s “J’ay perdu ma Tourterelle” first appeared; he certainly seems to have been the first nineteenth-‐century admirer of the villanelle to consult it. But Boulmier the book collector did more than consult that single volume, a volume Gosse couldn’t have goSen hold of: he searched through everything he had, and came to the correct conclusion:
26
“One fine day, auer having spoken successively of the rondeau, of the triolet, of the ballade, of the lai, of the virelai, of the chant royal, the author of I no longer know which trea1se on versifica1on, bungled to hell like they almost always are, finally tackled the villanelle, having the idea, or perhaps the luck, to cite as a model of this last genre–and auer all he wasn’t wrong–a certain naïve masterpiece escaped, God knows how, from the pen of the scholar Passerat….The turtledove of Passerat once launched into circula1on, what happened to it? All the trea1ses on versifica1on that succeeded one another and copied one another in single file, accompanying this or that grammar, this or that rhyming dic1onary, did not fail to drag it back on the scene, and especially to present it as a type from which it was absolutely forbidden to depart….Well, I say it without fear: you can, as I have done myself, page through all the essays on versifica1on from the fiueenth and sixteenth century, one auer another; you will not find there the least trace of Passerat’s turtledove, which is to say nothing that resembles this lovely form.”
27
In his entry on the villanelle for the monumental 1911 Encyclopaedia Britannica, Gosse (by then supposedly a wise elder) hardly retreated from the asser1ons he had made over thirty years earlier. Ci1ng Boulmier, Gosse conceded that there were no schema1c double-‐refrain villanelles before Passerat, yet (like Boulmier himself) he did not conclude that it was he and his contemporaries who were responsible for defining the modern form of the villanelle in the nineteenth century:
“VILLANELLE, a form of verse, originally loose in construc1on, but since the 16th century bound in exact limits of an arbitrary kind. . . . It appears, indeed, to have been by an accident that the special and rigorously defined form of the villanelle was invented. In the posthumous poems of Jean Passerat (1534-‐1602), which were printed in 1606, several villanelles were discovered, in different forms. One of these became, and has remained, so deservedly popular, that it has given its exact character to the subsequent history of the villanelle.”
Gosse’s plea, you see, had been successful, and because of his influence there had been a small villanelle vogue in England among the Parnassians at the end of the nineteenth century. James Joyce, eighteen years old in 1900, played along, and later reprinted a piece of his poe1c juvenilia in 1914’s Portrait of the ArBst as a Young Man. From there, and helped along by poetry handbooks quo1ng one another in single file, the villanelle became entrenched in English poetry with a reputa1on as an ancient French form, leading not only to “Do not go gentle into that good night” but also to Elizabeth Bishop’s “One Art” (recited in the Cameron Diaz flick In Her Shoes – “the art of losing isn’t hard to master.”)
28
When I first conducted the research on the villanelle that led to this tale of good (but ignored) and bungled (but influen1al) research, I took it upon myself to find the text or texts that caused Banville (Gosse’s 1872 source) to believe that the villanelle was an ancient French form. Banville had begun wri1ng villanelles himself in 1845, so I began to search for any and all French poetry handbooks and anthologies published between 1606 and 1845, with special aSen1on to early nineteenth-‐century works that Banville would probably have had to hand. I worried that I might have to search for works in other languages, as well, but it was surely best to begin with works in French.
My chief resource in compiling the list of 1tles was WorldCat, which I regularly plied with various "poe*" strings. From my carrel in the stacks of Alderman library at the University of Virginia, I began to make forays into the stacks from which I would return with armfuls of books that I would then page through, just like Boulmier (how much had changed since 1879?), looking for men1ons of the villanelle form or of Passerat or of "J'ay perdu ma Tourterelle," and looking for other poetry books to gather or to order from Interlibrary Loan. Whenever I visited the shelves, I would also scan the proximate volumes and, more ouen than not, scoop them up to take back to my carrel -‐-‐ ouen, I'm sorry to say, without checking them out. I remember that it was a week or so into this process that I discovered a 1986 Slatkine reprint of an 1844 work by an author named Wilhelm Ténint. Standing at the shelf, I paged through un1l I found an entry that both cited Passerat and claimed that the villanelle was an old fixed form. Siegel also men1oned that Banville himself had made marginal notes on the manuscript of Ténint’s Prosodie.
Remember, now, the year was 2003. I had not only the well-‐stocked stacks of an excellent research library at my disposal, but also Google, and also the WorldCat database. Google -‐-‐ the regular search engine, mind you, not Google Books -‐-‐ gave me a few par1cularly good leads at other points in my research. Auer the Ténint discovery, I con1nued to look for other men1ons of the villanelle form in early 19th-‐century French texts, but I found very liSle, almost nothing.
Flash forward five years, only five years, and imagine me now, if you will, engaged in co-‐wri1ng a new entry on the villanelle for the forthcoming revised edi1on of the Princeton Encyclopedia of Poetry and PoeBcs, edited by Stephen Cushman, my disserta1on advisor. This, obviously, was our big chance to correct the record about the villanelle in the gold standard of poetry handbooks. And so I revisited my search for men1ons of the villanelle and of "J'ay perdu ma Tourterelle" between 1606 and 1845, and this 1me I used Google Books.
29
Using Google Book Search, I found 38 texts that might have influenced Ténint, 38 more sources in that trail of textual transmission, more evidence of what was known and thought about the villanelle in that fragile 1me when a mistake that would engrave itself in the record for more than a century was just beginning to flap its delicate buSerfly wings. I didn't find anything that directly contradicted my claim that the Ténint work can be considered the chief entry point of the villanelle error, but what I did find were numerous texts that smoothed its way. To sa1sfy my conscience, I included two of the more popular dic1onaries and encyclopedias that Google Book Search turned up for me in the PEPP entry.
I’m not sure I can convey properly through this somewhat procedural narra1ve the thrill I felt at finding the Ténint source by siuing through dozens of books with my bare hands, and the dismay I felt at finding (just a liSle too late) thirty-‐eight addi1onal sources by siuing through millions of books with Google Book Search.
So much more data, so suddenly.
31
That “data dismay” is something researchers have always felt, of course. Witness Virginia Woolf’s descrip1on of a trip to the Library of the Bri1sh Museum, feeling as though she would “need claws of steel and beak of brass even to penetrate the husk” of all her data, as though she were some kind of steampunk clockwork woodpecker.
32
One of the chief aims of digital humanists since the 90s has always been simply to get more stuff online, preferably in a scholarly way. We’re just now, especially but not exclusively with text, beginning to say, Okay, we’ve put a lot of stuff online. Our primary sources, our data, are now digital. Google has put a lot online, TwiSer has put a lot online, humanity has put a lot online. Now what do we do with it?
In 2009, the Na1onal Endowment for the Humani1es’s Office of Digital Humani1es put that ques1on to researchers, but almost as a dare. What can you do with all that data, they asked. Show us. The Digging into the Enlightenment project, for interest, will look at 53,000 18th-‐century leSers.
33
The Digging into Data project is only part of a larger trend, some1mes called “distant reading,” in a term taken from Franco More{’s Graphs, Maps, Trees, shown here on the social reading site GoodReads.
34
Examples of distant reading include some of the work done with text mining, analysis, and visualiza1on tools such as the MONK project, described in the 2008 ar1cle “How Not to Read a Million Books.”
35
Tanya Clement’s work with Gertrude Stein’s Making of America is interes1ng not only for its conclusion, which is that the text has a decided structure and paSern that is not apparent to a human reader, but for its stated premise: that the work is unreadable by humans. (A neutral observa1on, not an aesthe1c judgment.)
36
But “distant reading” need not, perhaps, entail sta1s1cs and machines. Speaking with a journalist at the New York Times about his book How to Talk About Books You Haven’t Read, Pierre Bayard described some very human and qualita1ve and incomplete and yet s1ll valuable modes of distant reading:
37
“We are taught only one way of reading,” he said. “Students are told to read the book, then to fill out a form detailing everything they have read. It’s a linear approach that serves to enshrine books. People now come up to me to describe the cultural wounds they suffered at school. ‘You have to read all of Proust.’ They were trauma1zed.”
“They see culture as a huge wall, as a terrifying specter of ‘knowledge,’ “ he went on. “But we intellectuals, who are avid readers, know there are many ways of reading a book. You can skim it, you can start and not finish it, you can look at the index. You learn to live with a book.”
38
I think perhaps that large sets of humani1es data, like books, can be read in the way Bayard describes: not comprehensively, but by living with them. Their sheer size suggests but need not entail sta1s1cal analysis and visual display. We can browse very large humani1es datasets, skim them, live with them, instead of reading them in a linear fashion with computers. Auer all, how likely is it that the database itself is comprehensive? Isn’t it very likely itself simply a sample? The Reading Experience Database, for instance, itself a compelling example of a very large humani1es dataset, admits quite charmingly that it can never be comprehensive:
39
“While RED may never be the comprehensive database that would allow us to make rigorously sta1s1cal arguments for reading habits in given places or 1me periods, it can func1on as a source of compelling examples. The more entries that go in it, the more it can approach the ideal, but it can never hope to be a comprehensive database of every archive, every annotated page, every diary manuscript, in the Bri1sh World, 1450-‐1945, much and all as we may want it to!”
40
Like our datasets, our methods need not be comprehensive. Lately I’ve been interested in the possibili1es of what I think of as “datum love”: the selec1on (random, serendipitous, affec1onate) of compelling examples. In the Reading Experience Database, for instance, some idling through the byways turns up the interes1ng fact that Dickens was once at least read by “a revolu1onary Russian rag merchant.” Isn’t that 1dbit a spur to further inquiry? In my experience, faced with the “Fordist, func1onalist” impera1ve to write that Kathleen men1oned yesterday, humani1es scholars of any rank generally begin with a text, a topic, a theory, or a text and a topic and a theory, and we proceed on the assump1on that we must produce an original interpreta1on or argument. What I wonder is whether instead we can begin with the data, or with a datum, and simply watch for what it may tell us, even if what it tells us is simply a story. What I hope is that all our data will bring forth a new age of humanis1c induc1on, induc1on that can but need not necessarily rely on sta1s1cs and visualiza1ons.
41
And what I hope, too, is that more compilers of databases will recognize that they are at least as well-‐fiSed as anyone to tell us what the data can tell us. Archivists and librarians, especially, know the data, because they feed and groom it. Tim SherraS, for instance, is an archivist, historian, and programmer in Australia who has recently begun a project called Invisible Australians. When he worked for the Na1onal Archives of Australia, SherraS no1ced that there were a great many print records that could be converted into structured data.
42
One such type of print record is the “Cer1ficate Exemp1ng from Dicta1on Test,” or CEDT. The CEDT was a bureaucra1c outgrowth of the White Australia Policy, which restricted non-‐white immigra1on to Australia from 1901 to 1973: it was a form that enabled exis1ng non-‐white residents of Australia to leave and re-‐enter the country without being mistaken for immigrants. Over 50,000 paper CEDTs reside in the Na1onal Archives of Australia, and these forms have a great deal to tell history about some of the people on the margins of history. Obviously, they could tell more if their data were digital: enter the Invisible Australians project.
43
Choosing a CEDT subject surely almost at random, SherraS narrates some of the life of Charlie Allen, a half-‐Chinese man:
“Charlie was born in Sydney in 1896. His mother was Frances Allen (some1me sweet shop owner and brothel keeper), his father Charlie Gum (a buyer for Wing On company). Charlie was raised by his mother, but in 1909, at the age of thirteen, he was taken to China by his father. His father returned to Sydney, leaving Charlie in China. He lived with rela1ves in the town of Shekki (inland from Hong Kong) for six years. Charlie was homesick, but had no means of ge{ng back to Australia. His mother aSempted to enlist government help but to no avail. Charlie finally returned in 1915. The following year he enlisted in First AIF (well, actually he enlisted three 1mes, and was discharged as medically unfit each 1me). Charlie married in Sydney in 1917 and had two daughters soon auer. He returned to China in 1922 for seven months. Charlie Allen died in 1938 as the result of an industrial accident. He was forty-‐one.”
To my mind, SherraS is nearly the ideal digital humanist, not only because he is a builder of databases, but because his ins1nct, once he has built a database, is to use it to tell stories. Few or no graphs, maps, and trees for him.
44