dutch book trade 1660-1750: using the stcn to gain insight in publishers’ strategies
DESCRIPTION
Despite a stagnating domestic demand near the end of the seventeenth century, Dutch book producers managed to keep up their international market position. In a so-called embedded research project, the Short Title Catalogue, Netherlands (STCN) was used to gain insight in the strategies and decisions of these publishers. The STCN is a retrospective bibliography of publications 1540-1800, containing information on title, author, book producer, language, subject and collation. Historians and computer scientists collaborated to disclose this STCN, and to connect it to other relevant datasets. To explore the possibilities of, and difficulties in, disclosing and linking the bibliography, attention was turned to a particular strategy: publishing scandalous books. Next to explaining the process of converting and querying the STCN data, the presentation will deal with differences in handling data and the advantages of an Open Data approach in the humanities research.TRANSCRIPT
e-Humanities Group Research Meeting: STCN
2013/10/10 Wouter Beek
Albert Meroño Peñuela Rinke Hoekstra
Fernie Maas Inger Leemans
‘OPENING’ THE STCN LINKING THE STCN
Open data
Linked Open Data
• Connect to existing datasets • Connect to services • Queries/inferences run across datasets
– The Picarta topic hierarchy allows us to infer that certain publications cover related topics.
– GeoNames gives the latitude of publishing houses, allowing publishing decisions to be correlated to historical events.
– Lexvo / ISO standards allow translations to be traced via related languages (e.g. language families).
• Easy to create mashups / new applications.
died in
Biografisch portaal
same as
Taking the STCN to the Semantic Web
• 139.817 publications (4M facts) • 23.543 authors (120K facts) • 9.959 printers (55K facts) • 37K enriched concepts (DBpedia, Yago, Heidelberg
Diglit, …) • 105 topics (1K facts) • Relate to international standards
(GGC/OCLC/ISO/RFC/IANA) • Making the schema explicit (vocabulary)
Relational DB domain knowledge
RDF files
Text files ambiguous
XML files depends on structure
domain knowledge
Link to external sources (linksets) domain knowledge needed
Domain-independent data conversions fully automated
Simple RDF
Domain-dependent data conversions domain knowledge needed
Connect to services (e.g. query interface, maps)
high level of reuse
Fixing bad data origin inconsistencies
& inaccuracies
FROM THE LIBRARY TO THE LAB
“How many publications by Arminius?”
“How many publications by Gomarus?”
What happens to the average publication format after 1619?
Measured in terms of the number of folds: • Works by Arminius: 5.6 5.7 • Works by Gomarius: 6.8 4.9
Distant reading!
Methodological implications
From
searching for resources (librarian) to
validating/refuting hypotheses (scientist)
humR
humanities + R (statistics processing software)
A WEB SERVICE FOR
RESEARCH INVOLVING DISTANT READING
Open issues 0: institutional hurdles
• The products of publicly funded research should be publicly available (papers&datasets). – Not everybody makes their data publicly available.
• Distant reading research is often restricted by the user interace.
Open issues 1: meaning
A large percentage of the data has no/unknown meaning: • “before 1808” • “This book was published between the Big Bang and
1808.” Context-dependent: • “The first dinosaur walked the earth before 300M years
BC.” • “Einstein came up with the idea of general relativity
before 1937.” Fuzzyness: • “James Joyce’s Ulysses was published before 1925.”
Open issues 2: statistics • Which query results are statistically relevant? • How to detect whether a statistically significant
difference reflects reality and not the way in which the dataset was constructed?