copyright and mass-digitization: the strategic importance of data-mining presentation details...
TRANSCRIPT
Copyright and Mass-Digitization: The strategic importance of data-mining
Presentation Details
Matthew Sag
Professor of LawLoyola University of Chicago
Abbreviated Time Line
2004 Google library project begins 2005 Class action suit filed by Authors Guild (among others) 2008 & 2009
Settlement proposed, objections follow, settlement revised 2011
(March) Settlement rejected (September) 2011 Authors Guild v. HathiTrust filed
2012 (August) oral argument in Authors Guild v. HathiTrust (October) Judge Baer ruled against the plaintiffs in Authors Guild v.
HathiTrust. Library digitization (ADA + Data) are fair use. 2013
(July) Second Cir. tells Judge Chin, no class certification without addressing the fair use issue
(September) oral argument on fair use in Authors Guild v. Google
The strategic importance of text-mining
Different kinds of digitization program raise different legal issues and bring in different stakeholders.
The Many Faces of Library/Archive Digitization
PreservationData production and analysis*
Searching books, testing search algorithms, computational linguistics, automated translation, natural language processing, macro-analysis of text
A platform for display and distribution of individual works
Disabled access* Scholarly access General access 4
Strategic Considerations
Library digitization for data production and analysis Significant academic and commercial
constituency (not just Google!) Strong normative appeal Obvious orphan works problem Justifies digitizing entire collections
Even if some other uses are ‘too much’, no all-copyright owner class action possible
The Legal Argument #1
Metadata – facts about the work – does not infringe the rights of the copyright owner.
– This is not usually contested, but it’s important to make sure everyone understands the reasons why metadata can’t infringe. Those reasons are …
Idea-expression distinction Merger doctrine Metadata is not substantially similarity to
underlying text Facts about the work don’t originate with the
author
Whale v. Dinosaur
whale(s) old
boat(s)
sea
such
hand(s)head
men
Captaingood
might
Starbuck
water farcri
edworld cre
w airnight
0
200
400
600
800
1000
1200
Whale v. Dinosaur
Legal Argument #2
A copying process that only produces metadata does not infringe. Intermediate non-expressive use is either (a) not copying in
the relevant sense or (b) fair use
The distinction between expressive and nonexpressive parts of works is well recognized (no copyright in a phone book, etc). The same distinction should be made in relation to
potential acts of infringement.
Intermediate non-expressive uses don’t communicate the author’s original expression to the public. No expressive substitution, no infringement
Application to Fair Use
Sect. 107 Factors(1) purpose and character: Like transformative uses, a nonexpressive use poses no risk of expressive substitution
(2) nature of the work … “not much use”(3) Amount and Substantiality: Like transformative uses, because there is no expressive substitution in a nonexpressive use, the amount of copying is qualitatively insignificant.(4) Market effect: Like transformative uses, a nonexpressive use poses no risk of expressive substitution, thus no cognizable market effect.
Legal Argument #3
Non-expressive use does not harm copyright owners and has great social value
“The United States is” versus “The United States are” 1780 –1900
13
American Slavery in American, English, and Irish Literature, 1800-1899. Matthew Jockers, Macroanalysis: Digital Methods for Literary History (2013)
Proportion of Irish Literature with a topic of ‘slavery’ spikes ~ 1860-65
Importance of the Digital Humanities Brief
Focused attention on digitization for the sake of data
Demonstrated importanceDisentangled it from other issues
Not just a Google issue, Not just an internet issue, Not just a research/scholarship issue
Powerful examples tied directly to the understanding of literature
» In case making the Internet work through caching and search was not enough for you!
Quotes from HathiTrust judgment …
I cannot imagine a definition of fair use that would not encompass the transformative uses made by Defendants' MDP and would require that I terminate this invaluable contribution to the progress of science and cultivation of the arts that at the same time effectuates the ideals espoused by the ADA.
– “The search capabilities of the HDL have already given rise to new methods of academic inquiry such as text mining.” (brief cited)
– … metadata and text mining, which "could actually enhance the market for the underlying work, by causing researchers to revisit the original work and reexamine it in more detail” (brief quoted)
Impact of the Digital Humanities Amicus Brief
Three for the price of one Authors Guild v. HathiTrust (district court) Authors Guild v. Google (district court) Authors Guild v. HathiTrust (court of appeals)
Over 100 signatories!
Discussed with approval in HathiTrust United States is/are example made its way into
the judgment in HathiTrust last year and oral argument in Google books on this week!
Some Concluding Thoughts
Specific legal issues vary by jurisdiction fair use, fair dealing, legislative reform
Underlying policy questions are global Idea-expression distinction The promise of big data and problem of orphan
works
Challenge for libraries and archives is making courts/decision makers understand the broader consequences
Action Items
Commercial and non-commercial digitizers need to work together and defend everyone’s right to non-expressive use
Digital Humanities, Linguistics, Comp. Sci., Libraries Search providers, plagiarism and copyright
infringement detection tools, music identification tools, reverse engineering
Advantage of flexible limitations and exceptions Without reform, other nations cede ground to the
U.S. as the data engine of the world.
Abbreviated Issues Summary
Issue Status Case NotesPreservation Still open, but
court unconvincedv. HathiTrust
Orphan works display
Still open, not ripe v. HathiTrust Trove (Australia)Best practices
Disability access Digitization ok v. HathiTrust On appealData mining Digitization ok v. HathiTrust All but given up in
v. GoogleLibrary copies as quid pro quo
Still open v. Google Easier now underlying use is fair use
Making/retaining excessive copies
Still open v. Google
Snippet display Still open v. GoogleStanding, remedies, class action …
Mixed v. HathiTrust v. Google
Further Reading
Matthew Jockers, Matthew Sag & Jason Schultz, Digital Archives: Don’t Let Copyright Block Data Mining, 490 NATURE 29-30 (October 4, 2012)
Further reading
Matthew Sag, Orphan Works as Grist for the Data Mill, 27 BERKELEY TECHNOLOGY LAW JOURNAL 1503 – 1550 (2012)
Matthew Sag, Copyright and Copy-Reliant Technology, 103 NORTHWESTERN UNIVERSITY LAW REVIEW 1607–1682 (2009)