scientific data curation and processing with apache tika
DESCRIPTION
Scientific data curation and processing with Apache Tika. Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation. Roadmap. 1 st part of the talk Why Tika? What is Tika? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/1.jpg)
Scientific data curation and processing with Apache Tika
Chris A. MattmannSenior Computer Scientist, NASA Jet Propulsion Laboratory
Adjunct Assistant Professor, Univ. of Southern California
Member, Apache Software Foundation
![Page 2: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/2.jpg)
Roadmap• 1st part of the talk
– Why Tika?– What is Tika?– What are the current versions of Tika?– What can it do?
• 2nd part of the talk– NASA Earth Science Data Systems– Data System Needs and Requirements– How does Tika help?
![Page 3: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/3.jpg)
And you are?
• Apache Member involved in– Tika (VP,PMC), Nutch (PMC), Incubator (PMC),
OODT (Mentor), SIS (Mentor), Lucy (Mentor) and Gora (Champion)
• Architect/Developer at NASA JPL in Pasadena, CA
• Software Architecture/Engineering Prof at USC
![Page 4: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/4.jpg)
The Information Landscape
![Page 5: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/5.jpg)
Proliferation of content types available
• By some accounts, 16K to 51K content types*
• What to do with content types?– Parse them
• How?• Extract their text and structure
– Index their metadata• In an indexing technology like Lucene, Solr, or in
Google Appliance– Identify what language they belong to
• Ngrams
*http://filext.com/
![Page 6: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/6.jpg)
Importance of content types
![Page 7: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/7.jpg)
Importance of content type detection
![Page 8: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/8.jpg)
Search Engine Architecture
![Page 9: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/9.jpg)
Goals• Identify and classify file types
– MIME detection• Glob pattern
– *.txt– *.pdf
• URL– http://…pdf– ftp://myfile.txt
• Magic bytes• Combination of
the above means
• Classification means reaction can be targeted
![Page 10: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/10.jpg)
is…• A content analysis and detection toolkit• A set of Java APIs providing MIME type
detection, language identification, integration of various parsing libraries
• A rich Metadata API for representing different Metadata models
• A command line interface to the underlying Java code
• A GUI interface to the Java code
![Page 11: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/11.jpg)
Tika’s (Brief) History• Original idea for Tika came from Chris Mattmann
and Jerome Charron in 2006• Proposed as Lucene sub-project
– Others interested, didn’t gain much traction
• Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit– A Content Management System
• Graduated from the Incubator to Lucene sub-project in 2008
• Graduated to Apache TLP in April 2010• Over 90 issues shipping in latest release (0.8)
![Page 12: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/12.jpg)
Community• Mailing lists
– User: 153 peeps– Dev: 114 peeps
• Committers/PMC– 10 peeps– Probably 5-6 active
• Releases– 7 releases so far– Working on 0.8
Credit: svnsearch.org
![Page 13: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/13.jpg)
Getting started rapidly…like now!
• Download Tika from:– http://tika.apache.org/download.html
• Grab tika-app-0.7.jar• alias tika “java –jar tika-app-0.7.jar”• tika < somefile.doc > extracted-text.xhtml• tika –m < somefile.doc > extracted.met
• Works on Windows too (alias only on UNIX)
![Page 14: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/14.jpg)
Detecting MIME types from Java
• String type = Tika.detect(…)– java.io.InputStream– java.io.File– java.net.URL– java.lang.String
![Page 15: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/15.jpg)
Adding new MIME types
• Got XML?
• Based on freedesktop.org spec (loosely)
![Page 16: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/16.jpg)
Many custom applications and tools
• You need this: to read this:
![Page 17: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/17.jpg)
Third-party parsing libraries• Most of the custom applications come with
software libraries and tools to read/write these files– Rather than re-invent the wheel, figure out a
way to take advantage of them• Parsing text and structure is a difficult
problem– Not all libraries parse text in equivalent
manners– Some are faster than others– Some are more reliable than others
![Page 18: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/18.jpg)
Parsing
• String content = Tika.parseToString(…)– InputStream– File– URL
![Page 19: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/19.jpg)
Streaming Parsing
• Reader reader = Tika.parse(…)– InputStream– File– URL
![Page 20: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/20.jpg)
Extraction of Metadata• Important to follow common Metadata models
– Dublin Core – any electronic resource– XMP – also general like Dublin Core– Word Metadata – specific to .doc, .ppt, etc.– EXIF – image related
• Lots of standards and models out there– The use and extraction of common models allows for
content intercomparison– All standardize mechanisms for searching– You always know for X file type that field Y is there and of
type String or Int or Date
![Page 21: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/21.jpg)
Cancer Research Example
![Page 22: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/22.jpg)
Cancer Research Example
Attributes
Relationships
![Page 23: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/23.jpg)
Metadata• Metadata met = new Metadata();
//Dubiln Coremet.set(Metadata.FORMAT, “text/html”);//multi-valuedmet.set(Metadata.FORMAT, “text/plain”);System.out.println(met.getValues(Metadata.FORMAT));
• Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forcast, etc.)– New in Tika 0.8! run: tika --list-met-models
![Page 24: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/24.jpg)
Methods for language identification
• N-grams– Method of detecting next character or set
of characters in a sequence– Useful in determine whether small
snippets of text come from a particular language, or character set
• Non-computational approaches– Tagging– Looking for common words or characters
![Page 25: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/25.jpg)
Language Detection• LanguageIdentifier lang =
new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(newFile(filename))));
• System.out.println(lang.getLanguage());• Uses Ngram analysis included with Tika
– Originating from Nutch– Can be improved
![Page 26: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/26.jpg)
Running Tika in GUI form
• tika --gui
<html xmlns:html=“…”><body>…</body></html>
![Page 27: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/27.jpg)
Integrating Tika into your App
• Maven• Ant• Eclipse• It’s just a set of jars
– tika-core– tika-parsers– tika-app– tika-bundle tika-core
tika-parsers
tika-app
tika-bundle
![Page 28: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/28.jpg)
Some really great stuff in 0.8
• Container aware detection and MIME improvements
• “Drop in” Parsers– Compressed RTF / TNEF / LZFU parsing
available via external plugin at Github
• New Parsers– RSS– Scientific files: NetCDF, HDF
![Page 29: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/29.jpg)
Improvements to Tika
• Adding more parsers for content types– Omnigraffle?
• Expanding ability to handle random access file parsing– Scientific data file formats, some work on
this
• Improving language and charset detection
![Page 30: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/30.jpg)
Part 2
Science Data Systems at NASA
![Page 31: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/31.jpg)
NASA Ground Data Systems
Credit: D. Woollard
![Page 32: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/32.jpg)
Context• NASA develops science data processing systems
for multiple earth science missions• These systems convert the instrument telemetry
delivered to earth from space into useful data for scientific research
• Typical characteristics– Remote sensing instruments that orbit the Earth multiple
times daily– Data are acquired constantly– Complex algorithms convert instrument measurements to
geophysical quantities
![Page 33: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/33.jpg)
The Square Kilometer Array• 1 sq. km of
antennas• Never-before
seen resolution looking intothe sky
• 700 TB– Per second!
![Page 34: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/34.jpg)
NASA DESDynI Mission
• 16 TB/day
• Geographically distributed
• 10s of 1000s of jobs per day
• Tier 1 Earth Science Decadal Mission
![Page 35: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/35.jpg)
Some Considerations• Scale
– Data throughput rates– # of data types– # of metadata types– # of users to send the data to
• Federation– Must leave the data where it is– Socio/Economic/Political
• Heterogeneity– Technology, data formats, skills!
![Page 36: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/36.jpg)
Apache OODT
• We’ve got some components to deal with these issues
![Page 37: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/37.jpg)
How are we building these systems now? -Allow for
push/pull of data over arbitrary
protocols
- Ingestion builds std catalog and
archive
-Deliver product metadata to
search, portal or GIS
-Plug in arbitrary met extractors
![Page 38: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/38.jpg)
How are we building these systems now? -Separation of
file management from workflow
management
-Allow for heterogeneous
computing resources
-Easily integrate PGEs
-Leverages same ingestion crawler
![Page 39: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/39.jpg)
What does this have to do with Tika?
Metadata Ext: TIKA!
Metadata Ext: TIKA!
MIME identification: TIKA!
MIME identification: TIKA!
![Page 40: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/40.jpg)
What does this have to do with Tika?
Metadata Ext: TIKA!
MIME identification: TIKA!
MIME identification: TIKA!
![Page 41: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/41.jpg)
Science Data File Formats• Hierarchical Data Format (HDF)
– http://www.hdfgroup.org – Versions 4 and 5– Lots of NASA data is in 4, newer NASA data in 5– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages• C/C++, Python, Java
![Page 42: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/42.jpg)
Science Data File Formats• network Common Data Form (netCDF)
– www.unidata.ucar.edu/software/netcdf/ – Versions 3 and 4– Heavily used in DOE, NOAA, etc.– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)• Metadata (Summary info, date/time ranges, spatial ranges)
– Custom readers/writers/APIs in many languages• C/C++, Python, Java
– Not Hierarchical representation: all flat
![Page 43: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/43.jpg)
So how does it work?• Ingestion
– Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format
– Need to extract their met, catalog and archive them, etc.
• Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability into the Apache trunk
• Processing– Processors (PGEs) generate NetCDF and HDF,
must extract met, catalog and archive
![Page 44: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/44.jpg)
Tool support• Entire stacks of tools written around
these formats– OPeNDAP, LAS, readers, writers, custom
NASA mission toolkits– OGC
• WMS, WCS, etc.
– Unique, one of a kind software build around these data file formats
• Apache can contribute strongly in this area!
![Page 45: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/45.jpg)
Besides processing science files
• …Tika also helps with• MIME identification
– Useful in remote file acquisition– Useful in classification (catalog/archive) of
existing content– Useful in crawling (see my Nutch talk)
• Language identification– Can be useful when data is coming from around
the world, but need to quickly identify whether or not we can process it
![Page 46: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/46.jpg)
Big Goal• More closely link OODT and Tika
– Add new parser to Tika
– Easily get OODT met extractor based on it
• Contribute back some features still baking in OODT– Configuration aspects of parsing
– File types and extensions for science data files
• Spatial– Some work done in my CS572 class on spatial parser
for Tika – would be great to integrate with Tika, OODT, SIS, and Solr
![Page 47: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/47.jpg)
NASA Geo Challenges• Sometimes the data isn’t annotated with lat and lon
– How to discover this?
• Even when the data is annotated with spatial information,computation of e.g.,bounding box aroundthe poles is difficult
• Efficiency and speed are difficult since data is at scale
![Page 48: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/48.jpg)
Alright, I’ll shut up now
• Any questions?
• THANK YOU!– [email protected]– @chrismattmann on Twitter
![Page 49: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/49.jpg)
Acknowledgements
• Some Tika material inspired by Jukka Zitting’s talks– http://www.slideshare.net/jukka/text-and-
metadata-extraction-with-apache-tika– http://www.slideshare.net/jukka/text-and-
metadata-extraction-with-apache-tika-4427630
• NASA Jet Propulsion Laboratory– OODT Team
![Page 50: Scientific data curation and processing with Apache Tika](https://reader036.vdocuments.us/reader036/viewer/2022062423/568146bd550346895db3f439/html5/thumbnails/50.jpg)
Book
• Jukka and I are writinga book on Tika– Working on Chapters 8
and 9 of 15
• Early Access availablethrough MEAPprogram
• http://manning.com/mattmann/