seasr analytics loretta auvil [email protected] automated learning group data-intensive...
TRANSCRIPT
SEASR Analytics
Loretta Auvil
Automated Learning GroupData-Intensive Technologies and Applications,
National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
The SEASR project and its Meandre infrastructureare sponsored by The Andrew W. Mellon Foundation
SEASR Overview
SEASR Focus
• Project’s focus:– Supporting framework
– Developing
– Integrating
– Deploying
– Sustaining a set of
• Reusable and
• Expandable software components and
• SEASR can provide benefit a broad set of data mining applications for scholars in humanities
SEASR Goals
• The key goals are:– Support the development of a state-of-the-art software
environment for unstructured data management and analysis of digital libraries, repositories and archives
– Develop user interfaces, a data-flow engine and the data-flows that data management, analysis and visualization
– Support education and training through workshops to promote its usage among scholars
Workshop Objective
The objective of the workshop is to:
• Introduction of SEASR
• Learn what analytics SEASR can do
The SEASR Picture
SEASR Enables Scholarly Research
Discovery
– What are the words used in the corpus?
– What named entities (people, locations, dates) can be extracted?
– What hypothesis or rules can be generated by the “features” of the corpus?
– What “features” or language of the corpus best describes the corpus?
– What are the “similarities” between elements, documents, or corpuses to each other?
– What patterns can be identified?
Enables Scholar to Ask…
Pattern identification using automated learning
– Which patterns are characteristic of the English language?
– Which patterns are characteristic of a particular author, work, topic, or time?
– Which patterns based on words, phrases, sentences, etc. can be extracted from literary bodies?
– Which patterns are identified based on grammar or plot constructs?
– When are correlated patterns meaningful?
– Can they be categorized based on specific criteria?
– Can an author’s intent be identified given an extracted pattern?
Tag Cloud
• Counts tokens• Several different filtering options supported
Flesch-Kincaid Readability Test
• Results show scores for each item selected– Designed to indicate
comprehension difficulty when reading a passage of contemporary academic English
– Flesch Reading Ease: higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read
– Flesch–Kincaid Grade Level: result is a number that corresponds with a grade level
Dunning Loglikelihood
• Feature comparison of tokens
• Specify an analysis document/collection
• Specify a reference document/collection
• Perform Statistics comparison using Dunning Loglikelihood
Example showing over-representedAnalysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles DickensReference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens
Date Entities to Simile Timeline
• Entity Extraction with OpenNLP
• Dates viewed on Simile Timeline
Frequent Patterns
• Given: Set of documents • Find Frequent Patterns such that
– Common words patterns used in the collection
• Evaluation: What Is Good Patterns?
• Results:1060 patterns discovered
322: Lincoln147: Abe117: man100: Mr.100: time98: Lincoln Abe91: father85: Lincoln Mr.85: Lincoln man75: day70: Abraham
70: President68: boy67: Lincoln time65: Lincoln Abraham65: life63: Lincoln father57: men57: work52: Lincoln day…
HITS Summarizer
• Find the top sentences and tokens from all items submitted
Text Clustering
• Clustering of Text by token counts
• Filtering options for stop words, Part of Speech
• Dendogram Visualization
• NEMA: Executes a SEASR flow for each run– Loads audio data
– Extracts features for every 10 sec moving window of audio
– Loads and applies the models
– Sends results back to the WebUI
• NESTER: Annotation of Audio via Spectral Analysis
Audio Analysis
Emotion Tracking
Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)
Future: Application for Meme
“MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs”
Where can I Run SEASR Analysis
• Services that can be executed from
– SEASR website
– Zotero
– MONK
– VUE
SEASR Community Hub
• Explore existing flows to find others of interest
– Keyword Cloud
– Connections
• Find related flows
• Execute flow
• Comments
What is Zotero? (from Zotero Quick Start Guide)
• A citation manager. It is designed to store, manage, and cite bibliographic references, such as books and articles. In Zotero, each of these references constitutes an item.
• An extension for the Firefox web-browser by the Center for History and New Media at George Mason University.
• Installed by visiting zotero.org and clicking the download button on the page.
SEASR Analytics for Zotero
• An extension for the Firefox web-browser by the SEASR Team
• Uses your Zotero Collections
• Performs analysis using SEASR Services
The Value Add for SEASR & Zotero
• Analytical Results are saved as Zotero items (View Snapshot)– Includes metadata – Item naming strategy identifies the item or collection
processed– Creator indicates the Menu Label of the SEASR Analysis
• Related Tab links to the items processed in the Analysis
• No need to install the analysis, it runs as web service
MONK
Executes flows for each analysis requested
– Predictive modeling using Naïve Bayes
– Predictive modeling using Support Vector Machines (SVM)
– Feature comparisons
SEASR Support in VUE
• Goal: Provide functionality in VUE to use SEASR flows
• Implementations:
– Add content to map
– Get metadata for content
– Get information about content
Meandre Workbench
• Web-based UI
• Components and flows are retrieved from server
• Additional locations of components and flows can be added to server
• Create flow using a graphical drag and drop interface
• Change property values
• Execute the flow
The SEASR project and its Meandre infrastructureare sponsored by The Andrew W. Mellon Foundation
Extensible to Analysis that You Create
• You can leverage the flows we have on your server or request your university to host this analysis
• You can modify these flows and redeploy
• You can create new flows
– Perhaps you want to see only nouns or verbs
– Perhaps you want to see a list of extracted entities
• You can share these flows back to the community
Repository Search & Browse
Web Service
Interactive Web
Application
Zotero Upload to Repository
Zotero to SEASR : Fedora
JSTOR Data for Research:SEASR Accesses APIs
• Access JSTOR API in SEASR components• Use the output of these components with existing
SEASR components
feedback | login | searchcentral
Categories Recently Added Top 50 Submit About RSS Categories Recently Added Top 50 Submit About RSS
Featured Component [read more]
Word Counter by Jane Doe
Description Amazing component that given text stream, counts all the different words that appear on the text
Rights: NCSA/UofI open source license
Featured Component [read more]
Word Counter by Jane Doe
Description Amazing component that given text stream, counts all the different words that appear on the text
Rights: NCSA/UofI open source license
Featured Flow [read more]
FPGrowth by Joe Does
Featured Flow [read more]
FPGrowth by Joe Does
Browse Browse
By Joe DoeRights: NCSA/UofIDescription:Webservices given a Zotero entry tries to retrieve the content and measure its
By Joe DoeRights: NCSA/UofIDescription:Webservices given a Zotero entry tries to retrieve the content and measure its
Type
Component
Flows
Categories
Image
JSTOR
Zotero
Name
Author Centrality
Readability
Upload Fedora
SEASR Central
• Sharing and finding flows and components
Discussion Questions
• What kinds of data assets are you interested?
• What analysis would you like to use against this data?