seasr analytics loretta auvil [email protected] automated learning group data-intensive...

31
SEASR Analytics Loretta Auvil [email protected] Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation

Upload: britney-cleopatra-james

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

SEASR Analytics

Loretta Auvil

[email protected]

Automated Learning GroupData-Intensive Technologies and Applications,

National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

The SEASR project and its Meandre infrastructureare sponsored by The Andrew W. Mellon Foundation

Page 2: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

SEASR Overview

Page 3: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

SEASR Focus

• Project’s focus:– Supporting framework

– Developing

– Integrating

– Deploying

– Sustaining a set of

• Reusable and

• Expandable software components and

• SEASR can provide benefit a broad set of data mining applications for scholars in humanities

Page 4: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

SEASR Goals

• The key goals are:– Support the development of a state-of-the-art software

environment for unstructured data management and analysis of digital libraries, repositories and archives

– Develop user interfaces, a data-flow engine and the data-flows that data management, analysis and visualization

– Support education and training through workshops to promote its usage among scholars

Page 5: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Workshop Objective

The objective of the workshop is to:

• Introduction of SEASR

• Learn what analytics SEASR can do

Page 6: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

The SEASR Picture

Page 7: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

SEASR Enables Scholarly Research

Discovery

– What are the words used in the corpus?

– What named entities (people, locations, dates) can be extracted?

– What hypothesis or rules can be generated by the “features” of the corpus?

– What “features” or language of the corpus best describes the corpus?

– What are the “similarities” between elements, documents, or corpuses to each other?

– What patterns can be identified?

Page 8: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Enables Scholar to Ask…

Pattern identification using automated learning

– Which patterns are characteristic of the English language?

– Which patterns are characteristic of a particular author, work, topic, or time?

– Which patterns based on words, phrases, sentences, etc. can be extracted from literary bodies?

– Which patterns are identified based on grammar or plot constructs?

– When are correlated patterns meaningful?

– Can they be categorized based on specific criteria?

– Can an author’s intent be identified given an extracted pattern?

Page 9: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Tag Cloud

• Counts tokens• Several different filtering options supported

Page 10: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Flesch-Kincaid Readability Test

• Results show scores for each item selected– Designed to indicate

comprehension difficulty when reading a passage of contemporary academic English

– Flesch Reading Ease: higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read

– Flesch–Kincaid Grade Level: result is a number that corresponds with a grade level

Page 11: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Dunning Loglikelihood

• Feature comparison of tokens

• Specify an analysis document/collection

• Specify a reference document/collection

• Perform Statistics comparison using Dunning Loglikelihood

Example showing over-representedAnalysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles DickensReference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens

Page 12: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Date Entities to Simile Timeline

• Entity Extraction with OpenNLP

• Dates viewed on Simile Timeline

Page 13: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Frequent Patterns

• Given: Set of documents • Find Frequent Patterns such that

– Common words patterns used in the collection

• Evaluation: What Is Good Patterns?

• Results:1060 patterns discovered

322: Lincoln147: Abe117: man100: Mr.100: time98: Lincoln Abe91: father85: Lincoln Mr.85: Lincoln man75: day70: Abraham

70: President68: boy67: Lincoln time65: Lincoln Abraham65: life63: Lincoln father57: men57: work52: Lincoln day…

Page 14: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

HITS Summarizer

• Find the top sentences and tokens from all items submitted

Page 15: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Text Clustering

• Clustering of Text by token counts

• Filtering options for stop words, Part of Speech

• Dendogram Visualization

Page 16: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

• NEMA: Executes a SEASR flow for each run– Loads audio data

– Extracts features for every 10 sec moving window of audio

– Loads and applies the models

– Sends results back to the WebUI

• NESTER: Annotation of Audio via Spectral Analysis

Audio Analysis

Page 17: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Emotion Tracking

Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)

Page 18: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Future: Application for Meme

“MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs”

Page 19: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Where can I Run SEASR Analysis

• Services that can be executed from

– SEASR website

– Zotero

– MONK

– VUE

Page 20: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

SEASR Community Hub

• Explore existing flows to find others of interest

– Keyword Cloud

– Connections

• Find related flows

• Execute flow

• Comments

Page 21: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

What is Zotero? (from Zotero Quick Start Guide)

• A citation manager. It is designed to store, manage, and cite bibliographic references, such as books and articles. In Zotero, each of these references constitutes an item.

• An extension for the Firefox web-browser by the Center for History and New Media at George Mason University.

• Installed by visiting zotero.org and clicking the download button on the page.

Page 22: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

SEASR Analytics for Zotero

• An extension for the Firefox web-browser by the SEASR Team

• Uses your Zotero Collections

• Performs analysis using SEASR Services

Page 23: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

The Value Add for SEASR & Zotero

• Analytical Results are saved as Zotero items (View Snapshot)– Includes metadata – Item naming strategy identifies the item or collection

processed– Creator indicates the Menu Label of the SEASR Analysis

• Related Tab links to the items processed in the Analysis

• No need to install the analysis, it runs as web service

Page 24: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

MONK

Executes flows for each analysis requested

– Predictive modeling using Naïve Bayes

– Predictive modeling using Support Vector Machines (SVM)

– Feature comparisons

Page 25: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

SEASR Support in VUE

• Goal: Provide functionality in VUE to use SEASR flows

• Implementations:

– Add content to map

– Get metadata for content

– Get information about content

Page 26: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Meandre Workbench

• Web-based UI

• Components and flows are retrieved from server

• Additional locations of components and flows can be added to server

• Create flow using a graphical drag and drop interface

• Change property values

• Execute the flow

The SEASR project and its Meandre infrastructureare sponsored by The Andrew W. Mellon Foundation

Page 27: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Extensible to Analysis that You Create

• You can leverage the flows we have on your server or request your university to host this analysis

• You can modify these flows and redeploy

• You can create new flows

– Perhaps you want to see only nouns or verbs

– Perhaps you want to see a list of extracted entities

• You can share these flows back to the community

Page 28: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Repository Search & Browse

Web Service

Interactive Web

Application

Zotero Upload to Repository

Zotero to SEASR : Fedora

Page 29: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

JSTOR Data for Research:SEASR Accesses APIs

• Access JSTOR API in SEASR components• Use the output of these components with existing

SEASR components

Page 30: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

feedback | login | searchcentral

Categories Recently Added Top 50 Submit About RSS Categories Recently Added Top 50 Submit About RSS

Featured Component [read more]

Word Counter by Jane Doe

Description Amazing component that given text stream, counts all the different words that appear on the text

Rights: NCSA/UofI open source license

Featured Component [read more]

Word Counter by Jane Doe

Description Amazing component that given text stream, counts all the different words that appear on the text

Rights: NCSA/UofI open source license

Featured Flow [read more]

FPGrowth by Joe Does

Featured Flow [read more]

FPGrowth by Joe Does

Browse Browse

By Joe DoeRights: NCSA/UofIDescription:Webservices given a Zotero entry tries to retrieve the content and measure its

By Joe DoeRights: NCSA/UofIDescription:Webservices given a Zotero entry tries to retrieve the content and measure its

Type

Component

Flows

Categories

Image

JSTOR

Zotero

Name

Author Centrality

Readability

Upload Fedora

SEASR Central

• Sharing and finding flows and components

Page 31: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing

Discussion Questions

• What kinds of data assets are you interested?

• What analysis would you like to use against this data?