computer-based text- and data analysis€¦ · ibm watson (wins jeopardy in 2011) machine...

27
Computer-Based Text- and Data Analysis Technologies and Applications Mark Cieliebak 9.6.2015

Upload: others

Post on 24-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

Computer-Based Text- and Data AnalysisTechnologies and Applications

Mark Cieliebak962015

2Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

use

3Mark Cieliebak May 2015ZHAW

About Me

Mark Cieliebak

+ Software Engineer amp Data Scientist

+ PhD in Computer Science

+ CIO at Netbreeze (now Microsoft)

+ gt30 scientific publications

Lecturer at ZHAW

CEO of SpinningBytes AG

4Mark Cieliebak May 2015ZHAW

Classical Data Analysis

Peptide Sequencing

Uses IT to

retrieve

store

search

count

calculate

compare

visualize

data

5Mark Cieliebak May 2015ZHAW

Computer-Based Data Analytics

bull Started with Artificial intelligence in the 1960s

bull Analyzes huge amounts of data

bull Uses Machine Learning for

bull Pattern Detection

bull Image and Text Classification

bull Predictive Analysis

bull Data Clustering

bull and many more

6Mark Cieliebak May 2015ZHAW

Applications of Data Analytics

Deep Blue

(beats Kasparow 1997)

IBM Watson

(wins Jeopardy in 2011)

Machine Translation

Selfdriving Cars

Spelling Correction

Internet Search

Recommendation

Systems

Email Spam

Detection

Voice Recognition

DATAANALYTICS

7Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

Memory 64000 Bytes

8589934592 Bytes

Performance 500000 FLOPS

FLOPS

Cost per GFLOP (1984) $4278000000

$008

33863000000000000

1960 2015

8Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 2: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

2Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

use

3Mark Cieliebak May 2015ZHAW

About Me

Mark Cieliebak

+ Software Engineer amp Data Scientist

+ PhD in Computer Science

+ CIO at Netbreeze (now Microsoft)

+ gt30 scientific publications

Lecturer at ZHAW

CEO of SpinningBytes AG

4Mark Cieliebak May 2015ZHAW

Classical Data Analysis

Peptide Sequencing

Uses IT to

retrieve

store

search

count

calculate

compare

visualize

data

5Mark Cieliebak May 2015ZHAW

Computer-Based Data Analytics

bull Started with Artificial intelligence in the 1960s

bull Analyzes huge amounts of data

bull Uses Machine Learning for

bull Pattern Detection

bull Image and Text Classification

bull Predictive Analysis

bull Data Clustering

bull and many more

6Mark Cieliebak May 2015ZHAW

Applications of Data Analytics

Deep Blue

(beats Kasparow 1997)

IBM Watson

(wins Jeopardy in 2011)

Machine Translation

Selfdriving Cars

Spelling Correction

Internet Search

Recommendation

Systems

Email Spam

Detection

Voice Recognition

DATAANALYTICS

7Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

Memory 64000 Bytes

8589934592 Bytes

Performance 500000 FLOPS

FLOPS

Cost per GFLOP (1984) $4278000000

$008

33863000000000000

1960 2015

8Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 3: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

3Mark Cieliebak May 2015ZHAW

About Me

Mark Cieliebak

+ Software Engineer amp Data Scientist

+ PhD in Computer Science

+ CIO at Netbreeze (now Microsoft)

+ gt30 scientific publications

Lecturer at ZHAW

CEO of SpinningBytes AG

4Mark Cieliebak May 2015ZHAW

Classical Data Analysis

Peptide Sequencing

Uses IT to

retrieve

store

search

count

calculate

compare

visualize

data

5Mark Cieliebak May 2015ZHAW

Computer-Based Data Analytics

bull Started with Artificial intelligence in the 1960s

bull Analyzes huge amounts of data

bull Uses Machine Learning for

bull Pattern Detection

bull Image and Text Classification

bull Predictive Analysis

bull Data Clustering

bull and many more

6Mark Cieliebak May 2015ZHAW

Applications of Data Analytics

Deep Blue

(beats Kasparow 1997)

IBM Watson

(wins Jeopardy in 2011)

Machine Translation

Selfdriving Cars

Spelling Correction

Internet Search

Recommendation

Systems

Email Spam

Detection

Voice Recognition

DATAANALYTICS

7Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

Memory 64000 Bytes

8589934592 Bytes

Performance 500000 FLOPS

FLOPS

Cost per GFLOP (1984) $4278000000

$008

33863000000000000

1960 2015

8Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 4: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

4Mark Cieliebak May 2015ZHAW

Classical Data Analysis

Peptide Sequencing

Uses IT to

retrieve

store

search

count

calculate

compare

visualize

data

5Mark Cieliebak May 2015ZHAW

Computer-Based Data Analytics

bull Started with Artificial intelligence in the 1960s

bull Analyzes huge amounts of data

bull Uses Machine Learning for

bull Pattern Detection

bull Image and Text Classification

bull Predictive Analysis

bull Data Clustering

bull and many more

6Mark Cieliebak May 2015ZHAW

Applications of Data Analytics

Deep Blue

(beats Kasparow 1997)

IBM Watson

(wins Jeopardy in 2011)

Machine Translation

Selfdriving Cars

Spelling Correction

Internet Search

Recommendation

Systems

Email Spam

Detection

Voice Recognition

DATAANALYTICS

7Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

Memory 64000 Bytes

8589934592 Bytes

Performance 500000 FLOPS

FLOPS

Cost per GFLOP (1984) $4278000000

$008

33863000000000000

1960 2015

8Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 5: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

5Mark Cieliebak May 2015ZHAW

Computer-Based Data Analytics

bull Started with Artificial intelligence in the 1960s

bull Analyzes huge amounts of data

bull Uses Machine Learning for

bull Pattern Detection

bull Image and Text Classification

bull Predictive Analysis

bull Data Clustering

bull and many more

6Mark Cieliebak May 2015ZHAW

Applications of Data Analytics

Deep Blue

(beats Kasparow 1997)

IBM Watson

(wins Jeopardy in 2011)

Machine Translation

Selfdriving Cars

Spelling Correction

Internet Search

Recommendation

Systems

Email Spam

Detection

Voice Recognition

DATAANALYTICS

7Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

Memory 64000 Bytes

8589934592 Bytes

Performance 500000 FLOPS

FLOPS

Cost per GFLOP (1984) $4278000000

$008

33863000000000000

1960 2015

8Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 6: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

6Mark Cieliebak May 2015ZHAW

Applications of Data Analytics

Deep Blue

(beats Kasparow 1997)

IBM Watson

(wins Jeopardy in 2011)

Machine Translation

Selfdriving Cars

Spelling Correction

Internet Search

Recommendation

Systems

Email Spam

Detection

Voice Recognition

DATAANALYTICS

7Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

Memory 64000 Bytes

8589934592 Bytes

Performance 500000 FLOPS

FLOPS

Cost per GFLOP (1984) $4278000000

$008

33863000000000000

1960 2015

8Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 7: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

7Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

Memory 64000 Bytes

8589934592 Bytes

Performance 500000 FLOPS

FLOPS

Cost per GFLOP (1984) $4278000000

$008

33863000000000000

1960 2015

8Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 8: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

8Mark Cieliebak May 2015ZHAW

Foundation of Data Analytics

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 9: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

9Mark Cieliebak May 2015ZHAW

Foundations of Data Analytics

Faster Computers + More Data + Better Algorithms

Deep Learning

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 10: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

10Mark Cieliebak May 2015ZHAW

Predicted

Label

Machine Learning in a NutshellT

rain

ing

Ap

pli

ca

tio

n

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 11: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

11Mark Cieliebak May 2015ZHAW

Application Social Media Analysis

Sentiment Analysis

Topic Extraction

Trend Detection

Alerting

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 12: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

12Mark Cieliebak May 2015ZHAW

Twitter Facts

Data Access

bull Free Access with Search API

bull Commercial Access to Firehose Stream (all or 10)

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 13: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

13Mark Cieliebak May 2015ZHAW

Sentiment Analysis on Twitter

StackOverflow names Apple Swift

the worlds most loved programming

language httpwwwbloombergcomhellip

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 14: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

14Mark Cieliebak May 2015ZHAW

Sentiment Analysis Performance of

Commercial Tools (F1-Score)

0

01

02

03

04

05

06

07

Average of All Tools

Best Tool per Corpus

Overall Best Tool (Sentigem)

Experimantal Setup

9 commercial sentiment analysis tools evaluated on 7 public corpora with 28653 short texts

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 15: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

15Mark Cieliebak May 2015ZHAW

Take-Home Lesson

Sentiment Tools on short texts

achieve on average an

F1-Score of 51

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 16: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

16Mark Cieliebak May 2015ZHAW

Application Newspaper Segmentation

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 17: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

17Mark Cieliebak May 2015ZHAW

Application Sales Prediction

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 18: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

18Mark Cieliebak May 2015ZHAW

More Applications

Expert Match

Foundation Register Speaker Detection

FaceImage Recognition

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 19: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

19Mark Cieliebak May 2015ZHAW

What do Data Scientists Need

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 20: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

20Mark Cieliebak May 2015ZHAW

Data Sources

bull Experimental Data

bull Reference Datasets

bull Dictionaires Ontologies Thesaurus

bull Scientific Papers

bull Social Media

bull Media Archives Newspapers Magazines

bull Websites

bull Live Streams Twitter Online News Product Reviews

bull VideosMoviesPictures

bull Books Belletristic Technical Literature

bull Wikipedia

bull Hand-crafted Datasets

bull etc etc

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 21: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

21Mark Cieliebak May 2015ZHAW

Data Provisioning

Data Collections

bull Linguistic Data Consortium

bull European Language Ressource Association

bull NISt

bull TIMIT

bull etc

Access Licensing

Open Data

bull Open Governmental Data OGD

bull Open Research Data ORD

bull Linked Open Data

Participation Guidelines

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 22: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

22Mark Cieliebak May 2015ZHAW

Data Scientist

Library

Data

analyze

support

use

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 23: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

23Mark Cieliebak May 2015ZHAW

Digitizing

Make Information Accessible

bull Extract Text Images Charts Tables

bull Categorize

bull Search

bull Browse

bull (Summarize)

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 24: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

24Mark Cieliebak May 2015ZHAW

Data Integration

Combining data from different sources is very time consuming

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 25: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

25Mark Cieliebak May 2015ZHAW

SODES Automatic Data Integration

SearchAutomaticIntegration

Data Intake Preview Download

Data enters SODES viabull Linking to eg CKANbull APIbull Crawlingbull User upload

Semantic Context Comprehensionbull Matching of columnsbull Data Quality Improvements

Content based search onbull Full text of databull Full text of meta databull Descriptions of data sets

Easy data exploration enabled bybull All integrable search results in one tablebull Statistical standard plots amp measures

bull Export to variousstandard formats

bull Enables analytics in specialized tools of choice

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 26: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

26Mark Cieliebak May 2015ZHAW

Talk in Short

bull Text and Data Analytics is successful

bull Researcher need access to data

bull Data Scientists have powerful tools

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel

Page 27: Computer-Based Text- and Data Analysis€¦ · IBM Watson (wins Jeopardy in 2011) Machine Translation Selfdriving Cars Spelling Correction Internet Search Recommendation Systems Email

27Mark Cieliebak May 2015ZHAW

Thank You

Mark Cieliebak

ZHAW ndash Institute of Applied Information Technology (InIT)

Winterthur Switzerland

Email cielzhawch Website wwwzhawch~ciel