a data warehouse for canadian literature

29
Eduardo Gutarra

Upload: shanae

Post on 02-Feb-2016

18 views

Category:

Documents


0 download

DESCRIPTION

A Data Warehouse for Canadian Literature. Eduardo Gutarra. Overview. Introduction and Motivation Background The ETL Process The multidimensional model and star schema Issues with my star schema design Sample MDX queries for my cube. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Data Warehouse for  Canadian Literature

Eduardo Gutarra

Page 2: A Data Warehouse for  Canadian Literature

OverviewIntroduction and Motivation BackgroundThe ETL ProcessThe multidimensional model and star schemaIssues with my star schema designSample MDX queries for my cube

Page 3: A Data Warehouse for  Canadian Literature

IntroductionData warehouses are often used as

one of the main components of Decision Support Systems.

Data warehouses can be used to perform analyses on different fields as long as there is a lot of data.

We want to build a data warehouse on places mentioned in books.

Gutenberg Canada Website provides books in Text Files and other formats, free of charge.

Barcelona

Saint JohnMontreal

Page 4: A Data Warehouse for  Canadian Literature

MotivationProject is inspired from the LitOLAP projectSeeks to apply data warehousing techniques

in the domain of literary text processing.Allows a literary researcher answering

questions over an author’s style, or particularities about book among others.

Facilitates the analysis of literary texts to a domain expert.

Page 5: A Data Warehouse for  Canadian Literature

OverviewIntroduction and Motivation BackgroundThe ETL ProcessThe multidimensional model and star schemaIssues with my star schema designSample MDX queries for my cube

Page 6: A Data Warehouse for  Canadian Literature

Data warehouseA data warehouse is a database specifically

used for reporting.Populating a data warehouse (DW) involves

an ETL process where the data is:Extracted from data sourcesTransformed to conform the schema of your

DW.Loaded onto the data warehouse.

Once the DW is populated, Online Analytical Processing (OLAP) can be performed on it.

Page 7: A Data Warehouse for  Canadian Literature

Data warehouse

Sales in Store 1

Sales in Store 2

FlatFiles

ETL Process

Datawarehouse

OLAPCube

OLAPCube

Tend to beorders ofmagnitudelarger

Query responseTime is more important

Transactional throughput is More important

SummarizeThe data

Page 8: A Data Warehouse for  Canadian Literature

OverviewIntroduction and Motivation BackgroundThe ETL ProcessThe multidimensional model and star schemaIssues with my star schema designSample MDX queries for my cube

Page 9: A Data Warehouse for  Canadian Literature

ETL ProcessAccording to Kimball, about 70% of the effort

is spent in the ETL ProcessMy project has a Single Data SourceObtain the metadata, and the books

separately<html>:

</html>

</body>

<body>:

GutenbergCanada. (index.html)

Author Title

… Year

Page 10: A Data Warehouse for  Canadian Literature

MySQL

English?

No

Transform to Table Form

AnnotatedXML File

StructuredTable

AnnotatedXML File

Yes

Page 11: A Data Warehouse for  Canadian Literature

Book1.xml21: <sentence> I have lived in <place>Saint John</place>. </sentence>22: <sentence> This sentence has no place mentioned.</sentence>...

Book1.txt 21: I have lived in Saint John. 22: This sentece has no place mentioned. ...

Natural Language ProcessingGATE -- Open-source software for text processing.Gazetteer to determine what words or phrases

are a location.Annotates sentences and locationsProduces XML file

Page 12: A Data Warehouse for  Canadian Literature

MySQL

English?

No

Transform to Table Form

AnnotatedXML File

StructuredTable

AnnotatedXML File

Yes

Page 13: A Data Warehouse for  Canadian Literature

Book1.xml21: <sentence> I have lived in <place>Saint John</place>.</sentence>22: <sentence> This sentence has no place mentioned.</sentence>...

Book2.xml31: <sentence> This sentence mentions <place>Fredericton</place> and<place> Halifax </place>.</sentence>32: <sentence> This sentence mentions <place> Saint John </place>.</sentence>...

Once the XML file is written we have a process to transformInto a single denormalized table.

Book Place Sentence

Frequency

Book1 Saint John 21 1

Book2 Fredericton

31 1

Book2 Halifax 32 1Book Place Senten

ceFrequency

Book1 Saint John 21 1

Book1 NONE 22 1

Book2 Fredericton

31 1

Book2 Halifax 32 1

Page 14: A Data Warehouse for  Canadian Literature

MySQL

English?

No

Transform to Table Form

AnnotatedXML File

StructuredTable

AnnotatedXML File

Yes

Page 15: A Data Warehouse for  Canadian Literature

OverviewIntroduction and Motivation BackgroundThe ETL ProcessThe multidimensional model and star schemaIssues with my star schema designSample MDX queries for my cube

Page 16: A Data Warehouse for  Canadian Literature

The Multidimensional ModelWe use the multidimensional model to design

the way the data is structuredMultidimensional model divides the data in

measures and context.Measures: Numerical data being trackedContext for the facts: Data used for to

describe the circumstances for which a given measure was obtained.

Page 17: A Data Warehouse for  Canadian Literature

Units Sold

Profit

Measures

20$45

Time

Product

Location

Dimensions

Page 18: A Data Warehouse for  Canadian Literature

The Star Schema

• When we store a multidimensional model in a relational database it is called a Star Schema.

ProductID LocationID MonthID Units Sold Profit2 1 2 20 45.. .. .. .. .... .. .. .. ..

ProductID Product1 Sardines2 Anchovies3 Herring4 Pilchards

LocationID Location1 Boston2 Benson3 Seattle4 Wichita

MonthID Month1 April2 May3 June4 July

Fact TableDimension Table Dimension Table

Dimension Table

20$45

2NF

2NF

3NF2NF

Page 19: A Data Warehouse for  Canadian Literature

Attributes• Attributes are abstract items for convenient

qualification or summarization of data.• Attributes often form hierarchies.

TimeID Month Quarter Year1 January Q1 20102 February Q1 20103 March Q1 20104 April Q2 20105 May Q2 20106 June Q2 20107 July Q3 20108 August Q3 20109 September Q3 2010

10 October Q4 201011 November Q4 201012 December Q4 201013 January Q1 2011

Finest Coarsest

Q2

33

20

45

Q2 x Anchovies x Boston 98

Page 20: A Data Warehouse for  Canadian Literature

SentenceID x PlaceID Frequency

Place ID City CountryContinen

t

Sentence ID

Place IDFrequenc

y

Sentence ID

TextSentenc

e #Boo

kAuthor Occupation

Place

Sentence

Page 21: A Data Warehouse for  Canadian Literature

OverviewIntroduction and Motivation BackgroundThe ETL ProcessThe multidimensional model and star schemaIssues with my star schema designSample MDX queries for my cube

Page 22: A Data Warehouse for  Canadian Literature

Issues with the Design

PlaceID City Country Continent

40 Unspecified Canada North America

41 Unspecified Unspecified North America

42 Unspecified Unspecified South America

What if the place is a country?What if the place is a continent?Dummy value “unspecified” can fill in the

missing values

<sentence> I live in <place>Canada</place>. </sentence>

<sentence> I live in <place>North America</place>. </sentence>

Page 23: A Data Warehouse for  Canadian Literature

Issues with the DesignLondon in England, or London in Ontario?

Context required to resolve ambiguityAllocation to partially fix the issue

<sentence> I live in <place>London</place>. </sentence>

PlaceID City Country Continent

33 London England Europe

: : : :

45 London Canada North America

BookID SentenceID PlaceID Frequency

28 10 33 4/5

28 10 45 1/5

Fact Table

Dimension Table

Page 24: A Data Warehouse for  Canadian Literature

Issues with the DesignMany to Many relationship between Authors and BooksMany to Many relationships are tricky. They can lead to double-counting and other problems.

Author Title SentenceID Frequency

?The Knight of the Burning

Pestle 1 1Fletcher, John A Story 2 1

? A Tale of The Big Mountain 3 1

Author Title SentenceID FrequencyBeaumont, Francis The Knight of the Burning Pestle 1 ½

Fletcher, John The Knight of the Burning Pestle 1 ½Fletcher, John A Story 2 1

Beaumont, Francis A Tale of The Big Mountain 3 ½Fletcher, John A Tale of The Big Mountain 3 ½

Author_1 Author_2 Title SentenceID FrequencyBeaumont, Francis Fletcher, John The Knight of the Burning Pestle 1 1

Fletcher, John NULL A Story 2 1Beaumont, Francis Fletcher, John A Tale of The Big Mountain 3 1

Additional Attribute

Allocation

Beaumont, FrancisFletcher, John

Beaumont, FrancisFletcher, John

Page 25: A Data Warehouse for  Canadian Literature

Place ID City CountryContinen

t

Sentence ID

Place IDFrequen

cy

AuthorGID

AuthorID

AuthorID

AuthorName

Sentence ID

TextSentenc

e #Boo

kAuthorGI

DOccupatio

n

Dimension Table

Bridge Table

Outtriger Table Add two tablesTo the Star Schema

Page 26: A Data Warehouse for  Canadian Literature

AuthorID x SentenceID x PlaceID Frequency

Text

Authors

Sentences

Places

Sentence ID

Book Name

Sentence #

Place ID City CountryContinen

t

Author ID

Author Name

Occupation

DOB DOD

Sentence ID

Place IDFrequen

cyAuthor

ID

2

Page 27: A Data Warehouse for  Canadian Literature

OverviewIntroduction and Motivation BackgroundThe ETL ProcessThe multidimensional model and star schemaIssues with my star schema designSample MDX queries for my cube

Page 28: A Data Warehouse for  Canadian Literature

MySQL

English?

No

Transform to Table Form

AnnotatedXML File

StructuredTable

AnnotatedXML File

Yes

Page 29: A Data Warehouse for  Canadian Literature

OLAP SchemaThe OLAP Schema

file indicates where the fact table and dimension tables are in MySQL.

Mondrian creates the OLAP cube from the MySQL back-end.

JPivot provides the UI for the OLAP cube

OLAPSchema

File

MySQL