the weka data mining softwareathena.ecs.csus.edu/~appanap/presentation_177.pdf · why to use real...

19
Authors: Xiaofang Li Yingchi Mao Changzhou Institute Of Technology Hohai University Changzhou, China Nanjing, China Guided By : Prof. Meiliu Lu Presenting: Pranavi Appana Neelam Baviskar Pallavi Vardhamane Paper Presentation

Upload: ngophuc

Post on 21-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Authors:

Xiaofang Li Yingchi Mao

Changzhou Institute Of Technology Hohai University

Changzhou, China Nanjing, China

Guided By : Prof. Meiliu Lu

Presenting:Pranavi AppanaNeelam BaviskarPallavi Vardhamane

Paper Presentation

Agenda

● Background and Problem Statement

● Challenges and Problem Solution

● Improvised ETL Framework

● Dynamic Mirror Replication Technology

● Performance Evaluation

● Opinion

● Project Proposal

Background & Problem Definition

Why to use Real time data ware house?

The load cycle of traditional data warehouse is fix and longer, which cannot timely response the rapid data change. Whereas Real-time data warehouse can capture the rapid data change and process the real-time data.

Problem statements :

1. To get real-time data access without the processing delay with the real-time data warehouse

2. To avoid the Query contention between OLAP queries and OLTP updates

Challenges and Problem Solutions

Challenges

- Enabling real-time ETL

- Data aggregation operation not synchronized with the real-time data

Solutions

- Improvised ETL framework

- Dynamic mirror replication technology

Improvised ETL framework

Fig 1: The pre-processing framework for real-time data warehouse

Dynamic mirror replication technology

Dynamic mirror creation and allocation

- Creation of mirror files and initiate bucket link

Dynamic mirror release

- Load data into warehouse and release DSA

The procedure of query processing

- Retrieve the data image in the dynamic data storage based on the obtained data_id and perform processing

Performance Evaluation

Experiment Settings

- The OLAP query response time in different update interval

Experimental Results

- The OLAP query response time in different size of DSA.

The query response time in different update interval

Our opinion on the research paper

We agree and confirm with the solutions suggested by author for enabling Real time ETL and Data/Query Contention problem.

We suggest additional solution of using MetaMatrix with DataMigrator. This will help solving above problems and improve Query efficiency

References

Xiaofang Li, Yingchi Ma. Real-Time Data ETL Framework for Big Real-Time Data Analysis, Information and Automation, 2015 IEEE International Conference (1289-1294), August 2015

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7279485

Team : Pranavi AppanaNeelam Baviskar

Pallavi VardhamaneSpring 2016

Agenda

Background and Motivation

Purpose and Scope

Queries

Objectives

Resources

Schedule

References

Background and Motivation

Dataset :

https://bythenumbers.sco.ca.gov/browse?utf8=%E2%9C%93&page=1

A list of relevant financial reports have been provided by the

Government of California in the above dataset. This dataset has details of the

Expenditures, Revenues and State Income of all the departments generated

in the form of fees, penalties and taxes.

Purpose and Scope : DW and DM Purpose:

Develop a tool/web application for the state employee's and public use

to give important financial information on the government’s funding and

income.

Scope :

Multiple relevant datasets are available for billions of data. We are

trying to limit the scope for customer specific requirements like city,

county, departments, yearly dataset for financial data.

Queries

1. What is the County wise and City wise State Income?

2. What is the Business category under which this state income(taxes, fee)

was generated?

3. Which sub-departments are responsible for the maximum income

collection?

4. Determine the expenditures for a particular department.

Example: sewage, water, safety, public departments

5. Estimate and determine which cities/counties are in profit or running in

debt due to high expenses.

Objectives

Analyze, clean and prune the data.

Create sample of data marts for different generic purposes to solve the

problems.

Design database schema.

Design a data warehouse application.

Load data to warehouse and perform user queries.

Resources

Data visualization

- OffVis

Database Development

- MySQL, MariaDB

Data warehouse

- PHP, HTML5, CSS

Data mining

- Rapidminer, WEKA

Schedule

Week 1: Data analysis, cleaning and pruning. Designing Schema.

Week 2: Creating Data mart samples. Designing warehouse application

Week 3: Applying data mining. Applying Query processing.

Week 4: Testing and Documentation.

References California States Controller’s Office , Government Financial Reports,

Datasets, https://bythenumbers.sco.ca.gov/browse?utf8=%E2%9C%93&page=1

This website gives consolidated information about the government

expenditures in the state of California.

Xiaofang Li, Yingchi Ma. Real-Time Data ETL Framework for Big Real-Time Data Analysis, Information and Automation, 2015 IEEE International Conference (1289-1294), August 2015

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7279485

Thank You!