the weka data mining softwareathena.ecs.csus.edu/~appanap/presentation_177.pdf · why to use real...
TRANSCRIPT
Authors:
Xiaofang Li Yingchi Mao
Changzhou Institute Of Technology Hohai University
Changzhou, China Nanjing, China
Guided By : Prof. Meiliu Lu
Presenting:Pranavi AppanaNeelam BaviskarPallavi Vardhamane
Paper Presentation
Agenda
● Background and Problem Statement
● Challenges and Problem Solution
● Improvised ETL Framework
● Dynamic Mirror Replication Technology
● Performance Evaluation
● Opinion
● Project Proposal
Background & Problem Definition
Why to use Real time data ware house?
The load cycle of traditional data warehouse is fix and longer, which cannot timely response the rapid data change. Whereas Real-time data warehouse can capture the rapid data change and process the real-time data.
Problem statements :
1. To get real-time data access without the processing delay with the real-time data warehouse
2. To avoid the Query contention between OLAP queries and OLTP updates
Challenges and Problem Solutions
Challenges
- Enabling real-time ETL
- Data aggregation operation not synchronized with the real-time data
Solutions
- Improvised ETL framework
- Dynamic mirror replication technology
Dynamic mirror replication technology
Dynamic mirror creation and allocation
- Creation of mirror files and initiate bucket link
Dynamic mirror release
- Load data into warehouse and release DSA
The procedure of query processing
- Retrieve the data image in the dynamic data storage based on the obtained data_id and perform processing
Performance Evaluation
Experiment Settings
- The OLAP query response time in different update interval
Experimental Results
- The OLAP query response time in different size of DSA.
The query response time in different update interval
Our opinion on the research paper
We agree and confirm with the solutions suggested by author for enabling Real time ETL and Data/Query Contention problem.
We suggest additional solution of using MetaMatrix with DataMigrator. This will help solving above problems and improve Query efficiency
References
Xiaofang Li, Yingchi Ma. Real-Time Data ETL Framework for Big Real-Time Data Analysis, Information and Automation, 2015 IEEE International Conference (1289-1294), August 2015
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7279485
Background and Motivation
Dataset :
https://bythenumbers.sco.ca.gov/browse?utf8=%E2%9C%93&page=1
A list of relevant financial reports have been provided by the
Government of California in the above dataset. This dataset has details of the
Expenditures, Revenues and State Income of all the departments generated
in the form of fees, penalties and taxes.
Purpose and Scope : DW and DM Purpose:
Develop a tool/web application for the state employee's and public use
to give important financial information on the government’s funding and
income.
Scope :
Multiple relevant datasets are available for billions of data. We are
trying to limit the scope for customer specific requirements like city,
county, departments, yearly dataset for financial data.
Queries
1. What is the County wise and City wise State Income?
2. What is the Business category under which this state income(taxes, fee)
was generated?
3. Which sub-departments are responsible for the maximum income
collection?
4. Determine the expenditures for a particular department.
Example: sewage, water, safety, public departments
5. Estimate and determine which cities/counties are in profit or running in
debt due to high expenses.
Objectives
Analyze, clean and prune the data.
Create sample of data marts for different generic purposes to solve the
problems.
Design database schema.
Design a data warehouse application.
Load data to warehouse and perform user queries.
Resources
Data visualization
- OffVis
Database Development
- MySQL, MariaDB
Data warehouse
- PHP, HTML5, CSS
Data mining
- Rapidminer, WEKA
Schedule
Week 1: Data analysis, cleaning and pruning. Designing Schema.
Week 2: Creating Data mart samples. Designing warehouse application
Week 3: Applying data mining. Applying Query processing.
Week 4: Testing and Documentation.
References California States Controller’s Office , Government Financial Reports,
Datasets, https://bythenumbers.sco.ca.gov/browse?utf8=%E2%9C%93&page=1
This website gives consolidated information about the government
expenditures in the state of California.
Xiaofang Li, Yingchi Ma. Real-Time Data ETL Framework for Big Real-Time Data Analysis, Information and Automation, 2015 IEEE International Conference (1289-1294), August 2015
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7279485