thesis blending big data and cloud -epilepsy global data research and information system

Blending Big Data and Cloud - Epilepsy Global Data Research and Information

System

BITS ZG629T: Thesis

by

AnupSingh

2012HZ12707

Thesis work carried out at

Tata Consultancy Services Limited, LCH.Clearnet Limited,

Investec Bank Plc London, Birmingham Cancer Research Institute, United Kingdom

Submitted in fulfillment of M.S. by Research - Software Systems

Under the Supervision of Sandeep Patil, Researcher in NASA, Arlington University,

Ex. BARC Sr. Scientist

Kalwar Shivram, Project Manager,

Tata Consultancy Services Limited, SanJose, UnitedStates Professor B.M. Deshpande, [email protected]

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE

PILANI (RAJASTHAN)

April, 2014

mailto:[email protected]

ABSTRACT

Epilepsy is the most common neurological disorder affecting 65 million people

worldwide. While medications and other treatments help many people of all ages who

live with epilepsy, more than a million people continue to have seizures that can

severely limit their school achievements, employment prospects and participation in all

of life's experiences. It strikes most often among the very young and the very old,

although anyone can develop epilepsy at any age. Its prevalence is greater than autism

spectrum disorder, cerebral palsy, multiple sclerosis and Parkinson's disease combined.

Despite how common it is and major advances in diagnosis and treatment, epilepsy is

among the least understood of major chronic medical conditions, even though one in

three adults knows someone with the disorder. Epilepsy Global Data Research and

Information System is aimed to leverage Big Data, Cloud Computing, Datawarehouse

features to build a global system which will help the doctors, neurosurgeons to use the

information and methodologies to treat the childrens and people worldwide.

Objectives

This initiative is aimed to build a federated database of medical information and

services that act to serve as the platform for medical research into neurological

cases of epilepsy.

Providing access to very large data sets on patients with different neurological

disorders help the researchers, doctors, surgeons to make efficient decisions and

share their experiences.

Best treatment to be given to childrens and other people all over the world.

System to enrich and enhance its knowledge base so as to stimulate new questions

about Epilepsy and its symptoms – and, ultimately, lead to the fruitful answers on

its treatment.

To harness super-computer power and capabilties of Big Data and Cloud Computing.

Broad Academic Area of Work: Cloud Computing, Big Data, Datawarehouse.

Key words: Hadoop, Twitter Apps, Spring XD, HBASE, HDFS, MapReduce, Hue, Hive,

Pig, HCatalog, JSON Serde, Flume.

ACKNOWLEDGEMENTS

I would like to express my since gratitude and deep regards to my supervisor and

additional examiner for their constant motivation, monitoring and guidance throughout

the course of Dissertation work. This is indeed a new beginning for professionals like us

to extend technology beyond boundaries in healthcare. The blessing, guidance and help

had given me to begin this journey.

My prime motivation behind this dissertation is my loving nephew Aakash who is being

treated from epilepsy since past seven years and all childrens over the world. My sincere

regards and appreciation is extended to Dr. Vrajesh Udani, Hinduja Hospital, hospital

staff, Mumbai, Dr Neeta Ajit Naik, Sion, Mumbai who are the pioneers in treating

epileptic childrens in India.

I virtually would like to thank my family for motivating me to build this. It would have

been not possible without the constant support and help from them.

Indeed we have a lot to go beyond this.

AnupSingh

TABLE OF CONTENTS

Chapter

No

Topic Page No

1. Introduction: Understanding the power of Big Data, Cloud

features

1

2. Feasibility Study and Analysis of Algorithms, Application

Methodologies

2

3. Architecture Design of the System

4

4. Cloud Design of the Epilepsy Global Data Centre

5

5. Data Storage Structure and Query Processing in HDFS and

HBASE

6

6. Use Cases Overview

9

7. Conclusion and Recommendations

22

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK-INTEGRATED LEARNING PROGRAMMES DIVISION

Second Semester 2013-2014

Introduction: Understanding the power of Big Data, Cloud features

Data analysis on large volumes in fields like epilepsy, cardiac diseases, genetic,

neuroimaging, etc. on group of individuals with shared and variable characteristics or

subjects remains poorly approached as well as understood. Hence the very significant

challenges in terms of storing, accessing, building, accuracy and implementing complex

computations cannot be achieved with the traditional methods of data warehouse.

Globally as well as locally many families from different geographies in rural, urban areas

along with the modern sophisticated hospitals are unaware of different types of health

diseases, symptoms, medicines and health care solutions. Sharing a structured and

unstructured knowledge base amongst researchers, neurologists, doctors, associates,

parents is a must. There is a need of specific scientific environment as well as

automated software applications along with cost reduction to complement the above

scenarios. Epileptic disease among children need to be bridged a gap by leveraging the

technological revolution and predicting as well as finding new improved ways of cure.

Matured methodologies like Kimball's approach, Enterprise Wide DataWarehouse (EDW),

traditional RDBMS, ETL/ELT approach is insufficient for huge amount of epileptic data.

Over the course of years we have Terabytes to Petabytes to Zetabytes of unused data

which can be transformed, utilised, reengineered to device new findings to cure epilepsy.

We need better data access, data storage and data structures techniques.

Big Data environments create the opportunity to ease some of the rigidity of ETL-driven

data integration processes. The nature of big data requires that the infrastructure for

this process can scale cost-effectively. Hadoop*, MongoDB has emerged as the standard

solution for managing big data. Big Data refers to the large amounts, at least terabytes,

of poly-structured data that flows continuously through and around organizations,

including video, text, sensor logs, and transactional records.

Rapidly ingesting, storing, and processing big data requires a cost effective

infrastructure that can scale with the amount of data and the scope of analysis. Hadoop

has rapidly emerged as the de facto standard for managing large volumes of

unstructured data.

Hadoop is an open source distributed software platform for storing and processing data.

Written in Java, it runs on a cluster of industry-standard servers configured with direct-

attached storage. Using Hadoop, you can store petabytes of data reliably on tens of

thousands of servers while scaling performance cost-effectively by merely adding

inexpensive nodes to the cluster.

Cloud computing has emerged as a viable alternative to the acquisition and

management of physical or software resources. Scientific applications are being ported

on clouds to build on their inherent elasticity and scalability. The application needs to

run in parallel on a large set of resources in order to achieve reasonable execution

times. Cloud platforms, such as Amazon Web Services, Azure, Cloudera, are an

interesting option to tackle this problem. They provide High Performance Cloud

Computing Infrastructure for handling epileptic "Big Data" variability and provides some

eased as well as optimized deployment configurations.

We will be using Amazon Web Services (AWS) to blend the features of Big Data and

Cloud Computing.

1

Feasibility Study and Analysis of Algorithms, Application Methodologies

Assumptions: Representation of all the features of Big Data and Cloud is out of scope

and can be taken for separate research areas in epilepsy and other healthcare problems.

We will use Amazon EMR with the Hortonworks Distribution for Hadoop.

It makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available

in multiple distributions and Amazon EMR gives you the option of using the Amazon

Distribution or the Hortonworks Distribution for Hadoop. Hortonworks delivers on the

promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of

mission-critical and real-time production uses. Hortonworks brings unprecedented

dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and

streaming applications in one unified Big Data platform. Hortonworks is used across

financial services, retail, media, healthcare, manufacturing, telecommunications and

government organizations

Hadoop for Big Data and Cloud

Hadoop is an open source distributed software platform for storing and processing data.

Written in Java, it runs on a cluster of industry-standard servers configured with direct-

attached storage. Using Hadoop, you can store petabytes of data reliably on tens of

thousands of servers while scaling performance cost-effectively by merely adding

inexpensive nodes to the cluster. Central to the scalability of Hadoop is the distributed

processing framework known as MapReduce.

MapReduce, the programming paradigm implemented by Hadoop, breaks-up a batch job

into many smaller tasks for parallel processing on a distributed system. HDFS, the

distributed file system stores the data reliably.

2

MapReduce helps programmers solve data-parallel problems for which the data set can

be sub-divided into small parts and processed independently. MapReduce is an

important advance because it allows ordinary developers, not just those skilled in high-

performance computing, to use parallel programming constructs without worrying about

the complex details of intra-cluster communication, task monitoring, and failure

handling. MapReduce simplifies all that. The system splits the input data-set into

multiple chunks, each of which is assigned a map task that can process the data in

parallel.

Each map task reads the input as a set of (key, value) pairs and produces a transformed

set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the

map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group

them into final results. MapReduce uses JobTracker and TaskTracker mechanisms to

schedule tasks, monitor them, and restart any that fail. The Hadoop platform also

includes the Hadoop Distributed File System (HDFS), which is designed for scalability

and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or

128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for

MapReduce applications to read and write data in parallel. Capacity and performance can

be scaled by adding Data Nodes, and a single NameNode mechanism manages data

placement and monitors server availability. HDFS clusters in production use today

reliably hold petabytes of data on thousands of nodes.

In addition to MapReduce and HDFS, Hadoop includes many other components, some of

which are very useful for ETL.

• Flume* is a distributed system for collecting, aggregating, and moving large amounts

of data from multiple sources into HDFS or another central data store. Enterprises

typically collect log files on application servers or other systems and archive the log files

in order to comply with regulations. Being able to ingest and analyze that unstructured

or semi-structured data in Hadoop can turn this passive resource into a valuable asset.

Spring XD is one of the system similar to Flume.

• Sqoop* is a tool for transferring data between Hadoop and relational databases. You

can use Sqoop to import data from a MySQL or Oracle database into HDFS, run

MapReduce on the data, and then export the data back into an RDBMS. Sqoop

automates these processes, using MapReduce to import and export the data in parallel

with fault-tolerance.

• Hive* and Pig* are programming languages that simplify development of applications

employing the MapReduce framework. HiveQL is a dialect of SQL and supports a subset

of the syntax. Although slow, Hive is being actively enhanced by the developer

community to enable low-latency queries on HBase* and HDFS. Pig Latin is a procedural

programming language that provides high-level abstractions for MapReduce. You can

extend it with User Defined Functions written in Java, Python, and other languages.

• ODBC/JDBC Connectors for HBase and Hive are often proprietary components included

in distributions for Hadoop software. They provide connectivity with SQL applications by

translating standard SQL queries into HiveQL commands that can be executed upon

the data in HDFS or HBase.

• YARN provides cluster resource management capabilities to enable multiple data

processing engines with multiple workloads & applications across a single clustered

environment.

Thus Hadoop is a powerful platform for big data storage and processing.

3

Architecture Design of the System

Hadoop receives input structured and unstructured data from different sources hospitals,

healthcare vaccines, social media, information document to its various platform.

The features listed previously in feasibility section is depicted which is the core and

HDFS nodes which can be scaled for storage.

The output is the multiple application layers derived on the collated epileptic data in

terms of audios, videos, documents, research publications and collaboration forums

information from social media. We can also form data science to find out new research

areas, to predict and do analytical reporting.

Hospitals

and

Epileptic

Patient’s

Data

Files-

Epileptic

Cases,

Scenarios

Social

Media

ETL

ETL

ETL

Healthcare-

Worldwide

Epileptic

Vaccines,

Instruments

ETL

ETL

Information

Epilepsy Information And

Knowledge Sharing

HDFS Data Nodes

Advanced

Analytics

Architecture Design of the System

4

Cloud Design of the Epilepsy Global Data Centre

PAKISTAN

UK

INDIA

USMALAYSIA

SRILANKA

EPILEPSY GLOBAL DATA CENTRES LEVERAGING CLOUD COMPUTING FEATURES

Cloud is core to provide the infrastructure as a service (IAAS) to the Epilepsy Global

Data Centre across the world. Volumes, Variety and Velocity being huge we can scale up

the system automatically based on our data needs. Here the overhead of maintaining,

upgrade, version management and the services of Hadoop, mail services, reporting is at

the Cloud provider's end. Information sharing on epilepsy across different countries is

achievable. We can create our customised services on "Epilepsy Data As a Service" for

different clinical research, hospitals, doctors, neuroscientists, social media. Data

volumes in terms of trillions and trillions of Zetabytes or more can be stored. However

Cloud framework, network portability and components and legal matters, law across

different countries will hold the key. The cloud is also used to provide extra capacity for

an existing cluster or for test your Hadoop applications. Moreover Hortonworks Data

Platform (HDP) 2.0 features the NameNode High Availability functionality automates

failovers and ensures the availability of the full HDP stack. Cloud also leverages uses of

multiple database platforms whether it is mysql, oracle, sqlserver or other databases. It

also provides different reporting tools like Jasper, SAP Business Objects, Microstrategy,

Qlikview to interface with the hadoop. Cloud is certainly a multi-use platform when

coupled with BigData. Hadoop in the cloud makes a great deal of sense: the elastic

resource allocation that cloud computing is premised on works well for cluster-based

data processing infrastructure used on varying analyses and data sets of indeterminate

size.

5

http://hortonworks.com/blog/namenode-high-availability-in-hdp-2-0/

Data Storage Structure and Query Processing in HDFS and HBASE

Data Storage Structure and Query Processing Flow in Hadoop Distributed File System (HDFS) and HBASE

HDFS is a distributed file system that is well suited for the storage of large files. Data in

HDFS is organized into files and directories and is stored in encrypted format. We cannot

access the data like we do in our normal practice using the dir commands or explorer

commands. Files are divided into uniform sized blocks and distributed across cluster

nodes. Blocks are replicated to handle hardware failure. HDFS keeps checksums of data

for corruption detection and recovery. Depending upon the configuration the files are

broken into blocks of 128 MB. The blocks can be configured per file. The namenode

manages the file namespace, authorisation, authentication. It collects blocks reports

from datanodes based on block locations. It replicates the missing blocks in datanodes in

case of failures. Datanodes handles thousands of block storage. It stores the blocks

using the underlying OS's files. Client acess the blocks directly from data nodes based

on the metadata read from namenode. MapReduce uses the FileSystem interface -

hence it can run on multiple file systems. HDFS file system storage is depicted below.

Metadata

Hadoop Distributed File System Storage Structure

6

http://hadoop.apache.org/hdfs/

Sample java code to read the files in HDFS

package org.myorg;

import java.io.*;

import java.util.*;

import java.net.*;

import org.apache.hadoop.fs.*;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class cat{

public static void main (String [] args) throws Exception{

try{

FileSystem fs = FileSystem.get(new Configuration());

FileStatus[] status = fs.listStatus(new Path("/hdfs/epilepsycases"));

for (int i=0;i<status.length;i++)

{

BufferedReader br=new BufferedReader(new

InputStreamReader(fs.open(status[i].getPath())));

System.out.println(status[i]);

String line;

line=br.readLine();

while (line != null){

System.out.println(line);

line=br.readLine();

}

}

}catch(Exception e){

System.out.println("File not found");

}

}

}

[root@sandbox /]# hadoop jar epilepsy_case_files.jar org.myorg.cat >

epilepsy_case_files.txt

Here we can see the namenode, blocksize , replication mode, permissions.

7

HBase is designed as column stores. This is a more advanced form of a key-value pair

database. Essentially, the keys and values become composite. Think of this as a hash

map crossed with a multidimensional array. Essentially each column contains a row of

data. It is ideally suited for semi-structured data since the MapReduce is very often used

on these. The columns are naturally indexed and is good for scaling out horizontally.

Imagine the difference between the RDBMS table having hundred columns and HBASE

table having around 500 columns. However it is unsuited for complex data reads. HBase,

on the other hand, is built on top of HDFS and provides fast record lookups (and

updates) for large tables. This can sometimes be a point of conceptual confusion. HBase

internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed

lookups. A sample HBASE storage structure in contrast to SQL RDBMS table is depicted

below.

Firstname_lastname Doctorname_hospitalname Evaluation_date_Observations

FirstName Lastname DoctorName HospitalNameSurgical

EvaluationDateEvaluation/

ObservationsPatientID Country

PatientID_Country

Key Value

Column Family: CF_Data

Primary Key Table Columns

HBASE

SQL (RDBMS)

HBASE Storage Structure using Key Value Pair and SQL RDBMS Storage Structure

8

Use Cases

Pool in social media data and analyse the information on epilepsy. This is aimed for self

support care as well globally. In todays fast changing world there is a huge population

on twitter, facebook, linkedin and we see a common synergy and huge exchange of

information sharing.

XD Engine

Epilepsy Social Media (Twitter

App)

HADOOP - HDFS

STREAM APP DATA

ING

EST

DA

TA

ANALYTICS

PARSE UNSTRUCTURED DATA (JSON FORMAT)

Streaming and Analysing Social Media Data Flow in Hadoop

9

Scenario

This scenario is focused to stream unstructured data in real time from twitter app -

Epilepsy Social Media and transform into useful information.

Step 1:

Create a collaboration forum app "Epilepsy Social Media" on the twitter

https://dev.twitter.com/

Note down the API Keys, API secret, Access token and Access secret. In order to stream

in information from Twitter, then we will need these necessary keys. Once we have the

keys we configure the XD engine installed in Hadoop server.

10

https://apps.twitter.com/app/5918257/show

Step2:

Login to Spring XD engine under a separate shell from hadoop. Test whether hdfs is

accessible or not.

hadoop fs ls /

It should display some files and directories

Step 3

Create the tweet stream on collaboration forum in Spring XD

stream create --name epilepsytweets --definition "twitterstream --

track='epilepsysociety, epilepsy society' | hdfs"

11

Step 4

Check whether we are able to stream files in xd

hadoop fs -ls /xd/epilepsytweets

12

The tweets that were posted is listed in the files below screenshot.

13

JSON Data Format

{"created_at":"Wed Mar 19 19:33:25 +0000

2014","id":446368866097065984,"id_str":"446368866097065984","text":"@epilepsysoc

iety Hi we should build some ideas and come together to create awareness on epilepsy

many countries mothers and fathers dont

knw","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_i

d_str":null,"in_reply_to_user_id":87454049,"in_reply_to_user_id_str":"87454049","in_r

eply_to_screen_name":"epilepsysociety","user":{"id":2387686938,"id_str":"238768693

8","name":"AnupSingh","screen_name":"anupsingh4u","location":"","url":null,"descriptio

n":null,"protected":false,"followers_count":4,"friends_count":8,"listed_count":0,"created

_at":"Thu Mar 13 19:00:48 +0000

2014","favourites_count":0,"utc_offset":null,"time_zone":null,"geo_enabled":false,"verifi

ed":false,"statuses_count":8,"lang":"en-

gb","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"pro

file_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.c

om\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/

\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"p

rofile_image_url":"http:\/\/abs.twimg.com\/sticky\/default_profile_images\/default_prof

ile_0_normal.png","profile_image_url_https":"https:\/\/abs.twimg.com\/sticky\/default_

profile_images\/default_profile_0_normal.png","profile_link_color":"0084B4","profile_sid

ebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":

"333333","profile_use_background_image":true,"default_profile":true,"default_profile_i

mage":true,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"c

oordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"e

ntities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"epilep

sysociety","name":"epilepsy

society","id":87454049,"id_str":"87454049","indices":[0,16]}]},"favorited":false,"retwe

eted":false,"filter_level":"medium","lang":"en"}

14

{"created_at":"Wed Mar 19 20:07:31 +0000

2014","id":446377448163143680,"id_str":"446377448163143680","text":"I'm

fundraising for Epilepsy Society & I'd love your support! Text HERB49 \u00a32 to

70070 to sponsor me today. Thanks. http:\/\/t.co\/C74muxXk9P","source":"\u003ca

href=\"http:\/\/twitter.com\/tweetbutton\" rel=\"nofollow\"\u003eTweet

Button\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_stat

us_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_sc

reen_name":null,"user":{"id":98352324,"id_str":"98352324","name":"Steven

Herbert","screen_name":"sherbie40","location":"chepstow","url":null,"description":"Play

the guitar til your fingers bleed, quoted by Ted Nugent..\n\nLifes to short get on with

it...","protected":false,"followers_count":43,"friends_count":107,"listed_count":1,"create

d_at":"Mon Dec 21 11:14:17 +0000

2009","favourites_count":1,"utc_offset":0,"time_zone":"London","geo_enabled":true,"ve

rified":false,"statuses_count":119,"lang":"en","contributors_enabled":false,"is_translator

":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_bac

kground_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profi

le_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/

bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/pr

ofile_images\/442675380076703744\/Oje9Ifzk_normal.jpeg","profile_image_url_https":

"https:\/\/pbs.twimg.com\/profile_images\/442675380076703744\/Oje9Ifzk_normal.jpe

g","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/98352324\/139437

7010","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_si

debar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_imag

e":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_reque

st_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors"

:null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":

[{"url":"http:\/\/t.co\/C74muxXk9P","expanded_url":"http:\/\/www.justgiving.com\/Ste

ven-Herbert","display_url":"justgiving.com\/Steven-

Herbert","indices":[119,141]}],"user_mentions":[]},"favorited":false,"retweeted":false,"

possibly_sensitive":false,"filter_level":"medium","lang":"en"}

15

Step 5

Stop or undeploy the stream after collecting some data.

stream undeploy --name epilepsytweets

Step 6

Refine the Data using Hive

Create the tables based on the streamed data collected in Hive.

16

We can see the tweets in hadoop interface has been brought into structured format. A

report can be build on top of the same.

17

Use Cases

Collect and represent the information on epilepsy types, symptoms, medicines and pros

and cons of the same. Collect and represent the information on neurosurgeons, success

scenarios handled, publications.

Scenario: Collect doctors data from different hospitals and research centres

The HIVE ETL script below in Hadoop will load the list of doctors data into warehouse.

create table tbl_doctor ( id string, name string, age int, hospitalname string, expertise

string, publications_link string, profile_info string, country string, city string)

insert overwrite table tbl_doctor

SELECT

regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) doctor_id,

regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) fullname,

regexp_extract(col_value, '^(?:([^,]*)\,?){10}', 1) age,

regexp_extract(col_value, '^(?:([^,]*)\,?){3}', 1) organisation,

regexp_extract(col_value, '^(?:([^,]*)\,?){11}', 1) specialisation,

regexp_extract(col_value, '^(?:([^,]*)\,?){8}', 1) articles_cited,

regexp_extract(col_value, '^(?:([^,]*)\,?){13}', 1) wiki_profile,

regexp_extract(col_value, '^(?:([^,]*)\,?){4}', 1) Country,

regexp_extract(col_value, '^(?:([^,]*)\,?){5}', 1) City

from temp_doctor;

LOAD DATA INPATH '/user/hue/Doctors_List.csv' OVERWRITE INTO TABLE tbl_doctors

We can customise our script based on the information received from hospitals and

research centres. Columns position can be toggled For example if the specialisation field

from list of of doctors of Hinduja hospital is at position 11 the we go by the below script.

If the specialisation field from list of of doctors of Fortis hospital is at position 14 the we

modify the below script for statement "regexp_extract(col_value, '^(?:([^,]*)\,?){14}',

1) specialisation".

18

Scenario:

Build a catalog of epilepsy types and epilepsy medicines.

HCATALOG provides easy interface to upload the files in different formats and set up the

data.

19

Scenario

Collect patients data related to his presurgical evaluation, medical history, physical

examination and lab tests. The other tables are represented in the below. We can have

customised ETL jobs based on the hospitals data. We can automate this process once we

have the list of files. However it will be essential to encrypt and store the data or mask

the data rather than revealing individual name. This will be subject to the healthcare

laws of different nations. This scenario can be complimented by writing PIG scripts to

compare data on epileptic patients across different states or countries.

20

Scenario

Information can be shared easily on emails about the events to increase the awareness.

Design the job in Oozie Editor/Dashboard

21

Conclusion and Recommendations

The aim of this blend case is to increase networking amongst hospitals, doctors, people,

childrens thus improving the healthcare systems. We can have proper data warehouse

Kimball model as well as federated data warehouse in Hadoop. BigData is feasible for

structured as well as unstructured data.

Data across different testing methods, research is already available we can carry out

data mining and able to predict on epileptic data. This will also aid to recognise the

difference between the normal and abnormal flow on epileptic sufferers.

Cognitive features on neural networking can be aimed to read the machine language of

test carried out on epilepsy patients. Test data and their scenarios can be known upfront

based on the parameters. Algorithms can be developed to make the system precision

and agnostic.

We can aim to build a language interpreter app which can share the epilepsy data

primarily into different languages to the target audience across different countries. This

will help in bridging the language barrier on communication between different languages

spoken over the world.

Document stores for CT scans, MRI, EEG recordings can be explored in MongoDB to

optimize audio, video data.

Interfacing with SAP HANA, SAP Business Objects, Microstrategy, Jasper. Qlikview and

other reporting tools can be carried so that we can have the graphs and data

representing a normal behaviour and deviated behaviour on seizures.

22

List of Abbreviations

AWS - Amazon Web Services

EMR - Elastic Map Reduce

HDP - Hortonworks Data Platform

EDW - Enterprise wide Datawarehouse

HDFS - Hadoop Distributed File System

IAAS - Infrastructure As A Service

List of Figures

Page 1: Hadoop Architecture

Page 2: Architecture Design of the System

Page 3: Epilepsy Global Data Centres Leveraging Cloud Computing Features

Page 4: Data Storage Structure and Query Processing Flow in HDFS and HBASE

Page 4: HDFS Storage Structure

Page 8: HBASE Storage Structure

23

Literature References

[1] http://www.epilepsyfoundation.org

[2] Moving To The Cloud. Developing Apps in the New World of Cloud Computing. Dinkar

Sitaram. Geetha Manjunath.

[3] http://bigdatauniversity.com

[4] http://www.mongodb.com/learn/big-data

[5] http://ocw.mit.edu/courses/brain-and-cognitive-sciences/

[6] http://aws.amazon.com/

[7] Artificial Intelligence and Soft Computing: Behavioral and Cognitive Modeling of the

Human Brain, Volume 1 By Amit Konar

[8] Computational Intelligence: Principles, Techniques and Applications By Amit Konar

[9] http://hortonworks.com/

[10] http://hadoop.apache.org/

[11] http://projects.spring.io/spring-xd/

[12] http://guidance.nice.org.uk/

[13] https://www.hemr.org/wiki/Category:Epilepsy_syndromes

[14] Dr. Vrajesh Udani.

http://www.hindujahospital.com/communityportal/doctors/doctor-

details.aspx?did=140&name=dr-vrajesh-udani&cid=36&cname=

[15] https://twitter.com/epilepsysociety

[16] https://www.hemr.org/wiki/Category:Epilepsy_syndromes

[17] Jayapandian CP, Chen CH, Bozorgi A, Lhatoo SD, Zhang GQ, Sahoo SS.

Electrophysiological Signal Analysis and Visualization using Cloudwave for Epilepsy

Clinical Research. The 14th World Congress on Medical and Health Informatics

(MedInfo), 2013. http://www.ncbi.nlm.nih.gov/pubmed/23920671

[18] Hadoop Architecture http://www.intel.co.uk/content/www/xa/en/big-data/big-data-

analytics-turning-big-data-into-intelligence.html

24

http://www.ncbi.nlm.nih.gov/pubmed/23920671