thesis blending big data and cloud -epilepsy global data research and information system
TRANSCRIPT
Blending Big Data and Cloud - Epilepsy Global Data Research and Information
System
BITS ZG629T: Thesis
by
AnupSingh
2012HZ12707
Thesis work carried out at
Tata Consultancy Services Limited, LCH.Clearnet Limited,
Investec Bank Plc London, Birmingham Cancer Research Institute, United Kingdom
Submitted in fulfillment of M.S. by Research - Software Systems
Under the Supervision of Sandeep Patil, Researcher in NASA, Arlington University,
Ex. BARC Sr. Scientist
Kalwar Shivram, Project Manager,
Tata Consultancy Services Limited, SanJose, UnitedStates Professor B.M. Deshpande, [email protected]
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE
PILANI (RAJASTHAN)
April, 2014
ABSTRACT
Epilepsy is the most common neurological disorder affecting 65 million people
worldwide. While medications and other treatments help many people of all ages who
live with epilepsy, more than a million people continue to have seizures that can
severely limit their school achievements, employment prospects and participation in all
of life's experiences. It strikes most often among the very young and the very old,
although anyone can develop epilepsy at any age. Its prevalence is greater than autism
spectrum disorder, cerebral palsy, multiple sclerosis and Parkinson's disease combined.
Despite how common it is and major advances in diagnosis and treatment, epilepsy is
among the least understood of major chronic medical conditions, even though one in
three adults knows someone with the disorder. Epilepsy Global Data Research and
Information System is aimed to leverage Big Data, Cloud Computing, Datawarehouse
features to build a global system which will help the doctors, neurosurgeons to use the
information and methodologies to treat the childrens and people worldwide.
Objectives
This initiative is aimed to build a federated database of medical information and
services that act to serve as the platform for medical research into neurological
cases of epilepsy.
Providing access to very large data sets on patients with different neurological
disorders help the researchers, doctors, surgeons to make efficient decisions and
share their experiences.
Best treatment to be given to childrens and other people all over the world.
System to enrich and enhance its knowledge base so as to stimulate new questions
about Epilepsy and its symptoms – and, ultimately, lead to the fruitful answers on
its treatment.
To harness super-computer power and capabilties of Big Data and Cloud Computing.
Broad Academic Area of Work: Cloud Computing, Big Data, Datawarehouse.
Key words: Hadoop, Twitter Apps, Spring XD, HBASE, HDFS, MapReduce, Hue, Hive,
Pig, HCatalog, JSON Serde, Flume.
ACKNOWLEDGEMENTS
I would like to express my since gratitude and deep regards to my supervisor and
additional examiner for their constant motivation, monitoring and guidance throughout
the course of Dissertation work. This is indeed a new beginning for professionals like us
to extend technology beyond boundaries in healthcare. The blessing, guidance and help
had given me to begin this journey.
My prime motivation behind this dissertation is my loving nephew Aakash who is being
treated from epilepsy since past seven years and all childrens over the world. My sincere
regards and appreciation is extended to Dr. Vrajesh Udani, Hinduja Hospital, hospital
staff, Mumbai, Dr Neeta Ajit Naik, Sion, Mumbai who are the pioneers in treating
epileptic childrens in India.
I virtually would like to thank my family for motivating me to build this. It would have
been not possible without the constant support and help from them.
Indeed we have a lot to go beyond this.
AnupSingh
TABLE OF CONTENTS
Chapter
No
Topic Page No
1. Introduction: Understanding the power of Big Data, Cloud
features
1
2. Feasibility Study and Analysis of Algorithms, Application
Methodologies
2
3. Architecture Design of the System
4
4. Cloud Design of the Epilepsy Global Data Centre
5
5. Data Storage Structure and Query Processing in HDFS and
HBASE
6
6. Use Cases Overview
9
7. Conclusion and Recommendations
22
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
WORK-INTEGRATED LEARNING PROGRAMMES DIVISION
Second Semester 2013-2014
Introduction: Understanding the power of Big Data, Cloud features
Data analysis on large volumes in fields like epilepsy, cardiac diseases, genetic,
neuroimaging, etc. on group of individuals with shared and variable characteristics or
subjects remains poorly approached as well as understood. Hence the very significant
challenges in terms of storing, accessing, building, accuracy and implementing complex
computations cannot be achieved with the traditional methods of data warehouse.
Globally as well as locally many families from different geographies in rural, urban areas
along with the modern sophisticated hospitals are unaware of different types of health
diseases, symptoms, medicines and health care solutions. Sharing a structured and
unstructured knowledge base amongst researchers, neurologists, doctors, associates,
parents is a must. There is a need of specific scientific environment as well as
automated software applications along with cost reduction to complement the above
scenarios. Epileptic disease among children need to be bridged a gap by leveraging the
technological revolution and predicting as well as finding new improved ways of cure.
Matured methodologies like Kimball's approach, Enterprise Wide DataWarehouse (EDW),
traditional RDBMS, ETL/ELT approach is insufficient for huge amount of epileptic data.
Over the course of years we have Terabytes to Petabytes to Zetabytes of unused data
which can be transformed, utilised, reengineered to device new findings to cure epilepsy.
We need better data access, data storage and data structures techniques.
Big Data environments create the opportunity to ease some of the rigidity of ETL-driven
data integration processes. The nature of big data requires that the infrastructure for
this process can scale cost-effectively. Hadoop*, MongoDB has emerged as the standard
solution for managing big data. Big Data refers to the large amounts, at least terabytes,
of poly-structured data that flows continuously through and around organizations,
including video, text, sensor logs, and transactional records.
Rapidly ingesting, storing, and processing big data requires a cost effective
infrastructure that can scale with the amount of data and the scope of analysis. Hadoop
has rapidly emerged as the de facto standard for managing large volumes of
unstructured data.
Hadoop is an open source distributed software platform for storing and processing data.
Written in Java, it runs on a cluster of industry-standard servers configured with direct-
attached storage. Using Hadoop, you can store petabytes of data reliably on tens of
thousands of servers while scaling performance cost-effectively by merely adding
inexpensive nodes to the cluster.
Cloud computing has emerged as a viable alternative to the acquisition and
management of physical or software resources. Scientific applications are being ported
on clouds to build on their inherent elasticity and scalability. The application needs to
run in parallel on a large set of resources in order to achieve reasonable execution
times. Cloud platforms, such as Amazon Web Services, Azure, Cloudera, are an
interesting option to tackle this problem. They provide High Performance Cloud
Computing Infrastructure for handling epileptic "Big Data" variability and provides some
eased as well as optimized deployment configurations.
We will be using Amazon Web Services (AWS) to blend the features of Big Data and
Cloud Computing.
1
Feasibility Study and Analysis of Algorithms, Application Methodologies
Assumptions: Representation of all the features of Big Data and Cloud is out of scope
and can be taken for separate research areas in epilepsy and other healthcare problems.
We will use Amazon EMR with the Hortonworks Distribution for Hadoop.
It makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available
in multiple distributions and Amazon EMR gives you the option of using the Amazon
Distribution or the Hortonworks Distribution for Hadoop. Hortonworks delivers on the
promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of
mission-critical and real-time production uses. Hortonworks brings unprecedented
dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and
streaming applications in one unified Big Data platform. Hortonworks is used across
financial services, retail, media, healthcare, manufacturing, telecommunications and
government organizations
Hadoop for Big Data and Cloud
Hadoop is an open source distributed software platform for storing and processing data.
Written in Java, it runs on a cluster of industry-standard servers configured with direct-
attached storage. Using Hadoop, you can store petabytes of data reliably on tens of
thousands of servers while scaling performance cost-effectively by merely adding
inexpensive nodes to the cluster. Central to the scalability of Hadoop is the distributed
processing framework known as MapReduce.
MapReduce, the programming paradigm implemented by Hadoop, breaks-up a batch job
into many smaller tasks for parallel processing on a distributed system. HDFS, the
distributed file system stores the data reliably.
2
MapReduce helps programmers solve data-parallel problems for which the data set can
be sub-divided into small parts and processed independently. MapReduce is an
important advance because it allows ordinary developers, not just those skilled in high-
performance computing, to use parallel programming constructs without worrying about
the complex details of intra-cluster communication, task monitoring, and failure
handling. MapReduce simplifies all that. The system splits the input data-set into
multiple chunks, each of which is assigned a map task that can process the data in
parallel.
Each map task reads the input as a set of (key, value) pairs and produces a transformed
set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the
map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group
them into final results. MapReduce uses JobTracker and TaskTracker mechanisms to
schedule tasks, monitor them, and restart any that fail. The Hadoop platform also
includes the Hadoop Distributed File System (HDFS), which is designed for scalability
and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or
128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for
MapReduce applications to read and write data in parallel. Capacity and performance can
be scaled by adding Data Nodes, and a single NameNode mechanism manages data
placement and monitors server availability. HDFS clusters in production use today
reliably hold petabytes of data on thousands of nodes.
In addition to MapReduce and HDFS, Hadoop includes many other components, some of
which are very useful for ETL.
• Flume* is a distributed system for collecting, aggregating, and moving large amounts
of data from multiple sources into HDFS or another central data store. Enterprises
typically collect log files on application servers or other systems and archive the log files
in order to comply with regulations. Being able to ingest and analyze that unstructured
or semi-structured data in Hadoop can turn this passive resource into a valuable asset.
Spring XD is one of the system similar to Flume.
• Sqoop* is a tool for transferring data between Hadoop and relational databases. You
can use Sqoop to import data from a MySQL or Oracle database into HDFS, run
MapReduce on the data, and then export the data back into an RDBMS. Sqoop
automates these processes, using MapReduce to import and export the data in parallel
with fault-tolerance.
• Hive* and Pig* are programming languages that simplify development of applications
employing the MapReduce framework. HiveQL is a dialect of SQL and supports a subset
of the syntax. Although slow, Hive is being actively enhanced by the developer
community to enable low-latency queries on HBase* and HDFS. Pig Latin is a procedural
programming language that provides high-level abstractions for MapReduce. You can
extend it with User Defined Functions written in Java, Python, and other languages.
• ODBC/JDBC Connectors for HBase and Hive are often proprietary components included
in distributions for Hadoop software. They provide connectivity with SQL applications by
translating standard SQL queries into HiveQL commands that can be executed upon
the data in HDFS or HBase.
• YARN provides cluster resource management capabilities to enable multiple data
processing engines with multiple workloads & applications across a single clustered
environment.
Thus Hadoop is a powerful platform for big data storage and processing.
3
Architecture Design of the System
Hadoop receives input structured and unstructured data from different sources hospitals,
healthcare vaccines, social media, information document to its various platform.
The features listed previously in feasibility section is depicted which is the core and
HDFS nodes which can be scaled for storage.
The output is the multiple application layers derived on the collated epileptic data in
terms of audios, videos, documents, research publications and collaboration forums
information from social media. We can also form data science to find out new research
areas, to predict and do analytical reporting.
Hospitals
and
Epileptic
Patient’s
Data
Files-
Epileptic
Cases,
Scenarios
Social
Media
ETL
ETL
ETL
Healthcare-
Worldwide
Epileptic
Vaccines,
Instruments
ETL
ETL
Information
Epilepsy Information And
Knowledge Sharing
HDFS Data Nodes
Advanced
Analytics
Architecture Design of the System
4
Cloud Design of the Epilepsy Global Data Centre
PAKISTAN
UK
INDIA
USMALAYSIA
SRILANKA
EPILEPSY GLOBAL DATA CENTRES LEVERAGING CLOUD COMPUTING FEATURES
Cloud is core to provide the infrastructure as a service (IAAS) to the Epilepsy Global
Data Centre across the world. Volumes, Variety and Velocity being huge we can scale up
the system automatically based on our data needs. Here the overhead of maintaining,
upgrade, version management and the services of Hadoop, mail services, reporting is at
the Cloud provider's end. Information sharing on epilepsy across different countries is
achievable. We can create our customised services on "Epilepsy Data As a Service" for
different clinical research, hospitals, doctors, neuroscientists, social media. Data
volumes in terms of trillions and trillions of Zetabytes or more can be stored. However
Cloud framework, network portability and components and legal matters, law across
different countries will hold the key. The cloud is also used to provide extra capacity for
an existing cluster or for test your Hadoop applications. Moreover Hortonworks Data
Platform (HDP) 2.0 features the NameNode High Availability functionality automates
failovers and ensures the availability of the full HDP stack. Cloud also leverages uses of
multiple database platforms whether it is mysql, oracle, sqlserver or other databases. It
also provides different reporting tools like Jasper, SAP Business Objects, Microstrategy,
Qlikview to interface with the hadoop. Cloud is certainly a multi-use platform when
coupled with BigData. Hadoop in the cloud makes a great deal of sense: the elastic
resource allocation that cloud computing is premised on works well for cluster-based
data processing infrastructure used on varying analyses and data sets of indeterminate
size.
5
Data Storage Structure and Query Processing in HDFS and HBASE
Data Storage Structure and Query Processing Flow in Hadoop Distributed File System (HDFS) and HBASE
HDFS is a distributed file system that is well suited for the storage of large files. Data in
HDFS is organized into files and directories and is stored in encrypted format. We cannot
access the data like we do in our normal practice using the dir commands or explorer
commands. Files are divided into uniform sized blocks and distributed across cluster
nodes. Blocks are replicated to handle hardware failure. HDFS keeps checksums of data
for corruption detection and recovery. Depending upon the configuration the files are
broken into blocks of 128 MB. The blocks can be configured per file. The namenode
manages the file namespace, authorisation, authentication. It collects blocks reports
from datanodes based on block locations. It replicates the missing blocks in datanodes in
case of failures. Datanodes handles thousands of block storage. It stores the blocks
using the underlying OS's files. Client acess the blocks directly from data nodes based
on the metadata read from namenode. MapReduce uses the FileSystem interface -
hence it can run on multiple file systems. HDFS file system storage is depicted below.
Metadata
Hadoop Distributed File System Storage Structure
6
Sample java code to read the files in HDFS
package org.myorg;
import java.io.*;
import java.util.*;
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class cat{
public static void main (String [] args) throws Exception{
try{
FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] status = fs.listStatus(new Path("/hdfs/epilepsycases"));
for (int i=0;i<status.length;i++)
{
BufferedReader br=new BufferedReader(new
InputStreamReader(fs.open(status[i].getPath())));
System.out.println(status[i]);
String line;
line=br.readLine();
while (line != null){
System.out.println(line);
line=br.readLine();
}
}
}catch(Exception e){
System.out.println("File not found");
}
}
}
[root@sandbox /]# hadoop jar epilepsy_case_files.jar org.myorg.cat >
epilepsy_case_files.txt
Here we can see the namenode, blocksize , replication mode, permissions.
7
HBase is designed as column stores. This is a more advanced form of a key-value pair
database. Essentially, the keys and values become composite. Think of this as a hash
map crossed with a multidimensional array. Essentially each column contains a row of
data. It is ideally suited for semi-structured data since the MapReduce is very often used
on these. The columns are naturally indexed and is good for scaling out horizontally.
Imagine the difference between the RDBMS table having hundred columns and HBASE
table having around 500 columns. However it is unsuited for complex data reads. HBase,
on the other hand, is built on top of HDFS and provides fast record lookups (and
updates) for large tables. This can sometimes be a point of conceptual confusion. HBase
internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed
lookups. A sample HBASE storage structure in contrast to SQL RDBMS table is depicted
below.
Firstname_lastname Doctorname_hospitalname Evaluation_date_Observations
FirstName Lastname DoctorName HospitalNameSurgical
EvaluationDateEvaluation/
ObservationsPatientID Country
PatientID_Country
Key Value
Column Family: CF_Data
Primary Key Table Columns
HBASE
SQL (RDBMS)
HBASE Storage Structure using Key Value Pair and SQL RDBMS Storage Structure
8
Use Cases
Pool in social media data and analyse the information on epilepsy. This is aimed for self
support care as well globally. In todays fast changing world there is a huge population
on twitter, facebook, linkedin and we see a common synergy and huge exchange of
information sharing.
XD Engine
Epilepsy Social Media (Twitter
App)
HADOOP - HDFS
STREAM APP DATA
ING
EST
DA
TA
ANALYTICS
PARSE UNSTRUCTURED DATA (JSON FORMAT)
Streaming and Analysing Social Media Data Flow in Hadoop
9
Scenario
This scenario is focused to stream unstructured data in real time from twitter app -
Epilepsy Social Media and transform into useful information.
Step 1:
Create a collaboration forum app "Epilepsy Social Media" on the twitter
https://dev.twitter.com/
Note down the API Keys, API secret, Access token and Access secret. In order to stream
in information from Twitter, then we will need these necessary keys. Once we have the
keys we configure the XD engine installed in Hadoop server.
10
Step2:
Login to Spring XD engine under a separate shell from hadoop. Test whether hdfs is
accessible or not.
hadoop fs ls /
It should display some files and directories
Step 3
Create the tweet stream on collaboration forum in Spring XD
stream create --name epilepsytweets --definition "twitterstream --
track='epilepsysociety, epilepsy society' | hdfs"
11
Step 4
Check whether we are able to stream files in xd
hadoop fs -ls /xd/epilepsytweets
12
The tweets that were posted is listed in the files below screenshot.
13
JSON Data Format
{"created_at":"Wed Mar 19 19:33:25 +0000
2014","id":446368866097065984,"id_str":"446368866097065984","text":"@epilepsysoc
iety Hi we should build some ideas and come together to create awareness on epilepsy
many countries mothers and fathers dont
knw","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_i
d_str":null,"in_reply_to_user_id":87454049,"in_reply_to_user_id_str":"87454049","in_r
eply_to_screen_name":"epilepsysociety","user":{"id":2387686938,"id_str":"238768693
8","name":"AnupSingh","screen_name":"anupsingh4u","location":"","url":null,"descriptio
n":null,"protected":false,"followers_count":4,"friends_count":8,"listed_count":0,"created
_at":"Thu Mar 13 19:00:48 +0000
2014","favourites_count":0,"utc_offset":null,"time_zone":null,"geo_enabled":false,"verifi
ed":false,"statuses_count":8,"lang":"en-
gb","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"pro
file_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.c
om\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/
\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"p
rofile_image_url":"http:\/\/abs.twimg.com\/sticky\/default_profile_images\/default_prof
ile_0_normal.png","profile_image_url_https":"https:\/\/abs.twimg.com\/sticky\/default_
profile_images\/default_profile_0_normal.png","profile_link_color":"0084B4","profile_sid
ebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":
"333333","profile_use_background_image":true,"default_profile":true,"default_profile_i
mage":true,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"c
oordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"e
ntities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"epilep
sysociety","name":"epilepsy
society","id":87454049,"id_str":"87454049","indices":[0,16]}]},"favorited":false,"retwe
eted":false,"filter_level":"medium","lang":"en"}
14
{"created_at":"Wed Mar 19 20:07:31 +0000
2014","id":446377448163143680,"id_str":"446377448163143680","text":"I'm
fundraising for Epilepsy Society & I'd love your support! Text HERB49 \u00a32 to
70070 to sponsor me today. Thanks. http:\/\/t.co\/C74muxXk9P","source":"\u003ca
href=\"http:\/\/twitter.com\/tweetbutton\" rel=\"nofollow\"\u003eTweet
Button\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_stat
us_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_sc
reen_name":null,"user":{"id":98352324,"id_str":"98352324","name":"Steven
Herbert","screen_name":"sherbie40","location":"chepstow","url":null,"description":"Play
the guitar til your fingers bleed, quoted by Ted Nugent..\n\nLifes to short get on with
it...","protected":false,"followers_count":43,"friends_count":107,"listed_count":1,"create
d_at":"Mon Dec 21 11:14:17 +0000
2009","favourites_count":1,"utc_offset":0,"time_zone":"London","geo_enabled":true,"ve
rified":false,"statuses_count":119,"lang":"en","contributors_enabled":false,"is_translator
":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_bac
kground_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profi
le_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/
bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/pr
ofile_images\/442675380076703744\/Oje9Ifzk_normal.jpeg","profile_image_url_https":
"https:\/\/pbs.twimg.com\/profile_images\/442675380076703744\/Oje9Ifzk_normal.jpe
g","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/98352324\/139437
7010","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_si
debar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_imag
e":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_reque
st_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors"
:null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":
[{"url":"http:\/\/t.co\/C74muxXk9P","expanded_url":"http:\/\/www.justgiving.com\/Ste
ven-Herbert","display_url":"justgiving.com\/Steven-
Herbert","indices":[119,141]}],"user_mentions":[]},"favorited":false,"retweeted":false,"
possibly_sensitive":false,"filter_level":"medium","lang":"en"}
15
Step 5
Stop or undeploy the stream after collecting some data.
stream undeploy --name epilepsytweets
Step 6
Refine the Data using Hive
Create the tables based on the streamed data collected in Hive.
16
We can see the tweets in hadoop interface has been brought into structured format. A
report can be build on top of the same.
17
Use Cases
Collect and represent the information on epilepsy types, symptoms, medicines and pros
and cons of the same. Collect and represent the information on neurosurgeons, success
scenarios handled, publications.
Scenario: Collect doctors data from different hospitals and research centres
The HIVE ETL script below in Hadoop will load the list of doctors data into warehouse.
create table tbl_doctor ( id string, name string, age int, hospitalname string, expertise
string, publications_link string, profile_info string, country string, city string)
insert overwrite table tbl_doctor
SELECT
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) doctor_id,
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) fullname,
regexp_extract(col_value, '^(?:([^,]*)\,?){10}', 1) age,
regexp_extract(col_value, '^(?:([^,]*)\,?){3}', 1) organisation,
regexp_extract(col_value, '^(?:([^,]*)\,?){11}', 1) specialisation,
regexp_extract(col_value, '^(?:([^,]*)\,?){8}', 1) articles_cited,
regexp_extract(col_value, '^(?:([^,]*)\,?){13}', 1) wiki_profile,
regexp_extract(col_value, '^(?:([^,]*)\,?){4}', 1) Country,
regexp_extract(col_value, '^(?:([^,]*)\,?){5}', 1) City
from temp_doctor;
LOAD DATA INPATH '/user/hue/Doctors_List.csv' OVERWRITE INTO TABLE tbl_doctors
We can customise our script based on the information received from hospitals and
research centres. Columns position can be toggled For example if the specialisation field
from list of of doctors of Hinduja hospital is at position 11 the we go by the below script.
If the specialisation field from list of of doctors of Fortis hospital is at position 14 the we
modify the below script for statement "regexp_extract(col_value, '^(?:([^,]*)\,?){14}',
1) specialisation".
18
Scenario:
Build a catalog of epilepsy types and epilepsy medicines.
HCATALOG provides easy interface to upload the files in different formats and set up the
data.
19
Scenario
Collect patients data related to his presurgical evaluation, medical history, physical
examination and lab tests. The other tables are represented in the below. We can have
customised ETL jobs based on the hospitals data. We can automate this process once we
have the list of files. However it will be essential to encrypt and store the data or mask
the data rather than revealing individual name. This will be subject to the healthcare
laws of different nations. This scenario can be complimented by writing PIG scripts to
compare data on epileptic patients across different states or countries.
20
Scenario
Information can be shared easily on emails about the events to increase the awareness.
Design the job in Oozie Editor/Dashboard
21
Conclusion and Recommendations
The aim of this blend case is to increase networking amongst hospitals, doctors, people,
childrens thus improving the healthcare systems. We can have proper data warehouse
Kimball model as well as federated data warehouse in Hadoop. BigData is feasible for
structured as well as unstructured data.
Data across different testing methods, research is already available we can carry out
data mining and able to predict on epileptic data. This will also aid to recognise the
difference between the normal and abnormal flow on epileptic sufferers.
Cognitive features on neural networking can be aimed to read the machine language of
test carried out on epilepsy patients. Test data and their scenarios can be known upfront
based on the parameters. Algorithms can be developed to make the system precision
and agnostic.
We can aim to build a language interpreter app which can share the epilepsy data
primarily into different languages to the target audience across different countries. This
will help in bridging the language barrier on communication between different languages
spoken over the world.
Document stores for CT scans, MRI, EEG recordings can be explored in MongoDB to
optimize audio, video data.
Interfacing with SAP HANA, SAP Business Objects, Microstrategy, Jasper. Qlikview and
other reporting tools can be carried so that we can have the graphs and data
representing a normal behaviour and deviated behaviour on seizures.
22
List of Abbreviations
AWS - Amazon Web Services
EMR - Elastic Map Reduce
HDP - Hortonworks Data Platform
EDW - Enterprise wide Datawarehouse
HDFS - Hadoop Distributed File System
IAAS - Infrastructure As A Service
List of Figures
Page 1: Hadoop Architecture
Page 2: Architecture Design of the System
Page 3: Epilepsy Global Data Centres Leveraging Cloud Computing Features
Page 4: Data Storage Structure and Query Processing Flow in HDFS and HBASE
Page 4: HDFS Storage Structure
Page 8: HBASE Storage Structure
23
Literature References
[1] http://www.epilepsyfoundation.org
[2] Moving To The Cloud. Developing Apps in the New World of Cloud Computing. Dinkar
Sitaram. Geetha Manjunath.
[3] http://bigdatauniversity.com
[4] http://www.mongodb.com/learn/big-data
[5] http://ocw.mit.edu/courses/brain-and-cognitive-sciences/
[6] http://aws.amazon.com/
[7] Artificial Intelligence and Soft Computing: Behavioral and Cognitive Modeling of the
Human Brain, Volume 1 By Amit Konar
[8] Computational Intelligence: Principles, Techniques and Applications By Amit Konar
[9] http://hortonworks.com/
[10] http://hadoop.apache.org/
[11] http://projects.spring.io/spring-xd/
[12] http://guidance.nice.org.uk/
[13] https://www.hemr.org/wiki/Category:Epilepsy_syndromes
[14] Dr. Vrajesh Udani.
http://www.hindujahospital.com/communityportal/doctors/doctor-
details.aspx?did=140&name=dr-vrajesh-udani&cid=36&cname=
[15] https://twitter.com/epilepsysociety
[16] https://www.hemr.org/wiki/Category:Epilepsy_syndromes
[17] Jayapandian CP, Chen CH, Bozorgi A, Lhatoo SD, Zhang GQ, Sahoo SS.
Electrophysiological Signal Analysis and Visualization using Cloudwave for Epilepsy
Clinical Research. The 14th World Congress on Medical and Health Informatics
(MedInfo), 2013. http://www.ncbi.nlm.nih.gov/pubmed/23920671
[18] Hadoop Architecture http://www.intel.co.uk/content/www/xa/en/big-data/big-data-
analytics-turning-big-data-into-intelligence.html
24