expose google app engine as tasktracker nodes and data nodes
TRANSCRIPT
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES
AND DATA NODES
INTRODUCTION
MOTIVATION
IMPLEMENTATION Core Logic (Map-Reduce Framework) Job Scheduling Load Balancing
HADOOP & GOOGLE APP ENGINE
CHALLENGES & ISSUES
PERFORMANCE ANALYSIS & RESULTS
QUESTIONS
GOOGLE APP ENGINE ? Paas (Platform as a Service) A platform for hosting Web Applications Virtualizes applications across multiple servers
and Google – managed data centers
Project Description Distribute computation across
multiple servers and share the load across them
Use multiple accounts on App Engine Task Tracker runs on each account Job Tracker runs on a stand-alone
machine
WHY GOOGLE APP ENGINE ? WRITE THE CORE LOGIC OF APP &
DEPLOY IT NO NEED TO WORRY ABOUT DATA
CENTERS AUTOMATIC SCALING FREE UPTO CERTAIN LIMIT PAY AS WE GO FURTHER
WHAT WE DID ? BUILT APPLICATIONS(INVERTED INDEX,
WORDCOUNT, MOVIE RATINGS) BUILT MAP – REDUCE FUNCTIONS FOR THESE
APPLICATIONS DEPLOYED THESE MAP/REDUCE
FUNCTIONS ON TASK TRACKERS A JOB TRACKER, ACTING AS A MASTER,
DISTRIBUTES DATA THROUGH URLFETCH
PROVIDED A UI TO ENABLE THE USER TO UPLOAD INPUT DATA ON GOOGLE’S PERSISTENT STORAGE - BIGTABLE
LIBRARIES USED TO CONNECT TO THE PERSISTENT STORAGE : JDO/JPA
USER CAN CHOOSE THE APPLICATION TO BE RUN
JOB IS SUBMITTED TO JOB TRACKER JOB TRACKER MAINTAINS A QUEUE OF JOBS SCHEDULER
PRIORITY SCHEDULER THE USER CAN SPECIFY THE PRIORITY FOR
THE JOB. BASED ON IT, JOB WILL BE INSERTED INTO
THE QUEUE USED WHEN THE USER SPECIFIES A
PRIORITY
FIFO SCHEDULER THE SUBMITTED JOB IS INSERTED AT
THE BACK OF THE QUEUE A JOB IS PICKED FROM THE FRONT
THUS RUNNING IN A FIFO FASHION DEFAULT SCHEDULER
RESOURCE DAILY LIMIT(FREE)
MAX RATE (FREE)
DAILY LIMIT(BILLED)
MAX RATE(BILLED
REQUESTS 13,00,000 REQUESTS
7,400 REQUESTS/MIN
4,30,00,000 REQUESTS
30,000 REQUESTS/MIN
OUTGOING BANDWIDTH
1 GB 56 MB/MIN 1 GB FREE ; 1046 GB MAX
740 MB/MIN
INCOMING BANDWIDTH
1 GB 56 MB/MIN 1 GB FREE ; 1046 GB MAX
740 MB/MIN
CPU TIME 6.5 CPU HOURS
15 CPU-MIN/MIN
6.5 CPU HOURS FREE; 1729 MAX
72 CPU-MIN/MIN
WHY ? EVERY ACCOUNT HAS A FIXED QUOTA DISTRIBUTION OF DATA ACROSS MULTIPLE TASK TRACKERS TO PERTAIN TO THE
QUOTA COST MODEL FOR LOAD BALANCING
COST IS PROPORTIONAL TO THE AMOUNT OF DATA PROCESSED BY A TASK TRACKER
DATA DIVIDED INTO EQUAL SIZED CHUNKS AND SENT TO THE TASK TRACKER’S MAP FUNCTION
HANDLING HUGE DATA SETS DATA DIVIDED INTO CHUNKS WHAT IF CHUNK SIZE IS HUGE ??
AT LEAST, ONE OF THE TASK TRACKER WILL FAIL , NO MATTER WHICH LOAD BALANCING ALGORITHM IS USED
SOLUTION : DYNAMICALLY INCREASE THE NO. OF TASK TRACKERS IF ONE OF THEM FAILS AFTER A FIXED NO OF TRIALS.
LIMITED CONTROL ON GOOGLE APP ENGINE NO SPAWNING OF THREADS INABILITIY TO WRITE ON THE FILESYSTEM OF
GOOGLE’S SERVER NO CONTROL ON DATA LOCALITY
MACHINE ON WHICH DATA IS STORED, IS DYNAMICALLY ALLOCATED BY GOOGLE
IN HADOOP, THREADS AND FILE IO CAN BE DONE IMPLEMENTING HADOOP USING GOOGLE APP
ENGINE IS DIFFICULT
DATA RETRIEVAL IS NOT IN THE SAME ORDER AS DATA STORAGE BECAUSE OF GOOGLE’S STORAGE ARCHITECTURE
NO CONTROL ON USAGE OF NETWORK BANDWIDTH BETWEEN THE JOB TRACKER AND TASK TRACKERS
EXPENSIVE JOIN,UNION OPERATIONS WHEN NUMBER OF TABLES INVOLVED ARE HUGE.
RESULT SAME AS THAT WHEN RUNNING THE APPLICATION ON HADOOP.
TESTED WORDCOUNT APPLICATION ON A DATA SET CONSISTING OF 10000 WORDS USING 3 TASK TRACKERS
NETWORK BANDWIDTH IS A BOTTLENECK IN THE RUNTIME OF APPLICATION AS DATA HAS TO TRASNSFERRED FROM TASK TRACKERS TO JOB TRACKER AND VICE-VERSA.