theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

Upload: charlesyan

Post on 06-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    1/51

    1

    GOOGLE TALK

    Ed Austin 12-09-09

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    2/51

    2

    Pre PresentationThe Google Philosophy (according to ed)

    Jedis build their own lightsabres (the MS Eat your own Dog Food)

    Parallelize Everything

    Distribute Everything (to atomic level if possible) Compress Everything (CPU cheaper than bandwidth)

    Secure Everything (you can never be too paranoid)

    Cache (almost) Everything

    Redundantize Everything (in triplicate usually)

    Latency is VERY evil

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    3/51

    The Anatomy of the Google

    ArchitectureThe unofficial Version

    V1.0 November 2009

    Ed Austin{ed, edik} @i-dot.com

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    4/51

    4

    Section I The Basic Glue

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable MapreduceBigTable

    Chubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCHINDEX

    CRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

    1. Exterior Network (Perimeter Architecture)

    2. Data Centre

    3. Rack Characteristics

    4. Core Server Hardware

    5. Operating System Implementation

    6. Interior Network Architecture

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    5/51

    5

    THE PERIMETER

    How does your data enter the Google empire?

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    6/51

    6

    Perimeter Network Security (as known)

    DNS Load Balanced splits traffic (country, .com multiple DNS, other X1)to FW

    Firewall filters traffic (http/s, smtp,pop etc)

    Netscalar Load Balancers take Request from FW blocks DOS attacks,ping floods (DOS) blocks non IPv4/6 and none 80/443 ports and http

    multiplexes (limited caching capability) User Request forwarded to Squid (Reverse Proxy) probably HUGE

    cache (Petabytes?)

    If not in Cache forwarded to GWS (Custom C++ Web Server) now notusing Custom apache?

    GWS sends the Request to appropriate internal (Cell) servers

    Request is processed

    exterior https via thawte certs

    Dedicated Crawler Architecture se arate from other infrastructure

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTableMapreduce

    BigTable

    Chubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPS

    SEARCHINDEX

    CRAWL

    GMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

    CellInterior Network

    GFS II etc

    Firewall80/443

    DNSLoad Balanced (.COM = 3, UK only one)

    [ed@d800 ~]$ dig google.com.....

    ;; ANSWER SECTION:google.com. 223 IN A 74.125.45.100google.com. 223 IN A 74.125.53.100

    google.com. 223 IN A 74.125.67.100

    [ed@d800 ~]$

    NetScalarhttp multiplexing

    SquidReverse Proxy

    GWSWeb Server Farm

    FirewallDMZPerimeter

    Client Browser80/443

    Possible Search Traffic PathBased upon Known Technologies employededge routing not shown/instances not shown

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    7/51

    7

    PERIMETER NETWORK CACHING

    -Uses Squid Reverse Proxy

    -Perimeter Cache hit rates 30-60% = Huge!

    - Dependent on search complexity/user preferences/traffic type

    - All Image Thumbnails caches, much Multimedia cached- Expensive common queries cached (common words i.e. Obama,

    edinburgh) as they require significant back-end processing.

    - On cache flush/update big latency spike and capacity drop- Index servers need to do significant work to rebuild cache

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable MapreduceBigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPS

    SEARCHINDEX

    CRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQSquid

    Reverse Proxy

    80/443 80/443

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    8/51

    8

    THE DATA CENTRE

    Where do they store all that Data?

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    9/51

    9

    Worldwide Data Centres

    Where is Google Located?

    Last estimated were 36 Data Centers, 300+ GFSII Clusters andupwards of 800K machines.

    US (#1) Europe (#2) Asia (#3) South America/Russia (#4)

    Australia on Hold

    Future: Taiwan, Malaysia, Lithuania, and Blythewood, South Carolina.

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    10/51

    10

    The Modular Data Centre

    Standard Google Modular DC (Cell) holds 1160 Servers / 250KWPower Consumption in 30 racks (40U).

    This is the Atomic Data Centre Building Block of Google.

    A Data Centre would consist of 100s of Modular Cells.DC architecture then being the aggregation of smaller Cell level infrastructures intheir own container some being pure GFS, other BT, other Map, some mixed etc.

    MDCs can also be deployed autonomously at the Perimeter(stand alone).

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    11/51

    11

    THE RACK

    How is a server stored in the Data Centre?

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    12/51

    12

    Google Rack (GOOG rack)

    Why interesting?

    The rack Implementation!

    EVERYTHING custom!

    Mini Server Size

    Old Servers are Custom 1U

    New Servers are 2U...

    again a custom design

    seem 1/3 width of a normal 2U Server

    40U/80U Custom Racks (50% each side)

    DesignHuge Heating and Power Issues

    Optimized Motherboards

    Work closely with HW MB developers

    Have their own HW builds

    specified to component level

    Servers expected to be expendable

    build redundancy on top of failure

    Motherboard directly mounted into Rack

    servers have no casing - just bare boards

    assist with heat dispersal issues

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable

    Mapreduce

    BigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    13/51

    13

    THE HARDWARE

    Millions of exactly what?

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    14/51

    14

    Server Hardware

    2U Low-Cost (but not slow) Commodity Servers

    2009 Currently 2-Way, Dual Core/16GB/1-2TB +- Standard

    Both Intel/AMD Chipsets 1 NIC 2 USB Looks like they RAID1/mirror the disks for better I/O - read performance

    SATA 7.2K/10K/15K drives? 8 x 2GB DDR3 ECC

    Standard HW Build (Several HW Build Versions at any one time) Currently at 7Gen Build (1G 2005 was probably Dual Core/SMP)

    Each Server 12V Battery Backup and can run autonomously without external power (lasts 20-30s?)

    Work closely with chip manufacturers to improve design/reduce power custom Intel chips that canwithstand higher heat factors than generic versions

    YEAR Average Server Specification

    1999/2000 PII/PIII 128MB+

    2003/2004 Celeron 533, PIII 1.4 SMP, 2-4GB DRAM, Dual XEON 2.0/1-4GB/40-160GB IDE - SATA Disks via Silicon Images SATA 3114/SATA 3124

    2006 Dual Opteron/Working Set DRAM(4GB+)/2x400GB IDE (RAID0?)

    2009 2-Way/Dual Core/16GB/1-2TB SATA

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    15/51

    15

    THE OPERATING SYSTEM

    The Core Software on each of those servers

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    16/51

    16

    OPERATING SYSTEM

    -100% Redhat Linux Based since 1998 inception

    - RHEL (Why not CentOS?)

    - 2.6.X Kernel- PAE- Custom glibc.. rpc... ipvs...- Custom FS (GFS II)- Custom Kerberos- Custom NFS

    - Custom CUPS- Custom gPXE bootloader- Custom EVERYTHING.....

    Kernel/Subsystem Modificationstcmallocreplaces glibc 2.3 mallocmuchfaster! works very well with threads...rpcthe rpc layer extensively modified to provide > perf increase < latency

    (52%/40%)

    Significantly modified Kernel and Subsystems all IPv6 enabled

    Use Python as the primary scripting languageDeploy Ubuntu internally (likely for the Desktop) also Chrome OS base

    Easily the Worlds largest installed Linux base

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    17/51

    17

    THE INTERIOR NETWORK

    How does your datatravel around the Google empire?

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    18/51

    18

    INTERIOR NETWORK

    ROUTING PROTOCOL

    Internal network is IPv6 (exterior machines can bereached using IPv6)

    Heavily Modified Version of OSPF as the IRP

    Intra-rack network is 100baseT

    Inter-rack network is 1000baseTInter-DC network pipes unknown but veryfast

    Technology:

    Juniper, Cisco, Foundry, HP, routers and

    switches

    Software:

    ipvs (ip virtual server)

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    19/51

    19

    THE MAJOR GLUE

    The three foundation blocks of GooglesSecret Sauce

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    20/51

    20

    Section II Googles Major Glue

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable MapreduceBigTable

    Chubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCHINDEX

    CRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ 1. Google File System Architecture GFS II

    2. Google Database - Bigtable

    3. Google Computation - Mapreduce

    4. Google Scheduling - GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    21/51

    21

    GOOGLE FILE SYSTEM

    Manages the underlying Data on behalf of the upperlayers and ultimately the applications

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    22/51

    22

    FILE SYSTEM I GFS v1

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTableMapreduce

    BigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

    The GFS II cell is Googles fundamental building block everything can belayered on top of this

    Consists of (Highly distributed Linux based) Master Servers and Chunk Servers

    Chunk Servers serve the Data in 64MB Chunks to the client directly via Master arbitration

    DATA REDUNDANCY/FAULT TOLERANCE?Triplicate Copies of Chunks are kept often in other clusters / DCChunks can be pulled from outside the DC! Expensive.... And try not to do!However apps built on top of GFS/BT do this on an ad-hoc basis (i.e. Gmail)

    On Chunk loss the Master handles the Recovery by sourcing a chunk copy

    Data is compressed using BMDiff/Zippy

    Chunk Server Fault-Tolerance achieved by Heart-beat to the Master (I am alive..)

    Master Failure was problematic for Google (finally down from 2 minutes to 10 seconds)

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    23/51

    23

    FILE SYSTEM I GFS II

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTableMapreduce

    BigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

    GFS II Colossus Version 2 improves in many ways (is a complete

    rewrite)

    Elegant Master Failover (no more 2s delays...)

    Chunk Size is now 1MB likely to improve latency for serving data other than Indexing fo

    example GMail this was the rationale behind the change

    Master can store more Chunk Metadata (therefore more chunks addressable up to 100million) = also more Chunk Servers

    However according to Google Engineer they have only ever lost one 64MB chunk (in GFS I)during its entire production deployment (2004 2008?) so assumed extremely reliable

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    24/51

    24

    GOOGLE DATABASE

    Accesses the underlying Data on behalf of the upperlayers and ultimately the applications

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    25/51

    25

    Bigtable I - Introduction

    What is it?

    Googles Database Implementation since 1994

    Used internally for all large scale (Search, Indexing, GMail etc)

    Similar to a sharded Database implemention

    GOALS

    Huge Scalability to many PBs (Web Database currently around 40 Billion URLs)

    Tight Latency

    Highly efficient scans over Textual Data

    Fault Tolerant

    Load Balancable

    Eliminate Googles dependency on an external provider

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable

    Mapreduce

    BigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    26/51

    26

    Bigtable II

    How is Data Referenced?

    Distributed Multi-Dimensional Sparse Map

    Simple addressing model using a triple:

    (row, column, {timestamp}) -> cell contents

    ROWS

    - Rows (arbitrary length usually 10-100 Bytes Max

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    27/51

    27

    Bigtable III Table StructureStudying contents: columnshows three versions ofcontents of a page (current,cached and ?)presumably all othercolumns are timestampedso could be used in acomparitive way (such asanchor increase/decrease)OTF in SERPS alg mustuse a combo of TimeSt diffbetween n(=3 rest garbagecollected) page Versionsand crawled anchors -what else does table hold?Possibly PR (or OTF) andother search related

    weightings

    Google keeps much moreinfo for ranking purposesthan it did in 1999

    Webtable hinted at 100columns+!

    How do page units affect

    the URL reversal of theURL bigtable?-Does a Tables TabletsCross a Clustersnamespace (yes if unifiedelse no?)

    ENG uk.co.bbc.news

    language:ROW

    10-100 Bytes

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    28/51

    28

    Bigtable IV

    How tables are broken down in storage ?

    For example Webtable is billions of pages!

    -Large Tables broken (split) into tablets at row boundaries

    -Tablets discontiguous (assists in fault-tolerance) spread over DC but try tokeep one copy in same rack

    -Tablet Size is approximately 100-200MB of compressed Data

    -Load Balanced migrate tablets from heavily loaded machines to lightly loadedones

    - Heavily used tablets probably stay in working set (cached)

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTableMapreduce

    BigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPS

    SEARCHINDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    29/51

    29

    GOOGLE MAPREDUCE

    Computes the underlying Data on behalf of theapplications

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    30/51

    30

    Mapreduce I

    Map Reduction can be seen as a way to exploit massive parallelismby breaking a task down into constituent parts and executing onmultiple processors

    The Major Functions are MAP & REDUCE (with a number of intermediatsteps)

    MAP Break task down into parallel stepsREDUCE Combine results into final output

    Shown is a 2-pipeline Map Reduction (There are 24 Map Reductions in the indexing pipeline)Mappers & Reducers usually run on separate processors (90% loss of reducers job still completed!)

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    31/51

    31

    Mapreduce II

    LANGUAGE BINDINGS

    C++, Java, Python, Sawzall

    DEPLOYED

    Implemented 2004 before this MySQL?

    STATISTICS

    -In September 2009 Google ran 3,467,000 MR Jobs with an average 475 sec completion timeaveraging 488 machines per MR and utilising 25.5K Machine years

    -Technique extensively used by Yahoo with Hadoop (similar architecture to Google) and Facebook(since 06 multiple Hadoop clusters, one being 2500CPU/1PB with HBase).

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    32/51

    32

    Chubby Lock

    Googles Distributed File Locking Service for Bigtable

    -Provides Mutex Support for Data Access (atomic access to column data)

    - Used to synchronize access to shared resources

    - Consists of a Master and Slaves (designated by election)

    - Failover consists of a Slave replacing the functionality of a Master

    -- Also servers as an ultra-fast high availability File Server for small fines (100s bytes)

    - Provides an ACL for tablet authentication (row and column data)

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTableMapreduceBigTable

    Chubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    33/51

    33

    GOOGLE WORKQUEUE

    Provides Resource Management for the ComputationalJobs

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    34/51

    34

    GWQ Google Workqueue

    Batch Submission/Scheduler System

    -Software to submit Mapreduce Jobs to a Cell/Cluster

    -Arbitrates (process priorities) Schedules, Allocates Resources, process failover,Reports status, collects results

    - Often Workqueue overlaid on a GFS Cluster- i.e. GFS cluster not computational bound jobs also

    seems to match co-locate tasks near data = just diskI/O not Network I/O (on the Chunk Server?)

    - Workqueue can manage many tens of thousands of machines

    Launched via API or command line (sawzall example shown)

    saw --program code.szl --workqueue testing--input_files /gfs/cluster1/2005-02-0[1-7]/submits.* \--destination /gfs/cluster2/$USER/output@100

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    35/51

    35

    Section III Some more Glue

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable MapreduceBigTable

    Chubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCHINDEX

    CRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

    1. Languages employed

    2. Development Environment

    3. Google App Engine

    4. Network Security

    5. Future Google Architecture Advances

    6. Odds n Sods

    7. DIY Google

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    36/51

    36

    DEVELOPMENT LANGUAGES

    - Initially Python, Java, C++

    Usual Suspects

    - Sawzall (since 2006)

    - equivalent to Hadoops Pig Latin- written in C++ - interpreted bytecode output JITd

    An internal Procedural language employed to solve map reduction problems. Thefew published Google papers employ Sawzall in the algorithm examples. Runs in theMap phase, Aggregators run in the Reduce phase (from each Sawzall Map instance)to get the final output.

    - Transparent Parallelization no specialist Distrib Sys KnowledgeRequired (Good for developer)- Simple Datatypes 64-bit signed int, float, string, byte and a few unique such astime- Much STR regexp support

    - Compound Types arrays, tuples- typesafed (and declarations) similar to Pascal (Probably an LL(1) lang?)- similar to Algol, C Syntax (no pointers though!)- No Processing of exceptions (no exception handlers)- Shorter than corresponding C++ code by a factor of 10

    Early versions could not write into Bigtable. Now implemented?

    Output sometimes pipelined into MySQL for further analysis

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable MapreduceBigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPS

    SEARCHINDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    37/51

    37

    GOOGLE APP ENGINEUsing Application Platform technology stack

    Allows a developer to leverage components of Google Technology (butnot necessarily primary Infrastructure i.e. The usual business resources)

    -Supports Python, Java

    - Bigtable support (via GQL)

    -Uses GFS as underlying FS usual Fault-tolerance/Load-balancing

    -Task Queue similar to GWQ?

    -Code exposed to Google

    - No support for subprocess spawning more importantly none of the

    google mapreduce library made available- isolates computational aspects to single servers but the I/O is probably thegoogle standard implementation underneath- therefore computationally intensive tasks more problematic= keeping your resource usage under controlSERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTableMapreduceBigTable

    Chubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    38/51

    38

    Security

    Rack Board Level (possible scenario)

    gPXE on the board goes through DHCP/tftp sequence to pull over an encryptedimage (this is not expensive as is done once per boot and boots are not usual)

    Image is pulled from a Secure Image Distribution Server (and held encrypted onthese)

    Once at the board end the image is OTF decrypted and booted as normal RHEL

    02/09 Google Engineer didnt dispute this and seemed to concur adding that in-core encryption might be a possibility (R/T decryption might not be thatexpensive) this possibily means cryptology is used throughout the lifetime ofthe image including components outside the working-set but sensitive parts ofthe in-core OS (OTF decrypted)

    Enterprise

    Kerberos is used throughout the enterprise

    They have an Automated issuance system for SSL certificates, used by internal

    (secure) infrastructure to validate https/TLS and generic SSL connections.

    Complete internal network encryption unlikely due to latency introduced?

    Likely that one of the reasons failover between DCs problematic is the latency

    introduced due to the expense of Wide Area Encryption (essential)

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    39/51

    39

    Google Future Architecture

    - 99%ile latency for all data

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    40/51

    40

    Odds n Sods

    borg google technology/architecture (is a cluster..)Borg: a hybrid protocol for scalable application-level multicast in peer-to-peer networks(WAN multimedia steaming)

    data cube google technology

    Have a global loadbalancer assume load balances across a unified namespaceprobably worldwide

    gmail designers implemented application level failover to move your session toan alternate DC in a seamless fashion to the end user.Probably all Google Apps will be able to migrate to an alternate DC cell (the application,and its GFS data if need be)

    MySQL isused for back-end sys admin stuff (high availability master-slaveimplementations) and post Bigtable processing

    Remote employee access is via VPN

    Sys Admins maintain 5 and 30 minute SLAs so on the ball

    Has its own internal archive.org equiv.

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable Mapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    41/51

    41

    BUILD YOUR OWN GOOGLE

    The Basic Open Source Tools

    The Google Stack (vs Yahooish/Open Source)

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    42/51

    42

    The Google Stack (vs Yahooish/Open Source)

    SERVER HARDWARE SERVER HARDWARE

    RHEL 2.6.X PAE CentOS 2.6.X PAE

    RACK RACK

    INTERIOR NETWORK IPv6 INTERIOR NETWORK IPv6

    GFS / GFS II HDFS (hadoop)

    Hadoop Framework

    MapreduceHbase (Bigtable equiv.)

    Mapreduce

    BigTableChubby Lock

    Pig Latin, Python, PHP, Java ....anything

    Python, Java, C++,Sawzall, other

    CLIENT APPLICATION

    DC DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Conceptual Overview

    Google vs. Open Source

    Architecture

    Open Source(Yahooish)Architecture

    Exterior Network Exterior Network

    GWQ Job Tracker

    Googles

    Secret Sauce

    Hadoop

    Open Source(Other Tools such as crawlers, indexers readily available)

    BigTable

    Python, Java,C++,

    APP ENGINE

    Task Queue

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    43/51

    43

    END

    (Thankyou)

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    44/51

    44

    DIY GOOGLE

    What you require:Preferably 2 Machines + 100BT

    CentOS/RHEL(squid)ApacheHadoop (HDFS, Mapreduce, Pig, HBase)HDFS

    bmdiff/zippy compression libraryGoogle glibc/tcmalloc perftools

    Supporting stuff JRE etc

    Browser with Search Box

    pig mr call to scan a few filesprint results

    SERVER HARDWARE

    CentOS 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    HDFS (hadoop)

    Hadoop FrameworkMapreduce

    Hbase (Bigtable equiv.)

    Python, PHP, Java .... anything

    CLIENT APPLICATION

    DC

    Open Source(Yahooish)Architecture

    Exterior Network

    Job Tracker (Work Queue equiv.)

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    45/51

    45

    DIY GOOGLE

    Install Hadoop and Pig on ClusterInstall eclipse and dependenciesInstall PigPen for eclipse and configure to cluster (NFS)

    SERVER HARDWARE

    CentOS 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    HDFS (hadoop)

    Hadoop FrameworkMapreduce

    Hbase (Bigtable equiv.)

    Python, PHP, Java .... anything

    CLIENT APPLICATION

    DC

    Open Source(Yahooish)Architecture

    Exterior Network

    Job Tracker (Work Queue equiv.)

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    46/51

    46

    TEMPLATE

    - IPv6 enablement started 2008 (2009 finished?)

    - IRP OSPFGoogle authored RFC points towards OSPF

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTableMapreduceBigTable

    Chubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    47/51

    47

    DEVELOPMENT ENVIRONMENT bits&bobs

    A rare shot of some concrete google internal stuff (this of a GFS Master Server code executionfound as a perftools profiling example)

    Agile Methodologies Used (development iterations, teamwork, collaboration, and process adaptabilitythroughout the life-cycle of the project)

    Libraries are the predominant way of building programs

    An infrastructure handles versioning of applications so they can be release without a fear of breakingthings = roll out with minimal QA

    - Internal Code uses replacement libraries- Google as youd expect rewrites everything!- Hungarian Notation?- Work in small teams 3-5 peoplelikely few scutters know the big picture

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTable

    Mapreduce

    BigTableChubby Lock

    GOOGLE APPENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    Exterior Network

    GWQ

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    48/51

    48

    Internal Linux development and deployment

    Served as technical lead of team responsible for customizing and deploying Linux to internal systems and workstations.

    Fixed bugs and added enterprise features to several Linux components, including NFS, Kerberos, CUPS. All relevant patches werepushed to upstream maintainers, and most are in current released distributions.

    Developed and maintained systems to automate installation, updates, and upgrades of Linux systems.

    Developed IPv6 support for Linux load-balancing (ipvs).

    Managed several interns and contractors.

    loadbalancing user accounts within a datacenter, and coordinating with the global loadbalancer, which uses linear programming tooptimally allocate users. In particular, this avoids "shared fate" risks and reduces latency and costs incurred due to excessive transatlantic

    data traffic. Learned Sketchup so as to document the four dimensional data structures effectively The testing, evalulation, deployment, operations, and maintenance of Netscaler load balancers.

    automated Apache configuration reloader

    gPXE open-source network booting software

    GWS custom C++ webserver = not apache?

    Google 02/09 talk example was a Cluster is 30 racks (I believe this refers to Google). At a 40U rack 40Ux30racks = 1200 =approximately a MDC can assume each MDC is a Cluster/cell at architectural level

    Google engineer stated a DC is a collection of Modular Units(MDCs?) the picture (not above) illustrated suggested this.

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    49/51

    49

    Some Pre Presentation Information

    1 Million GB = 1000TB = 1 PB (x 1000 = 1 EXABYTE) Internet Archive is around 3PB (2009)

    CLEAN UP BEFORE all the poorly sourced stuff

    Add lock service to bt to all slides Google rack server on rack page SSTable

    Google PROFITS US $16M A DAY

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    50/51

    50

    Pre Presentation Disclaimer

    Put together in a week from knowing zero about Google

    I am not associated with Google

    Numbers are approximate but certainly are ball-park Googleoften delivers contradictory figures and uses many terms for

    some items - cell/cluster scheduler/workqueue (obfuscation?)

    Googles philosophy/paranoia of tell as little as possible (pausingpresenters and sideways answers) makes it hard to fill in some(significant) gapsinferences are sometimes drawn (in red)

    Google seem to design absolutely EVERYTHING themselvesfrom HW MB build, Racks, Switches(?), Software... So its hard tofind sources of information beyond broad concepts

  • 8/3/2019 theanatomyofthegooglearchitecturefinalv1-1-091210035101-phpapp02

    51/51

    51

    Bigtable VI

    -

    Latest (or at least since 2006..)

    -Increased Scalability (across Namespace/Datacenters)- i.e. Tablets spread over DCs for a table but expensive

    (both computationally and financially!)

    -Service Clusters (?)

    -Multiple Bigtable Clusters replicated throughout DC

    Current Status

    - Many Hundreds may be thousands of Bigtable Cells- Late 2009 stated 500 Bigtable clusters

    - At minimum scaled to many thousands of machinesper cell in production

    - Cells manage Managing 3-figure TB data (0.X PB)

    SERVER HARDWARE

    RHEL 2.6.X PAE

    RACK

    INTERIOR NETWORK IPv6

    GFS / GFS II

    BigTableMapreduce

    BigTableChubby Lock

    GOOGLE APP

    ENGINE

    Python, Java, C++,Sawzall, other

    DC

    GOOGLE APPSSEARCH

    INDEXCRAWLGMAIL...

    Architecture

    Python. Java.C++

    GWQ