database expert q&a from 2600hz and cloudant

34
Powerful, Distributed, API Communications Call-in Number: 513.386.0101Pin 705-705-141 Expert Q&A: Database Edition May 31 st , 2013

Upload: joshua-goldbard

Post on 02-Dec-2014

7.598 views

Category:

Technology


0 download

DESCRIPTION

This is the Expert Q&A from 2600hz and Cloudant on Database in Telecom. If you are a service provider, MSP or anyone running a VoIP switch, you should definitely check this out.

TRANSCRIPT

Page 1: Database Expert Q&A from 2600hz and Cloudant

Powerful, Distributed, API Communications

Call-in Number: 513.386.0101Pin 705-705-141

Expert Q&A: Database Edition

May 31st, 2013

Page 2: Database Expert Q&A from 2600hz and Cloudant

Welcome

Page 3: Database Expert Q&A from 2600hz and Cloudant

Our Panelists

Joshua Goldbard

Marketing Ninja, 2600hz, Moderator

Darren Schreiber

Founder, 2600hz

Sam Bisbee

Cloudant

Page 4: Database Expert Q&A from 2600hz and Cloudant

Database:It’s all good until it

isn’t

Page 5: Database Expert Q&A from 2600hz and Cloudant

Some background…

Page 6: Database Expert Q&A from 2600hz and Cloudant

What is Database?

• A Record of things Remembered or

Forgotten

• Used to be Unbelievably hard, now it’s

just hard sometimes

• Modern Databases are amazingly

resilient

• Failure Mode still requires lots of

attention

• In Distributed Environments…

• Database is inexorably linked to the

network

• The network is always unreliable if

public

Page 7: Database Expert Q&A from 2600hz and Cloudant

Masters and Slaves

• Databases have to Replicate

• Most Databases use a form of Master-

Slave Relationship to manage replication

and dedupe

• Masters are where new data is entered

• Then it’s mirrored out to the Slaves for

storage

• If you lose access to the original Master,

you can convert a Slave into a Master

and restore operation

Durability

Page 8: Database Expert Q&A from 2600hz and Cloudant

Other Replication Strategies

• Other strategies exist, such as…

• Master-Master (What 2600hz Uses)

• Tokenized Exchange

• Time-delimited

• The most popular methods tend to be

Master-Slave or Master-Master

Each Database has its advantages and

tradeoffs. Once again, there is no Magic

Bullet.

Page 9: Database Expert Q&A from 2600hz and Cloudant

Failure and Quorum

• When A Database needs to elect a new

master…

• There are many different strategies

• Most involve the concept of quorum

(figuring out where the greatest

number of copies reside)

• Once Quorum is established, a new

master is elected and (hopefully)

operation can resume

• Quorum is different in Master-Master

(Explain)

Page 10: Database Expert Q&A from 2600hz and Cloudant

Cap TheoremDatabases can have (at most) 2 out of 3 of the following:

•Consistency•Availability•Partition Tolerance

Modern Database Management is balancing between Consistency and Availability because

all modern networks are unreliable

Page 11: Database Expert Q&A from 2600hz and Cloudant

Examples of Databases

Page 12: Database Expert Q&A from 2600hz and Cloudant

What is Important in a Database?

• Reliable Storage of Data?

• Fast Retrieval of Data?

• Fast Saving of Data?

• Resilience during failures?

• <other>

Page 13: Database Expert Q&A from 2600hz and Cloudant

Examples

• Buying tickets from ticketmaster

• What’s important and why?

• Withdrawing money from a bank?

• Storing Call Forwarding Settings?

• Storing a List of Favorite Stocks?

Each Scenario has a different set of

requirements and constraints. There is

no silver bullet; if you could write one

database for all these scenarios, you’d

be rich.

Page 14: Database Expert Q&A from 2600hz and Cloudant

Which Database is Better?

• STUPID QUESTION

• But I thought there were no stupid

questions?

• This is the only stupid question.

• The fight of which database is better is

almost always silly

• Databases are a tool, to get a job done

• Like the previous examples, each job

is different

• Each database stresses different

pros/cons

Page 15: Database Expert Q&A from 2600hz and Cloudant

Let’s Get Technical!

Page 16: Database Expert Q&A from 2600hz and Cloudant

Trouble With Databases• HUGE TOPIC (We’re only going to cover

a little)

• Network Partitions

• Layer 1 disasters

• Flapping Internet (Special Class of

Network Partitions)

Page 17: Database Expert Q&A from 2600hz and Cloudant

Network Partitions• Common in Distributed Databases• When Databases lose contact with each other they

can partition• Caused by unreliable or faulty network connections• Databases can behave very weirdly when in

partitions

Arguably, most of what a database admin does is prepare for network partitions and how to resolve

them.

Page 18: Database Expert Q&A from 2600hz and Cloudant

Network without Partitions

Page 19: Database Expert Q&A from 2600hz and Cloudant

Network with Partitions

Page 20: Database Expert Q&A from 2600hz and Cloudant

Split-Brain• During a partition, some databases will elect N

masters, one for each partition in the network.• When the partition is fixed, unless there is a pre-

defined restoral procedure, there will be conflicts• Databases have all kinds of strategies for handling

WAN Split-brain failure, but you should understand them

Key Takeaway: No Database is perfect. Understand the automation but also understand the manual

intervention procedure.

Page 21: Database Expert Q&A from 2600hz and Cloudant

Layer 1 Failures

Page 22: Database Expert Q&A from 2600hz and Cloudant

Layer 1 Failures• Rut Roh• Actual Physical Disaster• No easy way out except…• Don’t be in a Datacenter that’s hit by a disasterOR• Be Nimble enough to Evade Disaster

Page 23: Database Expert Q&A from 2600hz and Cloudant

Evading Disaster• We’re not Magicians, we can’t simply predict disasters• The next best thing is being able to move and move

fast• Kazoo requires one line of code to move• Kazoo moves fast• Moving the Database fast is awesome (Thanks

BigCouch!)

During Hurricane Sandy, we cut our Datacenters away from Downtown New York to a Datacenter above the 100 year flood plain on the East Coast. Result: No Downtime.

Page 24: Database Expert Q&A from 2600hz and Cloudant

No Silver Bullets• Layer 1 disasters are a humbling experience• Don’t rely on DataCenters in the Path of a Storm• Flooding will brick datacenters that have

generators below ground• To avoid being powerless in a disaster…• Plan, Test, Analyze, Repeat• Check out Netflix Simian Army for examples of

tests

Page 25: Database Expert Q&A from 2600hz and Cloudant

Flapping• Is it up? Is it Down? Around and Around it Goes,

where it stops nobody knows…• Flapping Internet is a special case of network

partition or lose connectivity• Flapping connections lose contact with other

servers and then appear to come back online before going off

Why is this bad?

Page 26: Database Expert Q&A from 2600hz and Cloudant

Fixing Flapping• I’m trying to fix a partition• The Network keeps going up and down• As I repair my cluster, it keeps starting to repair

and failing (by attempting to reintegrate the unreliable nodes)

Flapping nodes make everything awful

Page 27: Database Expert Q&A from 2600hz and Cloudant

Why is the Network Difficult?

“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”

-Kyle Kingsbury, Aphyr.com

Page 28: Database Expert Q&A from 2600hz and Cloudant

Why is the Network Difficult?

“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”

-Kyle Kingsbury, Aphyr.com

Page 29: Database Expert Q&A from 2600hz and Cloudant

Why is the Network Difficult?

“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”

-Kyle Kingsbury, Aphyr.com

Page 30: Database Expert Q&A from 2600hz and Cloudant

What does 2600hz use?• Cloudant BigCouch• NoSQL Database• Master-Master• Very sensibly designed for our use case

Page 31: Database Expert Q&A from 2600hz and Cloudant

Why BigCouch?DEMANDS1.On the Fly Schema Changes2.Scale in a distributed fashion3.Configuration changes will happen as we grow4.Has to be equipment agnostic5.Accessible Raw Data View6.Simple to Install and Keep up7.It can’t fail, ergo Fault-Tolerance8.Multi-Master writes9.Simple (to cluster, to backup, to replicate, to split)

TRADEOFFS1.Eventual Consistency is OK2.Nodes going offline randomly3.Multi-server only

Why are we ok with these tradeoffs? They suit our use case.

Page 32: Database Expert Q&A from 2600hz and Cloudant

Let’s take some time to pontificate

about Database at scale…

What are the first things you think

of when you get errors reported

from the Database? What’s your

Thought Process?

Page 33: Database Expert Q&A from 2600hz and Cloudant

• Database is where you put stuff

• You want your Database not to die

• 2600hz uses BigCouch because it’s really

awesome technology

• Great for our Use Case

• Easy to Administrate

• Resilient and quick-to-restore

Recap

Page 34: Database Expert Q&A from 2600hz and Cloudant

QUESTIONS???