cloud computing and hadoop introduction

20
BioCloud Random large-scale tools that you can use

Upload: christianperez

Post on 10-Nov-2014

5.197 views

Category:

Technology


0 download

DESCRIPTION

Presentation done by Roman Valls at PRBB Computational Genomics Technical Seminars in Barcelona

TRANSCRIPT

Page 1: Cloud computing and Hadoop introduction

BioCloud

Random large-scale tools that you can use

Page 2: Cloud computing and Hadoop introduction

Disclaimer

I'm working on computer security research... no biology background anywhere in my field, not even on computer virus ;) While working, I stumbled across hadoop for scalable web spidering purposes.

I'm not a bioinformatician (yet)... but I saw a powerful tool that could be useful in your research field(s): 

"biodatacrunching" ?

Page 3: Cloud computing and Hadoop introduction

Glossary

• Cluster (beowulf)• Grid• Cloud

Page 4: Cloud computing and Hadoop introduction

Biology and computer science

• Increasingly resource-hungry applicationso Nowadays, they can be approached by "brute force"o More data means more "iron" to crunch it

• Local IT team nor budget keep up with this paceo  €€€ spent on new hardwareo  €€€ spent on IT personnelo Isn't it wiser to scale one machine at a time ?

• Developers get angry or frustrated ono Delays on software installation and configo Unscheduled downtimeso Delays as a result of not enough computing power

Page 5: Cloud computing and Hadoop introduction

What is cloud computing ?

In plain english: http://www.youtube.com/watch?v=XdBd14rjcs0

Page 6: Cloud computing and Hadoop introduction

Infrastructure layer

Page 7: Cloud computing and Hadoop introduction

Cloud niche

Page 8: Cloud computing and Hadoop introduction

Infraestructure

• Amazono EC2o S3o AMI

Recently added BioInformatic appliances Public data sets 

• Eukalyptuso EC2 + AMI server-side open source implementationo We run it for our internal projects

• Enomalism • Rightscale & Service Cloud

o Tools/Consultants for the upcoming cloud issues

Page 9: Cloud computing and Hadoop introduction

Application layer

• Tecnologias para paralelizar aplicaciones

Page 10: Cloud computing and Hadoop introduction

Application layer

• Hadoopo Open source mapreduce implementationo Java based, but any language can be used

• Cloudburst-bioo MapReduce fine tuned implementation for Bio (XXX)

Page 11: Cloud computing and Hadoop introduction

Easy mapreduce

 

Page 12: Cloud computing and Hadoop introduction

What is hadoop

Quotation from official web page:  

"Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data."

 "vast amounts of data (ATGTTAG...)" + "easily" = sounds good

 isn't it ? or is it vaporware ?

Page 13: Cloud computing and Hadoop introduction

Why is it used for ?

• Attack problems that imply several GB, TB even PB of data• The programmer does not care on job management

o The focus is on data transformation, piping (useful work) • Not intended for realtime processing• Suitable to offload databases from long batch jobs

Page 14: Cloud computing and Hadoop introduction

What is MapReduce

Joel on software explanationUseful to crunch *tons* of data parallellized by design

Page 15: Cloud computing and Hadoop introduction

HDFS: Hadoop Distributed FileSystem

Page 16: Cloud computing and Hadoop introduction

What about Jobs control ?

Page 17: Cloud computing and Hadoop introduction

Who is using it ?

• Googleo Lots of internal projects (proprietary MapReduce)

GMail spam machine learning Google maps ...

• Yahooo Internal web graph (powers search engine)o Pig (sqlish abstraction)o Sort 1 terabyte of data in 209 seconds

 • Facebook

o Users big graph, used for data mining (Hive)

Page 18: Cloud computing and Hadoop introduction

Hadoop has (lots of) new friends

• Nutch• Mahout• Hbase• Hama• Pig• ZooKeeper• Smartfrog• ...

Page 19: Cloud computing and Hadoop introduction

Next steps ?

Identify resource-hungry applications (batch vs interactive)Migrate apps to cloud1) Allocate a certain fixed amount of money2) Give a try on amazon EC23) Optional: Build (local) rocks cluster with Eukaliptus cloud

Test, deploy, automate, automate and automate ... puppet ?

Page 20: Cloud computing and Hadoop introduction

(a few) References

http://www.cloudera.com/hadoop-training-thinking-at-scale  http://www.slideshare.net/tag/hadoophttp://sourceforge.net/projects/cloudburst-bio/http://hadoop.apache.org/core/http://people.apache.org/~rdonkin/hadoop-talk/hadoop.html