cloud computing and hadoop introduction
DESCRIPTION
Presentation done by Roman Valls at PRBB Computational Genomics Technical Seminars in BarcelonaTRANSCRIPT
BioCloud
Random large-scale tools that you can use
Disclaimer
I'm working on computer security research... no biology background anywhere in my field, not even on computer virus ;) While working, I stumbled across hadoop for scalable web spidering purposes.
I'm not a bioinformatician (yet)... but I saw a powerful tool that could be useful in your research field(s):
"biodatacrunching" ?
Glossary
• Cluster (beowulf)• Grid• Cloud
Biology and computer science
• Increasingly resource-hungry applicationso Nowadays, they can be approached by "brute force"o More data means more "iron" to crunch it
• Local IT team nor budget keep up with this paceo €€€ spent on new hardwareo €€€ spent on IT personnelo Isn't it wiser to scale one machine at a time ?
• Developers get angry or frustrated ono Delays on software installation and configo Unscheduled downtimeso Delays as a result of not enough computing power
What is cloud computing ?
In plain english: http://www.youtube.com/watch?v=XdBd14rjcs0
Infrastructure layer
Cloud niche
Infraestructure
• Amazono EC2o S3o AMI
Recently added BioInformatic appliances Public data sets
• Eukalyptuso EC2 + AMI server-side open source implementationo We run it for our internal projects
• Enomalism • Rightscale & Service Cloud
o Tools/Consultants for the upcoming cloud issues
Application layer
• Tecnologias para paralelizar aplicaciones
Application layer
• Hadoopo Open source mapreduce implementationo Java based, but any language can be used
• Cloudburst-bioo MapReduce fine tuned implementation for Bio (XXX)
Easy mapreduce
What is hadoop
Quotation from official web page:
"Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data."
"vast amounts of data (ATGTTAG...)" + "easily" = sounds good
isn't it ? or is it vaporware ?
Why is it used for ?
• Attack problems that imply several GB, TB even PB of data• The programmer does not care on job management
o The focus is on data transformation, piping (useful work) • Not intended for realtime processing• Suitable to offload databases from long batch jobs
What is MapReduce
Joel on software explanationUseful to crunch *tons* of data parallellized by design
HDFS: Hadoop Distributed FileSystem
What about Jobs control ?
Who is using it ?
• Googleo Lots of internal projects (proprietary MapReduce)
GMail spam machine learning Google maps ...
• Yahooo Internal web graph (powers search engine)o Pig (sqlish abstraction)o Sort 1 terabyte of data in 209 seconds
o Users big graph, used for data mining (Hive)
Hadoop has (lots of) new friends
• Nutch• Mahout• Hbase• Hama• Pig• ZooKeeper• Smartfrog• ...
Next steps ?
Identify resource-hungry applications (batch vs interactive)Migrate apps to cloud1) Allocate a certain fixed amount of money2) Give a try on amazon EC23) Optional: Build (local) rocks cluster with Eukaliptus cloud
Test, deploy, automate, automate and automate ... puppet ?
(a few) References
http://www.cloudera.com/hadoop-training-thinking-at-scale http://www.slideshare.net/tag/hadoophttp://sourceforge.net/projects/cloudburst-bio/http://hadoop.apache.org/core/http://people.apache.org/~rdonkin/hadoop-talk/hadoop.html