cloud computing and hadoop introduction

BioCloud Random large-scale tools that you can use

Upload: christianperez

Post on 10-Nov-2014

5.197 views

Category:

Technology

0 download

Report

Download

Tags:

Embed Size (px):

DESCRIPTION

Presentation done by Roman Valls at PRBB Computational Genomics Technical Seminars in Barcelona

TRANSCRIPT

Page 1: Cloud computing and Hadoop introduction

BioCloud

Random large-scale tools that you can use

Page 2: Cloud computing and Hadoop introduction

Disclaimer

I'm working on computer security research... no biology background anywhere in my field, not even on computer virus ;) While working, I stumbled across hadoop for scalable web spidering purposes.

I'm not a bioinformatician (yet)... but I saw a powerful tool that could be useful in your research field(s):

"biodatacrunching" ?

Page 3: Cloud computing and Hadoop introduction

Glossary

• Cluster (beowulf)• Grid• Cloud

Page 4: Cloud computing and Hadoop introduction

Biology and computer science

• Increasingly resource-hungry applicationso Nowadays, they can be approached by "brute force"o More data means more "iron" to crunch it

• Local IT team nor budget keep up with this paceo €€€ spent on new hardwareo €€€ spent on IT personnelo Isn't it wiser to scale one machine at a time ?

• Developers get angry or frustrated ono Delays on software installation and configo Unscheduled downtimeso Delays as a result of not enough computing power

Page 5: Cloud computing and Hadoop introduction

What is cloud computing ?

In plain english: http://www.youtube.com/watch?v=XdBd14rjcs0

Page 6: Cloud computing and Hadoop introduction

Infrastructure layer

Page 7: Cloud computing and Hadoop introduction

Cloud niche

Page 8: Cloud computing and Hadoop introduction

Infraestructure

• Amazono EC2o S3o AMI

Recently added BioInformatic appliances Public data sets

• Eukalyptuso EC2 + AMI server-side open source implementationo We run it for our internal projects

• Enomalism • Rightscale & Service Cloud

o Tools/Consultants for the upcoming cloud issues

Page 9: Cloud computing and Hadoop introduction

Application layer

• Tecnologias para paralelizar aplicaciones

Page 10: Cloud computing and Hadoop introduction

Application layer

• Hadoopo Open source mapreduce implementationo Java based, but any language can be used

• Cloudburst-bioo MapReduce fine tuned implementation for Bio (XXX)

Page 11: Cloud computing and Hadoop introduction

Easy mapreduce

Page 12: Cloud computing and Hadoop introduction

What is hadoop

Quotation from official web page:

"Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data."

"vast amounts of data (ATGTTAG...)" + "easily" = sounds good

isn't it ? or is it vaporware ?

Page 13: Cloud computing and Hadoop introduction

Why is it used for ?

• Attack problems that imply several GB, TB even PB of data• The programmer does not care on job management

o The focus is on data transformation, piping (useful work) • Not intended for realtime processing• Suitable to offload databases from long batch jobs

Page 14: Cloud computing and Hadoop introduction

What is MapReduce

Joel on software explanationUseful to crunch *tons* of data parallellized by design

Page 15: Cloud computing and Hadoop introduction

HDFS: Hadoop Distributed FileSystem

Page 16: Cloud computing and Hadoop introduction

What about Jobs control ?

Page 17: Cloud computing and Hadoop introduction

Who is using it ?

• Googleo Lots of internal projects (proprietary MapReduce)

GMail spam machine learning Google maps ...

• Yahooo Internal web graph (powers search engine)o Pig (sqlish abstraction)o Sort 1 terabyte of data in 209 seconds

• Facebook

o Users big graph, used for data mining (Hive)

Page 18: Cloud computing and Hadoop introduction

Hadoop has (lots of) new friends

• Nutch• Mahout• Hbase• Hama• Pig• ZooKeeper• Smartfrog• ...

Page 19: Cloud computing and Hadoop introduction

Next steps ?

Identify resource-hungry applications (batch vs interactive)Migrate apps to cloud1) Allocate a certain fixed amount of money2) Give a try on amazon EC23) Optional: Build (local) rocks cluster with Eukaliptus cloud

Test, deploy, automate, automate and automate ... puppet ?

Page 20: Cloud computing and Hadoop introduction

(a few) References

http://www.cloudera.com/hadoop-training-thinking-at-scale http://www.slideshare.net/tag/hadoophttp://sourceforge.net/projects/cloudburst-bio/http://hadoop.apache.org/core/http://people.apache.org/~rdonkin/hadoop-talk/hadoop.html

Cloud Computing Install Hadoop

Discovery 2015: Cloud Computing Workshop June 20-24, 2011 ... · Discovery 2015: Cloud Computing Workshop | June 20 - 24, 2011 Cloudera • Bundles Apache Hadoop into an integrated

Cloud Computing using MapReduce, Hadoop, Sparkparlab.eecs.berkeley.edu/sites/all/parlab/files/andy_konwinski_parlab... · Cloud Computing using MapReduce, Hadoop, Spark Andy Konwinski

Cloud Computing with MapReduce and Hadoop Matei Zaharia UC Berkeley RAD Lab [email protected]

Big Data with Hadoop and Cloud Computing

Cloud computing-with-map reduce-and-hadoop

Jnomics—A cloud-scale sequence analysis suiteschatz-lab.org/publications/posters/2011.Genome... · Hadoop – a distributed computing framework Apache Hadoop is an open-source Java

Future @ Cloud: Cloud Computing meets Smart Ecosystems ... · IND²UCE for HBase/Hadoop Cloud Databases HBase: NoSQL database inspired and modeled after Google‘s Bigtable 1 Hadoop:

Hadoop, Big Data e Cloud Computing

A Hybrid Algorithm Using Genetic Algorithm – Hadoop ... · algorithm using genetic algorithm and Hadoop MapReduce framework to further promotes the energy efficiency in cloud computing

IOANNIS MAGNISALIS - International Hellenic Universitydorg.ihu.edu.gr/wp-content/uploads/imagnisalis_europass_cv_updated... · Cloud computing (Azure, GoogleCloud, AWS, Hadoop/MapReduce,

Cloud Computing using MapReduce, Hadoop, Spark - …parlab.eecs.berkeley.edu/sites/all/parlab/files/hindman_bootcamp... · Cloud Computing using MapReduce, Hadoop, Spark Benjamin