apache hadoop hdfs
TRANSCRIPT
Apache Hadoop HDFS
What is it ?
What is it for ?
Architecture
Resilience
Administration
Data access
Future changes ?
HDFS What is it ?
HDSF = Hadoop Distributed File System
It is a distributed file system
Runs on low cost hardware
It is open source
Written in Java
Fault tolerant
Designed for very large data sets
Tuned for high throughput
HDFS What is it for ?
Designed for batch processing
Streaming access to data
Large data sizes i.e. Terabytes
Highly reliable using data replication
Supports very large node clusters
Supports large files
Supports file numbers into millions
HDFS Architecture
HDFS Architecture
Has a master / slave architecture
A master NameNode
Controls file system operations
Maps data blocks to DataNodes
Logs all changes
Slave DataNodes
Store file blocks
Store replicated data
HDFS Resilience
Data is replicated across DataNodes
Nodes may fail but data is still available
DataNodes indicate state via heart beat report
Single point of failure in master NameNode
Data integrity via check sums
HDFS Administration
Access via Java API
FS Shell commands language
HTTP browser
C wrapper for Java API
Space reclamation
Via control of replication factor
Deleted files sent to trash folder
Trash folder cleaned after configurable time
HDFS Future changes
Things they might consider for HDFS
File append
User quotas
File links
Stand by nodes
Other Areas
Want to know about ?
Big Data
Nutch
Solr
see my other presentations
Contact Us
Feel free to contact us at
www.semtech-solutions.co.nz
We offer IT project consultancy
We are happy to hear about your problems
You can just pay for those hours that you need
To solve your problems