hadoop for intelligence analysis

8
July 2011 White Paper: Hadoop for Intelligence Analysis CTOlabs.com Inside: • Apache Hadoop and Cloudera • Intelligence Community Use Cases • Context You Can Use Today A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data.

Upload: bob-gourley

Post on 09-Mar-2015

2.589 views

Category:

Documents


0 download

DESCRIPTION

This paper explores use cases for Hadoop in the intelligence community and other organizations that seek to gain understanding through analysis of large quantities of information.

TRANSCRIPT

Page 1: Hadoop for Intelligence Analysis

July 2011

White Paper: Hadoop for Intelligence Analysis

CTOlabs.com

Inside:• Apache Hadoop and Cloudera• Intelligence Community Use Cases• Context You Can Use Today

A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data.

Page 2: Hadoop for Intelligence Analysis

CTOlabs.com

Hadoop and related capabilities bring new advantages to intelligence analysts

Executive SummaryIntelligence analysis is all about dealing with “Big Data,” massive collections of unstructuredinformation. Already, the Intelligence Community works with much more data than it can processand continues to collect more through new and evolving sensors, open-source intelligence, betterinformation sharing, and continued human information gathering. More information is always better,but to make use if it, analysis needs to keep pace through innovations in data management.

The Apache Hadoop TechnologyApache Hadoop is a project operating under the auspices of the Apache Software Foundation (ASF).The Hadoop project develops open source software for reliable, scalable, distributed computing.Hadoop is an exciting technology that can help analysts and agencies make the most of their data.

Hadoop can inexpensively store any type of information from any source on commodity hardwareand allow for fast, distributed analysis run in parallel on multiple servers in a Hadoop cluster. Hadoopis reliable, managing and healing itself; scales linearly, working as well with one terabyte of dataacross three nodes as it does with petabytes of data across thousands; affordable, costing much lessper terabyte to store and process data compared to traditional alternatives; and agile, implementingevolving schemas for data as it is read into the system.

Cloudera is the leading provider of Hadoop-based software and services and provides Cloudera’sDistribution including Apache Hadoop (CDH) which is the most popular way to implement Hadoop.CDH is an open system assembled from the most useful projects in the Hadoop ecosystem bundledtogether and simplified for use. As CDH is available for free download, it’s a great place to start whenimplementing Hadoop, and Cloudera also offers support, management apps, and training to make themost of Hadoop.

1

Page 3: Hadoop for Intelligence Analysis

A White Paper For The Federal IT Community

Intelligence Community Use Cases

Given its work with Big Data, implementing Hadoop is a natural choice for the IntelligenceCommunity. Most broadly, using Hadoop to manage data can lead to substantial savings. HadoopDistributed File System (HDFS) stores information for several cents a gigabyte a month as opposed totraditional methods that cost dollars. HDFS achieves this by pooling commodity servers into a singlehierarchical namespace, which works well with large files that are written once and read many times.With organizations everywhere on the lookout for efficient ways to modernize IT, this is an attractive feature. Hadoop can provide opportunities to manage the same data for a fraction of the cost.

The unpredictable nature of intelligence developments could also make good use of Hadoop’sscalability. Nodes can easily and seamlessly be added or removed from a Hadoop cluster andperformance will scale linearly, providing the agility to rapidly mobilize resources. Consider how fast Intelligence Community missions must shift, for example. Just a month after Bin Laden’s death the community was boresighted on operations in Libya. Who knows where the next focus area will be? The need is for an ability to easily shift computational power from one mission to another. Hadoop also provides the agility to deal with whatever form or source of information is needed for a project. Most of the data critical to intelligence is unstructured, such as text, video, and images and Hadoop can take data straight from the file store without any special transformation.

As this complex data can then be stored and analyzed together, Hadoop can help intelligenceagencies overcome information overload. Current technology and intelligence gathering methodsproduce expansive amounts of data, far more than what our analysts can reasonably hope to monitoron their own. For example, at the end of 2010, General Cartwright noted that it took 19 analysts toprocess the information gathered by a predator drone, and that, with new sensor technology beingdeveloped, “dense data sensors” meshing together video feeds to cover a city while simultaneouslyintercepting cell phone calls and e-mails, it would take 2,000 analysts to manually process the datagathered from a single drone.

Algorithms are currently being developed to pull out and present to analysts what really matters fromall of those video feeds and intercepts. Sorting through such an expanse of complex data is preciselythe sort of challenge Hadoop was designed to tackle. The data also becomes more valuable when

2

Page 4: Hadoop for Intelligence Analysis

CTOlabs.com

3

combined with information such as coordinates and times, and subjects identified in videos. Hadoopcould then effectively sort and search through the new multi-level and multi-media intelligencealmost instantly despite the amount and type of material generated.

Hadoop’s low cost and speed open up other intelligence capabilities. Hadoop clusters have oftenbeen used as data “sandboxes” to test out new analytics cheaply and quickly to see if they yield resultsand can be widely implemented. For analysts, this would mean that they could test out theoriesand algorithms even when they have a low probability of success without much wasted time orresources, allowing them to be more creative and thorough. This in turn helps prevent the “failuresof imagination” blamed for misreading the intelligence before the September 11 attacks and severalsubsequent plots.

Hadoop is also well suited for evolving analysis techniques such as Social Network Analysis and textualanalysis, which are both being aggressively developed by intelligence agencies and contractors. SocialNetwork Analysis uses human interactions, such as phone calls, text messages, meetings, and emails,to construct and decipher a social network, identifying leaders and key nodes for linking members,linking groups, getting exposure, and contacting other important members.

Rarely are these members the figureheads in the media or even the stated leadership of terrorist andcriminal organizations. Social Network Analysis is helpful for identifying high value targets to exploitor eliminate but for sizable organizations deals with a tremendous amount of data, thousands ofinteractions of varying types by thousands of members and associates,making Hadoop an excellentplatform. Some projects such as Klout, which applies Social Network Analysis to social media todetermine user influence, style, and role, already run on Hadoop.

Hadoop has also been proven as a platform for textual analysis. Large text repositories such as chatrooms, newspapers, or email inboxes are Big Data, expansive and unstructured, and hence well suitedfor analysis using Hadoop. IBM’s Watson, which beat human contestants on Jeopardy!, has been theLeveraging Hardware Design to Enhance Security and Functionality most prominent example of the power of textual analysis, and ran on Hadoop. Watson was able to look through and interpret libraries of text to form the right question for the answers presented in the game, from history to science to pop culture, faster and more accurately than the human champions he went against. Textual analysis

Page 5: Hadoop for Intelligence Analysis

A White Paper For The Federal IT Community

has value beyond game shows, however, as it can be used on forums and correspondences to analyze sentiment and find hidden connections to people, places, and topics.

For Further Reference

Many terms are used by technologists to describe the detailed features and functions providedin CDH. The following list may help you decipher the language of Big Data:

CDH: Cloudera’s Distribution including Apache Hadoop. It contains HDFS, Hadoop MapReduce, Hive,Pig, HBase, Sqoop, Flume, Oozie, Zookeeper and Hue. When most people say Hadoop they mean CDH.

HDFS: The Hadoop Distributed File System. This is a scalable means of distributing data that takesadvantage of commodity hardware. HDFS ensures all data is replicated in a location-aware mannerso as to lessen internal datacenter network load, which frees the network for more complicatedtransactions.

Hadoop MapReduce: This process breaks down jobs across the Hadoop datanodes and thenreassembles the results from each into a coherent answer.

Hive: A data warehouse infrastructure that leverages the power of Hadoop. Hive provides tools thateasily summarize queries. Hive puts structure on the data and gives users the ability to query usingfamiliar methods (like SQL). Hive also allows MapReduce programmers to enhance their queries.

Pig: A high level data-flow language that enables advanced parallel computation. Pig makes parallelprogramming much easier.

HBase: A scalable, distributed database that supports structured data storage for large tables.Used when you need random, realtime read/write access to you Big Data. It enables hosting of verylarge tables-- billions of rows times millions of columns -- atop commodity hardware. It is a columnoriented store modeled after Google’s BigTable and is optimized for realtime data. HBase has replaced Cassandra at Facebook.

4

Page 6: Hadoop for Intelligence Analysis

CTOlabs.com

5

Sqoop: Enabling the import and export of SQL to Hadoop.

Flume: A distributed, reliable and available service for efficiently collecting, aggregating and movinglarge amounts of streaming data.

Oozie: A workflow engine to enhance management of data processing jobs for Hadoop. Managesdependencies of jobs between HDFS, Pig and MapReduce.

Zookeeper: A very high performance coordination service for distributed applications.

Hue: A browser-based desktop interface for interacting with Hadoop. It supports a file browser, jobtracker interface, cluster health monitor and many other easy-to-use features.

More Reading

For more use cases for Hadoop in the intelligence community visit:

CTOvision.com- an blog for enterprise technologists with a special focus on Big Data.

CTOlabs.com - the respository for our research and reporting on all IT issues.

Cloudera.com - providing enterprise solutions around CDH plus training, services and support.

Page 7: Hadoop for Intelligence Analysis

A White Paper For The Federal IT Community

About the Author Alexander Olesker is a technology research analyst at Crucial Point LLC, focusing on disruptive technologies of interest to enterprise technologists. He writes at http://ctovision.com.

Alex is a graduate of the Edmund A. Walsh School of Foreign Service atGeorgetown University with a degree in Science, Technology, and International Affairs. He researches and writes on developments in technology and government best practices for CTOvision.com and CTOlabs.com, and has written numerous whitepapers on these subjects.

Alex has worked or interned in early childhood education, private intelligence, law enforcement, and academia, contributing to numerous publications on technology, international affairs, and security and has lectured at Georgetown and in the Netherlands. Alex is also the founder and primary contributor of an international security blog that has been quoted and featured by numerous pundits and the War Studies blog of King’s College, London. Alex is a fluent Russian speaker and proficient in French.

Contact Alex at [email protected]

6

Page 8: Hadoop for Intelligence Analysis

CTOlabs.com

For More InformationIf you have questions or would like to discuss this report, please contact me. As an advocate for better IT in government, I am committed to keeping the dialogue open on technologies, processes and best practices that will keep us moving forward.

Contact:Bob Gourley [email protected]

All information/data ©2011 CTOLabs.com.