BA
• Big Data are the new types of data that let go of the limitations we had to impose decades ago due to the state of hardware and software back then
• The main challenge is therefor unlearning said limitations, and learning to incorporate Big Data capabilities and agility into [policy] work
• Traditional reporting and BI works with “known knowns”. Big data allows working with “known unknowns”, “unknown knowns” and “unknown unknowns”.
• There are several distinctive types of technologies that fall under the “Big Data“ moniker, which has their unique capabilities: Hadoop, NOSQL, Semantic, Graph
2© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Consists of tables tightly packed with data, specific type per row
• Tables identified and created in advance
• Tables populated from human input
• Tables used by filtering, grouping by rows, as well as performing a limited number of joins, for reports, OLAP etc
• Text data are supposed to be read by people
3© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Data coming from all over the Internet
• Data from Internet of Things.
• Human circumstances
• XML structures
• Data come from someplace, designed by someone else
• Machine learning
• Clustering
• Graph algorithms
4© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Traditional for IT
• Fully defined data
• Traditional Database
• Master Data Model
• Data Warehouse
• New generation of ideas and technologies
• Presumes only part of information is known
• Internet
• Information across multiple enterprises
• Information extracted from texts
5© Copyright Business Abstraction Pty Ltd 2014-2015
BA
New generations of tools, often coming from Internet companies, designed for “New Data”
• Hadoop File System
• NoSQL: Cassandra, MarkLogic, Couchbase, DynamoDB
• Column-store RDB
• Semantic DBs
• Graph DBs
• Map/Reduce of different flavours
• Xquery
• Sparql
• Gremlin
6© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Write anything associated with a primary key (akin to a file path)
• Distributed over commodity servers
• Highly concurrent write and read
• Everything is cheap – hardware, “design” etc
• However, small records have to be stored in Sequence files or Map Files
• Anything at scale – can store files in Petabytes
• Designed for Map/Reduce batch work, data lakes
• Anything interactive requires massive hardware
7© Copyright Business Abstraction Pty Ltd 2014-2015
BA
The term “NoSQL” means “not relational”, and as such covers a lot of different models. Some of them are suitable for complexity of generic data storage. They are called “semi-structured” as although individual data items are structures, the structures are not necessarily defined in advanced
NoSQL platforms combine Hadoop’s “store anything” capability with indexing and
• store and index XML or JSON documents (“trees”).
• A deep row store can be seen as a document database where depth of trees limited to 2..
• Tables with named fields per row
8© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• “Interactive Hadoop”
• Low-granularity Hadoop
• Data Lake
• Operations DBs with complex data
• Data consolidation
• Dynamic Data Warehouse
• Operational Data Warehouse
• Data presumed “forests” of “trees” – connected data are handled not as good
• A touch more expensive than Hadoop
9© Copyright Business Abstraction Pty Ltd 2014-2015
BA
Provide traditional RDB interface in the new world
• Different internal structure
• Less suitable for OLTP
• Suitable for sparse data – empty fields don’t take space or penalise for read
• Much faster for analytics, especially if only selected fields are used
• Analytics when schema is known
• Cannot do schema-on-read
10© Copyright Business Abstraction Pty Ltd 2014-2015
BA
Support Resource Description Framework (RDF), originally created for Semantic Web metadata. It stores information in Subject-Predicate-Object “triples”, the most flexible representation possible. Use Sparql for queries.
• Graph patterns
• Metadata for Hadoop/NoSQL. Lack of internal schema requires external metadata
• Do not scale as much
• Hype-contaminated: people who understand enterprise and understand Semantic Tech are rare
11© Copyright Business Abstraction Pty Ltd 2014-2015
BA
Graph Databases see data as one huge graph. They are optimised for navigating the edges of the graph. Use Gremlin.
• Implementing Graph Analytics
• Bespoke graph logic
• Backend for general apps (if BASE jumping is too boring)
• Not as scalable as NoSQL
• Lack declarative data type, patterns & rules definitions of Semantic DBs
• Depend on ability to build and maintain a graph
12© Copyright Business Abstraction Pty Ltd 2014-2015
BA
Platform for massively parallel computations, enables effective sharing of workload between commodity servers.
• MapReduce
• YARN
• Apache Spark
• Batch jobs over massive data
• On-demand queries where some lag is acceptable
• Implementations have powerful Analytics/Machine Learning libraries
• Latency
13© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Ensure datasets are identifiable
• Capture metadata
• Ensure your data are not lost
• Profile data across field names, structures etc
• Locate data as needed
• As you learn more about data, build up your metadata
• Hadoop
• NoSQL
14© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• A server in $1,000-$10,000 range
• 0.5TB – 25TB per server
• A lot of them if needed
• Doubling the number of servers reduces the time to execute the task by the factor of 2.
15© Copyright Business Abstraction Pty Ltd 2014-2015
BA
Perhaps more complex than learning
• There are a lot of data you do not know about which is available and can be used
• For many types of objects, it is natural to have uncommon attributes
• Data storage is cheap. It doesn’t cost much to store everything remotely related
• No massive pre-work.
• Ask everything
16© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Traditional reporting, BI
• Predictive analytics
• Data consolidation, Semantic Integration, Object-based Intelligence
• Clustering
17© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Straightforward operation, no design upfront
• Can take immensely complex metadata, like UML & BPMN models
• Apply OWL for classification
• SWRL builds complex linkages
• Refer to Classes defined by lower-level Ontologies rather than data
18© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Words to be converted to tags (URLs)
• Some words have multiple meanings
• Ontology provides possible tags for nouns
• Software tries to resolve expected predicates
• The tag that can find necessary relations (predicates) wins
• Use ontology to restrict search
• Much more flexible than “foreign key”
19© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Information about data
• Traditional metadata was stored in form of data schema
• With schema-less storage, metadata should be stored separately
• Incremental discovery process requires Open World Assumption – you don’t know what other data are there.
• Reasoning to handle complexity
• Relationships as first-class citizens and the basis for classification
20© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Work in progress – not there yet
• Different data paradigms mandate different views
• SQL view of Big Data (Apache PIG etc)
• Excel import
• Analytic visualisation frontends
• 30+ JavaScript libraries
• Presume development
• Mahoot & other libraries
• Writing code in Scala, Java, Python, Groovy
21© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Better picture of the current state
• What if prediction
• Researching impact
• Increasing the number of categories, by several orders of magnitude if necessary
• Common, meaningful view of individual, organisation etc
• Prevention of undesirable effects on insights, complex events and prediction
22© Copyright Business Abstraction Pty Ltd 2014-2015
BA
• Description Logic, while using First-Order Predicate Logic terminology
• Reduced for practical purposes
• Is not necessary to be productive
• Can be applied to anything
• Class can be derived depending on values
• A State is a Class
• New triples can be derived from existing
24© Copyright Business Abstraction Pty Ltd 2014-2015