spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
TRANSCRIPT
How to build your own Delve: combining machine learning, big data and SharePoint
#SPSBE11
Joris Poelmans
April 18th, 2015
Agenda
Introduction to Delve
Office Graph
Big Data and Machine Learning
Building your own Delve - architectural concept
Agenda
Introduction to Delve
Office Graph
Big Data and Machine Learning
Building your own Delve - architectural concept
Stay In the Know Find What you Need Discover New Connections
Connect with the right experts and
learn more about their content.
Find just the right results from any
source and take actionDiscover new information tailored
to you from your network
Delve – Search and Discovery Across O365
Powered by Office Graph
Agenda
Introduction to Delve
Office Graph
Big Data and Machine Learning
Building your own Delve - architectural concept
What is The Office Graph?
Manager
Direct report
Works with
Shared with me
Viewed by me
Trending around me
Presented to me
Liked by me
Signals sent from Delve, Exchange, O365, …
Click person
Modify/Save
Elevate
Share
Follow
Like
Comments
Ignore
Presented to
Shown document
Open document
Shown board
++
Content and signals across O365 auto-
populating the Office Graph insights
Insights derived with machine learning for proactive and intelligent experiences
Agenda
Introduction to Delve
Office Graph
Big Data and Machine Learning
Building your own Delve - architectural concept
Big data is what
happened
when the cost
of storing user data
became cheaper
than making the
decision
to throw it away
Transactions + Interactions + Observations = Big Data
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail
Purchase record
Payment record
ERP
CRM
WEB
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMSSentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Big Data Core Technology landscape
• New paradigm for
storing data
• 100+ Non-SQL DB’s
and growing
• Support SQL querying
• Internal architecture
different from classic DBs
• Appliances
• Teradata
• Microsoft
PDW/APS
• Oracle BDA X4-2
• Hadoop/HDFS+
MapReduce
• Key Big Data
technology
Hadoop MPP
NoSQLNewSQL
Modern Data Architecture• Apache Hadoop is an open source
framework that supports data-intensive distributed applications Uses HDFS storage to enable
applications to work with 1000s of nodes and petabytes of data using a scale-out model
Uses MapReduce to process data
Inspired by Google
MapReduce
Google File System
Related projects:
HBase, Hive, Mahout, Pig,Sqoop, Ambari, Storm, Zookeeper, ... Andmany more
Hadoop components
Distributed Storage
(HDFS)
Hive
Distributed Processing
(MapReduce)
PigHB
ase
HCatalog
Data
Inte
gra
tion
( OD
BC
/ SQ
OO
P/ R
EST/F
lum
e)
MahoutPegasus Rhadoop
Oo
zie
Data integration
Data access
Hadoop core
Operations
Am
bari
Zo
oke
ep
er
StormKafka
http://jopx.blogspot.be/2015/03/overview-of-apache-hadoop-components-in.html
Microsoft Azure HDInsightSupport HBase as NoSQL columnar database on Azure Blobs
Support Storm as stream processing
Hadoop in Azure
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMasterCoordination
Region Server Region Server Region Server Region Server
Able to leverage Azure Blob Storage
Pay per use model
Based on Hortonworks Data Platform
Hive• Hadoop feature to perform data warehouse
operations
• HiveQL High-level, SQL-like language, abstraction over MapReduce
Supports equi-joins
Schema on read NOT schema on write
Automatically invokes MapReduce jobs
Much simpler than using MapReduce directly
• Metadata store Contains descriptions of tables
• Acts as a bridge to many BI products which expecttabular data
Machine learningfinding the needle in the haystack
• Formal definition: “A computer program is said to learn from
experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P,
improves with experience E” - Tom M. Mitchell
• Another definition: “The goal of machine learning is to program
computers to use example data or past experience to solve a given
problem.” – Introduction to Machine Learning, 2nd Edition, MIT Press
• ML often involves two primary techniques: – Supervised Learning: Finding the mapping between inputs and outputs using
correct values to “train” a model
– Unsupervised Learning: Finding patterns in the input data (similar to Density
Estimates in Statistics)
Vision Analytics
Recommendation
engines
Advertising analysis
Weather forecasting for
business planning
Social network analysis
Legal
discovery and document
archiving
Pricing analysis
Fraud
detection
Churn
analysis
Equipment monitoring
Location-based tracking
and services
Personalized Insurance
Typical machine learning algorithms• Clustering (k-means, orthogonal partitioning,…)
• Association rule learning ( A priori)
• Regression (linear/logistic)
• Recommendation engines
• Classification (C4.5, decision trees, SVM, Naïve Bayes, AdaBoost, Random Forest, …)
• Similarity matching
• Neural networks
• Bayesian networks
• Genetic algorithms
• EnsemblesSee http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
And http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf and
http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
Doing recommendations – some approaches• Collaborative filtering
• Feature based recommendations
• K-nearest neighbours
Collaborative filtering• A set of items
(books, beers, blogposts,…)
• Ratings from users
• Recommendeditems based on your ratings andother people’sratings
Feature based recommendations• Use user’s ratings of items
Create an algorithm to definewhich features (metadata ) of items the user likes
• Requires detailedinformation about items -content based An item can be a person as well –
see “People you may know”
• Most approaches combine “feature based” and“collaborative filtering”
K-Nearest Neighbours (Classification approach)• Find ratings from people similar
to you and see what they liked Use similarity functions (Minkowski
distance, RMSE, Pearson CorrelationCoefficient,…)
• Take the average ratings of the k people most similar to you Display the items with the highest
averages
• Conclusion – requires solidbackground in Math andStatistics
Machine Learning and Data Scientists
Developing predictive analytics and
machine learning must be simpler, today it requires specialized skills:• Data management• Data exploration• Math & statistics• Domain expertise• Machine learning• Software development• Data visualization
65% of enterprise feel they have a strategic shortage of data scientists, a role many did not know existed 12 months ago …
Microsoft Azure Machine Learning (Ctd.)
Personalized WorkspaceCombine R modules with Microsoft’s best in class algorithms running Xbox and Bing
Work with anyone, anywhere by simply sharing the workspace
Easy Access to All DataDrop in desktop data sets into the built-in storage space.
Bring in cloud data with the ease of a drop down
Deploy Models as Web ServicesOperationalize in minutes and refine models at the speed of the market
Partner ToolsML partners enjoy SDK access for robust solutions
Microsoft Azure Machine Learning Studio
Microsoft Azure Machine Learning API service
Microsoft AzureMachine Learning SDK
Agenda
Introduction to Delve
Office Graph
Big Data and Machine Learning
Building your own Delve - architectural concept
E vent producers
Web logs
Documents &
metadata
Transform Long-term storage
Azure SQL
Database & Azure
Storage
Predictive Analytics
Azure
Machine
Learning
Presentation and action
On premise
Building your own Delve - high level architecture
Building your own Delve – remarks
• Graph technology left out for simplicity Take a look at Neo4J or Pegasus on Hadoop if you are interested
• Not very realistic to rebuild Delve but possible todefine point solutions
• If you still go ahead Think about the end-to-end data pipeline
Fast track with Recommendation API in datamarket http://datamarket.azure.com/dataset/amla/recommendations
Cache recommendations for performance and cost optimization
Learn R or Python to extend AzureML capabilities
Online Resources
• www.coursera.org (MOOC)
• Microsoft Virtual Academy http://www.microsoftvirtualacademy.com/training-courses/getting-started-with-microsoft-
azure-machine-learning
http://www.microsoftvirtualacademy.com/training-courses/implementing-big-data-analysis
• Cloud Data Science process - http://azure.microsoft.com/en-
us/documentation/articles/machine-learning-data-science-how-to-create-machine-learning-service/
• Blogs http://blogs.msdn.com/b/benjguin/
http://hortonworks.com/blog/
http://blogs.msdn.com/b/bigdatasupport/
http://blogs.msdn.com/b/big_data_france/
http://blogs.msdn.com/b/brian_swan/
http://blogs.msdn.com/b/mwinkle/
http://blogs.msdn.com/b/avkashchauhan/
http://blogs.msdn.com/b/carlnol/
http://blogs.technet.com/b/machinelearning/