powerpoint presentationdownload.microsoft.com/download/d/e/7/de7ae181-ee05-4699...开源软件...
TRANSCRIPT
US flight data for 20 years
Linear Regression on Arrival Delay
Run on 4 core laptop, 16GB RAM and 500GB SSD
R Open Microsoft R Server
DeployRDevelopR
ConnectR• High-speed & direct
connectors
Available for:• High-performance XDF
• SAS, SPSS, delimited & fixed format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBCScaleR• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R algorithms across nodes
• Wide data sets supported – thousands of variables
DistributedR• Distributed computing framework
• Delivers cross-platform portability
R+CRAN• Open source R interpreter
• R 3.1.2
• Freely-available huge range of R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing R scripts, functions and packages
RevoR• Performance enhanced R
interpreter
• Based on open source R
• Adds high-performance math library to speed up linear algebra functions
Custom parallelization
PEMA-R API
rxDataStep
rxExec
Data step
Data import – Delimited, fixed, SAS, SPSS, OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, merge, split
Aggregate by category (means, sums)
Descriptive statistics
Min/max, mean, median (approx.)
Quantiles (approx.)
Standard deviation
Variance
Correlation
Covariance
Sum of squares (cross-product matrix for set variables)
Pairwise cross tabs
Risk ratio & odds ratio
Cross-tabulation of data (standard tables & long form)
Marginal summaries of cross tabulations
Statistical tests
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Sampling
Subsample (observations & variables)
Random sampling
Predictive models
Sum of squares (cross-product matrix for set variables)
Multiple linear regression
Generalized linear models (GLM) exponential family distributions: binomial,
Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit,
identity, log, logit, probit. User defined distributions & link functions.
Covariance & correlation matrices
Logistic regression
Classification & regression trees
Predictions/scoring for models
Residuals for all models
Simulation
Simulation (e.g., Monte Carlo)
Parallel random number generation
Cluster analysis
K-Means
Classification
Decision trees
Decision forests
Gradient-boosted decision trees
Naïve Bayes
数据科学家交互式分析数据
SQL 开发者/DBA管理数据/分析数据
扩展到R
例如:销售额预测
库存优化
预测性维护
信用卡交易保护
010010
100100
010101
关系型数据
分析库
T-SQL 接口
?R 集成
内置于SQL Server 2016
010010
100100
010101
不用移动数据就可以实时的分析交易数据
R结合SQL的内存数据库
数据科学家交互式直接访问数据并发布算法
在整个数据集上执行R
执行和测试都在数据库中
部署到本地数据库中
SQL 开发者更轻松的同时管理和分析数据
轻松调用R脚本或者模型
使用T-SQL调用R代码
DBA
更方便的管理数据
统一的管理性能
可以安全的管理R的执行
select o.name, o.description
from sys.dm_xe_objects o join sys.dm_xe_packages p
on o.package_guid = p.guid
where o.object_type = 'event' and p.name = 'SQLSatellite' order by o.name;
Sensors
Machines
Data Suppliers
Legacy Sources
Data Sources
EDW ERP/MRP
SQL Server
Azure Data Platform
29
Business Analysts
Power Analysts(R Studio, DevelopR, etc.)
Line of Business users(Analytic Apps, Rules Engines, etc.)
Analytics Consumers
Math Servers and
Clusters
Data
Models
Execution
DataModelsExecution
Ingest
Scored Data
Structured Data
Events Stream
Processing
ModelsEdge
Computing
Scores
VisualizationBig Data• Transformation
• Aggregation
• Exploration
• Modeling
• Model Evaluation
• Data Scoring
https://catalog.imagine.microsoft.com/en-us/Catalog/Product/105
https://msdn.microsoft.com/en-us/library/mt591993.aspx
https://blogs.msdn.microsoft.com/business-intelligence