![Page 1: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/1.jpg)
Reactive Feature Generation with Spark and MLlib
Jeff Smith x.ai
![Page 2: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/2.jpg)
x.ai is a personal assistant who schedules meetings for you
![Page 3: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/3.jpg)
M A N N I N G
Jeff Smith
![Page 4: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/4.jpg)
REACTIVE MACHINE LEARNING
![Page 5: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/5.jpg)
Reactive Machine Learning
![Page 6: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/6.jpg)
Machine Learning Systems
![Page 7: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/7.jpg)
Traits of Reactive Systems
![Page 8: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/8.jpg)
Responsive
![Page 9: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/9.jpg)
Resilient
![Page 10: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/10.jpg)
Elastic
![Page 11: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/11.jpg)
Message-Driven
![Page 12: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/12.jpg)
Reactive Strategies
![Page 13: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/13.jpg)
Machine Learning Data
![Page 14: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/14.jpg)
Machine Learning Data
![Page 15: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/15.jpg)
Machine Learning Systems
![Page 16: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/16.jpg)
INTRODUCING FEATURES
![Page 17: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/17.jpg)
Microblogging DataSquawks Squawkers Super
![Page 18: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/18.jpg)
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
![Page 19: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/19.jpg)
FEATURE TRANSFORMS
![Page 20: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/20.jpg)
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Extract Transform
![Page 21: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/21.jpg)
Features
trait FeatureType[V] { val name: String}
trait Feature[V] extends FeatureType[V] { val value: V}
![Page 22: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/22.jpg)
Features
case class IntFeature(name: String, value: Int) extends Feature[Int]
case class BooleanFeature(name: String, value: Boolean) extends Feature[Boolean]
![Page 23: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/23.jpg)
Named Transforms
def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature("binarized-" + feature.name, feature.value > threshold)}
![Page 24: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/24.jpg)
Non-Trivial Transformsdef categorize(thresholds: List[Double]) = { (rawFeature: DoubleFeature) => { IntFeature("categorized-" + rawFeature.name, thresholds.sorted .zipWithIndex .find { case (threshold, i) => rawFeature.value < threshold }.getOrElse((None, -1)) ._2) }}
![Page 25: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/25.jpg)
Standardizing Namingtrait Named { def name(inputFeature: Feature[Any]) : String = { inputFeature.name + "-" + Thread.currentThread.getStackTrace()(3).getMethodName }}
object Binarizer extends Named { def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature(name(feature), feature.value > threshold) }}
![Page 26: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/26.jpg)
Lineages"cleaned-normalized-categorized-interactions"
interactions categorized normalized cleaned
![Page 27: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/27.jpg)
PIPELINES
![Page 28: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/28.jpg)
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
![Page 29: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/29.jpg)
Multi-Stage Generation
Raw Data ExtractedExtract Transform Features
![Page 30: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/30.jpg)
Pipeline Compositiondef extract(rawSquawks: RDD[JsonDocument]): RDD[IntFeature] = { ???} def transform(inputFeatures: RDD[Feature[Int]]): RDD[BooleanFeature] = { ???} val trainableFeatures = transform(extract(rawSquawks))
![Page 31: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/31.jpg)
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
![Page 32: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/32.jpg)
Pipelines
Don’t orchestrate when you can compose
![Page 33: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/33.jpg)
Pipeline Failure
Raw Data FeaturesFeature Generation Pipeline
Raw Data FeaturesFeature Generation Pipeline
![Page 34: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/34.jpg)
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Supervision
![Page 35: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/35.jpg)
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature Validation
![Page 36: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/36.jpg)
Reactive Database DriversCouchbase MongoDB Cassandra
![Page 37: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/37.jpg)
Reactive Database Drivers
val rawSquawks: RDD[JsonDocument] = sc.couchbaseView( ViewQuery.from("squawks", "by_squawk_id")) .map(_.id) .couchbaseGet[JsonDocument]()
![Page 38: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/38.jpg)
Cluster ManagersSpark Standalone Mesos YARN
![Page 39: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/39.jpg)
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature Validation
![Page 40: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/40.jpg)
FEATURE COLLECTIONS
![Page 41: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/41.jpg)
Original Features
object SquawkLength extends FeatureType[Int]
object Super extends LabelType[Boolean]
val originalFeatures: Set[FeatureType] = Set(SquawkLength)val label = Super
![Page 42: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/42.jpg)
Basic Features
object PastSquawks extends FeatureType[Int]
val basicFeatures = originalFeatures + PastSquawks
![Page 43: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/43.jpg)
More Features
object MobileSquawker extends FeatureType[Boolean]
val moreFeatures = basicFeatures + MobileSquawker
![Page 44: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/44.jpg)
Feature Collections
case class FeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_])
![Page 45: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/45.jpg)
Feature Collectionsval earlierCollection = FeatureCollection(101, earlier, basicFeatures, label)
val latestCollection = FeatureCollection(202, now, moreFeatures, label)
val featureCollections = sc.parallelize( Seq(earlierCollection, latestCollection))
![Page 46: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/46.jpg)
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
![Page 47: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/47.jpg)
Fallback Collections
val FallbackCollection = FeatureCollection(404, beginningOfTime, originalFeatures, label)
![Page 48: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/48.jpg)
Fallback Collectionsdef validCollection(collections: RDD[FeatureCollection], invalidFeatures: Set[FeatureType[_]]) = { val validCollections = collections.filter( fc => !fc.features .exists(invalidFeatures.contains)) .sortBy(collection => collection.id) if (validCollections.count() > 0) { validCollections.first() } else FallbackCollection}
![Page 49: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/49.jpg)
VALIDATING FEATURES
![Page 50: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/50.jpg)
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature Validation
![Page 51: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/51.jpg)
Predicting Super Squawkers
![Page 52: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/52.jpg)
Training Instances
val instances = Seq( (123, Vectors.dense(0.2, 0.3, 16.2, 1.1), 0.0), (456, Vectors.dense(0.1, 1.3, 11.3, 1.2), 1.0), (789, Vectors.dense(1.2, 0.8, 14.5, 0.5), 0.0))
![Page 53: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/53.jpg)
Selection Setup
val featuresName = "features"val labelName = "isSuper"
val instancesDF = sqlContext.createDataFrame(instances) .toDF("id", featuresName, labelName)
val K = 2
![Page 54: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/54.jpg)
Feature Selection
val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")
![Page 55: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/55.jpg)
Feature Selection
val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")
val selectedFeatures = selector.fit(instancesDF) .transform(instancesDF)
![Page 56: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/56.jpg)
Back to RDDs
val labeledPoints = sc.parallelize(instances.map({ case (id, features, label) => LabeledPoint(label = label, features = features)}))
![Page 57: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/57.jpg)
Validating Features
def validateSelection(labeledPoints: RDD[LabeledPoint], topK: Int, cutoff: Double) = { val pValues = Statistics.chiSqTest(labeledPoints) .map(result => result.pValue) .sorted pValues(topK) < cutoff}
![Page 58: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/58.jpg)
Persisting Validationscase class ValidatedFeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_], passedValidation: Boolean, cutoff: Double)
![Page 59: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/59.jpg)
SUMMARY
![Page 60: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/60.jpg)
Machine Learning Systems
![Page 61: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/61.jpg)
Traits of Reactive Systems
![Page 62: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/62.jpg)
Reactive Strategies
![Page 63: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/63.jpg)
Machine Learning Data
![Page 64: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/64.jpg)
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
![Page 65: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a90b760da3da068b6a5c/html5/thumbnails/65.jpg)
Reactive Feature Generation with Spark and MLlib
reactivemachinelearning.com @jeffksmithjr @xdotai