ML pipelines talk from DataFest 3 (in russian)

Download ML pipelines talk from DataFest 3 (in russian)

Post on 20-Mar-2017

85 views

Category:

Internet

6 download

TRANSCRIPT

ML PipelinesSpark 2.0 vs Scikit-Learn , MTS 2016dmitri.babaev@gmail.commailto:dmitri.babaev@gmail.com?subject= Pipelines Scikit-Learn Pipelines Spark Spark ML (, , ) , (, ) / ML pipelines c 2- Transformer: fit + transform Estimator: fit + predict Composite software design pattern pipelineTransformer 5Transformer 2Transformer 1Stacking estimatorFeature unionTransformer 3 Transformer 4Estimator 1 Estimator 2 pipelines pipeline - , production pipeline , Show me the codeds = pd.read_csv('titanic.csv')features = ds.drop(['survived', 'alive'], axis=1)empty_space = FunctionTransformer( lambda x: x.replace(r'\s+', np.nan, regex=True), validate=False)df2dict = FunctionTransformer( lambda x: x.to_dict(orient='records'), validate=False)pl = Pipeline([ ('empty_space', empty_space), ('to_dict', df2dict), ('dv', DictVectorizer(sparse=False)), ('na', Imputer(strategy='most_frequent')), ('gbt', GradientBoostingRegressor( n_estimators=100, learning_rate=0.02, random_state=1, max_depth=3))])cv = cross_val_score(pl, features, ds.survived, cv=3, scoring='roc_auc')cv.mean(), cv.std()Spark ML 2 API, pipelines Spark 1.6 Spark RDD , Spark ML pipelines Spark DataFrames, - DataFrame - Transfromer Estimator sklearn Spark 2.0Titanic Spark ML pipelineidxCols = [col+'Idx' for col in categoricalCols]assembler = VectorAssembler( inputCols=idxCols + numCols, outputCol="features")cl = GBTClassifier( labelCol="survived", maxIter=100, maxDepth=3, stepSize=0.02)pl = Pipeline(stages=indexers + [assembler, cl])sdf_fna = sdf.fillna(0).replace('', 'NA')train_df, test_df = sdf_fna.randomSplit([0.7, 0.3])m = pl.fit(train_df)predictions = m.transform(test_df)evaluator = BinaryClassificationEvaluator( labelCol="survived", rawPredictionCol="prediction", metricName="areaUnderROC")evaluator.evaluate(predictions)ss = SparkSession.builder.getOrCreate()sdf = ss.read.csv('titanic.csv', header=True)numCols = [ 'pclass', 'age', 'sibsp', 'parch', 'fare', 'alone']for col in numCols: sdf = sdf.withColumn( col, sdf[col].astype('decimal'))sdf = sdf.withColumn( 'survived', sdf['survived'].astype('int'))categoricalCols =[ 'sex', 'embarked', 'class', 'deck', 'who', 'embark_town']indexers = [ StringIndexer( inputCol=col, outputCol=col+'Idx', handleInvalid='skip') for col in categoricalCols] Spark ML pipelines Estimators Transformers {estimator}Model StringIndexer OneHotEncoder StringIndexersi = StringIndexer(inputCol='in', outputCol='out')rows = [ {'in': 'm'}, {'in': 'm'}, {'in': 'f'}, {'in': 'f'}, {'in': 'm'},]df = ss.createDataFrame(rows)si.fit(df).transform(df).toPandas()in outm 0.0m 0.0f 1.0f 1.0m 0.0 Spark ML Not Big Data but Big Computations c Boruta c - ( sklearn, 1 ) Spark , pe = PolynomialExpansion(degree=2, inputCol='in', outputCol='out')rows = [ {'in': Vectors.dense([2, 10, 20])},]df = ss.createDataFrame(rows)pe.transform(df).collect()[0].out.toArray()array([ 2., 4., 10., 20., 100., 20., 40., 200., 400.]) # only 2-category features can be used without binarizationcategoricalCols =['sex'] #,'embarked', 'class', 'deck', 'who', 'embark_town']indexers = [ StringIndexer(inputCol=col, outputCol=col+'Idx', handleInvalid='skip') for col in categoricalCols]idxCols = [col+'Idx' for col in categoricalCols]numCols = ['pclass', 'age', 'sibsp', 'parch', 'fare', 'alone']assembler = VectorAssembler( inputCols=idxCols + numCols, outputCol="features")pe = PolynomialExpansion(degree=2, inputCol='features', outputCol='features_p')cl = LogisticRegression(featuresCol='features_p', labelCol="survived", maxIter=10, regParam=0.1)pl = Pipeline(stages=indexers + [assembler, pe, cl])m = pl.fit(sdf.fillna(0).replace('', 'NA')) fnames = idxCols + numColspnames = [ n+'*'+n2 for i, n in zip(range(len(fnames)), fnames) for n2 in (['1']+fnames)[:i+2]]weights = m.stages[-1].coefficients.arraypd.DataFrame( {'weights': weights, 'importance': np.abs(weights), 'names': pnames}).sort_values('importance', ascending=False)[:10] importance names weights0.746541 sexIdx*1 0.7465410.746541 sexIdx*sexIdx 0.7465410.190673 pclass*1 -0.1906730.164902 parch*1 0.1649020.109082 pclass*sexIdx 0.1090820.080076 parch*sexIdx -0.0800760.075500 sibsp*sexIdx -0.0755000.067250 pclass*pclass -0.0672500.040980 parch*sibsp -0.0409800.031343 sibsp*pclass -0.031343 Tree-structured Parzen Estimator (Hyperopt) Gaussian process regression (Spearmint) Random forest regression (SMAC)The End! , MTS 2016dmitri.babaev@gmail.commailto:dmitri.babaev@gmail.com?subject=