ml in the cloud

6
1 Machine Learning in the Cloud: From a Local Scope to a Scalable Reality Javier Fern´ andez Abstract—While software engineering solutions are taking advantage of the scalability, efficiency and flexibility of cloud computing, machine learning is still in an incipient process of migration to the cloud. Academia and industry still present many colliding approaches with respect to data mining as its whole, from data analysis and processing to model building and deployment, although the amount of cloud solutions for data-related tasks is increasing every day. Many tools have been developed to tackle scalability, performance and complexity of algorithms, yet there are many challenges to be surpassed in order to achieve a successful shift from local computers to the cloud. We present here the current state of machine learning in the cloud, involving tools and methodologies, focusing on the main obstacles to be surpassed in order to make the shift from academia to industry while keeping a coordinated balance; while also describing the most representative tools available and some detailed examples of how are they being applied in industry today. Index Terms—Machine Learning, Cloud Computing, Distributed Systems, Object Recognition, Data Mining. 1 I NTRODUCTION M achine Learning has brought novel ways to approach problems where useful information and knowledge can be extracted in an automated fashion, fastly surpassing manual solution counterparts. In the last decades we have seen great advances in tasks such as computer vision, speech recognition and natural language processing, propeled by the exponential advance of computer hardware and software technology. Beyond the more technical aspects of building machine learning algorithms, the whole methodology involving the analysis, preparation and modelling of data, as well as the deployment of final functional solutions, typically refered all as data mining, has suffered substantial improvements. However, while research in artificial intelligence is currently beating every contest there is in computer vision and speech recognition with the rise of deep learning, automated scoring of credit risk is now a common place in banking industry, search engines are getting increasingly better through user behavior data, recommender systems have become a standard in e-commerce industry, and many other fields are eagearly looking for applying machine learning on every day tasks, there is still not a clear path and foundation to effectively take machine learning to the cloud, providing accuracy, scalability, accesibility and mantainability of the solutions. In the meanwhile cloud computing has stopped to be an utopia and became a solid reality; it becomes more usual everyday to find software engineering solutions being deployed in many of infraestructure providers in the cloud such as Amazon Web Services and Google Cloud platform, using platform-specific software that is maintained by a third-party as typical in Platform as a Service models, or even Javier Fern´ andez is a student for MSc in Artificial Intelligence in UPC, UB and URV. Barcelona, Spain. E-mail: [email protected] buying already deployed and configurable software as in Software as a Service. The cloud provides the advantage of scalability of solutions without hardware mantainance, ease of use, validated software and increased time-to-market for business. While it provides differences in computation model, storage model, and networking model, it also provides economical benefits such as elasticity, power cooling and physical plants costs, operation costs, by zone availability and disaster recovery. [Armbrust et al., 2009]. As defined by Tim O’Reilly, the future belongs to services that respond in real time to information provided either by their users or by nonhuman sensors [Armbrust et al., 2009], and the cloud has many to do with that future. However, the incorporation of data mining as whole in the cloud is still in its beginning. Many efforts have been developed to find ways to take the academy approach to data analysis including tools and methodologies, into a scalable, mantainable and service-like schema of solutions. These cover optimizations for parallelizing computation, finding intermediate languages to implement the same algorithms with an orientation to hardware efficiency, improvements in data storage and streaming for handling massive amounts of data and the popularization of some usual methods for data processing, visualization and modelling. This also has come at a cost of flexibility on implementations, algorithms choice, programming languages, algorithm customization, amount of data to be used. Still, there is no single tool that provide a homogenous pipeline without integrating multiple stacks of software. In this work we present the current state of the incor- poration of machine learning and data mining as whole into the state-of-the-art of cloud computing. The first section describes the main differences between academia and indus- trial approaches to machine learning, presenting common paths that are or can be followed, and the main challenges to deal with in order to achieve an integration of data mining

Upload: synergy-vision

Post on 14-Feb-2017

210 views

Category:

Documents


0 download

TRANSCRIPT

1

Machine Learning in the Cloud:From a Local Scope to a Scalable Reality

Javier Fernandez

Abstract—While software engineering solutions are taking advantage of the scalability, efficiency and flexibility of cloud computing,machine learning is still in an incipient process of migration to the cloud. Academia and industry still present many colliding approacheswith respect to data mining as its whole, from data analysis and processing to model building and deployment, although the amount ofcloud solutions for data-related tasks is increasing every day. Many tools have been developed to tackle scalability, performance andcomplexity of algorithms, yet there are many challenges to be surpassed in order to achieve a successful shift from local computersto the cloud. We present here the current state of machine learning in the cloud, involving tools and methodologies, focusing on themain obstacles to be surpassed in order to make the shift from academia to industry while keeping a coordinated balance; while alsodescribing the most representative tools available and some detailed examples of how are they being applied in industry today.

Index Terms—Machine Learning, Cloud Computing, Distributed Systems, Object Recognition, Data Mining.

F

1 INTRODUCTION

M achine Learning has brought novel ways to approachproblems where useful information and knowledge

can be extracted in an automated fashion, fastly surpassingmanual solution counterparts. In the last decades we haveseen great advances in tasks such as computer vision,speech recognition and natural language processing,propeled by the exponential advance of computer hardwareand software technology. Beyond the more technicalaspects of building machine learning algorithms, thewhole methodology involving the analysis, preparationand modelling of data, as well as the deployment offinal functional solutions, typically refered all as datamining, has suffered substantial improvements. However,while research in artificial intelligence is currently beatingevery contest there is in computer vision and speechrecognition with the rise of deep learning, automated scoringof credit risk is now a common place in banking industry,search engines are getting increasingly better throughuser behavior data, recommender systems have become astandard in e-commerce industry, and many other fields areeagearly looking for applying machine learning on everyday tasks, there is still not a clear path and foundation toeffectively take machine learning to the cloud, providingaccuracy, scalability, accesibility and mantainability of thesolutions.

In the meanwhile cloud computing has stopped to bean utopia and became a solid reality; it becomes more usualeveryday to find software engineering solutions beingdeployed in many of infraestructure providers in the cloudsuch as Amazon Web Services and Google Cloud platform,using platform-specific software that is maintained by athird-party as typical in Platform as a Service models, or even

• Javier Fernandez is a student for MSc in Artificial Intelligence in UPC,UB and URV. Barcelona, Spain.

E-mail: [email protected]

buying already deployed and configurable software as inSoftware as a Service. The cloud provides the advantage ofscalability of solutions without hardware mantainance, easeof use, validated software and increased time-to-marketfor business. While it provides differences in computationmodel, storage model, and networking model, it alsoprovides economical benefits such as elasticity, powercooling and physical plants costs, operation costs, by zoneavailability and disaster recovery. [Armbrust et al., 2009].As defined by Tim O’Reilly, the future belongs to servicesthat respond in real time to information provided either bytheir users or by nonhuman sensors [Armbrust et al., 2009],and the cloud has many to do with that future.

However, the incorporation of data mining as whole inthe cloud is still in its beginning. Many efforts have beendeveloped to find ways to take the academy approach todata analysis including tools and methodologies, into ascalable, mantainable and service-like schema of solutions.These cover optimizations for parallelizing computation,finding intermediate languages to implement the samealgorithms with an orientation to hardware efficiency,improvements in data storage and streaming for handlingmassive amounts of data and the popularization of someusual methods for data processing, visualization andmodelling. This also has come at a cost of flexibilityon implementations, algorithms choice, programminglanguages, algorithm customization, amount of data to beused. Still, there is no single tool that provide a homogenouspipeline without integrating multiple stacks of software.

In this work we present the current state of the incor-poration of machine learning and data mining as wholeinto the state-of-the-art of cloud computing. The first sectiondescribes the main differences between academia and indus-trial approaches to machine learning, presenting commonpaths that are or can be followed, and the main challenges todeal with in order to achieve an integration of data mining

2

to the cloud that satisfy both research and industrial inter-ests. Then, some of the most important tools developed foreach necessity of making this shift are described, in order tounderstand approaches from methodological, algorithmicaland performance point of view. Finally some concludingremarks are presented about the current and future scenarioof cloud computing for data-related work.

2 FROM ACADEMIA TO INDUSTRY: MAIN CHAL-LENGES

Developments in artificial intelligence in the last twodecades has brought an increasing interest of wide rangeof industries given the business opportunities that arisefrom this technology. In particular the capacity of extractinginformation successfully from raw input in automatedways, besides the increasing availability of data hasprovided a growing demand for scalable machine learningsolutions. This expectations, however, collided with typicalapproaches of machine learning researchers, that oftenprefer to code solutions in statistical computing languagesor high level languages such as R, Matlab and Python,that allow quicker implementations but that often lead toad-hoc, non-robust and non-scalable solutions [Sparks et al.,2013].

This contrast between industry scalability needs versusacademy need for high level and quick implementations hasgiven rise to multiple intermediate solutions that attemptto benefit both worlds, such as graph-based interfaces[Gonzalez et al., 2012], [Low et al., 2010], intermediate levelsystems integrated with parallel computing technologylike Mahout or Oryx on Hadoop [Owen et al., 2011],[White, 2009], [Owen, ], or more sophisticated systemsthat rely on optimizers to map high-level implementationsto run in distributed systems, according to [Sparks et al.,2013]. Currently, there are also cloud services that providedifferent machine learning capabilities that are intended tobe scalable and easy to use, with different offering scopes inthe context of Infraestructure as a Service (IaaS), Platform asa Service (PaaS) and Software as a Service (SaaS), providinga new frontier that could be defined as Machine Learningas a Service. Some sound examples are Amazon MachineLearning, Microsoft Azure Machine Learning, Google PredictionAPI, and IBM Watson [Almeida, ]

Although the different approaches available, there is stillnot a single solution that can capture the full analyticspipeline for machine learning [Weimer et al., 2011]. More-over, most solutions consider critical tradeoffs between theavailability of algorithms, the ease of implementation, theactual scalibility of the system, the ease of understandingand transfering the implementation and knowledge, theneed for strong background in distributed systems and low-level primitives for researches, among many others. Belowit is detailed the main aspects to be considered and furtherdeveloped to reach a scalable and widely viable machinelearning in the cloud.

2.1 Scalability[Bekkerman et al., 2011] Cloud computing has provided

a fast-pace democratization and availability of hardwareand software resources, allowing to reduce costs, providinginfraestructure flexibility and multiplying exponentially thereach of the software business [Zhang et al., 2010]. Thisinfraestructure capacities, the evolution of programmingframeworks that can exploit parallelism, the increasingavailibility of datasets and information sources, and theabundance of platforms for building distributed softwarehas provided an increasing interest in scaling up machinelearning [Bekkerman et al., 2011].

Beyond the physical capacity of actually scaling machinelearning, there are set of considerations that transformthis possibility into an increasing need, as covered by[Bekkerman et al., 2011]. One is the increasingly largenumber of data instances available that make unfeasable oreven impossible to be processed by a single computer, fortraining and even testing purposes. This comes along withincreasing number of features, which in some cases can behandled in a distributed way by partitioning computationacross the set of machines. A typical example of this is therecent developments in computer vision where individualpixels in images are being used directly as features withinlarge datasets of images, such as in Deep ConvolutionalNetworks [Krizhevsky et al., 2012], or even as initial inputin larger image processing pipelines as in classic computervision methods.

Another typical issue is the model and algorithm com-plexity, where computationally expensive processes are car-ried out to find highly accurate solutions, such as in NeuralNetworks (shallow or deep), Genetic Algorithms, Opti-mization Algorithms, and most of methods building non-linear models. And beyond the complexity of offline modelbuilding, there is an increasing need for bulding onlinelearning systems, that can be able to learn increasingly fromdata without requiring a full dataset re-training. This comesalong with the need of some solutions to have low responsetimes on new queries that can keep the pace of typicalmobile and web solutions; such is the case of recommendersystems commonly used in E-Commerce platforms such asAmazon and Ebay, as well as in more specific (Netflix) or gen-eral (Google Search) solutions. All this reasons get multipliedexponentially when taking into account that most of the bestmachine learning algorithms require intensive parameterselection and training procedures, as well as the typicalrepetitions and variations of model training required forstatistically-correct results, all implying higher and complexcomputation that impacts scalability.

2.2 Data Mining MethodologyThe decades of development of data mining as its whole,has lead to the definition of a set of activities and processesthat come together in the complete work of makingsense of data. In this matter, both academy and industryhave reached structured methodologies for carrying outorganized, statistically correct, and validatable managementand analysis of data that can lead to meaningful results.

3

Some of these are CRISP, KDD, and SEMMA [Wirth,2000], [Azevedo and Santos, 2008], as well as many othercombinations and modifications of these, and custommethodologies, to mention some. CRISP defines a seriesof cyclical phases that can be use as reference to betterdescribe the current state of development of data miningand machine learning solutions in the cloud, regardingstructured methodologies. The defined phases are: problemunderstanding, data preparation, modeling, evaluation anddeployment.

Currently, there is no single platform or solution in thecloud that can capture the entire pipeline of activities fordata mining, [Weimer et al., 2011], with a high level ofmaturity and flexibility in each phase. Although is possiblethat this will never be achievable entirely, given the widerange of algorithms, methods and practices being developedevery day, it is clear to observe that while some platformsadvance rapidly in providing standarized, portable andscalable solutions, they might lack, for instance, from asufficiently wide variety of machine learning algorithmsor higher-level languages support to implement them. Asimilarly idea applies for evaluating and assesing models,or for analyzing, visualizing and transforming initial rawdata.

Most of PaaS and SaaS solutions for data mining andmachine learning are centered in providing infraestructuresthat require little to non administration from users and alimited set of training algorithms, data processing, eval-uation and visualization methods, surrounded with webservices API’s for integrability. Some of them provide graph-ical interfaces for building processing pipelines, which setsa base for future addition of new functionality, such asAmazon Machine Learning, Dotplot, Microsoft Azure MachineLearning, BigML and RapidMiner. Just a few such as Mi-crosoft Azure Machine Learning allow to add custom codeto a given process in order to add higher flexibility. Thegeneral concept behind the marketing of these solutions isthe ease of building machine learning based applicationsand its integrability with everyday devices. When compar-ing this offering with the structured, detailed and highly-specialized knowledge that is required when analyzingdata and making sense of it, it becomes easy to think thatmachine learning is being understimated or simplified tooquickly to the eyes of the global community. Possibly thecurrent state of these solution set the first step to getting toknow and validate the real world applications that couldbe built through artificial intelligence, however it is still farfrom providing a stable and comprehensive set of tools thatallows to follow a proper data mining methodology, in orderto provide the migration of researchers to the these tools.

2.3 Security

A recurrent and still incipient subject regarding anythingrelated to data manipulation is achieving data security onits whole, and guaranteeing confidentiality and protectionof private data. In many industries such as financeand medicine, data privacy becomes a highly relevantrequirement at first glance. However, most of the available

frameworks, PaaS and SaaS, libraries, and software offeringdata mining capabilities deployable in the cloud, do nothighlight security advantages of their software or services,particularly considering the high amount of data to betransfered during distributed processing.

This issue becomes particulary important when datato be used is stored and available in different places orregions that might not be accessible by a user. Moreover,researches could take advantage of being able to access togreat amounts of data of a particular study (e.g. brain scans)while keeping privacy and security in the handled data. Arecent study [Vamsi Potluru, 2014] presents a distributedservice to access resources transparently, matching aresearcher requirements while preserving data privacy.More than finding specific datasets, the researcher is ableto obtain results from the application of specific machinelearning and data processing methods to selected datasets.Computation is distributed and results can be used directlyby the researching by preserving data privacy.

In academy research in machine learning, data securityis not an issue that is commonly refered to. Most papersemphasize the scientific method that has been followed,the quality of the obtained results and sometimes the hard-ware and software requirements for computing the models.However, security of data is typically assumed. It is clear toobserve that in the leap from academy to industry there isstill much to be developed.

2.4 Performance tradeoffs

Performance of data mining processes regarding time andresources becomes a decisive element both for academyand industry. There are many tradeoffs regarding thechoice of platforms, algorithms, underlying softwareand frameworks, in order to achieve solutions that areaccurate enough while being able to be built in feasibleinfraestructures and response times. Some of the mostimportant elements that come into play when takingmachine learning to the cloud are parallelism, data scale anddistribution, offline vs online execution, programming paradigmsand degree of algorithm customization [Bekkerman et al., 2011].

Parallelism is providing highly efficient computation ofalgorithms through the use of GPU’s and FPGA’s, that areeveryday more affordable, probably due to Moore’s law andpopularization of algorithms such as deep learning. However,using this capability implies re-implementing the wholeprocessing pipeline to take advantage of it, something thatsometimes is not possible or feasible with some algorithms.Moreover, regarding scalability, to distribute a computationthat is done parallelized involves complex tasks andfurther low-level computing knowlege. An additionalelement is Data scale and distribution that refers to theissue of data being stored in multiple locations, requiringefficient algorithms for obtaining and processing datawithout impacting critically the data throughput. A criticalissue is also the capacity of developing machine learningalgorithms using programming paradigms that can take fulladvantage of parallel and distributed processing as well as

4

being able to successfully implement the algorithm itself,which is not always feasible. This is directly related withthe capacity of algorithm customization which gets oftenlytruncated due to the strict and cumbersome requirementsof hardware-centered implementations.

Beyond cloud computing, currently super computersare being used as a much powerful mechanism for imple-menting machine learning algorithms that demand inten-sive computation and that processes great and increasinglybigger amounts of data. Such is the case of latest advances indeep learning in industry, where Baidu, the Chinese search-engine giant, has built a super computer for deep convolu-tional networks for image processing and object recognition[Harris, 2015].

2.5 Machine Learning Status-Quo

A critical question that could arise when discussing aboutdata mining and machine learning latest developmentsis ‘is it really relevant?’. The increasing rise of machinelearning solutions in the cloud, and the efforts on providingan environment that enriches that distributability andefficiency of algorithms, must be sustained by greaterconnection to meaningful problems of the world of scienceand society. A recent study titled Machine Learning thatMatters [Wagstaff, 2012], discusses about the currentapproaches to machine learning pointing out that datasets,metrics and processes being used might be taking theresearch efforts far from the right path to be able to providesuccessful solutions and a comprehensive knowledgetransfer. It states that research efforts on machine learningmight be too centered on adding value to very specificstudies within the machine learning world withoutattempting to express research questions that couldmake results actionable in the real world. Some of thehighlighted issues inside this world are the excesivefocus on benchmark datasets to compare results thatare universally accepted as good, such as UCL datasets;the hyper-focus on abstract metrics that might representaccurately a result in mathematical terms but that mightnot be complete to be compared with similar results inother contexts (e.g. 75% accuracy in financial authenticationmight be considered low while being remarkable onobject recognition ); and the capacity of reproducing andtransfering experiments and results which is a typical andongoing problem.

Besides academy, it can be added that industrial appli-cations might suffer from the opposite phenomenon andthat is the understimation of proper scientific methodologyfor performing data mining, while popularizing black boxtools for complex machine learning domains that are just atits beginnings, such as natural language processing, objectrecognition and speech recognition. The coordination ofboth research and industry to achieve meaningful devel-opments on the portability of data mining and machinelearning to the cloud, will also require preserving andpromoting unified methodologies, enhancing the adoptionof robust security of data, and providing mechanisms forhigher transferability of knowledge.

3 HIGH ORDER MACHINE LEARNING IN THECLOUD

The last decade has seen the introduction of many newtypes of machine learning solutions that focus on providingfunctional capabilities that can be deployed in the cloud.As described before, efforts have been directed to improveone or more of the processes involving data mining, andthe accesibility of the produced model to generate valuein real world solution. A recent study has categorized thedifferent solutions on machine learning and data mining inthe cloud in five classes: machine learning environments fromthe cloud, which corresponds to cloud-served workstationswith typical statistical tools, plugins for machine learning tools,that extends statistical tools to provide Hadoop clusteringsupport; distributed machine learning libraries, referring tosets of parallelized implementations of machine learningalgorithms that can run in distributed environments;complex machine learning systems, comprehending a longerstack of capabilities for full data analysis, model trainingand high performance deployment; and software as a serviceproviders for machine learning, covering some PaaS and SaaSsolutions, [Pop, 2012].

In this section we describe some of the more relevantsolutions addressing the most important aspects contributedto the overall data mining cloud computing area.

3.1 Distributing the load: MapReduce and Hadoop

The sucessfull parallelization framework MapReduce builton top of Hadoop provided a direct a useful approach toadapt machine learning algorithms in order to gain scala-bility and better performance, considering the high loadsof processing required by most of the algorithms, and theabundance of lengthy processes. Hadoop enabled the pos-sibility of dealing with ever increasing buckets of data infeasible times, allowing large-scale parallel scoring, and be-coming ideal for computationally intensive data preprocess-ing and model building. Although its success, implementingmappers and reducers became a low-level and intensive tasksthat lead to the appearance of higher level frameworks ontop of Hadoop, such as Hive which provided a new highlevel query language and system that compiled queries intoMapReduce jobs that are executed using Hadoop [Thusooet al., 2009]. Further many machine learning frameworks ap-pear that allow simpler and higher level machine learning-based system implementations on top of these technologies,such as: Apache Mahout [Owen et al., 2011], Cloudera Oryx[Owen, ], Spark MLLib and SystemML [Ghoting et al., 2011].

3.2 A new language for Machine Learning: ML-Base,Pig Latin and ScalOps

As mentioned before, the increasing need to take advantageof the benefits of Hadoop for machine learning, lead tothe appearance of a new group of solutions proposingan intermediate way of programming machine learningalgorithms while being easyly integrated with Hadoop.SystemML defined a declarative higher level languagespecifically designed for machine learning in order toreduce the complexity of implementing custom MapReduce

5

jobs. Similarly MLbase provided a way of expressingmachine learning tasks in a declarative way, high leveloperators to provide flexibility and mantain efficiency onthe implementation of new algorithms, [Kraska et al., 2013].MLbase based its language on Pig Latin which defineda intermediate language between the style of SQL andthe low-level procedural style of MapReduce to generateinstructions that also translate directly into Hadoop [Olstonet al., 2008]. A similar declarative language and translatingsystem Scalops states to cover a wider range of big dataapplications in the sense of performance [Weimer et al.,2011].

3.3 Graph-based: GraphLab and PowerGraphSimilar issues on lack of expresiveness and lowmaintainability of machine learning algorithmimplementations in previous sections have been pointedout by a different paradigm that enhances the use ofdata graphs to encode computational structures anddata dependencies. Such is the case of GraphLab whichis presented as a parallel abstraction achieving usability,expresiveness and performance [Low et al., 2010]. The toolprovides an expert user the ability to tune the algorithmbehavior through high level functions which are supportedby a sophisticated scheduling mechanism, showing highperformance behavior on set of complex machine learningproblems, particularly in those implying inference andprobabilistic graphical models. Some additional researchsuch as the PowerGraph abstraction [Gonzalez et al.,2012] has shown several optimizations and effectiveimplementation proposals to upgrade the barriers ofperformance and scalability that are associated with thisoriginal model.

This particular formulation and computation has showngreat advances for natural language processing and man-aging problems related with complex networks, using real-world data. This approach adds up to a wide variety ofcomputational, abstractional and practical options that havebeen developed to tackle the inmense variety of data-relatedproblem and its associated varying complexity.

3.4 PaaS, SaaS and Distributed LibrariesThe last five years have seen the incorporation of giantsGoogle, Amazon, Microsoft and IBM in the provision of ready-to-use and somewhat configurable data mining softwareto tackle modern and real-world data-related problems.These approaches have been mentioned in some detailalong this work, however it is necessary to mention thatwhile each of them offers an attractive guarantee of stabilityand scalability, many of the main challenges of the shiftof machine learning to the coud mentioned before are yetunresolved. The availability of a wide range of algorithmsis a common isue along them. Also, just Microsoft AzureML allows adding code manually to process pipeline,which otherwise make the scientist to depend on availablefeatures only. Regarding data sources, data formats, dataset types, and data types, the variations and differences areconsiderable: Azure accepts SQL, textual, HiveQL, Azure

Storage and even URL data sources while Amazon centerson AWS-based services of storage, and Google focuses onmore technical or software engineering approach withREST API calls, plain HTTP requests or Google storageservices. Many other differences can be pointed out directly,showing the lack of standarization of needs for data miningin the cloud [Almeida, ]. However, the addition of thesesoftware and technology leaders to the world of cloud ondata mining allows to expect considerable improvements inthe following years, following a regain of maturity of thearea.

Another highlighting block of solutions is those that fitunder the previous category of distributed libraries. ApacheMahout [Owen et al., 2011] started initially as an attemptto provide a fully functional machine learning toolbox forspecific problems such as recommendation, classificationand prediction, through the implementation of specific al-gorithms running under a pre-built and production-readysystem. Further development of the system had lead toMyrrix, a similar platform developed by Mahout’s projectleader Sean Owen, futher renamed Oryx and being boughtby big data giant Cloudera. Oryx [Owen, ] also offers afully operative system that can be deployed on premise andthat is surrounded with a series of components such as aserving layer (web services API), an input layer for datastreaming supported by Apache Kafka and several in-deepoptimizations for model computations on top of ApacheSpark, allowing a distributed construction of models andcapabilities of online learning. Such a complete approachas this one is available open source and improved byincreasingly bigger communities of developers that willhelp to shape the future of machine learning. Similar opensolutions as the recently released Google TensorFlow, Theanoand Torch that lead the development base for Deep Learningsolutions are already adopting from the beginning most ofthe more important requirements for a sustainable, scalableand effective machine learning.

4 CONCLUDING REMARKS

Data mining has started an accelerated shift from pureacademia-centered research to functional and production-ready solutions in the cloud. Along with the advance ofdistributed system, parallel computation and the increasingavailabilitiy of hardware resources at marginal costs,novel machine learning tools have been adopting differentapproaches to reach both scalable and maintainablesolutions. Still, there is a long path to travel, where manyof the greatest challenges for this shift are yet to beresolved effectively, such as the disponibility of a widerrange of algorithms, the adaptation of processing flowskeeping efficiency and not impacting readability or addingcomplexity, robust schemas for dealing with security anddata privacy without restricting the availibility of dataaround the world, as well as the capacity to customizealgorithms, among others. Also, the current status ofresearch in machine learning let us to analyze if the maininterests of academia and industry is actually relevant totackle world’s meaningful problems.

6

Many different approaches have been presented thatattempt to improve the most complex and desirable featuresin machine learning, specially regarding its shift to cloudcomputing. New paradigms for efficient parallelized anddistributed computing such as Hadoop have impacted theway algorithms are being developed, even generating inter-mediate solutions between readability and efficiency suchas declarative and SQL-like languages. Also, different fieldsof artificial intelligence have taken advantage of specificimplementations for big data problems such as graph-basedapproaches on natural language processing and complexnetworks problems. Along these, many other tools try toreach the limits and fit within classical and novel spots ofresearch and industry to propel the every-day increasingdevelopment of a field that has already taken its firts stepson achieving worlwide relevant solutions with contributionof academia, industry and open source communities.

REFERENCES

[Almeida, ] Almeida, I. Machine learning as a service.https://blog.onliquid.com/machine-learning-service-benchmark.

[Armbrust et al., 2009] Armbrust, M., Fox, A., Griffith, R., Joseph,A. D., Katz, R. H., Konwinski, A., Lee, G., Patterson, D. A., Rabkin,A., and Zaharia, M. (2009). Above the clouds: A berkeley view ofcloud computing. Technical report.

[Azevedo and Santos, 2008] Azevedo, A. and Santos, M. F. (2008).Kdd, semma and crisp-dm: a parallel overview. In Abraham, A.,editor, IADIS European Conf. Data Mining, pages 182–185. IADIS.

[Bekkerman et al., 2011] Bekkerman, R., Bilenko, M., and Langford,J. (2011). Scaling up machine learning: Parallel and distributedapproaches. In Proceedings of the 17th ACM SIGKDD InternationalConference Tutorials, KDD ’11 Tutorials, pages 4:1–4:1, New York, NY,USA. ACM.

[Ghoting et al., 2011] Ghoting, A., Krishnamurthy, R., Pednault, E.,Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., andVaithyanathan, S. (2011). Systemml: Declarative machine learningon mapreduce. In Proceedings of the 2011 IEEE 27th InternationalConference on Data Engineering, ICDE ’11, pages 231–242, Washington,DC, USA. IEEE Computer Society.

[Gonzalez et al., 2012] Gonzalez, J. E., Low, Y., Gu, H., Bickson, D.,and Guestrin, C. (2012). Powergraph: Distributed graph-parallelcomputation on natural graphs. In Proceedings of the 10th USENIXConference on Operating Systems Design and Implementation, OSDI’12,pages 17–30, Berkeley, CA, USA. USENIX Association.

[Harris, 2015] Harris, D. (2015). Baidu built a supercomputer fordeep learning. https://gigaom.com/2015/01/14/baidu-has-built-a-supercomputer-for-deep-learning/.

[Kraska et al., 2013] Kraska, T., Talwalkar, A., Duchi, J. C., Griffith, R.,Franklin, M. J., and Jordan, M. I. (2013). Mlbase: A distributedmachine-learning system. In CIDR. www.cidrdb.org.

[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton,G. E. (2012). Imagenet classification with deep convolutional neuralnetworks. In Bartlett, P., Pereira, F., Burges, C., Bottou, L., andWeinberger, K., editors, Advances in Neural Information ProcessingSystems 25, pages 1106–1114.

[Low et al., 2010] Low, Y., Gonzalez, J., Kyrola, A., Bickson, D.,Guestrin, C., and Hellerstein, J. M. (2010). Graphlab: A new parallelframework for machine learning. In Conference on Uncertainty inArtificial Intelligence (UAI), Catalina Island, California.

[Olston et al., 2008] Olston, C., Reed, B., Srivastava, U., Kumar, R., andTomkins, A. (2008). Pig latin: A not-so-foreign language for dataprocessing. In Proceedings of the 2008 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’08, pages 1099–1110,New York, NY, USA. ACM.

[Owen, ] Owen, S. Oryx 2: Lambda architecture onspark, kafka for real-time large scale machine learning.https://github.com/OryxProject/oryx.

[Owen et al., 2011] Owen, S., Anil, R., Dunning, T., and Friedman, E.(2011). Mahout in Action. Manning Publications Co., Greenwich, CT,USA.

[Pop, 2012] Pop, D. (2012). Machine learning and cloud comput-ing: Survey of distributed and saas solutions. Institute e-AustriaTimisoara.

[Sparks et al., 2013] Sparks, E. R., Talwalkar, A., Smith, V., Kottalam, J.,Pan, X., Gonzalez, J. E., Franklin, M. J., Jordan, M. I., and Kraska,T. (2013). MLI: an API for distributed machine learning. CoRR,abs/1310.5426.

[Thusoo et al., 2009] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka,P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. (2009). Hive- awarehousing solution over a map-reduce framework. In IN VLDB’09: PROCEEDINGS OF THE VLDB ENDOWMENT, pages 1626–1629.

[Vamsi Potluru, 2014] Vamsi Potluru, Javier Diaz-Montes, A. D. S. S. P.V. D. C. B. A. P. M. P. (2014). Cometcloudcare (c3): Distributed ma-chine learning platform-as-a-service with privacy preservation. In InNIPS workshop: Distributed Machine Learning and Matrix Computations.

[Wagstaff, 2012] Wagstaff, K. (2012). Machine learning that matters.CoRR, abs/1206.4656.

[Weimer et al., 2011] Weimer, M., Condie, T., and Ramakrishnan, R.(2011). Machine learning in scalops, a higher order cloud computinglanguage. In NIPS 2011 Workshop on parallel and large-scale machinelearning (BigLearn).

[White, 2009] White, T. (2009). Hadoop: The Definitive Guide. O’ReillyMedia, Inc., 1st edition.

[Wirth, 2000] Wirth, R. (2000). Crisp-dm: Towards a standard processmodel for data mining. In Proceedings of the Fourth InternationalConference on the Practical Application of Knowledge Discovery and DataMining, pages 29–39.

[Zhang et al., 2010] Zhang, Q., Cheng, L., and Boutaba, R. (2010).Cloud computing: state-of-the-art and research challenges. J. InternetServices and Applications, 1(1):7–18.