introduction to pythondsc.soic.indiana.edu/publications/cloudcomputing-python.pdfintroduction to...
TRANSCRIPT
INTRODUCTIONTOPYTHON
GregorvonLaszewski
(c)GregorvonLaszewski,2018,2019
INTRODUCTIONTOPYTHON
1PREFACE1.1Disclaimer☁�1.1.1Acknowledgment1.1.2Extensions
2INTRODUCTION2.1IntroductiontoPython☁�2.1.1References
3INSTALATION3.1Python3.7.4Installation☁�3.1.1Hardware3.1.2PrerequisitsUbuntu19.043.1.3PrerequisitsmacOS3.1.3.1InstallationfromAppleAppStore3.1.3.2Installationfrompython.org3.1.3.3InstallationfromHoembrew
3.1.4PrerequisitsUbuntu18.043.1.5PrerequisiteWindows103.1.5.1LinuxSubsystemInstall
3.1.6Prerequisitvenv3.1.7InstallPython3.7viaAnaconda3.1.7.1Downloadcondainstaller3.1.7.2Installconda3.1.7.3InstallPython3.7.4viaconda
3.2Multi-VersionPythonInstallation☁�3.2.1Disablingwrongpythoninstalls3.2.2Managing2.7and3.7PythonVersionswithoutPyenv3.2.3ManagingMultiplePythonVersionswithPyenv3.2.3.1InstallationpyenvviaHomebrew3.2.3.2InstallpyenvonUbuntu18.043.2.3.3Usingpyenv3.2.3.3.1UsingpyenvtoInstallDifferentPythonVersions3.2.3.3.2SwitchingEnvironments
3.2.3.4UpdatingPythonVersionList3.2.3.4.1UpdatingtoanewversionofPythonwithpyenv
3.2.4AnacondaandMinicondaandConda3.2.4.1Miniconda3.2.4.2Anaconda
3.2.5Exercises4FIRSTSTEPS4.1InteractivePython☁�4.1.1REPL(ReadEvalPrintLoop)4.1.2Interpreter4.1.3Python3FeaturesinPython2
4.2Editors☁�4.2.1Pycharm4.2.2Pythonin45minutes
4.3GoogleColab☁�4.3.1IntroductiontoGoogleColab4.3.2ProgramminginGoogleColab4.3.3BenchamrkinginGoogleColabwithCloudmesh
5LANGUAGE5.1Language☁�5.1.1StatementsandStrings5.1.2Comments5.1.3Variables5.1.4DataTypes5.1.4.1Booleans5.1.4.2Numbers
5.1.5ModuleManagement5.1.5.1ImportStatement5.1.5.2Thefrom…importStatement
5.1.6DateTimeinPython5.1.7ControlStatements5.1.7.1Comparison5.1.7.2Iteration
5.1.8Datatypes5.1.8.1Lists5.1.8.2Sets5.1.8.3RemovalandTestingforMembershipinSets5.1.8.4Dictionaries5.1.8.5DictionaryKeysandValues
5.1.8.6CountingwithDictionaries5.1.9Functions5.1.10Classes5.1.11Modules5.1.12LambdaExpressions5.1.12.1map5.1.12.2dictionary
5.1.13Iterators5.1.14Generators5.1.14.1Generatorswithfunction5.1.14.2Generatorsusingforloop5.1.14.3GeneratorswithListComprehension5.1.14.4WhytouseGenerators?
6CLOUDMESH6.1Introduction☁�6.2Installation☁�6.2.1Prerequisite6.2.2BasicInstall
6.3Output☁�6.3.1Console6.3.2Banner6.3.3Heading6.3.4VERBOSE6.3.5Usingprintandpprint
6.4Dictionaries☁�6.4.1Dotdict6.4.2FlatDict6.4.3PrintingDicts
6.5Shell☁�6.6StopWatch☁�6.7CloudmeshCommandShell☁�6.7.1CMD56.7.1.1Resources6.7.1.2Installationfromsource6.7.1.3Execution6.7.1.4CreateyourownExtension6.7.1.5Bug:Quotes
6.8Exercises☁�6.8.1CloudmeshCommon6.8.2CloudmeshShell
7LIBRARIES7.1PythonModules☁�7.1.1UpdatingPip7.1.2UsingpiptoInstallPackages7.1.3GUI7.1.3.1GUIZero7.1.3.2Kivy
7.1.4FormattingandCheckingPythonCode7.1.5Usingautopep87.1.6WritingPython3CompatibleCode7.1.7UsingPythononFutureSystems7.1.8Ecosystem7.1.8.1pypi7.1.8.2AlternativeInstallations
7.1.9Resources7.1.9.1JupyterNotebookTutorials
7.1.10Exercises7.2DataManagement☁�7.2.1Formats7.2.1.1Pickle7.2.1.2TextFiles7.2.1.3CSVFiles7.2.1.4Excelspreadsheets7.2.1.5YAML7.2.1.6JSON7.2.1.7XML7.2.1.8RDF7.2.1.9PDF7.2.1.10HTML7.2.1.11ConfigParser7.2.1.12ConfigDict
7.2.2Encryption7.2.3DatabaseAccess7.2.4SQLite
7.2.4.1Exercises �7.3Plottingwithmatplotlib☁�7.4DocOpts☁�7.5OpenCV☁�7.5.1Overview7.5.2Installation7.5.3ASimpleExample7.5.3.1Loadinganimage7.5.3.2Displayingtheimage7.5.3.3ScalingandRotation7.5.3.4Gray-scaling7.5.3.5ImageThresholding7.5.3.6EdgeDetection
7.5.4AdditionalFeatures7.6SecchiDisk☁�7.6.1SetupforOSX7.6.2Step1:Recordthevideo7.6.3Step2:AnalysetheimagesfromtheVideo7.6.3.1ImageThresholding7.6.3.2EdgeDetection7.6.3.3Blackandwhite
8DATA8.1DataFormats☁�8.1.1YAML8.1.2JSON8.1.3XML
9MONGO9.1MongoDBinPython☁�9.1.1CloudmeshMongoDBUsageQuickstart9.1.2MongoDB9.1.2.1Installation9.1.2.1.1Installationprocedure
9.1.2.2CollectionsandDocuments9.1.2.2.1Collectionexample9.1.2.2.2Documentstructure9.1.2.2.3CollectionOperations
9.1.2.3MongoDBQuerying
9.1.2.3.1MongoQueriesexamples9.1.2.4MongoDBBasicFunctions9.1.2.4.1Import/Exportfunctionsexamples
9.1.2.5SecurityFeatures9.1.2.5.1Collectionbasedaccesscontrolexample
9.1.2.6MongoDBCloudService9.1.3PyMongo9.1.3.1Installation9.1.3.2Dependencies9.1.3.3RunningPyMongowithMongoDeamon9.1.3.4ConnectingtoadatabaseusingMongoClient9.1.3.5AccessingDatabases9.1.3.6CreatingaDatabase9.1.3.7InsertingandRetrievingDocuments(Querying)9.1.3.8LimitingResults9.1.3.9UpdatingCollection9.1.3.10CountingDocuments9.1.3.11Indexing9.1.3.12Sorting9.1.3.13Aggregation9.1.3.14DeletingDocumentsfromaCollection9.1.3.15CopyingaDatabase9.1.3.16PyMongoStrengths
9.1.4MongoEngine9.1.4.1Installation9.1.4.2ConnectingtoadatabaseusingMongoEngine9.1.4.3QueryingusingMongoEngine
9.1.5Flask-PyMongo9.1.5.1Installation9.1.5.2Configuration9.1.5.3Connectiontomultipledatabases/servers9.1.5.4Flask-PyMongoMethods9.1.5.5AdditionalLibraries9.1.5.6ClassesandWrappers
9.2Mongoengine☁�9.2.1Introduction9.2.2Installandconnect
9.2.3Basics10OTHER10.1WordCountwithParallelPython☁�10.1.1GeneratingaDocumentCollection10.1.2SerialImplementation10.1.3SerialImplementationUsingmapandreduce10.1.4ParallelImplementation10.1.5Benchmarking10.1.6Excersises10.1.7References
10.2NumPy☁�10.2.1InstallingNumPy10.2.2NumPyBasics10.2.3DataTypes:TheBasicBuildingBlocks10.2.4Arrays:StringingThingsTogether10.2.5Matrices:AnArrayofArrays10.2.6SlicingArraysandMatrices10.2.7UsefulFunctions10.2.8LinearAlgebra10.2.9NumPyResources
10.3Scipy☁�10.3.1Introduction10.3.2References
10.4Scikit-learn☁�10.4.1IntroductiontoScikit-learn10.4.2Installation10.4.3SupervisedLearning10.4.4UnsupervisedLearning10.4.5Building a end to end pipeline for Supervisedmachine learningusingScikit-learn10.4.6Stepsfordevelopingamachinelearningmodel10.4.7ExploratoryDataAnalysis10.4.7.1Barplot10.4.7.2Correlationbetweenattributes10.4.7.3HistogramAnalysisofdatasetattributes10.4.7.4BoxplotAnalysis10.4.7.5ScatterplotAnalysis
10.4.8DataCleansing-RemovingOutliers10.4.9PipelineCreation10.4.9.1 Defining DataFrameSelector to separate Numerical andCategoricalattributes10.4.9.2FeatureCreation/AdditionalFeatureEngineering
10.4.10CreatingTrainingandTestingdatasets10.4.11Creatingpipelinefornumericalandcategoricalattributes10.4.12Selectingthealgorithmtobeapplied10.4.12.1LinearRegression10.4.12.2LogisticRegression10.4.12.3Decisiontrees10.4.12.4KMeans10.4.12.5SupportVectorMachines10.4.12.6NaiveBayes10.4.12.7RandomForest10.4.12.8Neuralnetworks10.4.12.9DeepLearningusingKeras10.4.12.10XGBoost
10.4.13ScikitCheatSheet10.4.14ParameterOptimization10.4.14.1Hyperparameteroptimization/tuningalgorithms
10.4.15 Experiments with Keras (deep learning), XGBoost, and SVM(SVC)comparedtoLogisticRegression(Baseline)10.4.15.1Creatingaparametergrid10.4.15.2 Implementing Grid search with models and also creatingmetricsfromeachofthemodel.10.4.15.3ResultstablefromtheModelevaluationwithmetrics.10.4.15.4ROCAUCScore
10.4.16K-meansinscikitlearn.10.4.16.1Import
10.4.17K-meansAlgorithm10.4.17.1Import10.4.17.2Createsamples10.4.17.3Createsamples10.4.17.4Visualize10.4.17.5Visualize
10.5Dask-RandomForestFeatureDetection☁�
10.5.1Setup10.5.2Dataset10.5.3DetectingFeatures10.5.3.1DataPreparation
10.5.4RandomForest10.5.5Acknowledgement
10.6ParallelComputinginPython☁�10.6.1Multi-threadinginPython10.6.1.1ThreadvsThreading10.6.1.2Locks
10.6.2Multi-processinginPython10.6.2.1Process10.6.2.2Pool10.6.2.2.1SynchronousPool.map()10.6.2.2.2AsynchronousPool.map_async()
10.6.2.3Locks10.6.2.4ProcessCommunication10.6.2.4.1Value
10.7Dask☁�10.7.1HowDaskWorks10.7.2DaskBag10.7.3ConcurrencyFeatures10.7.4DaskArray10.7.5DaskDataFrame10.7.6DaskDataFrameStorage10.7.7Links
11APPLICATIONS11.1FingerprintMatching☁�11.1.1Overview11.1.2Objectives11.1.3Prerequisites11.1.4Implementation11.1.5Utilityfunctions11.1.6Dataset11.1.7DataModel11.1.7.1Utilities11.1.7.1.1Checksum
11.1.7.1.2Path11.1.7.1.3Image
11.1.7.2Mindtct11.1.7.3Bozorth311.1.7.3.1RunningBozorth311.1.7.3.1.1One-to-one11.1.7.3.1.2One-to-many
11.1.8Plotting11.1.9PuttingitallTogether
11.2NISTPedestrianandFaceDetection �☁�11.2.0.1Introduction11.2.0.1.1INRIAPersonDataset11.2.0.1.2HOGwithSVMmodel11.2.0.1.3AnsibleAutomationTool
11.2.0.2DeploymentbyAnsible11.2.0.3CloudmeshforProvisioning11.2.0.4RolesExplainedforInstallation11.2.0.4.1ServergroupsforMasters/SlavesbyAnsibleinventory
11.2.0.5InstructionsforDeployment11.2.0.5.1CloningPedestrianDetectionRepositoryfromGithub11.2.0.5.2AnsiblePlaybook
11.2.0.6OpenCVinPython11.2.0.6.1Importcv211.2.0.6.2ImageDetection
11.2.0.7HumanandFaceDetectioninOpenCV11.2.0.7.1INRIAPersonDataset11.2.0.7.2FaceDetectionusingHaarCascades11.2.0.7.3FaceDetectionPythonCodeSnippet
11.2.0.8PedestrianDetectionusingHOGDescriptor11.2.0.8.1PythonCodeSnippet
11.2.0.9ProcessingbyApacheSpark11.2.0.9.1ParallelizeinSparkContext11.2.0.9.2MapFunction(apply_batch)11.2.0.9.3CollectFunction
11.2.0.10Resultsfor100+imagesbySparkCluster12REFERENCES
1PREFACE
SatNov2305:25:16EST2019☁�
1.1DISCLAIMER☁�ThisbookhasbeengeneratedwithCyberaideBookmanager.
Bookmanagerisatooltocreateapublicationfromanumberofsourcesontheinternet. It is especially useful to create customized books, lecture notes, orhandouts. Content is best integrated inmarkdown format as it is very fast toproducetheoutput.
Bookmanagerhasbeendevelopedbasedonourexperienceoverthelast3yearswith amore sophisticated approach.Bookmanager takes the lessons from thisapproachanddistributesatoolthatcaneasilybeusedbyothers.
The followingshieldsprovide some informationabout it.Feel free toclickonthem.
pypipypi v0.2.28v0.2.28 LicenseLicense Apache2.0Apache2.0 pythonpython 3.73.7 formatformat wheelwheel statusstatus stablestable buildbuild unknownunknown
1.1.1Acknowledgment
Ifyouusebookmanagertoproduceadocumentyoumustincludethefollowingacknowledgement.
“This document was produced with Cyberaide Bookmanagerdeveloped by Gregor von Laszewski available athttps://pypi.python.org/pypi/cyberaide-bookmanager. It is in theresponsibility of the user tomake sure an author acknowledgementsection is included in your document. Copyright verification ofcontentincludedinabookisresponsibilityofthebookeditor.”
Thebibtexentryis@Misc{www-cyberaide-bookmanager,
author={GregorvonLaszewski},
1.1.2Extensions
We are happy to discuss with you bugs, issues and ideas for enhancements.Pleaseusetheconvenientgithubissuesat
https://github.com/cyberaide/bookmanager/issues
Pleasedonotfilewithusissuesthatrelatetoaneditorsbook.Theywillprovideyouwiththeirownmechanismonhowtocorrecttheircontent.
title={{CyberaideBookManager}},
howpublished={pypi},
month=apr,
year=2019,
url={https://pypi.org/project/cyberaide-bookmanager/}
}
2INTRODUCTION
2.1INTRODUCTIONTOPYTHON☁�
LearningObjectives
Learn quickly Python under the assumption you know a programminglanguageWorkwithmodulesUnderstanddocoptsandcmdContuctsomepythonexamplestorefreshyourpythonknpwledgeLearnaboutthemapfunctioninPythonLearnhowtostartsubprocessesandrederecttheiroutputLearnmoreadvancedconstructssuchasmultiprocessingandQueuesUnderstandwhywedonotuseanacondaGetfamiliarwithpyenv
Portions of this lesson have been adapted from the official Python TutorialcopyrightPythonSoftwareFoundation.
Pythonisaneasytolearnprogramminglanguage.Ithasefficienthigh-leveldatastructuresandasimplebuteffectiveapproachtoobject-orientedprogramming.Python’ssimplesyntaxanddynamictyping,togetherwithitsinterpretednature,make it an ideal language for scripting and rapid application development inmanyareasonmostplatforms.ThePythoninterpreterandtheextensivestandardlibraryarefreelyavailableinsourceorbinaryformforallmajorplatformsfromthe PythonWeb site, https://www.python.org/, and may be freely distributed.ThesamesitealsocontainsdistributionsofandpointerstomanyfreethirdpartyPythonmodules,programsandtools,andadditionaldocumentation.ThePythoninterpretercanbeextendedwithnewfunctionsanddatatypesimplementedinCor C++ (or other languages callable from C). Python is also suitable as anextensionlanguageforcustomizableapplications.
Pythonisaninterpreted,dynamic,high-levelprogramminglanguagesuitableforawiderangeofapplications.
ThephilosophyofpythonissummarizedinTheZenofPythonasfollows:
ExplicitisbetterthanimplicitSimpleisbetterthancomplexComplexisbetterthancomplicatedReadabilitycounts
ThemainfeaturesofPythonare:
UseofindentationwhitespacetoindicateblocksObjectorientparadigmDynamictypingInterpretedruntimeGarbagecollectedmemorymanagementalargestandardlibraryalargerepositoryofthird-partylibraries
Python is used by many companies and is applied for web development,scientific computing, embedded applications, artificial intelligence, softwaredevelopment,andinformationsecurity,tonameafew.
The material collected here introduces the reader to the basic concepts andfeaturesofthePythonlanguageandsystem.Afteryouhaveworkedthroughthematerialyouwillbeableto:
usePythonusetheinteractivePythoninterfaceunderstandthebasicsyntaxofPythonwriteandrunPythonprogramshaveanoverviewofthestandardlibraryinstall Python libraries using pyenv for multipython interpreterdevelopment.
Edoenotattempttobecomprehensiveandcovereverysinglefeature,orevenevery commonly used feature. Instead, it introduces many of Python’s most
noteworthyfeatures,andwillgiveyouagoodideaofthelanguage’sflavorandstyle.After reading it, youwillbeable to readandwritePythonmodulesandprograms,andyouwillbereadytolearnmoreaboutthevariousPythonlibrarymodules.
Inordertoconductthislessonyouneed
AcomputerwithPython2.7.16or3.7.4FamiliaritywithcommandlineusageA text editor such as PyCharm, emacs, vi or others.You should identitywhichworksbestforyouandsetitup.
2.1.1References
Some important additional information can be found on the following Webpages.
PythonPipVirtualenvNumPySciPyMatplotlibPandaspyenvPyCharm
Python module of the week is a Web site that provides a number of shortexamplesonhowtousesomeelementarypythonmodules.Notallmodulesareequallyusefulandyoushoulddecideiftherearebetteralternatives.Howeverforbeginnersthissiteprovidesanumberofgoodexamples
Python2:https://pymotw.com/2/Python3:https://pymotw.com/3/
3INSTALATION
3.1PYTHON3.7.4INSTALLATION☁�
LearningObjectives
Learnhowtoinstallpython.FindadditionalinformationaboutPython.MakesureyourComputersupportsPython.
Inthissetionweexplainhowtoinstallpython3.7.4onacomputer.Likelymuchofthecodewillworkwithearlierversions,butwedothedevelopmentinPythononthenewestversionofpythonavailableathttps://www.python.org/downloads.
3.1.1Hardware
Python does not require any special hardware.We have installed Python notonlyonPC’sandLaptops,butalsoonRaspberryPI’sandLegoMindstorms.
However,therearesomethingstoconsider.Ifyouusemanyprogramsonyourdesktop and run them all at the same time you will find that in up-to-dateoperating systems you will find your self quickly out of memmory. This isespeciallytrueifyouuseeditorssuchasPyCharmwhichwehighlyrecommend.Furthermore,asyoulikelyhavelotsofdiskaccess,makesuretouseafastHDDorbetteranSSD.
AtypicalmoderndeveloperPCorLaptophas16GBRAMandanSSD.Youcancertainlydopythonona$35RapbperryPI,butyouprobablywillnotbeabletorun PyCharm. There are many alternative editors with lessMemory footprintavialable.
3.1.2PrerequisitsUbuntu19.04
Python 3.7 is installed in ubuntu 19.04. Therefore, it already fulfills theprerequisits.Howeverwerecommendthatyouupdate to thenewestversionofpythonandpip.Howeverwerecommendthatyouupdatethethenewestversionofpython.Pleasevisit:https://www.python.org/downloads
3.1.3PrerequisitsmacOS
3.1.3.1InstallationfromAppleAppStore
Youwant a number of useful tool on yourmacOS. They are not installed bydefault,butareavailableviaXcode.Firstyouneedtoinstallxcodefrom
https://apps.apple.com/us/app/xcode/id497799835
NextyouneedtoinstallmacOSxcodecommandlinetools:
3.1.3.2Installationfrompython.org
The easiest instalation of Python for cloudmesh is to use the instaltion fromhttps://www.python.org/downloads. Please, visit the page and follow theinstructions.Afterthisinstallyouhavepython3avalablefromthecommandline
3.1.3.3InstallationfromHoembrew
An alternative instalation is provided from Homebrew. To use this installmethod,youneed to installHomebrewfirst.Start theprocessby installing thepython3usinghomebrew.Installhomebrewusingtheinstructionintheirwebpage:
ThenyoushouldbeabletoinstallPython3.7.4using:
3.1.4PrerequisitsUbuntu18.04
We recommend you update your ubuntu version to 19.04 and follow the
$xcode-select--install
$/usr/bin/ruby-e"$(curl-fsSLhttps://raw.githubusercontent.com/Homebrew/install/master/install)"
$brewinstallpython
instructionsforthatversioninstead,asitissignificantlyeasier.Ifyouhoweverarenotabletodoso,thefollowinginstructionsmaybehelpful.
Wefirstneed tomakesure that thecorrectversionof thePython3is installed.ThedefaultversionofPythononUbuntu18.04is3.6.Youcangettheversionwith:
Iftheversionisnot3.7.4ornewer,youcanupdateitasfollows:
Youcan thencheck the installedversionusing python3.7--version which should be3.7.4.
Nowwewillcreateanewvirtualenvironment:
Theeditthe~/.bashrcfileandaddthefollowinglineattheend:
nowactivatethevirtualenvironmentusing:
nowyoucaninstallthepipforthevirtualenvironmentwithoutconflictingwiththenativepip:
3.1.5PrerequisiteWindows10
Python 3.7 can be installed on Windows 10 using:https://www.python.org/downloads
For3.7.4cangoto thedownloadpageanddownloadoneof thedifferent filesforWindows.
$python3--version
$sudoapt-getupdate
$sudoaptinstallsoftware-properties-common
$sudoadd-apt-repositoryppa:deadsnakes/ppa
$sudoapt-getinstallpython3.7python3-devpython3.7-dev
$python3.7-mvenv--without-pip~/ENV3
aliasENV3="source~/ENV3/bin/activate"
ENV3
$source~/.bashrc
$curl"https://bootstrap.pypa.io/get-pip.py"-o"get-pip.py"
$pythonget-pip.py
$rmget-pip.py
LetusassumeyouchoetheWebbasedinstaller,thanyouclickonthefileintheedge browser (make sure the account you use has administrative priviledges).Followtheinstructionsthattheinstallergives.Importantisthatyouselectatonepoint“[x]AddtoPath”.Therewillbeanemptycheckmarkaboutthisthatyouwillclickon.
Onceitisinstalled.choseaterminalandexecute
However, ifyouhave installedconda for somereasonyouneed to readuponhowtoinstall3.7.4pythonincondaoridentifyhowtoruncondaandpython.orgatthesametime.Weseeoftenothersgivingthewronginstallationinstructions.
Analternative is tousepythonfromwithin theLinuxSubsystem.But thathassomelimitationsandyouwillneedtoexplorehowtoexxessthefilesysteminthesubssytemtohaveasmoothintegrationbetweenyourWindowshostsoyoucanforexampleusePyCharm.
3.1.5.1LinuxSubsystemInstall
ToactivatetheLinuxSubsystem,pleasefollowtheinstructionsat
https://docs.microsoft.com/en-us/windows/wsl/install-win10
Asuitabledistributionwouldbe
https://www.microsoft.com/en-us/p/ubuntu-1804-lts/9n9tngvndl3q?activetab=pivot:overviewtab
Howeverasitusesanolderversionofpythonyouwillahvetoupdateit.
3.1.6Prerequisitvenv
This step is highly recommend if you have not yet already installed a venv forpythontomakesureyouarenotinterferingwithyoursystempython.NotusingavenvcouldhavecatastrophicconsequencesandadestructionofyouroperatingsystemtoolsiftheyrealyonPython.Theuseofvenvissimple.Forourpurposesweassumethatyouusethedirectory:
python--version
Followthesestepsfirst:
Firstcdtoyourhomedirectory.Thenexecute
Youcanaddat theendofyour .bashrc(ubuntu)or .bash_profile (macOS)filetheline
sotheenvironmentisalwaysloaded.Nowyouarereadytoinstallcloudmesh.
Checkifyouhavetherightversionofpythoninstalledwith
Tomakesureyouhaveanuptodateversionofpipissuethecommand
3.1.7InstallPython3.7viaAnaconda
3.1.7.1Downloadcondainstaller
Minicondaisrecommendedhere.Downloadan installerforWindows,macOS,andLinuxfromthispage:https://docs.conda.io/en/latest/miniconda.html
3.1.7.2Installconda
Followinstructionstoinstallcondaforyouroperatingsystems:
Windows. https://conda.io/projects/conda/en/latest/user-guide/install/windows.htmlmacOS. https://conda.io/projects/conda/en/latest/user-guide/install/macos.htmlLinux.https://conda.io/projects/conda/en/latest/user-guide/install/linux.html
~/ENV3
$python3-mvenv~/ENV3
$source~/ENV3/bin/activate
$source~/ENV3/bin/activate
$python--version
$pipinstallpip-U
3.1.7.3InstallPython3.7.4viaconda
Itisveryimportanttomakesureyouhaveanewerversionofpipinstalled.Afteryou installed and created theENV3you need to activate it. This can be donewith
Ifyou like toactivate itwhenyoustartanewterminal,pleaseadd this line toyour.bashrcor.bash_profile
Ifyouusezshpleaseadditto.zprofileinstead.
3.2MULTI-VERSIONPYTHONINSTALLATION☁�
LearningObjectives
Understandwhyweneedtoworryaboutpython3.7and2.7UsepyenvtosupportbothversionsUnderstandthelimitationsofanaconda/condafordevelopers
WearelivinginaninterestingjunctionpointinthedevelopmentofPython.InJanuary 2019, it is encouraged that Python developers swoth from pythonversion2.7topythonversion3.7.
Howevertheremaybetherequirementwhenyoustillneedtodevelopcodenotonlyinpython3.7butalsoinpython2.7.Tofacilitatethismulti-pythonversiondevelopment,thebesttoolweknowaboutcapableofdoingsoispyenv.Wewillexplainyouinthissectionhowtoinstallbothversionswiththehelpofpyenv.
Python is easy to install andverygood instructions formostplatformscanbefoundonthepython.orgWebpage.Weseetwodifferentversions:
$cd~
$condacreate-nENV3python=3.7.4
$condaactivateENV3
$condainstall-canacondapip
$condadeactivateENV3
$condaactivateENV3
Python2.7.16Python3.7.4
Tomanagepythonmodules,itisusefultohavepippackageinstallationtoolonyoursystem.
We assume that you have a computer with python installed. The version ofpythonhowevermaynotbethenewestversion.Pleasecheckwith
whichversionofpythonyourun.Ifitisnotthenewestversion,weusepyenvtoinstallanewerversionsoyoudonoteffect thedefaultversionofpythonfromyoursystem.
3.2.1Disablingwrongpythoninstalls
Whileworkingwithstudentswehaveseenattimesthattheytakeotherclasseseither at universities or online that teach them how to program in python.Unfortunately, they seem to often ignore to teach you how to properly installPython.Ijustrecentlyhadastudentsthathadinstalledpython7differenttimesonhismacOSmachine,whileanotherstudenthad3differentinstallations,allofwhich conflicted with each other as they were not set up properly and thestudents did not even realize that theywere using Python incorrectly on theircomputerduetosetupissuesandconflictinglibraries.
Werecommendthatyouinspectifyouhaveafilessuchas~/.bashrcor~/.bashrc_profileinyourhomedirectoryandidentifyifitactivatesvariousversionsofpythononyourcomputer.Ifsoyoucouldtrytodeactivatethemwhileout-commentingthevarious versionswith the # character at the beginning of the line, start a newterminal and see if the terminal shell still works. Than you can follow ourinstructionsherewhileusinganinstallonpyenv.
3.2.2Managing2.7and3.7PythonVersionswithoutPyenv
Ifyouneedtohavemorethanonepythonversioninstalledanddonotwantorcanusepyenv,werecommendyoudownloadandinstallpython2.7.16and3.7.4frompython.org(https://www.python.org/downloads/)
$python--version
YOucanthanuseeitherpython2orpython3toinvokethepythoninterpreter.
3.2.3ManagingMultiplePythonVersionswithPyenv
Python has several versions that are used by the community. This includesPython2andPython3,butalldifferentmanagementofthepythonlibraries.AseachOSmayhavetheirownversionofpythoninstalled.Itisrecommendedthatyounotmodifythatversion.Insteadyoumaywanttocreatealocalizedpythoninstallation that you as a user can modify. To do that we recommend pyenv.Pyenv allows users to switch between multiple versions of Python(https://github.com/yyuu/pyenv).Tosummarize:
userstochangetheglobalPythonversiononaper-userbasis;userstoenablesupportforper-projectPythonversions;easyversionchangeswithoutcomplexenvironmentvariablemanagement;tosearchinstalledcommandsacrossdifferentpythonversions;integratewithtox(https://tox.readthedocs.io/).
Toinstallpyenvonyoursystemyoucanusethecommand
Nowyoucaninstalldifferentpythonversionsonyoursystemsuchaspython2.7and3.7withafewcommands:
To automatically access them fromyour shellwe integrate them into bash byediting the bash configuration files.Make sure that on Linux you add to the~/.bashrcfileandonmacOStothefile~/.bash_profileor.zprofile.
$curlhttps://pyenv.run|bash
$pyenvinstall3.7.4
$pyenvinstall2.7.16
$pyenvvirtualenv3.7.4ENV3
$pyenvvirtualenv2.7.16ENV2
exportPYENV_ROOT="$HOME/.pyenv"
exportPATH="$PYENV_ROOT/bin:$PATH"
exportPYENV_VIRTUALENV_DISABLE_PROMPT=1
eval"$(pyenvinit-)"
eval"$(pyenvvirtualenv-init-)"
__pyenv_version_ps1(){
localret=$?;
output=$(pyenvversion-name)
if[[!-z$output]];then
echo-n"($output)"
fi
return$ret;
}
Werecommendthatyoudothistowardstheendofyourfile.ThanlookupourconveniencemethodstosetanALIASandinstallPython3.7.4viapyenv
Nextwerecommendtoupdatepip
3.2.3.1InstallationpyenvviaHomebrew
OnmacOSyoucaninstallpyenvalsoviaHomebrew.Beforeinstallinganythingon your computermake sure you have enough space.Use in the terminal thecommand:
whichgivesyour anoverviewofyour file system. Ifyoudonothaveenoughspace,pleasemakesureyoufreeupunusedfilesfromyourdrive.
In many occasions it is beneficial to use readline as it provides nice editingfeaturesfortheterminalandxzforcompletion.First,makesureyouhavexcodeinstalled:
OnMojaveyouwillgetanerrorthatzlibisnotinstalled.THisisduetothattheheaderfilesarenotproperlyinstalled.Todothisyoucansay
Next install homebrew, pyenv, pyenv-virtualenv and pyenv-virtualwrapper.Additionallyinstallreadlineandsomecompressiontools:
PS1="\$(__pyenv_version_ps1)${PS1}"
aliasENV2="pyenvactivateENV2"
aliasENV3="pyenvactivateENV3"
ENV3
$ENV2
$pipinstallpip-U
$ENV3
$pipinstallpip-U
$df-h
$xcode-select--install
$sudoinstaller-pkg/Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg-target/
$/usr/bin/ruby-e"$(curl-fsSLhttps://raw.githubusercontent.com/Homebrew/install/master/install)"
$brewupdate
$brewinstallreadlinexz
Toinstallpyenvwithhomebrewexecuteintheterminal:
3.2.3.2InstallpyenvonUbuntu18.04
Thefollowingstepswillinstallpyenvinanewubuntu18.04distribution.
Start up a terminal and execute in the terminal the following commands.Werecommend that you do it one command at a time so you can observe if thecommandsucceeds:
Youcanalsoinstallpyenvusingcurlcommandinfollowingway:
Theninstallitsdependencies:
Now that you have installed pyenv it is not yet activated in your currentterminal.Theeasiestthingtodoistostartanewterminalandtypin:
Ifyouseearesponsepyenvisinstalledandyoucanproceedwiththenextsteps.
Pleaserememberwheneveryoumodify.bashrcor.bash_profileor.zprofileyouneedtostartanewterminal.
3.2.3.3Usingpyenv
3.2.3.3.1UsingpyenvtoInstallDifferentPythonVersions
brewinstallpyenvpyenv-virtualenvpyenv-virtualenvwrapper
$sudoapt-getupdate
$sudoapt-getinstallgitpython-pipmakebuild-essentiallibssl-dev
$sudoapt-getinstallzlib1g-devlibbz2-devlibreadline-devlibsqlite3-dev
$sudopipinstallvirtualenvwrapper
$gitclonehttps://github.com/yyuu/pyenv.git~/.pyenv
$gitclonehttps://github.com/pyenv/pyenv-virtualenv.git~/.pyenv/plugins/pyenv-virtualenv
$gitclonehttps://github.com/yyuu/pyenv-virtualenvwrapper.git~/.pyenv/plugins/pyenv-virtualenvwrapper
$echo'exportPYENV_ROOT="$HOME/.pyenv"'>>~/.bashrc
$echo'exportPATH="$PYENV_ROOT/bin:$PATH"'>>~/.bashrc
$curl-Lhttps://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer|bash
$sudoapt-getupdate&&sudoapt-getupgrade
$sudoapt-getinstall-ymakebuild-essentiallibssl-dev
$sudoapt-getinstall-yzlib1g-devlibbz2-devlibreadline-devlibsqlite3-dev
$sudoapt-getinstall-ywgetcurlllvmlibncurses5-devgit
$whichpyenv
Pyenv provides a large list of different python versions. To see the entire listpleaseusethecommand:
However, forusweonlyneed toworryaboutpython2.7.16andpython3.7.4.You can now install different versions of python into your local environmentwiththefollowingcommands:
Youcansettheglobalpythondefaultversionwith:
Typethefollowingtodeterminewhichversionyouactivated:
Typethefollowingtodeterminewhichversionsyouhaveavailable:
Associate a specific environment namewith a certain python version, use thefollowingcommands:
In the example, ENV2 would represent python 2.7.16 while ENV3 wouldrepresentpython3.7.4.Oftenitiseasiertotypethealiasratherthantheexplicitversion.
3.2.3.3.2SwitchingEnvironments
After setting up the different environments, switching between them is noweasy.Simplyusethefollowingcommands:
Tomakeiteveneasier,youcanaddthefollowinglinestoyour.bash_profileoror
$pyenvinstall-l
$pyenvupdate
$pyenvinstall2.7.16
$pyenvinstall3.7.4
$pyenvglobal3.7.4
$pyenvversion
$pyenvversions
$pyenvvirtualenv2.7.16ENV2
$pyenvvirtualenv3.7.4ENV3
(2.7.16)$pyenvactivateENV2
(ENV2)$pyenvactivateENV3
(ENV3)$pyenvactivateENV2
(ENV2)$pyenvdeactivateENV2
(2.7.16)$
.zprofilefile:
If you start a new terminal, you can switch between the different versions ofpythonsimplybytyping:
3.2.3.4UpdatingPythonVersionList
Pyenvmaintainslocallyalistofavailablepythonversions.Toseethelistusethecommand
Youwillseetheupdatedlist.
3.2.3.4.1UpdatingtoanewversionofPythonwithpyenv
Naturally python itself evolves and new versions will become available viapyenv.Tofacilitatesuchanewversionyouneedtofirstinstallitintopyenv.Letus assume you had an old version of python installed onto the ENV3environment.Thanyouneedtoexecutethefollowingsteps:
Withthepiinstallcommand,wemakesurewehavethenewestversionofpip.Incaseyougetanerror,youmayhavetoupdatexcodeasfollowsandtryagain:
AfteryouinstalledityoucanactivateitbytypingENV3.NaturallythisrequiresthatyouaddedittoyourbashenvironmentasdiscussedinSection1.1.1.8. �
3.2.4AnacondaandMinicondaandConda
aliasENV2="pyenvactivateENV2"
aliasENV3="pyenvactivateENV3"
$ENV2
$ENV3
$pyenvupdate
$pyenvinstall-l
$pyenvdeactivate
$pyenvuninstallENV3
$pyenvinstall3.7.4
$pyenvvirtualenv3.7.4ENV3
$ENV3
$pipinstallpip-U
xcode-select--install
While inothers on the internet or inyour classesmayhave taught you touseanaconda,Wewillavoiditasithasseveraldisadvantagesforedevelopers.Thereasonforthisisthatitinstallsmanypackagesthatyouarelikelynottouse.InfactinstallinganacondaonyourVMwillwastespaceandtimeandyoushouldlookintootherinstalls.
Wedonotrecommendthatyouuseanacondaorminicondaasitmay
interferewithyourdefaultpythoninterpretersandsetup.
Pleasenotethatbeginnerstopythonshouldalwaysuseanacondaorminicondaonlyafter theyhave installedpyenvanduse it.For thisclassneitheranacondanorminicondaisrequired.Infactwedonotrecommendit.WekeepthissectionasweknowthatotherclassesatIUmayuseanaconda.Wearenotawareiftheseclassesteachyoutherightwaytoinstallit,withpyenv.
3.2.4.1Miniconda
This section about miniconda is experimental and has not beentested.Wearelookingforcontributorsthathelpcompletingit.Ifyouuseanacondaorminicondawerecommendtomanageitviapyenv.
Toinstallminicondayoucanusethefollowingcommands:
Toactivateuse:
Todeactivateuse:
3.2.4.2Anaconda
This section about anaconda is experimental and has not been
$mkdirana
$cdana
$pyenvinstallminiconda3-latest
$pyenvlocalminiconda3-latest
$pyenvactivateminiconda3-latest
$condacreate-nanaanaconda
$sourceactivateana
$sourcedeactivate
tested.Wearelookingforcontributorsthathelpcompletingit.
Youcanaddanacondatoyourpyenvwiththefollowingcommands:
To switch more easily we recommend that you use the following in your.bash_profileor.zprofilefile:
Onceyouhavedonethisyoucaneasilyswitchtoanacondawiththecommand:
Terminologyinanacondacouldleadtoconfusion.Thusweliketopointoutthattheversionnumberofanacondaisunrelatedtothepythonversion.Furthermore,anaconda uses the term root not for the root user, but for the originatingdirectoryinwhichtheanacondaprogramisinstalled.
Incaseyouliketobuildyourowncondapackagesatalatertimewerecommendthatyouinstalltheconda-buildpackage:
Whenexecuting:
youwillseeaftertheinstallcompletedtheanacondaversionsinstalled:
Letusnowcreatevirtualenvforanaconda:
Toactivateityoucannowuse:
pyenvinstallanaconda3-4.3.1
aliasANA="pyenvactivateanaconda3-4.3.1"
$ANA
$condainstallconda-build
$pyenvversions
pyenvversions
system
2.7.16
2.7.16/envs/ENV2
3.7.4
3.7.4/envs/ENV3
ENV2
ENV3
*anaconda3-4.3.1(setbyPYENV_VERSIONenvironmentvariable)
$pyenvvirtualenvanaconda3-4.3.1ANA
$pyenvANA
However, anacondamaymodify your .bashrc or .bash_profile or or .zprofile files andmay result in incompatibilitieswith other python versions. For this reasonwerecommendnot touseit. Ifyoufindwaystoget it toworkreliablywithotherversions,pleaseletusknowandweupdatethistutorial.
3.2.5Exercises
E.Python.Install.1:
InstallPython3.7.4
E.Python.Install.1:
Writeinstallationinstructionsforanoperatingsystemofyourchoiceandaddtothisdocumentation.
E.Python.Install.2:
Replicate the steps to install pyenv, so you can type in ENV2 andENV3inyourterminalstoswitchbetweenpython2and3.
E.Python.Install.3:
Why do you not want to use generally anaconda for cloudcomputing?Whenisitoktouseanaconda?
4FIRSTSTEPS
4.1INTERACTIVEPYTHON☁�Pythoncanbeusedinteractively.Youcanentertheinteractivemodebyenteringtheinteractiveloopbyexecutingthecommand:
Youwillseesomethinglikethefollowing:
The >>> is the prompt used by the interpreter. This is similar to bash wherecommonly$isused.
Sometimes it is convenient to show the promptwhen illustrating an example.This is to provide some context for what we are doing. If you are followingalongyouwillnotneedtotypeintheprompt.
Thisinteractivepythonprocessdoesthefollowing:
readyourinputcommandsevaluateyourcommandprinttheresultofevaluationloopbacktothebeginning.
This is why you may see the interactive loop referred to as aREPL:Read-Evaluate-Print-Loop.
4.1.1REPL(ReadEvalPrintLoop)
There are many different types beyond what we have seen so far, such asdictionariess,lists,sets.Onehandywayofusingtheinteractivepythonistogetthetypeofavalueusingtype():
$python
$python
Python3.7.1(default,Nov242018,14:27:15)
[Clang10.0.0(clang-1000.11.45.5)]ondarwin
Type"help","copyright","credits"or"license"formoreinformation.
>>>
Youcanalsoaskforhelpaboutsomethingusinghelp():
Usinghelp()opensupahelpmessagewithinapager.Tonavigateyoucanusethespacebartogodownapagewtogoupapage,thearrowkeystogoup/downline-by-line,orqtoexit.
4.1.2Interpreter
Althoughtheinteractivemodeprovidesaconvenienttooltotestthingsoutyouwillseequicklythatforourclasswewanttousethepythoninterpreterfromthecommandline.Letusassumetheprogramiscalledprg.py.Onceyouhavewrittenitinthatfileyousimplycancallitwith
Itisimportanttonametheprogramwithmeaningfulnames.
4.1.3Python3FeaturesinPython2
In this coursewewant to be able to seamlessly switch between python 2 andpython3.Thusitisconvenientfromthestarttousepython3syntaxwhenitissupportedalsoinpython2.Oneofthemostusedfunctionsistheprintstatementthathasinpython3parentheses.Toenableitinpython2youjustneedtoimportthisfunction:
Thefirstoftheseimportsallowsustousetheprintfunctiontooutputtexttothescreen, instead of the print statement, which Python 2 uses. This is simply adesigndecisionthatbetterreflectsPython’sunderlyingphilosophy.
Otherfunctionssuchasthedivisionalsobehavedifferently.Thusweuse
>>>type(42)
<type'int'>
>>>type('hello')
<type'str'>
>>>type(3.14)
<type'float'>
>>>help(int)
>>>help(list)
>>>help(str)
$pythonprg.py
from__future__importprint_function,division
from__future__importdivision
Thisimportmakessurethatthedivisionoperatorbehavesinawayanewcomertothelanguagemightfindmoreintuitive.InPython2,division/isfloordivisionwhentheargumentsareintegers,meaningthatthefollowing
InPython3,division/isafloatingpointdivision,thus
4.2EDITORS☁�Thissectionismeanttogiveanoverviewofthepythoneditingtoolsneededfordoing for this course. There are many other alternatives, however, we dorecommendtousePyCharm.
4.2.1Pycharm
PyCharm is an Integrated Development Environment (IDE) used forprogramming in Python. It provides code analysis, a graphical debugger, anintegratedunittester,integrationwithgit.
Python8:56Pycharm
4.2.2Pythonin45minutes
AnadditionalcommunityvideoaboutthePythonprogramminglanguagethatwefoundontheinternet.Naturallytherearemanyalternativestothisvideo,butthevideoisprobablyagoodstart.ItalsousesPyCharmwhichwerecommend.
Python43:16PyCharm
Howmuchyouwanttounderstandofpythonisactuallyabituptoyou.Whileitsgood toknowclassesand inheritance,youmaybeable for thisclass togetawaywithoutusingit.However,wedorecommendthatyoulearnit.
PyCharmInstallation:Method1:PyCharmInstallationonubuntuusingumake
(5/2==2)isTrue
(5/2==2.5)isTrue
Once umake command is run, use the next command to install Pycharmcommunityedition:
IfyouwanttoremovePyCharminstalledusingumakecommand,usethis:
Method2:PyCharminstallationonubuntuusingPPA
PyCharm also has a Professional (paid) version which can be installed usingfollowingcommand:
Onceinstalled,gotoyourVMdashboardandsearchforPyCharm.
4.3GOOGLECOLAB☁�In thissectionwearegoingto introduceyou,howtouseGoogleColabtorundeeplearningmodels.
4.3.1IntroductiontoGoogleColab
ThisvideocontainstheintroductiontoGoogleColab.InthissectionwewillbelearninghowtostartaGoogleColabproject.
4.3.2ProgramminginGoogleColab
Inthisvideowewilllearnhowtocreateasimple,ColabNotebook.
sudoadd-apt-repositoryppa:ubuntu-desktop/ubuntu-make
sudoapt-getupdate
sudoapt-getinstallubuntu-make
umakeidepycharm
umake-ridepycharm
sudoadd-apt-repositoryppa:mystic-mirage/pycharm
sudoapt-getupdate
sudoapt-getinstallpycharm-community
sudoapt-getinstallpycharm
RequiredInstallations
4.3.3BenchamrkinginGoogleColabwithCloudmesh
In this video we learn how to do a basic benchmark with Cloudmesh tools.CloudmeshStopWatchwillbeusedinthistutorial.
RequiredInstallations
pipinstallnumpy
pipinstallnumpy
pipinstallcloudmesh-installer
pipinstallcloudmesh-common
5LANGUAGE
5.1LANGUAGE☁�
5.1.1StatementsandStrings
LetusexplorethesyntaxofPythonwhilestartingwithaprintstatement
Thiswillprintontheterminal
The print function was given a string to process. A string is a sequence ofcharacters. A character can be a alphabetic (A through Z, lower and uppercase), numeric (any of the digits), white space (spaces, tabs, newlines, etc),syntacticdirectives(comma,colon,quotation,exclamation,etc),andsoforth.Astringisjustasequenceofthecharacterandtypicallyindicatedbysurroundingthecharactersindoublequotes.
StandardoutputisdiscussedintheSectionLinux.
So, what happened when you pressed Enter? The interactive Python programreadthelineprint("HelloworldfromPython!"),splititintotheprintstatementandthe"HelloworldfromPython!"string,andthenexecutedtheline,showingyoutheoutput.
5.1.2Comments
Commentsinpythonarefollowedbya#:
5.1.3Variables
Youcanstoredataintoavariabletoaccessitlater.Forinstance:
print("HelloworldfromPython!")
HelloworldfromPython!
#Thisisacomment
hello='HelloworldfromPython!'
print(hello)
Thiswillprintagain
5.1.4DataTypes
5.1.4.1Booleans
Aboolean is a value that can have the values True or False. You can combinebooleanswithbooleanoperatorssuchasandandor
5.1.4.2Numbers
Theinteractiveinterpretercanalsobeusedasacalculator.Forinstance,saywewantedtocomputeamultipleof21:
Wesawheretheprintstatementagain.Wepassedintheresultoftheoperation21 * 2.An integer (or int) in Python is a numeric valuewithout a fractionalcomponent(thosearecalledfloatingpointnumbers,orfloatforshort).
Themathematicaloperators compute the relatedmathematicaloperation to theprovidednumbers.Someoperatorsare:
Operator Function* multiplication/ division+ addition- subtraction** exponent
Exponentiationxyiswrittenasx**yisxtotheythpower.
HelloworldfromPython!
print(TrueandTrue)#True
print(TrueandFalse)#False
print(FalseandFalse)#False
print(TrueorTrue)#True
print(TrueorFalse)#True
print(FalseorFalse)#False
print(21*2)#42
Youcancombinefloatsandints:
Notethatoperatorprecedenceisimportant.Usingparenthesistoindicateaffecttheorderofoperationsgivesadifferenceresults,asexpected:
5.1.5ModuleManagement
AmoduleallowsyoutologicallyorganizeyourPythoncode.Groupingrelatedcodeintoamodulemakesthecodeeasiertounderstandanduse.AmoduleisaPythonobjectwitharbitrarilynamedattributesthatyoucanbindandreference.Amodule is a file consistingofPython code.Amodule candefine functions,classesandvariables.Amodulecanalsoincluderunnablecode.
5.1.5.1ImportStatement
Whentheinterpreterencountersanimportstatement,itimportsthemoduleifthemoduleispresentinthesearchpath.Asearchpathisalistofdirectoriesthattheinterpreter searches before importing a module. The from…import StatementPython’s fromstatement letsyou importspecificattributes fromamodule intothecurrentnamespace.Itispreferredtouseforeachimportitsownlinesuchas:
Whentheinterpreterencountersanimportstatement,itimportsthemoduleifthemoduleispresentinthesearchpath.Asearchpathisalistofdirectoriesthattheinterpretersearchesbeforeimportingamodule.
5.1.5.2Thefrom…importStatement
Python’s fromstatement letsyou importspecificattributes fromamodule intothecurrentnamespace.Thefrom…importhasthefollowingsyntax:
print(3.14*42/11+4-2)#13.9890909091
print(2**3)#8
print(3.14*(42/11)+4-2)#11.42
print(1+2*3-4/5.0)#6.2
print((1+2)*(3-4)/5.0)#-0.6
importnumpy
importmatplotlib
fromdatetimeimportdatetime
5.1.6DateTimeinPython
Thedatetimemodulesuppliesclassesformanipulatingdatesandtimesinbothsimpleandcomplexways.Whiledateandtimearithmeticissupported,thefocusof the implementation is on efficient attribute extraction for output formattingand manipulation. For related functionality, see also the time and calendarmodules.
The import Statement You can use any Python source file as a module byexecutinganimportstatementinsomeotherPythonsourcefile.
Thismoduleoffersagenericdate/timestringparserwhichisabletoparsemostknownformatstorepresentadateand/ortime.
pandas is an open source Python library for data analysis that needs to beimported.
Createastringvariablewiththeclassstarttime
Convertthestringtodatetimeformat
Creatingalistofstringsasdates
ConvertClass_datesstringsintodatetimeformatandsavethelistintovariablea
Useparse()toattempttoauto-convertcommonstringformats.Parsermustbea
fromdatetimeimportdatetime
fromdateutil.parserimportparse
importpandasaspd
fall_start='08-21-2018'
datetime.strptime(fall_start,'%m-%d-%Y')\#
datetime.datetime(2017,8,21,0,0)
class_dates=[
'8/25/2017',
'9/1/2017',
'9/8/2017',
'9/15/2017',
'9/22/2017',
'9/29/2017']
a=[datetime.strptime(x,'%m/%d/%Y')forxinclass_dates]
stringorcharacterstream,notlist.
Useparse()oneveryelementoftheClass_datesstring.
Useparse,butdesignatethatthedayisfirst.
Create adataframe.ADataFrame is a tabulardata structure comprisedof rowsand columns, akin to a spreadsheet, database table. DataFrame as a group ofSeriesobjectsthatshareanindex(thecolumnnames).
Convertdf[`date`]fromstringtodatetime
5.1.7ControlStatements
5.1.7.1Comparison
parse(fall_start)#datetime.datetime(2017,8,21,0,0)
[parse(x)forxinclass_dates]
#[datetime.datetime(2017,8,25,0,0),
#datetime.datetime(2017,9,1,0,0),
#datetime.datetime(2017,9,8,0,0),
#datetime.datetime(2017,9,15,0,0),
#datetime.datetime(2017,9,22,0,0),
#datetime.datetime(2017,9,29,0,0)]
parse(fall_start,dayfirst=True)
#datetime.datetime(2017,8,21,0,0)
importpandasaspd
data={
'dates':[
'8/25/201718:47:05.069722',
'9/1/201718:47:05.119994',
'9/8/201718:47:05.178768',
'9/15/201718:47:05.230071',
'9/22/201718:47:05.230071',
'9/29/201718:47:05.280592'],
'complete':[1,0,1,1,0,1]}
df=pd.DataFrame(
data,
columns=['dates','complete'])
print(df)
#datescomplete
#08/25/201718:47:05.0697221
#19/1/201718:47:05.1199940
#29/8/201718:47:05.1787681
#39/15/201718:47:05.2300711
#49/22/201718:47:05.2300710
#59/29/201718:47:05.2805921
importpandasaspd
pd.to_datetime(df['dates'])
#02017-08-2518:47:05.069722
#12017-09-0118:47:05.119994
#22017-09-0818:47:05.178768
#32017-09-1518:47:05.230071
#42017-09-2218:47:05.230071
#52017-09-2918:47:05.280592
#Name:dates,dtype:datetime64[ns]
Computer programs do not only execute instructions. Occasionally, a choiceneedstobemade.Suchasachoiceisbasedonacondition.Pythonhasseveralconditionaloperators:
Operator Function> greaterthan< smallerthan== equals!= isnot
Conditionsarealwayscombinedwithvariables.Aprogramcanmakeachoiceusingtheifkeyword.Forexample:
In this example,You guessed correctly! will only be printed if the variable xequals to four. Python can also executemultiple conditions using the elif andelsekeywords.
5.1.7.2Iteration
To repeat code, the for keyword can be used. For example, to display thenumbersfrom1to10,wecouldwritesomethinglikethis:
Thesecondargument to range,11, isnot inclusive,meaning that the loopwillonlygetto10beforeitfinishes.Pythonitselfstartscountingfrom0,sothiscodewillalsowork:
x=int(input("Guessx:"))
ifx==4:
print('Correct!')
x=int(input("Guessx:"))
ifx==4:
print('Correct!')
elifabs(4-x)==1:
print('Wrong,butclose!')
else:
print('Wrong,wayoff!')
foriinrange(1,11):
print('Hello!')
foriinrange(0,10):
print(i+1)
Infact,therangefunctiondefaultstostartingvalueof0,soitisequivalentto:
Wecanalsonestloopsinsideeachother:
In this case we have two nested loops. The code will iterate over the entirecoordinaterange(0,0)to(9,9)
5.1.8Datatypes
5.1.8.1Lists
see:https://www.tutorialspoint.com/python/python_lists.htm
Lists inPythonareorderedsequencesofelements,whereeachelementcanbeaccessedusinga0-basedindex.
Todefinealist,yousimplylistitselementsbetweensquarebrackets‘[]’:
Youcanalsouseanegative index ifyouwant tostartcountingelementsfromthe endof the list.Thus, the last element has index -1, the second before lastelementhasindex-2andsoon:
Pythonalsoallowsyoutotakewholeslicesofthelistbyspecifyingabeginningandendofthesliceseparatedbyacolon
foriinrange(10):
print(i+1)
foriinrange(0,10):
forjinrange(0,10):
print(i,'',j)
names=[
'Albert',
'Jane',
'Liz',
'John',
'Abby']
#accessthefirstelementofthelist
names[0]
#'Albert'
#accessthethirdelementofthelist
names[2]
#'Liz'
#accessthelastelementofthelist
names[-1]
#'Abby'
#accessthesecondlastelementofthelist
names[-2]
#'John'
Asyoucanseefromtheexample,thestartingindexinthesliceisinclusiveandtheendingone,exclusive.
Pythonprovidesavarietyofmethodsformanipulatingthemembersofalist.
Youcanaddelementswithappend’:
Asyoucansee,theelementsinalistneednotbeunique.
Mergetwolistswith‘extend’:
Findtheindexofthefirstoccurrenceofanelementwith‘index’:
Removeelementsbyvaluewith‘remove’:
Removeelementsbyindexwith‘pop’:
Noticethatpopreturnstheelementbeingremoved,whileremovedoesnot.
Ifyouarefamiliarwithstacksfromotherprogramminglanguages,youcanuseinsertand‘pop’:
#themiddleelements,excludingfirstandlast
names[1:-1]
#['Jane','Liz','John']
names.append('Liz')
names
#['Albert','Jane','Liz',
#'John','Abby','Liz']
names.extend(['Lindsay','Connor'])
names
#['Albert','Jane','Liz','John',
#'Abby','Liz','Lindsay','Connor']
names.index('Liz')\#2
names.remove('Abby')
names
#['Albert','Jane','Liz','John',
#'Liz','Lindsay','Connor']
names.pop(1)
#'Jane'
names
#['Albert','Liz','John',
#'Liz','Lindsay','Connor']
names.insert(0,'Lincoln')
names
#['Lincoln','Albert','Liz',
#'John','Liz','Lindsay','Connor']
names.pop()
ThePythondocumentationcontainsafulllistoflistoperations.
To go back to the range function you used earlier, it simply creates a list ofnumbers:
5.1.8.2Sets
Pythonlistscancontainduplicatesasyousawpreviously:
Whenwedonotwantthistobethecase,wecanuseaset:
Keepinmindthatthesetisanunorderedcollectionofobjects,thuswecannotaccessthembyindex:
However,wecanconvertasettoalisteasily:
Noticethatinthiscase,theorderofelementsinthenewlistmatchestheorderinwhichtheelementsweredisplayedwhenwecreatetheset.Wehadset(['Lincoln','John','Albert','Liz','Lindsay'])
andnowwehave['Lincoln','John','Albert','Liz','Lindsay'])
#'Connor'
names
#['Lincoln','Albert','Liz',
#'John','Liz','Lindsay']
range(10)
#[0,1,2,3,4,5,6,7,8,9]
range(2,10,2)
#[2,4,6,8]
names=['Albert','Jane','Liz',
'John','Abby','Liz']
unique_names=set(names)
unique_names
#set(['Lincoln','John','Albert','Liz','Lindsay'])
unique_names[0]
#Traceback(mostrecentcalllast):
#File"<stdin>",line1,in<module>
#TypeError:'set'objectdoesnotsupportindexing
unique_names=list(unique_names)
unique_names[`Lincoln',`John',`Albert',`Liz',`Lindsay']
unique_names[0]
#`Lincoln'
You should not assume this is the case in general. That is, do not make anyassumptionsabouttheorderofelementsinasetwhenitisconvertedtoanytypeofsequentialdatastructure.
You can change a set’s contents using the add, remove and update methodswhich correspond to the append, remove and extend methods in a list. Inaddition to these, set objects support the operations youmay be familiarwithfrommathematicalsets:union,intersection,difference,aswellasoperations tocheck containment. You can read about this in the Python documentation forsets.
5.1.8.3RemovalandTestingforMembershipinSets
Oneimportantadvantageofasetoveralististhataccesstoelementsisfast. Ifyou are familiarwith different data structures fromaComputerScience class,thePython list is implementedby an array,while the set is implementedby ahashtable.
Wewilldemonstratethiswithanexample.Letussaywehavealistandasetofthesamenumberofelements(approximately100thousand):
Wewill use the timeit Pythonmodule to time 100 operations that test for theexistenceofamemberineitherthelistorset:
The exact duration of the operations on your systemwill be different, but thetake away will be the same: searching for an element in a set is orders ofmagnitudefasterthaninalist.Thisisimportanttokeepinmindwhenyouworkwithlargeamountsofdata.
5.1.8.4Dictionaries
importsys,random,timeit
nums_set=set([random.randint(0,sys.maxint)for_inrange(10**5)])
nums_list=list(nums_set)
len(nums_set)
#100000
timeit.timeit('random.randint(0,sys.maxint)innums',
setup='importrandom;nums=%s'%str(nums_set),number=100)
#0.0004038810729980469
timeit.timeit('random.randint(0,sys.maxint)innums',
setup='importrandom;nums=%s'%str(nums_list),number=100)
#0.398054122924804
Oneoftheveryimportantdatastructuresinpythonisadictionaryalsoreferredtoasdict.
Adictionaryrepresentsakeyvaluestore:
Aconvenientfortoprintbynamedattributesis
Thisformofprintingwiththeformatstatementandareferencetodataincreasesreadabilityoftheprintstatements.
Youcandeleteelementswiththefollowingcommands:
Youcaniterateoveradict:
5.1.8.5DictionaryKeysandValues
Youcanretrieveboth thekeysandvaluesofadictionaryusing thekeys()andvalues()methodsofthedictionary,respectively:
person={
'Name':'Albert',
'Age':100,
'Class':'Scientist'
}
print("person['Name']:",person['Name'])
#person['Name']:Albert
print("person['Age']:",person['Age'])
#person['Age']:100
print("{Name}{Age}'.format(**data))
delperson['Name']#removeentrywithkey'Name'
#person
#{'Age':100,'Class':'Scientist'}
person.clear()#removeallentriesindict
#person
#{}
delperson#deleteentiredictionary
#person
#Traceback(mostrecentcalllast):
#File"<stdin>",line1,in<module>
#NameError:name'person'isnotdefined
person={
'Name':'Albert',
'Age':100,
'Class':'Scientist'
}
foriteminperson:
print(item,person[item])
#Age100
#NameAlbert
#ClassScientist
person.keys()#['Age','Name','Class']
person.values()#[100,'Albert','Scientist']
Bothmethodsreturnlists.Notice,however,thattheorderinwhichtheelementsappear in the returned lists (Age, Name, Class) is different from the order inwhichwe listed theelementswhenwedeclared thedictionary initially (Name,Age,Class).Itisimportanttokeepthisinmind:
Youcannotmakeanyassumptionsabouttheorderinwhichtheelementsofadictionarywillbe returnedby thekeys()andvalues()methods.
However,youcanassumethatifyoucallkeys()andvalues()insequence,theorderof elements will at least correspond in both methods. In the example Agecorrespondsto100,Nameto Albert,andClasstoScientist,andyouwillobservethe same correspondence in general as long as keys() and values() are called onerightaftertheother.
5.1.8.6CountingwithDictionaries
Oneapplicationofdictionariesthatfrequentlycomesupiscountingtheelementsinasequence.Forexample,saywehaveasequenceofcoinflips:
Theactual listdie_rollswill likelybedifferentwhenyouexecute thisonyourcomputersincetheoutcomesofthedierollsarerandom.
Tocomputetheprobabilitiesofheadsandtails,wecouldcounthowmanyheadsandtailswehaveinthelist:
In addition to how we use the dictionary counts to count the elements of
importrandom
die_rolls=[
random.choice(['heads','tails'])for_inrange(10)
]
#die_rolls
#['heads','tails','heads',
#'tails','heads','heads',
'tails','heads','heads','heads']
counts={'heads':0,'tails':0}
foroutcomeincoin_flips:
assertoutcomeincounts
counts[outcome]+=1
print('Probabilityofheads:%.2f'%(counts['heads']/len(coin_flips)))
#Probabilityofheads:0.70
print('Probabilityoftails:%.2f'%(counts['tails']/sum(counts.values())))
#Probabilityoftails:0.30
coin_flips,noticeacouplethingsaboutthisexample:
1. We used the assert outcome in counts statement. The assert statement inPython allows you to easily insert debugging statements in your code tohelp you discover errors more quickly. assert statements are executedwhenevertheinternalPython__debug__variableissettoTrue,whichisalwaysthecaseunlessyoustartPythonwiththe-OoptionwhichallowsyoutorunoptimizedPython.
2. When we computed the probability of tails, we used the built-in sumfunction,whichallowedus toquickly find the totalnumberof coin flips.sumisoneofmanybuilt-infunctionyoucanreadabouthere.
5.1.9Functions
Youcanreusecodebyputtingitinsideafunctionthatyoucancallinotherpartsofyourprograms.Functionsarealsoagoodwayofgroupingcodethatlogicallybelongs together in one coherentwhole.A function has a unique name in theprogram.Onceyoucallafunction,itwillexecuteitsbodywhichconsistsofoneormorelinesofcode:
The def keyword tells Python we are defining a function. As part of thedefinition,wehavethefunctionname,check_triangle,andtheparametersofthefunction–variablesthatwillbepopulatedwhenthefunctioniscalled.
Wecallthefunctionwitharguments4,5and6,whicharepassedinorderintotheparametersa,bandc.Afunctioncanbecalledseveral timeswithvaryingparameters.Thereisnolimittothenumberoffunctioncalls.
It is also possible to store the output of a function in a variable, so it can bereused.
defcheck_triangle(a,b,c):
return\
a<b+canda>abs(b-c)and\
b<a+candb>abs(a-c)and\
c<a+bandc>abs(a-b)
print(check_triangle(4,5,6))
defcheck_triangle(a,b,c):
return\
a<b+canda>abs(b-c)and\
b<a+candb>abs(a-c)and\
c<a+bandc>abs(a-b)
5.1.10Classes
Aclass is an encapsulation of data and the processes thatwork on them.Thedata is represented inmember variables, and the processes are defined in themethodsoftheclass(methodsarefunctionsinsidetheclass).Forexample,let’sseehowtodefineaTriangleclass:
Python has full object-oriented programming (OOP) capabilities, however wecannotcoveralloftheminthissection,soifyouneedmoreinformationpleaserefertothePythondocsonclassesandOOP.
5.1.11Modules
Nowwritethissimpleprogramandsaveit:
Asacheck,makesurethefilecontainstheexpectedcontentsonthecommandline:
result=check_triangle(4,5,6)
print(result)
classTriangle(object):
def__init__(self,length,width,
height,angle1,angle2,angle3):
ifnotself._sides_ok(length,width,height):
print('Thesidesofthetriangleareinvalid.')
elifnotself._angles_ok(angle1,angle2,angle3):
print('Theanglesofthetriangleareinvalid.')
self._length=length
self._width=width
self._height=height
self._angle1=angle1
self._angle2=angle2
self._angle3=angle3
def_sides_ok(self,a,b,c):
return\
a<b+canda>abs(b-c)and\
b<a+candb>abs(a-c)and\
c<a+bandc>abs(a-b)
def_angles_ok(self,a,b,c):
returna+b+c==180
triangle=Triangle(4,5,6,35,65,80)
print("Helloworld!")
$cathello.py
print("Helloworld!")
Toexecuteyourprogrampassthefileasaparametertothepythoncommand:
Files in which Python code is stored are calledmodules. You can execute aPythonmoduleformthecommandlinelikeyoujustdid,oryoucanimportitinotherPythoncodeusingtheimportstatement.
Let us write a more involved Python program that will receive as input thelengths of the three sides of a triangle, andwill outputwhether they define avalidtriangle.Atriangleisvalidifthelengthofeachsideislessthanthesumofthelengthsoftheothertwosidesandgreaterthanthedifferenceofthelengthsoftheothertwosides.:
Assumingwesavetheprograminafilecalledcheck_triangle.py,wecanrunitlikeso:
Letusbreakthisdownabit.
1. Weare importing theprint_function anddivisionmodules frompython3likewedidearlierinthissection.It’sagoodideatoalwaysincludetheseinyourprograms.
2. We’vedefinedabooleanexpressionthattellsusifthesidesthatwereinputdefine a valid triangle. The result of the expression is stored in the
$pythonhello.py
Helloworld!
"""Usage:check_triangle.py[-h]LENGTHWIDTHHEIGHT
Checkifatriangleisvalid.
Arguments:
LENGTHThelengthofthetriangle.
WIDTHThewidthofthetraingle.
HEIGHTTheheightofthetriangle.
Options:
-h--help
"""
fromdocoptimportdocopt
if__name__=='__main__':
arguments=docopt(__doc__)
a,b,c=int(arguments['LENGTH']),
int(arguments['WIDTH']),
int(arguments['HEIGHT'])
valid_triangle=\
a<b+canda>abs(b-c)and\
b<a+candb>abs(a-c)and\
c<a+bandc>abs(a-b)
print('Trianglewithsides%d,%dand%disvalid:%r'%(
a,b,c,valid_triangle
))
$pythoncheck_triangle.py456
Trianglewithsides4,5and6isvalid:True
valid_trianglevariable.insidearetrue,andFalseotherwise.3. We’ve used the backslash symbol \ to format are code nicely. The
backslash simply indicates that the current line is being continued on thenextline.
4. Whenweruntheprogram,wedothecheckif__name__=='__main__'. __name__ is aninternal Python variable that allows us to tell whether the current file isbeingrunfromthecommandline(value__name__),orisbeingimportedbyamodule (the value will be the name of the module). Thus, with thisstatementwe’rejustmakingsuretheprogramisbeingrunbythecommandline.
5. Weareusing thedocoptmodule tohandlecommand linearguments.Theadvantageofusing thismodule is that itgeneratesausagehelpstatementfor theprogramandenforces command line arguments automatically.Allofthisisdonebyparsingthedocstringatthetopofthefile.
6. Intheprintfunction,weareusingPython’sstringformattingcapabilitiestoinsertvaluesintothestringwearedisplaying.
5.1.12LambdaExpressions
As oppose to normal functions in Python which are defined using the def
keyword, lambda functions in Python are anonymous functions which do nothaveanameandaredefinedusing the lambda keyword.Thegeneric syntaxof alambda function is in form oflambdaarguments:expression, as shown in the followingexample:
Asyoucouldprobablyguess,theresultis:
Nowconsiderthefollowingexamples:
The power2 function defined in the expression, is equivalent to the followingdefinition:
greeter=lambdax:print('Hello%s!'%x)
print(greeter('Albert'))
HelloAlbert!
power2=lambdax:x**2
defpower2(x):
returnx**2
Lambdafunctionsareusefulforwhenyouneedafunctionforashortperiodoftime.Note that theycanalsobeveryusefulwhenpassedasanargumentwithotherbuilt-infunctionsthattakeafunctionasanargument,e.g.filter()andmap().Inthenextexampleweshowhowalambdafunctioncanbecombinedwiththefilerfunction. Consider the array all_names which contains five words that rhymetogether.Wewanttofilterthewordsthatcontainthewordname.Toachievethis,wepassthefunctionlambdax:'name'inxas thefirstargument.This lambdafunctionreturns True if the word name exists as a sub-string in the string x. The secondargumentoffilterfunctionisthearrayofnames,i.e.all_names.
Asyoucansee,thenamesaresuccessfullyfilteredasweexpected.
InPython3,filterfunctionreturnsafilterobjectortheiteratorwhichgetslazilyevaluatedwhichmeans neitherwe can access the elements of the filter objectwithindexnorwecanuselen()tofindthelengthofthefilterobject.
InPython,wecanhaveasmallusuallyasinglelineranonymousfunctioncalledLambda functionwhich canhave anynumberof arguments just like anormalfunctionbutwithonlyoneexpressionwithnoreturnstatement.Theresultofthisexpressioncanbeappliedtoavalue.
BasicSyntax:
Foranexample:afunctioninpython
SamefunctioncanwrittenasLambdafunction.Thisfunctionnamedasmultiplyishaving2argumentsandreturnstheirmultiplication.
all_names=['surname','rename','nickname','acclaims','defame']
filtered_names=list(filter(lambdax:'name'inx,all_names))
print(filtered_names)
#['surname','rename','nickname']
list_a=[1,2,3,4,5]
filter_obj=filter(lambdax:x%2==0,list_a)
#Convertthefilerobjtoalist
even_num=list(filter_obj)
print(even_num)
#Output:[2,4]
lambdaarguments:expression
defmultiply(a,b):
returna*b
#callthefunction
multiply(3*5)#outputs:15
Lambdaequivalentforthisfunctionwouldbe:
Here a and b are the 2 arguments and a*b is the expression whose value isreturnedasanoutput.
Alsowedon’tneedtoassignLambdafunctiontoavariable.
Lambdafunctionsaremostlypassedasparametertoafunctionwhichexpectsafunctionobjectslikeinmaporfilter.
5.1.12.1map
Thebasicsyntaxofthemapfunctionis
mapfunctionsexpectsafunctionobjectandanynumberofiterableslikelistordictionary.Itexecutesthefunction_objectforeachelementinthesequenceandreturnsalistoftheelementsmodifiedbythefunctionobject.
Example:
IfwewanttowritesamefunctionusingLambda
5.1.12.2dictionary
Now,letsseehowwecaninterateoveradictionaryusingmapandlambdaLetssaywehaveadictionaryobject
multiply=Lambdaa,b:a*b
print(multiply(3,5))
#outputs:15
(lambdaa,b:a*b)(3*5)
map(function_object,iterable1,iterable2,...)
defmultiply(x):
returnx*2
map(multiply2,[2,4,6,8])
#Output[4,8,12,16]
map(lambdax:x*2,[2,4,6,8])
#Output[4,8,12,16]
dict_movies=[
Wecan iterate over this dictionary and read the elements of it usingmap andlambdafunctionsinfollowingway:
In Python3, map function returns an iterator or map object which gets lazilyevaluatedwhichmeans neitherwe can access the elements of themap objectwith indexnorwe canuse len() to find the lengthof themapobject.We canforceconvertthemapoutputi.e.themapobjecttolistasshownnext:
5.1.13Iterators
InPython, an iteratorprotocol isdefinedusing twomethods: __iter()__ and next().The former returns the iterator object and latter returns the next element of asequence.Someadvantagesofiteratorsareasfollows:
ReadabilitySupportssequencesofinfinitelengthSavingresources
Thereareseveralbuilt-inobjects inPythonwhich implement iteratorprotocol,e.g.string,list,dictionary.Inthefollowingexample,wecreateanewclassthatfollowstheiteratorprotocol.Wethenusetheclasstogeneratelog2ofnumbers:
{'movie':'avengers','comic':'marvel'},
{'movie':'superman','comic':'dc'}]
map(lambdax:x['movie'],dict_movies)#Output:['avengers','superman']
map(lambdax:x['comic'],dict_movies)#Output:['marvel','dc']
map(lambdax:x['movie']=="avengers",dict_movies)
#Output:[True,False]
map_output=map(lambdax:x*2,[1,2,3,4])
print(map_output)
#Output:mapobject:<mapobjectat0x04D6BAB0>
list_map_output=list(map_output)
print(list_map_output)#Output:[2,4,6,8]
frommathimportlog2
classLogTwo:
"Implementsaniteratoroflogtwo"
def__init__(self,last=0):
self.last=last
def__iter__(self):
self.current_num=1
returnself
def__next__(self):
ifself.current_num<=self.last:
result=log2(self.current_num)
self.current_num+=1
returnresult
As you can see,we first create an instance of the class and assign its __iter()__functiontoavariablecalledi.Thenbycallingthenext()functionfourtimes,wegetthefollowingoutput:
Asyouprobablynoticed,thelinesarelog2()of1,2,3,4respectively.
5.1.14Generators
Before we go to Generators, please understand Iterators. Generators are alsoIteratorsbuttheycanonlybeinteratedoveronce.ThatsbecauseGeneratorsdonotstorethevaluesinmemoryinsteadtheygeneratethevaluesonthego.Ifwewanttoprintthosevaluesthenwecaneithersimplyiterateoverthemorusetheforloop.
5.1.14.1Generatorswithfunction
For example:we have a function named asmultiplyBy10which prints all theinputnumbersmultipliedby10.
Now,ifwewanttouseGeneratorsherethenwewillmakefollowingchanges.
else:
raiseStopIteration
L=LogTwo(5)
i=iter(L)
print(next(i))
print(next(i))
print(next(i))
print(next(i))
$pythoniterator.py
0.0
1.0
1.584962500721156
2.0
defmultiplyBy10(numbers):
result=[]
foriinnumbers:
result.append(i*10)
returnresult
new_numbers=multiplyBy10([1,2,3,4,5])
printnew_numbers#Output:[10,20,30,40,50]
defmultiplyBy10(numbers):
foriinnumbers:
yield(i*10)
new_numbers=multiplyBy10([1,2,3,4,5])
InGenerators,weuseyield() function inplaceof return().Sowhenwe try toprintnew_numberslistnow,itjustprintsGeneratorsobject.ThereasonforthisisbecauseGeneratorsdontholdanyvalue inmemory, ityieldsone resultatatime.Soessentiallyitisjustwaitingforustoaskforthenextresult.Toprintthenextresultwecanjustsayprintnext(new_numbers),sohowitisworkingisitsreadingthefirstvalueandsquaringitandyieldingoutvalue1.Alsointhiscasewecanjustprintnext(new_numbers)5timestoprintallnumbersandifwedoitfor6thtimethenwewillgetanerrorStopIterationwhichmeannsGeneratorshasexausteditslimitandithasno6thelementtoprint.
5.1.14.2Generatorsusingforloop
Ifwenowwanttoprintthecompletelistofsquaredvaluesthenwecanjustdo:
Theoutputwillbe:
5.1.14.3GeneratorswithListComprehension
Python has something called List Comprehension, ifwe use this thenwe canreplacethecompletefunctiondefwithjust:
Here the point to note is square brackets [] in line 1 is very important. Ifwechangeitto()thenagainwewillstartgettingGeneratorsobject.
printnew_numbers#Output:Generatorsobject
printnext(new_numbers)#Output:1
defmultiplyBy10(numbers):
foriinnumbers:
yield(i*10)
new_numbers=multiplyBy10([1,2,3,4,5])
fornuminnew_numbers:
printnum
10
20
30
40
50
new_numbers=[x*10forxin[1,2,3,4,5]]
printnew_numbers#Output:[10,20,30,40,50]
new_numbers=(x*10forxin[1,2,3,4,5])
printnew_numbers#Output:Generatorsobject
Wecanget the individualelementsagain fromGenerators ifwedoa for loopovernew_numberslikewedidpreviously.Alternatively,wecanconvertitintoalistandthenprintit.
Buthereifweconvertthisintoalistthenwelooseonperformance,whichwewilljustseenext.
5.1.14.4WhytouseGenerators?
Generators are betterwithPerformance because it does not hold the values inmemoryandherewiththesmallexamplesweprovideitsnotabigdealsinceweare dealing with small amount of data but just consider a scenario where therecords are in millions of data set. And if we try to convert millions of dataelements into a list then that will definitely make an impact on memory andperformancebecauseeverythingwillinmemory.
Lets see an example on how Generators help in Performance. First, withoutGenerators, normal function taking 1 million record and returns theresult[people]for1million.
new_numbers=(x*10forxin[1,2,3,4,5])
printlist(new_numbers)#Output:[10,20,30,40,50]
names=['John','Jack','Adam','Steve','Rick']
majors=['Math',
'CompScience',
'Arts',
'Business',
'Economics']
#printsthememorybeforewerunthefunction
memory=mem_profile.memory_usage_resource()
print(f'Memory(Before):{memory}Mb')
defpeople_list(people):
result=[]
foriinrange(people):
person={
'id':i,
'name':random.choice(names),
'major':randon.choice(majors)
}
result.append(person)
returnresult
t1=time.clock()
people=people_list(10000000)
t2=time.clock()
#printsthememoryafterwerunthefunction
memory=mem_profile.memory_usage_resource()
print(f'Memory(After):{memory}Mb')
print('Took{time}seconds'.format(time=t2-t1))
#Output
Memory(Before):15Mb
I am justgivingapproximatevalues tocompare itwithnext executionbutwejusttrytorunitwewillseeaseriousconsumptionofmemorywithgoodamountoftimetaken.
Now after running the same code using Generators, wewill see a significantamount of performanceboostwith alomost 0Seconds.And the reasonbehindthisisthatincaseofGenerators,wedonotkeepanythinginmemorysosystemjustreads1atatimeandyieldsthat.
Memory(After):318Mb
Took1.2seconds
names=['John','Jack','Adam','Steve','Rick']
majors=['Math',
'CompScience',
'Arts',
'Business',
'Economics']
#printsthememorybeforewerunthefunction
memory=mem_profile.memory_usage_resource()
print(f'Memory(Before):{memory}Mb')
defpeople_generator(people):
foriinxrange(people):
person={
'id':i,
'name':random.choice(names),
'major':randon.choice(majors)
}
yieldperson
t1=time.clock()
people=people_list(10000000)
t2=time.clock()
#printsthememoryafterwerunthefunction
memory=mem_profile.memory_usage_resource()
print(f'Memory(After):{memory}Mb')
print('Took{time}seconds'.format(time=t2-t1))
#Output
Memory(Before):15Mb
Memory(After):15Mb
Took0.01seconds
6CLOUDMESH
6.1INTRODUCTION☁�
LearningObjectives
IntroductiontothecloudmeshAPIUsingcmd5viacmsIntroduction to cloudmesh convenience API for output, dotdict, shell,stopwatch,benchmarkmanagementCreatingyourowncmscommandsCloudmeshconfigurationfileCloudmeshinventory
InthisChapterweliketointroduceyoutocloudmeshwhichprovidesyouwithanumberofconvenientmethodstointerfacewiththelocalsystem,butalsowithcloud services.Wewill startwhile focussing on some simpleAPI’s and thangraduallyintroducethecloudmeshshellwhichnotonlyprovidesashell,butalsoacommandline interfacesoyoucanusecloudmeshfroma terminal.Thisdualabilityisquiteusefulaswecanwritecloudmeshscripts,butcanalsoinvokethefunctionality from the terminal. This is quite an important distinction towardsothertoolsthatonlyallowcommandlineinterfaces.
Moreoverwealsoshoyouthatitiseasytocreatenewcommandsandaddthemdynamicallytothecloudmeshshellviasimplepipinstalls.
Cloudmeshisanevolvingprojectandyouhavetheopportunitytoimproveitifyouseesomefeaturesmissing.
Themanualofcloudmeshcanbefoundat
https://cloudmesh.github.io/cloudmesh-manual
TheAPIdocumentationislocatedat
https://cloudmesh.github.io/cloudmesh-manual/api/index.html#cloudmesh-api
Wewillinitiallyfocusonasubsetofthisfunctionality.
6.2INSTALLATION☁�Theinstallationofcloudmeshissimpleandcantechnicallybedoneviapipbyauser.Howeveryouarenotauser,youareadeveloper.Cloudmeshisdistributedindifferenttopicalrepositoriesandinorderfordeveloperstoeasilyinteractwiththemwehavewrittenaconvenientcloudmesh-installerprogram.
As a developer you must also use a python virtual environment to avoidaffectingyoursystemwidepythoninstallation.ThiscanbeachievedwhileusingPython3 from python.org or via conda. We do recommend that you usepython.orgas this is thevanillapython thatmostdevelopers in theworlduse.Condaisoftenusedbyusersofpythoniftheynotneedtousebleeding-edgebutolderprepackagedpythontoolsandlibraries.
6.2.1Prerequisite
Werequireyoutocreateapythonvirtualenvironmentandactivateit.HowtodothiswasdiscussedinSection3.1.Pleasecreate theENV3environment.Pleaseactivateit.
6.2.2BasicInstall
Cloudmeshcaninstallfordevelopersanumberofbundles.Abundleisasetofgitrepositories that are needed for a particular install. For us, we are mostlyinterested in thebundles cms, cloud, storage.Wewill introduceyou tootherbundlesthroughoutthisdocumentation.
Ifyouliketofindoutmoreaboutthedetailsofthisyoucanlookatcloudmesh-installerwhichwillberegularlyupdated.
Tomakeuseofthebundleandtheeasyinstallationfordeveloperspleaseinstallthecloudmesh-installerviapip,butmakesureyoudothisinapythonvirtualenv
asdiscussedpreviously. Ifnotyoumay impactyoursystemnegatively.Pleasenote thatwe are not responsible for fixing your computer.Naturally, you canalso use a virtualmachine, if you prefer. It is also important thatwe create auniform development environment. In our case we create an empty directorycalledcminwhichweplacethebundle.
Toseethebundleyoucanuse
WewillstartwiththebasiccloudmeshfunctionalityatthistimeandonlyinstalltheshellandsomecommonAPI’s.
Thesecommandsdownloadandinstallcloudmeshshellintoyourenvironment.Itisimportantthatyouusethe-eflag
Toseeifitworksyoucanusethecommand
Youwillseeanoutput.Ifthisdoesnotworkforyou,andyoucannotfigureouttheissue,pleasecontactussowecanidentifywhatwentwrong.
Formoreinformation,pleasevisitourInstallationInstructionsforDevelopers
6.3OUTPUT☁�Cloudmesh provides a number of convenient API’s to make output easier ormorefancyful.
TheseAPI’sinclude
ConsoleBannerHeading
$mkdircm
$cdcm
$pipinstallcloudmesh-installer
$cloudmesh-installerbundles
$cloudmesh-installergitclonecms
$cloudmesh-installerinstallcms-e
$cmshelp
VERBOSE
6.3.1Console
Print is theusual function tooutput to the terminal.However,oftenwe like tohavecoloredoutputthathelpsusinthenotificationtotheuser.Forthisreasonwe have a simple Console class that has several built-in features. You can evenswitchanddefineyourowncolorschemes.
In case of the error message we also have convenient flags that allow us toincludethetracebackintheoutput.
Theprefixcanbeswitchedonandoffwith theprefix flag,while the traceflagswitchesonandofifthetraceshouldbeset.
The verbosity of the output is controlled via variables that are stored in the~/.cloudmeshdirectory.
Formorefeatures,seeAPI:Console
6.3.2Banner
Incaseyouneedabanneryoucandothiswith
Formorefeatures,seeAPI:Banner
fromcloudmesh.common.consoleimportConsole
msg="mymessage"
Console.ok(msg)#prinsagreenmessage
Console.error(msg)#prinsaredmessageproceededwithERROR
Console.msg(msg)#prinsaregularblackmessage
Console.error(msg,prefix=True,traceflag=True)
fromcloudmesh.common.variablesimportVariables
variables=Variables()
variables['debug']=True
variables['trace']=True
variables['verbose']=10
fromcloudmesh.common.utilimportbanner
banner("mytext")
6.3.3Heading
AparticularusefulfunctionisHEADING()whichprintsthemethodname.
The invocation of the HEADING() function doit prints a banner with the nameinformation.ThereasonwedidnotdoitasadecoratoristhatyoucanplacetheHEADING()functioninanarbitrarylocationofthemethodbody.
Formorefeatures,seeAPI:Heading
6.3.4VERBOSE
VERBOSEisaveryusefulmethodallowingyoutoprintadictionary.Notonlywillitprintthedict,butitwillalsoprovideyouwiththeinformationinwhichfileitisusedandwhichlinenumber.Itwillevenprintthenameofthedict thatyouuseinyourcode.
To use this youwill have to enable the debuggingmethods for cloudmesh asdiscusedinSection6.3.1
Formorefeatures,pleaseseeVERBOSE
6.3.5Usingprintandpprint
Inmanycasesitmaybesufficienttouseprintandpprintfordebugging.However,asthecodeisbigandyoumayforgetwhereyouplacedprintstatementsortheprintstatementsmayhavebeenaddedbyothers,werecommendthatyouusetheVERBOSE function. If you use print or pprint we recommend using a uniqueprefix,suchas:
fromcloudmesh.common.utilimportHEADING
classExample(object):
defdoit(self):
HEADING()
print("Hello")
fromcloudmesh.common.debugimportVERBOSE
m={"key":"value"}
VERBOSE(m)
frompprintimportpprint
6.4DICTIONARIES☁�6.4.1Dotdict
For simple dictionaries we sometimes like to simplify the notation with a .
insteadofusingthe[]:
Youcanachievethiswithdotdict
Nowyoucaneithercall
or
Thisisespaciallyusefulinifconditionsasitmaybeeasiertoreadandwrite
andisthesameas
Formorefeatures,seeAPI:dotdict
6.4.2FlatDict
d={"sample":"value"}
print("MYDEBUG:")
pprint(d)
#orwithprint
print("MYDEBUG:",d)
fromcloudmesh.common.dotdictimportdotdict
data={
"name":"Gregor"
}
data=dotdict(data)
data["name"]
data.name
ifdata.nameis"Gregor":
print("thisisquitereadable")
ifdata["name"]is"Gregor":
print("thisisquitereadable")
Insomecasesitisusefultobeabletoflattenoutdictionariesthatcontaindictswithindicts.ForthiswecanuseFlatDict.
Thiswillbeconvertedtoadictwiththefollowingstructure.
With sep you can change the sepaerator between the nested dict attributes. Formorefeatures,seeAPI:dotdict
6.4.3PrintingDicts
In case we want to print dicts and lists of dicts in various formats, we haveincludedasimplePrinterthatcanprintadictinyaml,json,table,andcsvformat.
Thefunctioncanevenguessfromthepassedparameterswhattheinputformatisandusestheappropriateinternalfunction.
Acommonexampleis
fromcloudmesh.common.FlatdictimportFlatDict
data={
"name":"Gregor"
"address":{
"city":"Bloomington",
"state":"IN"
}
}
flat=FlatDict(data,sep=".")
flat={
"name":"Gregor"
"address.city":"Bloomington",
"address.state":"IN"
}
frompprintimportpprint
fromcloudmesh.common.PrinterimportPrinter
data=[
{
"name":"Gregor",
"address":{
"street":"FunnyLane11",
"city":"Cloudville"
}
},
{
"name":"Albert",
"address":{
"street":"MemoryLane1901",
"city":"Cloudnine"
}
}
]
Formorefeatures,seeAPI:Printer
Moreexamplesareavailableinthesourcecodeastests
6.5SHELL☁�Python provides a sophisticated method for starting background processes.However inmanycases it isquitecomplex to interactwith it. Italsodoesnotprovideconvenientwrappersthatwecanusetostarttheminapythonicfashion.Forthisreasonwehavewrittenaprimitive Shellclass thatprovidesjustenoughfunctionalitytobeusefulinmanycases.
Let us review some exampleswhere result is set to theoutput of the commandbeingexecuted.
Formanycommoncommands,weprovidebuilt-infunctions.Forexample:
Thelistincludes(naturallythecommandsmustbeavailableonyourOS.IftheshellcommandisnotavailableonyourOS,pleasehelpusimprovingthecodetoeither provide functions that work on your OS or develop with us platformindependentfunctionalityofasubsetofthefunctionalityfortheshellcommand
pprint(data)
table=Printer.flatwrite(data,
sort_keys=["name"],
order=["name","address.street","address.city"],
header=["Name","Street","City"],
output='table')
print(table)
fromcloudmesh.common.ShellimportShell
result=Shell.execute('pwd')
print(result)
result=Shell.execute('ls',["-l","-a"])
print(result)
result=Shell.execute('ls',"-l-a")
print(result)
result=Shell.ls("-aux")
print(result)
result=Shell.ls("-a","-u","-x")
print(result)
result=Shell.pwd()
print(result)
thatwemaybenefitfrom.
VBoxManage(cls,*args)
bash(cls,*args)
blockdiag(cls,*args)
brew(cls,*args)
cat(cls,*args)
check_output(cls,*args,**kwargs)
check_python(cls)
cm(cls,*args)
cms(cls,*args)
command_exists(cls,name)
dialog(cls,*args)
edit(filename)
execute(cls,*args)
fgrep(cls,*args)
find_cygwin_executables(cls)
find_lines_with(cls,lines,what)
get_python(cls)
git(cls,*args)
grep(cls,*args)
head(cls,*args)
install(cls,name)
install(cls,name)
keystone(cls,*args)
kill(cls,*args)
live(cls,command,cwd=None)
ls(cls,*args)
mkdir(cls,directory)
mongod(cls,*args)
nosetests(cls,*args)
nova(cls,*args)
operating_system(cls)
pandoc(cls,*args)
ping(cls,host=None,count=1)
pip(cls,*args)
ps(cls,*args)
pwd(cls,*args)
rackdiag(cls,*args)
remove_line_with(cls,lines,what)
rm(cls,*args)
rsync(cls,*args)
scp(cls,*args)
sh(cls,*args)
sort(cls,*args)
ssh(cls,*args)
sudo(cls,*args)
tail(cls,*args)
terminal(cls,command='pwd')
terminal_type(cls)
unzip(cls,source_filename,dest_dir)
vagrant(cls,*args)
version(cls,name)
which(cls,command)
Formorefeatures,pleaseseeShell
6.6STOPWATCH☁�Often you find yourself in a situation where you like to measure the timebetween two events.We provide a simple StopWatch that allows you not only tomeasureanumberoftimes,butalsotoprintthemoutinaconvenientformat.
Toprintthem,youcanalsouse:
Formorefeatures,pleaseseeeStopWatch
fromcloudmesh.common.StopWatchimportStopWatch
fromtimeimportsleep
StopWatch.start("test")
sleep(1)
StopWatch.stop("test")
print(StopWatch.get("test"))
StopWatch.benchmark()
6.7CLOUDMESHCOMMANDSHELL☁�
6.7.1CMD5
Python’s CMD (https://docs.python.org/2/library/cmd.html) is a very usefulpackagetocreatecommandlineshells.Howeveritdoesnotallowthedynamicintegrationofnewlydefinedcommands.Furthermore,additionstoCMDneedtobe donewithin the same source tree. To simplify developing commands by anumber of people and to have a dynamic plugin mechanism, we developedcmd5.Itisarewriteonourearliereffortsincloudmeshclientandcmd3.
6.7.1.1Resources
Thesourcecodeforcmd5islocatedingithub:
https://github.com/cloudmesh/cmd5
We have discussed in Section 6.2 how to install cloudmesh as developer andhaveaccesstothesourcecodeinadirectorycalledcm.Asyoureadthisdocumentweassumeyouareadeveloperandcanskipthenextsection.
6.7.1.2Installationfromsource
WARNING:DONOT EXECUTE THIS IFYOUAREADEVELOPERORYOURENVIRONMENTWILLNOTPROPERLYWORK.
However,ifyouareauserofcloudmeshyoucaninstallitwith
6.7.1.3Execution
To run the shell you can activate it with the cms command. cms stands forcloudmeshshell:
Itwillprintthebannerandentertheshell:
$pipinstallcloudmesh-cmd5
(ENV2)$cms
Toseethelistofcommandsyoucansay:
Toseethemanualpageforaspecificcommand,pleaseuse:
6.7.1.4CreateyourownExtension
OneofthemostimportantfeaturesofCMD5isitsabilitytoextenditwithnewcommands.Thisisdoneviapackagednamespaces.Werecommendyounameiscloudmesh-mycommand,wheremycommand is thenameof thecommand thatyou like to create. This can easily be done while using the sys* cloudmeshcommand (we suggest you use a different name than gregor maybe yourfirstname):
Itwilldownloadatemplatefromcloudmeshcalledcloudmesh-barandgenerateanewdirectorycloudmesh-gregorwithalltheneededfilestocreateyourowncommandandregister it dynamically with cloudmesh. All you have to do is to cd into thedirectoryandinstallthecode:
Addingyourowncommandiseasy.Itisimportantthatallobjectsaredefinedinthe command itself and that noglobal variables beuse in order to alloweachshell command to stand alone. Naturally you should develop API librariesoutside of the cloudmesh shell command and reuse them in order to keep thecommandcodeassmallaspossible.Weplacethecommandin:
+-------------------------------------------------------+
|_______|
|/___||_______||____________||__|
|||||/_\||||/_`|'_`_\/_\/__|'_\|
|||___||(_)||_||(_|||||||__/\__\||||
|\____|_|\___/\__,_|\__,_|_||_||_|\___||___/_||_||
+-------------------------------------------------------+
|CloudmeshCMD5Shell|
+-------------------------------------------------------+
cms>
cms>help
helpCOMMANDNAME
$cmssyscommandgenerategregor
$cdcloudmesh-gregor
$pythonsetup.pyinstall
#pipinstall.
cloudmsesh/mycommand/command/gregor.py
Nowyoucangoaheadandmodifyyourcommandinthatdirectory.Itwilllooksimilarto(ifyouusedthecommandnamegregor):
An important difference to other CMD solutions is that our commands canleverage (besides the standarddefinition), docopts as away todefine themanualpage. This allows us to use arguments as dict and use simple if conditions tointerpret the command. Using docopts has the advantage that contributors areforcedtothinkaboutthecommandanditsoptionsanddocumentthemfromthestart.Previouslywedidnotusebutargparseandclick.Howeverwenoticedthatforourcontributorsbothsystemsleadtocommandsthatwereeithernotproperlydocumentedor thedevelopersdelivered ambiguous commands that resulted inconfusionandwrongusagebysubsequentusers.Hence,wedorecommendthatyou use docopts for documenting cmd5 commands. The transformation isenabledbythe@commanddecoratorthatgeneratesamanualpageandcreatesaproper help message for the shell automatically. Thus there is no need tointroduceaseparatehelpmethodaswouldnormallybeneeded inCMDwhilereducingtheeffortittakestocontributenewcommandsinadynamicfashion.
6.7.1.5Bug:Quotes
Wehaveonebugincmd5thatrelatestotheuseofquotesonthecommandline
Forexampleyouneedtosay
from__future__importprint_function
fromcloudmesh.shell.commandimportcommand
fromcloudmesh.shell.commandimportPluginCommand
classGregorCommand(PluginCommand):
@command
defdo_gregor(self,args,arguments):
"""
::
Usage:
gregor-fFILE
gregorlist
Thiscommanddoessomeusefulthings.
Arguments:
FILEafilename
Options:
-fspecifythefile
"""
print(arguments)
ifarguments.FILE:
print("Youhaveusedfile:",arguments.FILE)
return""
$cmsgregor-f\"filenamewithspaces\"
Ifyou like tohelpus fix this thatwouldbegreat. it requires theuseof shlex.Unfortuantlywedidnotyettimetofixthis“feature”.
6.8EXERCISES☁�Whendoingyourassignment,make sureyou label theprogramsappropriatelywith comments that clearly identify the assignment.Place all assignments in afolderongithubnamed“cloudmesh-exercises”
ForexamplenametheprogramsolvingE.Cloudmesh.Common.1e-cloudmesh-1.pyandsoon.Formorecomplexassignmentsyoucannamethemasyoulike,aslongasinthefileyouhaveacommentsuchas#fa19-516-000E.Cloudmesh.Common.1
at the beginning of the file. Please do not store any screenshots in your gitrepositoryofyourworkingprogram.
6.8.1CloudmeshCommon
E.Cloudmesh.Common.1
Developaprogramthatdemonstratestheuseofbanner,HEADING,andVERBOSE.
E.Cloudmesh.Common.2
Developaprogramthatdemonstratestheuseofdotdict.
E.Cloudmesh.Common.3
DevelopaprogramthatdemonstratestheuseofFlatDict.
E.Cloudmesh.Common.4
Developaprogramthatdemonstratestheuseofcloudmesh.common.Shell.
E.Cloudmesh.Common.5
Developaprogramthatdemonstratestheuseofcloudmesh.common.StopWatch.
6.8.2CloudmeshShell
E.Cloudmesh.Shell.1
Installcmd5andthecommandcmsonyourcomputer.
E.Cloudmesh.Shell.2
Writeanewcommandwithyourfirstnameasthecommandname.
E.Cloudmesh.Shell.3
Write a new command and experiment with docopt syntax andargumentinterpretationofthedictwithifconditions.
E.Cloudmesh.Shell.4
Ifyouhaveusefulextensionsthatyoulikeustoaddbydefault,pleaseworkwithus.
E.Cloudmesh.Shell.5
Atthis timeoneneedstoquoteinsomecommandsthe " intheshellcommandline.Developandtestcodethatfixesthis.
7LIBRARIES
7.1PYTHONMODULES☁�OftenyoumayneedfunctionalitythatisnotpresentinPython’sstandardlibrary.Inthiscaseyouhavetwooption:
implementthefeaturesyourselfuseathird-partylibrarythathasthedesiredfeatures.
Oftenyoucanfindapreviousimplementationofwhatyouneed.Sincethisisacommonsituation,thereisaservicesupportingit:thePythonPackageIndex(orPyPiforshort).
Our task here is to install the autopep8 tool from PyPi. Thiswill allow us toillustratetheuseifvirtualenvironmentsusingthepyenvorvirtualenvcommand,andinstallinganduninstallingPyPipackagesusingpip.
7.1.1UpdatingPip
Itisimportantthatyouhavethenewestversionofpipinstalledforyourversionof python. Let us assume your python is registered with python and you usepyenv,thanyoucanupdatepipwith
without interferingwith a potential systemwide installed version of p ip thatmaybeneededby the systemdefaultversionofpython.See the sectionaboutpyenvformoredetails
7.1.2UsingpiptoInstallPackages
Letusnowlookatanother important toolforPythondevelopment: thePythonPackageIndex,orPyPIforshort.PyPIprovidesalargesetofthird-partypythonpackages. If youwant todo something inpython, first checkpypi, asodd aresomeonealreadyranintotheproblemandcreatedapackagesolvingit.
pipinstall-Upip
InordertoinstallpackagefromPyPI,usethepipcommand.WecansearchforPyPIforpackages:
Itappearsthatthetoptworesultsarewhatwewantsoinstallthem:
Thiswill cause pip to download the packages fromPyPI, extract them, checktheir dependencies and install those as needed, then install the requestedpackages.
Youcanskip‘–trusted-hostpypi.python.org’optionifyouhave
patchedurllib3onPython2.7.9.
7.1.3GUI
7.1.3.1GUIZero
Installguizerowiththefollowingcommand:
Foracomprehensivetutorialonguizero,clickhere.
7.1.3.2Kivy
YoucaninstallKivyonmacOSasfollows:
Ahelloworldprogramforkivy is included in thecloudmesh.robot repository.Whichyoucanfinehere
https://github.com/cloudmesh/cloudmesh.robot/tree/master/projects/kivy
To run the program, please download it or execute it in cloudmesh.robot as
$pipsearch--trusted-hostpypi.python.orgautopep8pylint
$pipinstall--trusted-hostpypi.python.orgautopep8pylint
sudopipinstallguizero
brewinstallpkg-configsdl2sdl2_imagesdl2_ttfsdl2_mixergstreamer
pipinstall-UCython
pipinstallkivy
pipinstallpygame
follows:
Tocreatestandalonepackageswithkivy,pleasesee:
7.1.4FormattingandCheckingPythonCode
First,getthebadcode:
Examinethecode:
As you can see, this is very dense and hard to read. Cleaning it up by handwouldbeatime-consuminganderror-proneprocess.Luckily,thisisacommonproblemsothereexistacouplepackagestohelpinthissituation.
7.1.5Usingautopep8
Wecannowrunthebadcodethroughautopep8tofixformattingproblems:
Letuslookattheresult.Thisisconsiderablybetterthanbefore.Itiseasytotellwhattheexample1andexample2functionsaredoing.
It is a good idea to develop a habit of using autopep8 in your python-development workflow. For instance: use autopep8 to check a file, and if itpasses,makeanychangesinplaceusingthe-iflag:
IfyouusepyCharmyouhavetheabilitytouseasimilarfunctionwhilepressingonInspectCode.
7.1.6WritingPython3CompatibleCode
cdcloudmesh.robot/projects/kivy
pythonswim.py
-https://kivy.org/docs/guide/packaging-osx.html
$wget--no-check-certificatehttp://git.io/pXqb-Obad_code_example.py
$emacsbad_code_example.py
$autopep8bad_code_example.py>code_example_autopep8.py
$autopep8file.py#checkoutputtoseeofpasses
$autopep8-ifile.py#updateinplace
Towritepython2and3compatiblecodewerecommendthatyoutakealookat:http://python-future.org/compatible_idioms.html
7.1.7UsingPythononFutureSystems
ThisisonlyimportantifyouuseFuturesystemsresources.
InordertousePythonyoumust logintoyourFutureSystemsaccount.Thenattheshellpromptexecutethefollowingcommand:
Thiswillmakethepythonandvirtualenvcommandsavailabletoyou.
Thedetailsofwhatthemoduleloadcommanddoesaredescribedinthefuturelessonmodules.
7.1.8Ecosystem
7.1.8.1pypi
The Python Package Index is a large repository of software for the Pythonprogramminglanguagecontaininga largenumberofpackages,manyofwhichcanbefoundonpypi.Thenice thingaboutpypi is thatmanypackagescanbeinstalledwiththeprogram‘pip’.
Todosoyouhave to locate the<package_name>forexamplewith thesearchfunctioninpypiandsayonthecommandline:
where package_name is the string name of the package. an example would be thepackagecalledcloudmesh_clientwhichyoucaninstallwith:
Ifallgoeswellthepackagewillbeinstalled.
7.1.8.2AlternativeInstallations
$moduleloadpython
$pipinstall<package_name>
$pipinstallcloudmesh_client
The basic installation of python is provided by python.org. However othersclaim to have alternative environments that allow you to install python. Thisincludes
CanopyAnacondaIronPython
Typically they include not only the python compiler but also several usefulpackages.Itisfinetousesuchenvironmentsfortheclass,butitshouldbenotedthat in both cases not every python librarymay be available for install in thegivenenvironment.Forexampleifyouneedtousecloudmeshclient,itmaynotbeavailableascondaorCanopypackage.This isalso thecaseformanyothercloudrelatedandusefulpythonlibraries.Hence,wedorecommendthat ifyouare new to python to use the distribution form python.org, and use pip andvirtualenv.
Additionally some python version have platform specific libraries ordependencies.Forexamplecocalibraries,.NETorotherframeworksareexamples.Fortheassignmentsandtheprojectssuchplatformdependentlibrariesarenottobeused.
If however you can write a platform independent code that works on Linux,macOSandWindowswhileusingthepython.orgversionbutdevelopitwithanyoftheothertoolsthatisjustfine.Howeveritisuptoyoutoguaranteethatthisindependence is maintained and implemented. You do have to writerequirements.txtfilesthatwillinstallthenecessarypythonlibrariesinaplatformindependent fashion.ThehomeworkassignmentPRG1hasevena requirementtodoso.
Inordertoprovideplatformindependencewehavegivenintheclassaminimalpythonversionthatwehavetestedwithhundredsofstudents:python.org.Ifyouuseanyotherversion,thatisyourdecision.Additionallysomestudentsnotonlyusepython.orgbuthaveusediPythonwhichisfinetoo.Howeverthisclassisnotonlyaboutpython,butalsoabouthowtohaveyourcoderunonanyplatform.Thehomeworkisdesignedsothatyoucanidentifyasetupthatworksforyou.
Howeverwehaveconcernsifyouforexamplewantedtousechameleoncloud
whichwerequireyoutoaccesswithcloudmesh.cloudmeshisnotavailableasconda,canopy,orotherframeworkpackage.Cloudmeshclientisavailableformpypiwhichisstandardandshouldbesupportedbytheframeworks.Wehavenottestedcloudmeshonanyotherpythonversionthenpython.orgwhichistheopensourcecommunitystandard.Noneoftheotherversionsarestandard.
In factwe had students over the summer using canopyon theirmachines andtheygotconfusedas theynowhadmultiplepythonversionsanddidnotknowhow to switchbetween themandactivate the correct version.Certainly if youknowhowtodothat,thanfeelfreetousecanopy,andifyouwanttousecanopyall this isuptoyou.However thehomeworkandprojectrequiresyoutomakeyourprogramportabletopython.org.Ifyouknowhowtodothatevenifyouusecanopy,anaconda,oranyotherpythonversionthatisfine.Graderswilltestyourprograms on a python.org installation and not canopy, anaconda, ironpythonwhileusingvirtualenv. It isobviouswhy. Ifyoudonotknowthatansweryoumaywant to thinkabout thatevery timetheytestaprogramtheyneedtodoanewvirtualenvandrunvanillapythoninit.Ifweweretoruntwoinstallsinthesamesystem,thiswillnotworkaswedonotknowifonestudentwillcauseasideeffect foranother.Thusweas instructorsdonot justhave to lookatyourcode but code of hundreds of students with different setups. This is a nonscalablesolutionaseverytimewetestoutcodefromastudentwewouldhavetowipeout theOS, install itnew, installannewversionofwhateverpythonyouhave elected, become familiarwith that version and so on and on.This is thereason why the open source community is using python.org.We follow bestpractices.Usingotherversionsisnotacommunitybestpractice,butmayworkforanindividual.
We have however in regards to using other python version additional bonusprojectssuchas
deployrunanddocumentcloudmeshonironpythondeploy run and document cloudmesh on anaconda, develop script togenerateacondapackageformgithubdeployrunanddocumentcloudmeshoncanopy,developscripttogenerateacondapackageformgithubdeployrunanddocumentcloudmeshonironpythonotherdocumentationthatwouldbeuseful
7.1.9Resources
IfyouareunfamiliarwithprogramminginPython,wealsoreferyoutosomeofthenumerousonlineresources.YoumaywishtostartwithLearnPythonorthebookLearnPythontheHardWay.OtheroptionsincludeTutorialsPointorCodeAcademy, and the Python wiki page contains a long list of references forlearningaswell.Additionalresourcesinclude:
https://virtualenvwrapper.readthedocs.iohttps://github.com/yyuu/pyenvhttps://amaral.northwestern.edu/resources/guides/pyenv-tutorialhttps://godjango.com/96-django-and-python-3-how-to-setup-pyenv-for-multiple-pythons/https://www.accelebrate.com/blog/the-many-faces-of-python-and-how-to-manage-them/http://ivory.idyll.org/articles/advanced-swc/http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.htmlhttp://www.youtube.com/watch?v=0vJJlVBVTFghttp://www.korokithakis.net/tutorials/python/http://www.afterhoursprogramming.com/tutorial/Python/Introduction/http://www.greenteapress.com/thinkpython/thinkCSpy.pdfhttps://docs.python.org/3.3/tutorial/modules.htmlhttps://www.learnpython.org/en/Modules/_and/_Packageshttps://docs.python.org/2/library/datetime.htmlhttps://chrisalbon.com/python/strings/_to/_datetime.html
Averylonglistofusefulinformationarealsoavailablefrom
https://github.com/vinta/awesome-pythonhttps://github.com/rasbt/python_reference
This list may be useful as it also contains links to data visualization andmanipulationlibraries,andAItoolsandlibraries.Pleasenotethatforthisclassyoucanreusesuchlibrariesifnototherwisestated.
7.1.9.1JupyterNotebookTutorials
AShort Introduction toJupyterNotebooksandNumPyToviewthenotebook,open this link in a background tab https://nbviewer.jupyter.org/ and copy andpaste the following link in the URL input areahttps://cloudmesh.github.io/classes/lesson/prg/Jupyter-NumPy-tutorial-I523-F2017.ipynbThenhitGo.
7.1.10Exercises
E.Python.Lib.1:
Write a python program called iterate.py that accepts an integer nfromthecommandline.Passthisintegertoafunctioncallediterate.
Theiteratefunctionshouldtheniteratefrom1ton.Ifthei-thnumberis a multiple of three, print multiple of 3, if a multiple of 5 printmultipleof5,ifamultipleofbothprintmultipleof3and5,elseprintthevalue.
E:Python.Lib.2:
1. Createapyenvorvirtualenv~/ENV
2. Modify your ~/.bashrc shell file to activate your environmentuponlogin.
3. Installthedocoptpythonpackageusingpip
4. Write a program that uses docopt to define a commandlineprogram.Hint:modifytheiterateprogram.
5. Demonstratetheprogramworks.
7.2DATAMANAGEMENT☁�Obviouslywhendealingwithbigdatawemaynotonlybedealingwithdatainoneformatbutinmanydifferentformats.Itisimportantthatyouwillbeabletomastersuchformatsandseamlesslyintegrateinyouranalysis.Thusweprovidesome simple examples on which different data formats exist and how to use
them.
7.2.1Formats
7.2.1.1Pickle
Pythonpickleallowsyoutosavedatainapythonnativeformatintoafilethatcan later be read in by other programs.However, the data formatmay not beportableamongdifferentpythonversionsthustheformatisoftennotsuitabletostoreinformation.Insteadwerecommendforstandarddatatouseeitherjsonoryaml.
Toreaditbackinuse
7.2.1.2TextFiles
Toreadtextfilesintoavariablecalledcontentyoucanuse
Youcanalsousethefollowingcodewhileusingtheconvenientwithstatement
Tosplitupthelinesofthefileintoanarrayyoucando
Thiscamalsobedonewiththebuildinreadlinesfunction
Incasethefileistoobigyouwillwanttoreadthefilelinebyline:
importpickle
flavor={
"small":100,
"medium":1000,
"large":10000
}
pickle.dump(flavor,open("data.p","wb"))
flavor=pickle.load(open("data.p","rb"))
content=open('filename.txt','r').read()
withopen('filename.txt','r')asfile:
content=file.read()
withopen('filename.txt','r')asfile:
lines=file.read().splitlines()
lines=open('filename.txt','r').readlines()
7.2.1.3CSVFiles
Oftendataiscontainedincommaseparatedvalues(CSV)withinafile.Toreadsuchfilesyoucanusethecsvpackage.
Usingpandasyoucanreadthemasfollows.
TherearemanyothermodulesandlibrariesthatincludeCSVreadfunctions.Incaseyouneedtosplitasinglelinebycomma,youmayalsousethesplitfunction.However,rememberitswillsplitateverycomma,includingthosecontainedinquotes.Sothismethodalthoughlookingoriginallyconvenienthaslimitations.
7.2.1.4Excelspreadsheets
PandascontainsamethodtoreadExcelfiles
7.2.1.5YAML
YAML is a very important format as it allows you easily to structure data inhierarchicalfieldsItisfrequentlyusedtocoordinateprogramswhileusingyamlasthespecificationforconfigurationfiles,butalsodatafiles.Toreadinayamlfilethefollowingcodecanbeused
Thenicepartisthatthiscodecanalsobeusedtoverifyifafileisvalidyaml.Towritedataoutwecanuse
withopen('filename.txt','r')asfile:
line=file.readline()
print(line)
importcsv
withopen('data.csv','rb')asf:
contents=csv.reader(f)
forrowincontent:
printrow
importpandasaspd
df=pd.read_csv("example.csv")
importpandasaspd
filename='data.xlsx'
data=pd.ExcelFile(file)
df=data.parse('Sheet1')
importyaml
withopen('data.yaml','r')asf:
content=yaml.load(f)
The flow style set to false formats the data in a nice readable fashion withindentations.
7.2.1.6JSON
7.2.1.7XML
XML format is extensively used to transport data across the web. It has ahierarchicaldataformat,andcanberepresentedintheformofatree.
ASampleXMLdatalookslike:
PythonprovidestheElementTreeXMLAPItoparseandcreateXMLdata.
ImportingXMLdatafromafile:
ReadingXMLdatafromastringdirectly:
Iteratingoverchildnodesinaroot:
ModifyingXMLdatausingElementTree:
Modifyingtextwithinatagofanelementusing.textmethod:
withopen('data.yml','w')asf:
yaml.dump(data,f,default_flow_style=False)
importjson
withopen('strings.json')asf:
content=json.load(f)
<data>
<items>
<itemname="item-1"></item>
<itemname="item-2"></item>
<itemname="item-3"></item>
</items>
</data>
importxml.etree.ElementTreeasET
tree=ET.parse('data.xml')
root=tree.getroot()
root=ET.fromstring(data_as_string)
forchildinroot:
print(child.tag,child.attrib)
tag.text=new_data
tree.write('output.xml')
Adding/modifyinganattributeusing.set()method:
OtherPythonmodulesusedforparsingXMLdatainclude
minidom:https://docs.python.org/3/library/xml.dom.minidom.htmlBeautifulSoup:https://www.crummy.com/software/BeautifulSoup/
7.2.1.8RDF
ToreadRDFfilesyouwillneedtoinstallRDFlibwith
ThiswillthanallowyoutoreadRDFfiles
Good examples on using RDF are provided on the RDFlib Web page athttps://github.com/RDFLib/rdflib
FromtheWebpageweshowcasealsohowtodirectlyprocessRDFdata fromtheWeb
7.2.1.9PDF
The Portable Document Format (PDF) has been made available by AdobeInc.royaltyfree.ThishasenabledPDFtobecomeaworldwideadoptedformatthat also has been standardized in 2008 (ISO/IEC 32000-1:2008,https://www.iso.org/standard/51502.html). A lot of research is published inpapersmakingPDFoneofthede-factostandardsforpublishing.However,PDFis difficult to parse and is focused on high quality output instead of datarepresentation.Nevertheless,toolstomanipulatePDFexist:
tag.set('key','value')
tree.write('output.xml')
$pipinstallrdflib
fromrdflib.graphimportGraph
g=Graph()
g.parse("filename.rdf",format="format")
forentrying:
print(entry)
importrdflib
g=rdflib.Graph()
g.load('http://dbpedia.org/resource/Semantic_Web')
fors,p,oing:
prints,p,o
PDFMiner
https://pypi.python.org/pypi/pdfminer/allowsthesimpletranslationofPDFinto text that than can be further mined. The manual page helps todemonstratesomeexampleshttp://euske.github.io/pdfminer/index.html.
pdf-parser.py
https://blog.didierstevens.com/programs/pdf-tools/ parses pdf documentsandidentifiessomestructuralelementsthatcanthanbefurtherprocessed.
Ifyouknowaboutothertools,letusknow.
7.2.1.10HTML
A very powerful library to parse HTML Web pages is provided withhttps://www.crummy.com/software/BeautifulSoup/
More details about it are provided in the documentation pagehttps://www.crummy.com/software/BeautifulSoup/bs4/doc/
�TODO:Studentscancontributeasection
BeautifulSoupisapythonlibrarytoparse,processandeditHTMLdocuments.
ToinstallBeautifulSoup,usepipcommandasfollows:
In order to process HTML documents, a parser is required. Beautiful Soupsupports the HTML parser included in Python’s standard library, but it alsosupports a number of third-party Python parsers like the lxml parser which iscommonlyused[1].
Followingcommandcanbeusedtoinstalllxmlparser
Tobeginwith,weimportthepackageandinstantiateanobjectasfollowsforahtmldocumenthtml_handle:
$pipinstallbeautifulsoup4
$pipinstalllxml
Now,wewilldiscussafewfunctions,attributesandmethodsofBeautifulSoup.
prettifyfunction
prettify() method will turn a Beautiful Soup parse tree into a nicely formattedUnicode string,witha separate line for eachHTML/XML tagand string. It isanalgoustopprint()function.Theobjectcreatedabovecanbeviewedbyprintingtheprettfiedversionofthedocumentasfollows:
tagObject
AtagobjectreferstotagsintheHTMLdocument.ItispossibletogodowntotheinnerlevelsoftheDOMtree.Toaccessatagdivunderthetagbody,itcanbedoneasfollows:
TheattrsattributeofthetagobjectreturnsadictionaryofallthedefinedattributesoftheHTMLtagaskeys.
has_attr()method
Tocheckifatagobjecthasaspecificattribute,has_attr()methodcanbeused.
tagobjectattributes
name-Thisattributereturnsthenameofthetagselected.attrs -Thisattribute returnsadictionaryofall thedefinedattributesof theHTMLtagaskeys.contents -Thisattributereturnsa listofcontentsenclosedwithin theHTMLtagstring-ThisattributewhichreturnsthetextenclosedwithintheHTMLtag.ThisreturnsNoneiftherearemultiplechildrenstrings-Thisovercomesthelimitationofstringandreturnsageneratorofall
frombs4importBeautifulSoup
soup=BeautifulSoup(html_handle,`lxml`)
print(soup.prettify())
body_div=soup.body.div
print(body_div.prettify())
ifbody_div.has_attr('p'):
print('Thevalueof\'p\'attributeis:',body_div['p'])
stringsenclosedwithinthegiventag
Followingcodeshowcasesusageoftheabovediscussedattributes:
SearchingtheTree
find() function takes a filter expression as argument and returns the firstmatchfoundfindall()functionreturnsalistofallthematchingelements
select()functioncanbeusedtosearchthetreeusingCSSselectors
7.2.1.11ConfigParser
�TODO:Studentscancontributeasection
https://pymotw.com/2/ConfigParser/
7.2.1.12ConfigDict
https://github.com/cloudmesh/cloudmesh-common/blob/master/cloudmesh/common/ConfigDict.py
7.2.2Encryption
body_tag=soup.body
print("Nameofthetag:',body_tag.name)
attrs=body_tag.attrs
print('Theattributesdefinedforbodytagare:',attrs)
print('Thecontentsof\'body\'tagare:\n',body_tag.contents)
print('Thestringvalueenclosedin\'body\'tagis:',body_tag.string)
forsinbody_tag.strings:
print(repr(s))
search_elem=soup.find('a')
print(search_elem.prettify())
search_elems=soup.find_all("a",class_="sample")
pprint(search_elems)
#Select`a`tagwithclass`sample`
a_tag_elems=soup.select('a.sample')
print(a_tag_elems)
Oftenweneedtoprotecttheinformationstoredinafile.Thisisachievedwithencryption.Therearemanymethodsofsupportingencryptionandevenifafileisencrypteditmaybetargettoattacks.Thusitisnotonlyimportanttoencryptdatathatyoudonotwantotherstosebutalsotomakesurethatthesystemonwhichthedataishostedissecure.Thisisespeciallyimportantifwetalkaboutbigdatahavingapotentiallargeeffectifitgetsintothewronghands.
To illustrate one type of encryption that is non trivial we have chosen todemonstrate how to encrypt a file with an ssh key. In case you have opensslinstalledonyoursystem,thiscanbeachievedasfollows.
MostimportanthereareStep4thatencryptsthefileandStep5thatdecryptsthefile. Using the Python os module it is straight forward to implement this.However,weareprovidingincloudmeshaconvenientclassthatmakestheuseinpythonverysimple.
Inourclassweinitializeitwiththelocationsofthefilethatistobeencryptedanddecrypted.Toinitiatethatactionjustcallthemethodsencryptanddecrypt.
7.2.3DatabaseAccess
�TODO:Students:defineconventionaldatabaseaccesssection
see:https://www.tutorialspoint.com/python/python_database_access.htm
7.2.4SQLite
#!/bin/sh
#Step1.Creatingafilewithdata
echo"BigDataisthefuture.">file.txt
#Step2.Createthepem
opensslrsa-in~/.ssh/id_rsa-pubout>~/.ssh/id_rsa.pub.pem
#Step3.lookatthepemfiletoillustratehowitlookslike(optional)
cat~/.ssh/id_rsa.pub.pem
#Step4.encryptthefileintosecret.txt
opensslrsautl-encrypt-pubin-inkey~/.ssh/id_rsa.pub.pem-infile.txt-outsecret.txt
#Step5.decryptthefileandprintthecontentstostdout
opensslrsautl-decrypt-inkey~/.ssh/id_rsa-insecret.txt
fromcloudmesh.common.ssh.encryptimportEncryptFile
e=EncryptFile('file.txt','secret.txt')
e.encrypt()
e.decrypt()
�TODO:Studentscancontributetothissection
https://www.sqlite.org/index.html
https://docs.python.org/3/library/sqlite3.html
7.2.4.1Exercises �
E:Encryption.1:
Testtheshellscripttoreplicatehowthisexampleworks
E:Encryption.2:
Testthecloudmeshencryptionclass
E:Encryption.3:
What other encryptionmethods exist. Can you provide an exampleandcontributetothesection?
E:Encryption.4:
WhatistheissueofencryptionthatmakeitchallengingforBigData
E:Encryption.5:
Givenatestdatasetwithmanyfilestextfiles,howlongwillittaketoencrypt anddecrypt themon variousmachines.Write a benchmarkthatyoutest.Developthisbenchmarkasagroup,testoutthetimeittakestoexecuteitonavarietyofplatforms.
7.3PLOTTINGWITHMATPLOTLIB☁�Abrief overviewofplottingwithmatplotlib alongwith examples is provided.Firstmatplotlibmustbeinstalled,whichcanbeaccomplishedwithpipinstallasfollows:$pipinstallmatplotlib
Wewillstartbyplottingasimplelinegraphusingbuiltinnumpyfunctionsforsineandcosine.Thisfirststepistoimporttheproperlibrariesshownnext.
Nextwewilldefinethevaluesforthexaxis,wedothiswiththelinspaceoptioninnumpy.Thefirsttwoparametersarethestartingandendingpoints,thesemustbescalars.Thethirdparameterisoptionalanddefinesthenumberofsamplestobe generated between the starting and ending points, this value must be aninteger.Additionalparametersforthelinspaceutilitycanbefoundhere:
Nowwewillusethesineandcosinefunctionsinordertogenerateyvalues,forthiswewill use the values of x for the argument of both our sine and cosinefunctionsi.e.cos(x).
Youcandisplay thevaluesof the threeparameterswehavedefinedby typingtheminapythonshell.
Havingdefinedxandyvalueswecangeneratealineplotandsinceweimportedmatplotlib.pyplotaspltwesimplyuseplt.plot.
Wecandisplaytheplotusingplt.show()whichwillpopupafiguredisplayingtheplotdefined.
Additionallywecanaddthesinelinetooutlinegraphbyenteringthefollowing.
Invoking plt.show() now will show a figure with both sine and cosine linesdisplayed.Nowthatwehaveafiguregenerateditwouldbeusefultolabelthex
importnumpyasnp
importmatplotlib.pyplotasplt
x=np.linspace(-np.pi,np.pi,16)
cos=np.cos(x)
sin=np.sin(x)
x
array([-3.14159265,-2.72271363,-2.30383461,-1.88495559,-1.46607657,
-1.04719755,-0.62831853,-0.20943951,0.20943951,0.62831853,
1.04719755,1.46607657,1.88495559,2.30383461,2.72271363,
3.14159265])
plt.plot(x,cos)
plt.show()
plt.plot(x,sin)
andyaxisandprovideatitle.Thisisdonebythefollowingthreecommands:
Alongwithaxislabelsandatitleanotherusefulfigurefeaturemaybealegend.Inordertocreatealegendyoumustfirstdesignatealabelfortheline,thislabelwill be what shows up in the legend. The label is defined in the initialplt.plot(x,y)instance,nextisanexample.
Theninordertodisplaythelegendthefollowingcommandisissued:
Thelocationisspecifiedbyusingupperorlowerandleftorright.Naturallyallthesecommandscanbecombinedandput ina filewith the .pyextensionandrunfromthecommandline.
�linkerror
Anexampleofabarchartisprecedednextusingdatafrom[T:fast-cars].
plt.xlabel("X-label(units)")
plt.ylabel("Y-label(units)")
plt.title("AcleverTitleforyourFigure")
plt.plot(x,cos,label="cosine")
plt.legend(loc='upperright')
importnumpyasnp
importmatplotlib.pyplotasplt
x=np.linspace(-np.pi,np.pi,16)
cos=np.cos(x)
sin=np.sin(x)
plt.plot(x,cos,label="cosine")
plt.plot(x,sin,label="sine")
plt.xlabel("X-label(units)")
plt.ylabel("Y-label(units)")
plt.title("AcleverTitleforyourFigure")
plt.legend(loc='upperright')
plt.show()
importmatplotlib.pyplotasplt
x=['ToyotaPrius',
'TeslaRoadster',
'BugattiVeyron',
'HondaCivic',
'LamborghiniAventador']
horse_power=[120,288,1200,158,695]
x_pos=[ifori,_inenumerate(x)]
plt.bar(x_pos,horse_power,color='green')
plt.xlabel("CarModel")
plt.ylabel("HorsePower(Hp)")
plt.title("HorsePowerforSelectedCars")
You can customize plots further by using plt.style.use(), in python 3. If youprovide thefollowingcommandinsideapythoncommandshellyouwillseealistofavailablestyles.
Anexampleofusingapredefinedstyleisshownnext.
Uptothispointwehaveonlyshowcasedhowtodisplayfiguresthroughpythonoutput, however web browsers are a popular way to display figures. OneexampleisBokeh, thefollowinglinescanbeenteredinapythonshellandthefigureisoutputtedtoabrowser.
7.4DOCOPTS☁�Whenwewanttodesigncommandlineargumentsforpythonprogramswehavemany options. However, as our approach is to create documentation first,docoptsprovidesalsoagoodapprachforPython.Thecodeforitislocatedat
https://github.com/docopt/docopt
Itcanbeinstalledwith
Asampleprogramsarelocatedat
https://github.com/docopt/docopt/blob/master/examples/options_example.py
Asampleprogramofusingdocoptsforourpurposesloksasfollows
plt.xticks(x_pos,x)
plt.show()
print(plt.style.available)
plt.style.use('seaborn')
frombokeh.ioimportshow
frombokeh.plottingimportfigure
x_values=[1,2,3,4,5]
y_values=[6,7,2,3,6]
p=figure()
p.circle(x=x_values,y=y_values)
show(p)
$pipinstalldocopt
Another good feature of using docopts is that we can use the same verbaldescriptioninotherprogramminglanguagesasshowcasedinthisbook.
7.5OPENCV☁�
LearningObjectives
Providesomesimplecalculationssowecantestcloudservices.ShowcasesomeelementaryOpenCVfunctionsShowanenvironmentalimageanalysisapplicationusingSecchidisks
OpenCV (OpenSourceComputerVisionLibrary) is a library of thousands ofalgorithmsforvariousapplicationsincomputervisionandmachinelearning.Ithas C++, C, Python, Java and MATLAB interfaces and supports Windows,Linux,AndroidandMacOS. In this section,wewill explainbasic featuresofthislibrary,includingtheimplementationofasimpleexample.
7.5.1Overview
OpenCVhascountlessfunctionsforimageandvideosprocessing.Thepipelinestarts with reading the images, low-level operations on pixel values,preprocessinge.g.denoising,andthenmultiplestepsofhigher-leveloperations
"""CloudmeshVMmanagement
Usage:
cm-govmstartNAME[--cloud=CLOUD]
cm-govmstopNAME[--cloud=CLOUD]
cm-goset--cloud=CLOUD
cm-go-h|--help
cm-go--version
Options:
-h--helpShowthisscreen.
--versionShowversion.
--cloud=CLOUDThenameofthecloud.
--mooredMoored(anchored)mine.
--driftingDriftingmine.
ARGUMENTS:
NAMEThenameoftheVM`
"""
fromdocoptimportdocopt
if__name__=='__main__':
arguments=docopt(__doc__,version='1.0.0rc2')
print(arguments)
which vary depending on the application.OpenCV covers thewhole pipeline,especiallyprovidingalargesetoflibraryfunctionsforhigh-leveloperations.Asimpler library for image processing in Python is Scipy’s multi-dimensionalimageprocessingpackage(scipy.ndimage).
7.5.2Installation
OpenCV for Python can be installed on Linux in multiple ways, namelyPyPI(Python Package Index), Linux package manager (apt-get for Ubuntu),Condapackagemanager,andalsobuildingfromsource.YouarerecommendedtousePyPI.Here’sthecommandthatyouneedtorun:
ThiswastestedonUbuntu16.04withafreshPython3.6virtualenvironment.Inordertotest,importthemoduleinPythoncommandline:
If itdoesnotraiseanerror, it is installedcorrectly.Otherwise, try tosolvetheerror.
ForinstallationonWindows,see:
https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_setup/py_setup_in_windows/py_setup_in_windows.html#install-opencv-python-in-windows
NotethatbuildingfromsourcecantakealongtimeandmaynotbefeasiblefordeployingtolimitedplatformssuchasRaspberryPi.
7.5.3ASimpleExample
Inthisexample,animageisloaded.Asimpleprocessingisperformed,andtheresultiswrittentoanewimage.
7.5.3.1Loadinganimage
$pipinstallopencv-python
importcv2
%matplotlibinline
importcv2
TheimagewasdownloadedfromUSCstandarddatabase:
http://sipi.usc.edu/database/database.php?volume=misc&image=9
7.5.3.2Displayingtheimage
The image is saved in anumpyarray.Eachpixel is representedwith3values(R,G,B).Thisprovidesyouwithaccesstomanipulatetheimageatthelevelofsingle pixels. You can display the image using imshow function as well asMatplotlib’simshowfunction.
Youcandisplaytheimageusingimshowfunction:
oryoucanuseMatplotlib.IfyouhavenotinstalledMatplotlibbefore,installitusing:
Nowyoucanuse:
whichresultsinFigure1
Figure1:Imagedisplay
img=cv2.imread('images/opencv/4.2.01.tiff')
cv2.imshow('Original',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
$pipinstallmatplotlib
importmatplotlib.pyplotasplt
plt.imshow(img)
7.5.3.3ScalingandRotation
Scaling(resizing)theimagerelativetodifferentaxis
whichresultsinFigure2
Figure2:Scalingandrotation
Rotationoftheimageforanangleoft
whichresultsinFigure3
res=cv2.resize(img,
None,
fx=1.2,
fy=0.7,
interpolation=cv2.INTER_CUBIC)
plt.imshow(res)
rows,cols,_=img.shape
t=45
M=cv2.getRotationMatrix2D((cols/2,rows/2),t,1)
dst=cv2.warpAffine(img,M,(cols,rows))
plt.imshow(dst)
Figure3:image
7.5.3.4Gray-scaling
whichresultsin+Figure4
Figure4:Graysacling
7.5.3.5ImageThresholding
whichresultsinFigure5
img2=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
plt.imshow(img2,cmap='gray')
ret,thresh=cv2.threshold(img2,127,255,cv2.THRESH_BINARY)
plt.subplot(1,2,1),plt.imshow(img2,cmap='gray')
plt.subplot(1,2,2),plt.imshow(thresh,cmap='gray')
Figure5:ImageThresholding
7.5.3.6EdgeDetection
EdgedetectionusingCannyedgedetectionalgorithm
whichresultsinFigure6
Figure6:Edgedetection
7.5.4AdditionalFeatures
OpenCV has implementations of many machine learning techniques such asKMeansandSupportVectorMachines,thatcanbeputintousewithonlyafewlines of code. It also has functions especially for video analysis, featuredetection,objectrecognitionandmanymore.Youcanfindoutmoreaboutthemintheirwebsite
[OpenCV](https://docs.opencv.org/3.0-beta/index.html was initially developed
edges=cv2.Canny(img2,100,200)
plt.subplot(121),plt.imshow(img2,cmap='gray')
plt.subplot(122),plt.imshow(edges,cmap='gray')
for C++ and still has a focus on that language, but it is still one of themostvaluableimageprocessinglibrariesinPython.
7.6SECCHIDISK☁�Wearedevelopinganautonomousrobotboatthatyoucanbepartofdevelopingwithinthisclass.Therobotbotisactuallymeasuringturbidityorwaterclarity.TraditionallythishasbeendonewithaSecchidisk.TheuseoftheSecchidiskisasfollows:
1. LowertheSecchidiskintothewater.2. Measurethepointwhenyoucannolongerseeit3. Recordthedepthatvariouslevelsandplotinageographical3Dmap
One of the thingswe can do is take a video of themeasurement instead of ahumanrecordingthem.Thanwecananalysethevideoautomaticallytoseehowdeep a diskwas lowered.This is a classical image analysis program.You areencouragedtoidentifyalgorithmsthatcanidentifythedepth.Themostsimplestseemstobetodoahistogramatavarietyofdepthsteps,andmeasurewhenthehistogramno longerchangessignificantly.Thedepthat that imagewillbe themeasurementwelookfor.
Thus ifwe analyse the imageswe need to look at the image and identify thenumbersonthemeasuringtape,aswellasthevisibilityofthedisk.
To show case how such a disk looks like we refer to the image showcasingdifferent Secchi disks. For our purpose the black-white contrast Secchi diskworkswell.SeeFigure7
Figure 7: Secchi disk types. A marine style on the left and thefreshwaterversionontherightwikipedia.
MoreinformationaboutSecchiDiskcanbefoundat:
https://en.wikipedia.org/wiki/Secchi/_disk
WehaveincludednextacoupleofexampleswhileusingsomeobviouslyusefulOpenCVmethods.Surprisingly,theuseoftheedgedetectionthatcomesinmindfirst to identify if we still can see the disk, seems to complicated to use foranalysis.Weatthistimebelievethehistogramwillbesufficient.
Pleaseinspectourexamples.
7.6.1SetupforOSX
First lest setup theOpenCVenvironment forOSX.Naturallyyouwillhave toupdatetheversionsbasedonyourversionsofpython.Whenwetriedtheinstallof OpenCV on MacOS, the setup was slightly more complex than otherpackages. This may have changed by now and if you have improvedinstructions, pleas elt us know. However we do not want to install it viaAnacondaoutoftheobviousreasonthatanacondainstallstomanyotherthings.importos,sys
fromos.pathimportexpanduser
7.6.2Step1:Recordthevideo
Recordthevideoontherobot
Wehaveactuallydonethisforyouandwillprovideyouwithimagesandvideosifyouareinterestedinanalyzingthem.SeeFigure8
7.6.3Step2:AnalysetheimagesfromtheVideo
Fornowwejustselected4imagesfromthevideo
os.path
home=expanduser("~")
sys.path.append('/usr/local/Cellar/opencv/3.3.1_1/lib/python3.6/site-packages/')
sys.path.append(home+'/.pyenv/versions/OPENCV/lib/python3.6/site-packages/')
importcv2
cv2.__version__
!pipinstallnumpy>tmp.log
!pipinstallmatplotlib>>tmp.log
%matplotlibinline
importcv2
importmatplotlib.pyplotasplt
img1=cv2.imread('secchi/secchi1.png')
img2=cv2.imread('secchi/secchi2.png')
img3=cv2.imread('secchi/secchi3.png')
img4=cv2.imread('secchi/secchi4.png')
figures=[]
fig=plt.figure(figsize=(18,16))
foriinrange(1,13):
figures.append(fig.add_subplot(4,3,i))
count=0
forimgin[img1,img2,img3,img4]:
figures[count].imshow(img)
color=('b','g','r')
fori,colinenumerate(color):
histr=cv2.calcHist([img],[i],None,[256],[0,256])
figures[count+1].plot(histr,color=col)
figures[count+2].hist(img.ravel(),256,[0,256])
count+=3
print("Legend")
print("Firstcolumn=imageofSecchidisk")
print("Secondcolumn=histogramofcolorsinimage")
print("Thirdcolumn=histogramofallvalues")
plt.show()
Figure8:Histogram
7.6.3.1ImageThresholding
SeeFigure9,Figure10,Figure11,Figure12defthreshold(img):
ret,thresh=cv2.threshold(img,150,255,cv2.THRESH_BINARY)
plt.subplot(1,2,1),plt.imshow(img,cmap='gray')
plt.subplot(1,2,2),plt.imshow(thresh,cmap='gray')
threshold(img1)
Figure9:Threshold1
Figure10:Threshold2
Figure11:Threshold3
Figure12:Threshold4
7.6.3.2EdgeDetection
threshold(img2)
threshold(img3)
threshold(img4)
SeeFigure13,Figure14,Figure15,Figure16,Figure17.EdgedetectionusingCannyedgedetectionalgorithm
Figure13:EdgeDetection1
Figure14:EdgeDetection2
Figure15:EdgeDetection3
deffind_edge(img):
edges=cv2.Canny(img,50,200)
plt.subplot(121),plt.imshow(img,cmap='gray')
plt.subplot(122),plt.imshow(edges,cmap='gray')
find_edge(img1)
find_edge(img2)
find_edge(img3)
find_edge(img4)
Figure16:EdgeDetection4
7.6.3.3Blackandwhite
Figure17:BackWhiteconversion
bw1=cv2.cvtColor(img1,cv2.COLOR_BGR2GRAY)
plt.imshow(bw1,cmap='gray')
8DATA
8.1DATAFORMATS☁�
8.1.1YAML
ThetermYAMLstandfor“YAMLAinotMarkupLanguage”.AccordingtotheWebPageat
http://yaml.org/
“YAML is a human friendly data serialization standard for all programminglanguages.”TherearemultipleversionsofYAMLexistingandoneneedstotakecare of that your software supports the right version. The current version isYAML1.2.
YAML is oftenused for configuration and inmany cases can alsobeused asXMLreplacement.ImportantistatYAMincontrasttoXMLremovesthetagswhilereplacingthemwithindentation.Thishasnaturallytheadvantagethatitismoreasily to read,however, the format is strictandneeds toadhere toproperindentation. Thus it is important that you check your YAML files forcorrectness,eitherbywritingforexampleapythonprogramthatreadyouryamlfile,oranonlineYAMLcheckersuchasprovidedat
http://www.yamllint.com/
An example on how to use yaml in python is provided in our next example.PleasenotethatYAMLisasupersetofJSON.OriginallyYAMLwasdesignedasamarkuplanguage.Howeverasitisnotdocumentorientedbutdataorientedithasbeenrecastanditdoesnolongerclassifyitselfasmarkuplanguage.importos
importsys
importyaml
try:
yamlFilename=os.sys.argv[1]
yamlFile=open(yamlFilename,"r")
except:
print("filenamedoesnotexist")
sys.exit()
try:
Resources:
http://yaml.org/https://en.wikipedia.org/wiki/YAMLhttp://www.yamllint.com/
8.1.2JSON
ThetermJSONstandforJavaScriptObjectNotation.Itistargetedasanopen-standard file format that emphasizes on integration of human-readable text totransmitdataobjects.Thedataobjectscontainattributevaluepairs.Althoughitoriginates from JavaScript, the format itself is language independent. It usesbracketstoalloworganizationofthedata.PLeasenotethatYAMLisasupersetofJSONandnotallYAMLdocumentscanbeconvertedtoJSON.FurthermoreJSONdoesnotsupportcomments.ForthesereasonsweoftenprefertousYAMlinsteadofJSON.HoweverJSONdatacaneasilybetranslatedtoYAMLaswellasXML.
Resources:
https://en.wikipedia.org/wiki/JSONhttps://www.json.org/
8.1.3XML
XMLstandsforExtensibleMarkupLanguage.XMLallowstodefinedocumentswith the help of a set of rules in order to make it machine readable. Theemphasize here is on machine readable as document in XML can becomequickly complex and difficult to understand for humans. XML is used fordocumentsaswellasdatastructures.
AtutorialaboutXMLisavailableat
https://www.w3schools.com/xml/default.asp
Resources:
yaml.load(yamlFile.read())
except:
print("YAMLfileisnotvalid.")
https://en.wikipedia.org/wiki/XML
9MONGO
9.1MONGODBINPYTHON☁�
LearningObjectives
IntroductiontobasicMongoDBknowledgeUseofMongoDBviaPyMongoUseofMongoEngineMongoEngineandObject-Documentmapper,UseofFlask-Mongo
In today’s era, NoSQL databases have developed an enormous potential toprocess the unstructured data efficiently. Modern information is complex,extensive, andmaynothavepre-existing relationships.With the adventof theadvanced search engines, machine learning, and Artificial Intelligence,technology expectations to process, store, and analyze such data have growntremendously[2].TheNoSQLdatabaseenginessuchasMongoDB,Redis,andCassandra have successfully overcome the traditional relational databasechallenges such as scalability, performance, unstructured data growth, agilesprint cycles, andgrowingneeds of processingdata in real-timewithminimalhardwareprocessingpower[3].TheNoSQLdatabasesareanewgenerationofengines that do not necessarily require SQL language and are sometimes alsocalledNotOnlySQL databases.However,mostof themsupportvarious third-partyopenconnectivitydriversthatcanmapNoSQLqueriestoSQL’s.Itwouldbe safe to say that althoughNoSQL databases are still far from replacing therelationaldatabases,theyareaddinganimmensevaluewhenusedinhybridITenvironmentsinconjunctionwithrelationaldatabases,basedontheapplicationspecific needs [3].We will be covering theMongoDB technology, its driverPyMongo, itsobject-documentmapperMongoEngine, and theFlask-PyMongomicro-webframeworkthatmakeMongoDBmoreattractiveanduser-friendly.
9.1.1CloudmeshMongoDBUsageQuickstart
Beforeyoureadonwelikeyoutoreadthisquickstart.Theeasiestwayformanyof the activities we do to interact with MongoDB is to use our cloudmeshfunctionality.Thispreludesectionisnotintendedtodescribeallthedetails,butgetyoustartedquicklywhileleveragingcloudmesh
Thisisdoneviathecloudmeshcmd5andthecloudmesh_community/cmcode:
https://cloudmesh-community.github.io/cm/
ToinstallmongoonforexamplemacOSyoucanuse
Tostart,stopandseethestatusofmongoyoucanuse
To add anobject toMongo, you simplyhave to define a dictwithpredefinedvaluesforkindandcloud.InfuturesuchattributescanbepassedtothefunctiontodeterminetheMongoDBcollection.
When you invoke the function itwill automatically store the information intoMongoDB. Naturally this requires that the ~/.cloudmesh/cloudmesh.yaml file is properlyconfigured.
9.1.2MongoDB
TodayMongoDB isoneof leadingNoSQLdatabasewhich is fullycapableofhandling dynamic changes, processing large volumes of complex andunstructureddata,easilyusingobject-orientedprogrammingfeatures;aswellasdistributed system challenges [4]. At its core, MongoDB is an open source,cross-platform,documentdatabasemainlywritteninC++language.
$cmsadminmongoinstall
$cmsadminmongostart
$cmsadminmongostop
$cmsadminmongostatus
fromcloudmesh.mongo.DataBaseDecoratorimportDatabaseUpdate
@DatabaseUpdate
deftest():
data={
"kind":"test",
"cloud":"testcloud",
"value":"hello"
}
returndata
result=test()
9.1.2.1Installation
MongoDBcanbeinstalledonvariousUnixPlatforms,includingLinux,Ubuntu,AmazonLinux,etc[5].ThissectionfocusesoninstallingMongoDBonUbuntu18.04BionicBeaverusedasastandardOSforavirtualmachineusedasapartofBigDataApplicationClassduringthe2018Fallsemester.
9.1.2.1.1Installationprocedure
Beforeinstalling,itisrecommendedtoconfigurethenon-rootuserandprovidethe administrative privileges to it, in order to be able to perform generalMongoDBadmin tasks.Thiscanbeaccomplishedby loginas the rootuser inthefollowingmanner[6].
When logged in as a regular user, one can perform actions with superuserprivilegesbytypingsudobeforeeachcommand[6].
Oncetheusersetupiscompleted,onecanloginasaregularuser(mongoadmin)andusethefollowinginstructionstoinstallMongoDB.
To update the Ubuntu packages to the most recent versions, use the nextcommand:
ToinstalltheMongoDBpackage:
Tochecktheserviceanddatabasestatus:
VerifyingthestatusofasuccessfulMongoDBinstallationcanbeconfirmedwithanoutputsimilartothis:
$addusermongoadmin
$usermod-aGsudosammy
$sudoaptupdate
$sudoaptinstall-ymongodb
$sudosystemctlstatusmongodb
$mongodb.service-Anobject/document-orienteddatabase
Loaded:loaded(/lib/systemd/system/mongodb.service;enabled;vendorpreset:enabled)
Active:**active**(running)sinceSat2018-11-1507:48:04UTC;2min17sago
Docs:man:mongod(1)
MainPID:2312(mongod)
Tasks:23(limit:1153)
Toverify theconfiguration,morespecifically the installedversion, server,andport,usethefollowingcommand:
Similarly,torestartMongoDB,usethefollowing:
To allow access toMongoDB from an outside hosted server one can use thefollowingcommandwhichopensthefire-wallconnections[5].
Statuscanbeverifiedbyusing:
Other MongoDB configurations can be edited through the /etc/mongodb.conffilessuchasportandhostnames,filepaths.
Also, to complete this step, a server’s IPaddressmustbe added to thebindIPvalue[5].
MongoDB is now listening for a remote connection that can be accessed byanyonewithappropriatecredentials[5].
9.1.2.2CollectionsandDocuments
Each database within Mongo environment contains collections which in turncontaindocuments.Collectionsanddocumentsareanalogoustotablesandrowsrespectivelytotherelationaldatabases.Thedocumentstructureisinakey-valueform which allows storing of complex data types composed out of field andvalue pairs. Documents are objects which correspond to native data types inmanyprogramming languages, hence awell defined, embeddeddocument can
CGroup:/system.slice/mongodb.service
└─2312/usr/bin/mongod--unixSocketPrefix=/run/mongodb--config/etc/mongodb.conf
$mongo--eval'db.runCommand({connectionStatus:1})'
$sudosystemctlrestartmongodb
$sudoufwallowfromyour_other_server_ip/32toanyport27017
$sudoufwstatus
$sudonano/etc/mongodb.conf
$logappend=true
bind_ip=127.0.0.1,your_server_ip
*port=27017*
helpreduceexpensivejoinsandimprovequeryperformance.The_idfieldhelpstoidentifyeachdocumentuniquely[3].
MongoDB offers flexibility towrite records that are not restricted by columntypes.Thedata storageapproach is flexibleas it allowsone to storedataas itgrowsandtofulfillvaryingneedsofapplicationsand/orusers.ItsupportsJSONlikebinarypointsknownasBSONwheredatacanbestoredwithoutspecifyingthe type of data.Moreover, it can be distributed tomultiplemachines at highspeed. It includes a sharding feature that partitions and spreads the data outacrossvariousservers.ThismakesMongoDBanexcellentchoiceforclouddataprocessing. Its utilities can load high volumes of data at high speed whichultimately provides greater flexibility and availability in a cloud-basedenvironment[2].
ThedynamicschemastructurewithinMongoDBallowseasytestingofthesmallsprints in theAgile projectmanagement life cycles and research projects thatrequirefrequentchangestothedatastructurewithminimaldowntime.Contrarytothisflexibleprocess,modifyingthedatastructureofrelationaldatabasescanbeaverytediousprocess[2].
9.1.2.2.1Collectionexample
ThefollowingcollectionexampleforapersonnamedAlbertincludesadditionalinformationsuchasage,status,andgroup[7].
9.1.2.2.2Documentstructure
9.1.2.2.3CollectionOperations
If collection does not exists, MongoDB database will create a collection by
{
name:"Albert"
age:"21"
status:"Open"
group:["AI","MachineLearning"]
}
{
field1:value1,
field2:value2,
field3:value3,
...
fieldN:valueN
}
default.
9.1.2.3MongoDBQuerying
Thedataretrievalpatterns, thefrequencyofdatamanipulationstatementssuchas insert, updates, and deletes may demand for the use of indexes orincorporatingtheshardingfeaturetoimprovequeryperformanceandefficiencyof MongoDB environment [3]. One of the significant difference betweenrelationaldatabasesandNoSQLdatabasesare joins. In the relationaldatabase,onecancombineresultsfromtwoormoretablesusingacommoncolumn,oftencalled as key. The native table contains the primary key column while thereferenced table contains a foreign key. This mechanism allows one to makechangesinasinglerowinsteadofchangingallrowsinthereferencedtable.Thisaction is referred to as normalization.MongoDB is a document database andmainlycontainsdenormalizeddatawhichmeansthedataisrepeatedinsteadofindexedoveraspecifickey.Ifthesamedataisrequiredinmorethanonetable,itneedstoberepeated.ThisconstrainthasbeeneliminatedinMongoDB’snewversion 3.2. The new release introduced a $lookup feature whichmore likelyworksasaleft-outer-join.Lookupsarerestrictedtoaggregatedfunctionswhichmeansthatdatausuallyneedsometypeoffilteringandgroupingoperations tobe conducted beforehand. For this reason, joins in MongoDB require morecomplicated querying compared to the traditional relational database joins.Although at this time, lookups are still very far from replacing joins, this is aprominent feature that can resolve some of the relational data challenges forMongoDB[8].MongoDBqueriessupport regularexpressionsaswellas rangeasksforspecificfieldsthateliminatetheneedofreturningentiredocuments[3].MongoDB collections do not enforce document structure like SQL databaseswhichisacompellingfeature.However,itisessentialtokeepinmindtheneedsoftheapplications[2].
9.1.2.3.1MongoQueriesexamples
ThequeriescanbeexecutedfromMongoshellaswellasthroughscripts.
To query the data from a MongoDB collection, one would use MongoDB’s
>db.myNewCollection1.insertOne({x:1})
>db.myNewCollection2.createIndex({y:1})
find()method.
Theoutputcanbeformattedbyusingthepretty()command.
TheMongoDBinsertstatementscanbeperformedinthefollowingmanner:
“The$lookup command performs a left-outer-join to an unshardedcollectioninthesamedatabasetofilterindocumentsfromthejoinedcollectionforprocessing”[9].
ThisoperationisequivalenttothefollowingSQLoperation:
ToperformaLikeMatch(Regex),onewouldusethefollowingcommand:
9.1.2.4MongoDBBasicFunctions
WhenitcomestothetechnicalelementsofMongoDB,itpossesarichinterfacefor importing and storage of external data in various formats. By using theMongoImport/Exporttool,onecaneasilytransfercontentsfromJSON,CSV,orTSV files into a database. MongoDB supports CRUD (create, read, update,delete) operations efficiently and has detailed documentation available on theproductwebsite.Itcanalsoquerythegeospatialdata,anditiscapableofstoringgeospatial data in GeoJSON objects. The aggregation operation of theMongoDB process data records and returns computed results. MongoDB
>db.COLLECTION_NAME.find()
>db.mycol.find().pretty()
>db.COLLECTION_NAME.insert(document)
${
$lookup:
{
from:<collectiontojoin>,
localField:<fieldfromtheinputdocuments>,
foreignField:<fieldfromthedocumentsofthe"from"collection>,
as:<outputarrayfield>
}
}
$SELECT*,<outputarrayfield>
FROMcollection
WHERE<outputarrayfield>IN(SELECT*
FROM<collectiontojoin>
WHERE<foreignField>=<collection.localField>);`
>db.products.find({sku:{$regex:/789$/}})
aggregationframeworkismodeledontheconceptofdatapipelines[10].
9.1.2.4.1Import/Exportfunctionsexamples
ToimportJSONdocuments,onewouldusethefollowingcommand:
The CSV import uses the input file name to import a collection, hence, thecollectionnameisoptional[10].
“Mongoexport is a utility that produces a JSON or CSV export ofdatastoredinaMongoDBinstance”[10].
9.1.2.5SecurityFeatures
DatasecurityisacrucialaspectoftheenterpriseinfrastructuremanagementandisthereasonwhyMongoDBprovidesvarioussecurityfeaturessuchasolebasedaccess control, numerous authentication options, and encryption. It supportsmechanisms such as SCRAM, LDAP, and Kerberos authentication. Theadministrator can create role/collection-based access control; also roles can bepredefined or custom. MongoDB can audit activities such as DDL, CRUDstatements,authenticationandauthorizationoperations[11].
9.1.2.5.1Collectionbasedaccesscontrolexample
Auserdefinedrolecancontainthefollowingprivileges[11].
9.1.2.6MongoDBCloudService
In regards to the cloud technologies, MongoDB also offers fully automatedcloudservicecalledAtlaswithcompetitivepricingoptions.MongoAtlasCloudinterface offers interactive GUI for managing cloud resources and deploying
$mongoimport--dbusers--collectioncontacts--filecontacts.json
$mongoimport--dbusers--typecsv--headerline--file/opt/backups/contacts.csv
$mongoexport--dbtest--collectiontraffic--outtraffic.json
$privileges:[
{resource:{db:"products",collection:"inventory"},actions:["find","update"]},
{resource:{db:"products",collection:"orders"},actions:["find"]}
]
applications quickly. The service is equipped with geographically distributedinstances to ensure no single point failure. Also, a well-rounded performancemonitoring interface allows users to promptly detect anomalies and generateindex suggestions to optimize the performance and reliability of the database.Global technology leaders such as Google, Facebook, eBay, and Nokia areleveragingMongoDB andAtlas cloud services makingMongoDB one of themostpopularchoicesamongtheNoSQLdatabases[12].
9.1.3PyMongo
PyMongo is the official Python driver or distribution that allowsworkwith aNoSQLtypedatabasecalledMongoDB[13].Thefirstversionofthedriverwasdevelopedin2009[14],onlytwoyearsafterthedevelopmentofMongoDBwasstarted.Thisdriverallowsdevelopers tocombinebothPython’sversatilityandMongoDB’sflexibleschemanatureintosuccessfulapplications.Currently,thisdriver supports MongoDB versions 2.6, 3.0, 3.2, 3.4, 3.6, and 4.0 [15].MongoDBandPythonrepresentacompatiblefitconsideringthatBSON(binaryJSON) used in this NoSQL database is very similar to Python dictionaries,whichmakes thecollaborationbetweenthe twoevenmoreappealing[16].Forthisreason,dictionariesaretherecommendedtoolstobeusedinPyMongowhenrepresentingdocuments[17].
9.1.3.1Installation
Prior to being able to exploit the benefits of Python and MongoDBsimultaneously,thePyMongodistributionmustbeinstalledusingpip.Toinstallitonallplatforms,thefollowingcommandshouldbeused[18]:$python-mpipinstallpymongo
SpecificversionsofPyMongocanbe installedwithcommand lines suchas inourexamplewherethe3.5.1versionisinstalled[18].
Asinglelineofcodecanbeusedtoupgradethedriveraswell[18].
$python-mpipinstallpymongo==3.5.1
$python-mpipinstall--upgradepymongo
Furthermore, the installation process can be completed with the help of theeasy_installtool,whichrequiresuserstousethefollowingcommand[18].
To do an upgrade of the driver using this tool, the following command isrecommended[18]:
There are many other ways of installing PyMongo directly from the source,however, theyrequireforCextensiondependencies tobe installedprior to thedriver installation step, as they are the ones that skim through the sources onGitHubandusethemostup-to-datelinkstoinstallthedriver[18].
Tocheckiftheinstallationwascompletedaccurately,thefollowingcommandisusedinthePythonconsole[19].
If the command returns zero exceptions within the Python shell, one canconsiderforthePyMongoinstallationtohavebeencompletedsuccessfully.
9.1.3.2Dependencies
The PyMongo driver has a few dependencies that should be taken intoconsiderationpriortoitsusage.Currently,itsupportsCPython2.7,3.4+,PyPy,and PyPy 3.5+ interpreters [15]. An optional dependency that requires someadditionalcomponentstobeinstalledistheGSSAPIauthentication[15].FortheUnixbasedmachines, it requirespykerberos,while for theWindowsmachinesWinKerberos is needed to fullfill this requirement [15]. The automaticinstallation of this dependency can be done simultaneously with the driverinstallation,inthefollowingmanner:
Other third-party dependencies such as ipaddress, certifi, or wincerstore arenecessaryforconnectionswithhelpofTLS/SSLandcanalsobesimultaneouslyinstalledalongwiththedriverinstallation[15].
$python-measy_installpymongo
$python-measy_install-Upymongo
importpymongo
$python-mpipinstallpymongo[gssapi]
9.1.3.3RunningPyMongowithMongoDeamon
OncePyMongois installed, theMongodeamoncanberunwithaverysimplecommandinanewterminalwindow[19].
9.1.3.4ConnectingtoadatabaseusingMongoClient
In order to be able to establish a connectionwith a database, aMongoClientclass needs to be imported, which sub-sequentially allows the MongoClientobjecttocommunicatewiththedatabase[19].
Thiscommandallowsaconnectionwithadefault,localhostthroughport27017,however, depending on the programming requirements, one can also specifythosebylistingthemintheclient instanceorusethesameinformationviatheMongoURIformat[19].
9.1.3.5AccessingDatabases
Since MongoClient plays a server role, it can be used to access any desireddatabasesinaneasyway.Todothat,onecanusetwodifferentapproaches.Thefirstapproachwouldbedoingthisvia theattributemethodwhere thenameofthe desired database is listed as an attribute, and the second approach, whichwouldincludeadictionary-styleaccess[19].Forexample,toaccessadatabasecalled cloudmesh_community, onewould use the following commands for theattributeandforthedictionarymethod,respectively.
9.1.3.6CreatingaDatabase
Creating a database is a straight forward process. First, one must create aMongoClientobjectandspecifytheconnection(IPaddress)aswellasthenameofthedatabasetheyaretryingtocreate[20].Theexampleof thiscommandispresentedinthefollowngsection:
$mongod
frompymongoimportMongoClient
client=MongoClient()
db=client.cloudmesh_community
db=client['cloudmesh_community']
9.1.3.7InsertingandRetrievingDocuments(Querying)
Creating documents and storing data using PyMongo is equally easy asaccessingandcreatingdatabases.Inordertoaddnewdata,acollectionmustbespecifiedfirst.Inthisexample,adecisionismadetousethecloudmeshgroupofdocuments.
Oncethisstepiscompleted,datamaybeinsertedusingtheinsert_one()method,whichmeans that only one document is being created.Of course, insertion ofmultiple documents at the same time is possible as well with use of theinsert_many()method[19].Anexampleofthismethodisasfollows:
Anotherexampleofthismethodwouldbetocreateacollection.Ifwewantedtocreateacollectionofstudents in thecloudmesh_community,wewoulddo it inthefollowingmanner:
Retrievingdocumentsisequallysimpleascreatingthem.Thefind_one()methodcanbeusedtoretrieveonedocument[19].Animplementationofthismethodisgiveninthefollowingexample.
Similarly, to retieve multiple documents, one would use the find() method
importpymongo
client=pymongo.MongoClient('mongodb://localhost:27017/')
db=client['cloudmesh']
cloudmesh=db.cloudmesh
course_info={
'course':'BigDataApplicationsandAnalytics',
'instructor':'GregorvonLaszewski',
'chapter':'technologies'
}
result=cloudmesh.insert_one(course_info)`
student=[{'name':'John','st_id':52642},
{'name':'Mercedes','st_id':5717},
{'name':'Anna','st_id':5654},
{'name':'Greg','st_id':5423},
{'name':'Amaya','st_id':3540},
{'name':'Cameron','st_id':2343},
{'name':'Bozer','st_id':4143},
{'name':'Cody','price':2165}]
client=MongoClient('mongodb://localhost:27017/')
withclient:
db=client.cloudmesh
db.students.insert_many(student)
gregors_course=cloudmesh.find_one({'instructor':'GregorvonLaszewski'})
insteadofthe find_one().Forexample,tofindallcoursesthoughtbyprofessorvonLaszewski,onewouldusethefollowingcommand:
Onethingthatusersshouldbecognizantofwhenusingthefind()methodisthatit doesnot return results in an array formatbut as acursor object,which is acombinationofmethods thatwork together tohelpwithdataquerying[19]. Inordertoreturnindividualdocuments,iterationovertheresultmustbecompleted[19].
9.1.3.8LimitingResults
When itcomes toworkingwith largedatabases it isalwaysuseful to limit thenumberofquery results.PyMongosupports thisoptionwith its limit()method[20]. This method takes in one parameter which specifies the number ofdocumentstobereturned[20].Forexample,ifwehadacollectionwithalargenumber of cloud technologies as individual documents, one couldmodify thequery results to return only the top 10 technologies.To do this, the followingexamplecouldbeutilized:
9.1.3.9UpdatingCollection
Updating documents is very similar to inserting and retrieving the same.Depending on the number of documents to be updated, one would use theupdate_one()orupdate_many()method[20].Twoparametersneedtobepassedintheupdate_one()methodforittosuccessfullyexecute.Thefirstargumentisthe query object that specifies the document to be changed, and the secondargumentistheobjectthatspecifiesthenewvalueinthedocument.Anexampleoftheupdate_one()methodinactionisthefollowing:
Updating all documents that fall under the same criteria can be donewith theupdate_many method [20]. For example, to update all documents in which
gregors_course=cloudmesh.find({'instructor':'GregorvonLaszewski'})
client=pymongo.MongoClient('mongodb://localhost:27017/')
db=client['cloudmesh']
col=db['technologies']
topten=col.find().limit(10)
myquery={'course':'BigDataApplicationsandAnalytics'}
newvalues={'$set':{'course':'CloudComputing'}}
coursetitlestartswithletterBwithadifferentinstructorinformation,wewoulddothefollowing:
9.1.3.10CountingDocuments
Counting documents can be done with one simple operation calledcount_documents()insteadofusingafullquery[21].Forexample,wecancountthedocumentsinthecloudmesh_commpunitybyusingthefollowingcommand:
Tocreateamorespecificcount,onewoulduseacommandsimilartothis:
Thistechnologysupportssomemoreadvancedqueryingoptionsaswell.Thoseadvanced queries allow one to add certain contraints and narrow down theresults even more. For example, to get the courses thought by professor vonLaszewskiafteracertaindate,onewouldusethefollowingcommand:
9.1.3.11Indexing
Indexing is a very important part of querying. It can greately improve queryperformancebutalsoaddfunctionalityandaideinstoringdocuments[21].
“To create a unique index on a key that rejects documents whosevalueforthatkeyalreadyexistsintheindex”[21].
Weneedtofirstlycreatetheindexinthefollowingmanner:
client=pymongo.MongoClient('mongodb://localhost:27017/')
db=client['cloudmesh']
col=db['courses']
query={'course':{'$regex':'^B'}}
newvalues={'$set':{'instructor':'GregorvonLaszewski'}}
edited=col.update_many(query,newvalues)
cloudmesh=count_documents({})
cloudmesh=count_documents({'author':'vonLaszewski'})
d=datetime.datetime(2017,11,12,12)
forcourseincloudmesh.find({'date':{'$lt':d}}).sort('author'):
pprint.pprint(course)
result=db.profiles.create_index([('user_id',pymongo.ASCENDING)],
unique=True)
sorted(list(db.profiles.index_information()))
Thiscommandacutallycreatestwodifferentindexes.Thefirstoneisthe*_id*,createdbyMongoDBautomatically,andthesecondoneis theuser_id,createdbytheuser.
Thepurposeof those indexes is to cleverlyprevent future additionsof invaliduser_idsintoacollection.
9.1.3.12Sorting
Sortingontheserver-sideisalsoavaialableviaMongoDB.ThePyMongosort()methodisequivalenttotheSQLorderbystatementanditcanbeperformedaspymongo.ascending andpymongo.descending [22].Thismethod ismuchmoreefficient as it is being completed on the server-side, compared to the sortingcompleted on the client side. For example, to return all userswith first nameGregorsortedindescendingorderbybirthdatewewoulduseacommandsuchasthis:
9.1.3.13Aggregation
Aggregationoperationsareusedtoprocessgivendataandproducesummarizedresults. Aggregation operations collect data from a number of documents andprovide collective results by grouping data. PyMongo in its documentationoffers a separate framework that supports data aggregation. This aggregationframeworkcanbeusedto
“provideprojectioncapabilitiestoreshapethereturneddata”[23].
In the aggregation pipeline, documents pass through multiple pipeline stageswhich convert documents into result data. The basic pipeline stages includefilters. Those filters act like document transformation by helping change thedocument output form. Other pipelines help group or sort documents withspecific fields. By using native operations from MongoDB, the pipelineoperatorsareefficientinaggregatingresults.
TheaddFieldsstageisusedtoaddnewfieldsintodocuments.Itreshapeseach
users=cloudmesh.users.find({'firstname':'Gregor'}).sort(('dateofbirth',pymongo.DESCENDING))
foruserinusers:
printuser.get('email')
document in stream, similarly to the project stage. The output document willcontain existing fields from input documents and the newly added fields 24].Thefollowingexampleshowshowtoaddstudentdetailsintoadocument.
Thebucketstageisusedtocategorizeincomingdocumentsintogroupsbasedonspecified expressions. Those groups are called buckets [24]. The followingexampleshowsthebucketstageinaction.
In the bucketAuto stage, the boundaries are automatically determined in anattempttoevenlydistributedocumentsintoaspecifiednumberofbuckets.Inthefollowingoperation,inputdocumentsaregroupedintofourbucketsaccordingtothevaluesinthepricefield[24].
ThecollStatsstagereturnsstatisticsregardingacollectionorview[24].
Thecount stagepasses a document to thenext stage that contains thenumberdocumentsthatwereinputtothestage[24].
db.cloudmesh_community.aggregate([
{
$addFields:{
"document.StudentDetails":{
$concat:['$document.student.FirstName','$document.student.LastName']
}
}
}])
db.user.aggregate([
{"$group":{
"_id":{
"city":"$city",
"age":{
"$let":{
"vars":{
"age":{"$subtract":[{"$year":newDate()},{"$year":"$birthDay"}]}},
"in":{
"$switch":{
"branches":[
{"case":{"$lt":["$$age",20]},"then":0},
{"case":{"$lt":["$$age",30]},"then":20},
{"case":{"$lt":["$$age",40]},"then":30},
{"case":{"$lt":["$$age",50]},"then":40},
{"case":{"$lt":["$$age",200]},"then":50}
]}}}}},
"count":{"$sum":1}}})
db.artwork.aggregate([
{
$bucketAuto:{
groupBy:"$price",
buckets:4
}
}
])
db.matrices.aggregate([{$collStats:{latencyStats:{histograms:true}}
}])
db.scores.aggregate([{
The facet stage helps processmultiple aggregation pipelines in a single stage[24].
The geoNear stage returns an ordered stream of documents based on theproximity to a geospatial point. The output documents include an additionaldistancefieldandcanincludealocationidentifierfield[24].
The graphLookup stage performs a recursive search on a collection. To eachoutputdocument, itaddsanewarrayfieldthatcontainsthetraversalresultsoftherecursivesearchforthatdocument[24].
Thegroup stageconsumes thedocumentdatapereachdistinctgroup. IthasaRAM limit of 100MB. If the stage exceeds this limit, thegroup produces anerror[24].
$match:{score:{$gt:80}}},
{$count:"passing_scores"}])
db.artwork.aggregate([{
$facet:{"categorizedByTags":[{$unwind:"$tags"},
{$sortByCount:"$tags"}],"categorizedByPrice":[
//Filteroutdocumentswithoutapricee.g.,_id:7
{$match:{price:{$exists:1}}},
{$bucket:{groupBy:"$price",
boundaries:[0,150,200,300,400],
default:"Other",
output:{"count":{$sum:1},
"titles":{$push:"$title"}
}}}],"categorizedByYears(Auto)":[
{$bucketAuto:{groupBy:"$year",buckets:4}
}]}}])
db.places.aggregate([
{$geoNear:{
near:{type:"Point",coordinates:[-73.99279,40.719296]},
distanceField:"dist.calculated",
maxDistance:2,
query:{type:"public"},
includeLocs:"dist.location",
num:5,
spherical:true
}}])
db.travelers.aggregate([
{
$graphLookup:{
from:"airports",
startWith:"$nearestAirport",
connectFromField:"connects",
connectToField:"airport",
maxDepth:2,
depthField:"numConnections",
as:"destinations"
}
}
])
db.sales.aggregate(
[
{
$group:{
_id:{month:{$month:"$date"},day:{$dayOfMonth:"$date"},
The indexStats stage returns statistics regarding the use of each index for acollection[24].
The limit stage is used for controlling the number of documents passed to thenextstageinthepipeline[24].
ThelistLocalSessionsstagegivesthesessioninformationcurrentlyconnectedtomongosormongodinstance[24].
ThelistSessionsstagelistsoutallsessionthathavebeenactivelongenoughtopropagatetothesystem.sessionscollection[24].
Thelookupstageisusefulforperformingouterjoinstoothercollectionsinthesamedatabase[24].
Thematchstageisusedtofilterthedocumentstream.Onlymatchingdocumentspasstonextstage[24].
Theproject stage is used to reshape the documents by adding or deleting thefields.
year:{$year:"$date"}},
totalPrice:{$sum:{$multiply:["$price","$quantity"]}},
averageQuantity:{$avg:"$quantity"},
count:{$sum:1}
}
}
]
)
db.orders.aggregate([{$indexStats:{}}])
db.article.aggregate(
{$limit:5}
)
db.aggregate([{$listLocalSessions:{allUsers:true}}])
useconfig
db.system.sessions.aggregate([{$listSessions:{allUsers:true}}])
{
$lookup:
{
from:<collectiontojoin>,
localField:<fieldfromtheinputdocuments>,
foreignField:<fieldfromthedocumentsofthe"from"collection>,
as:<outputarrayfield>
}
}
db.articles.aggregate(
[{$match:{author:"dave"}}]
)
The redact stage reshapes stream documents by restricting information usinginformationstoredindocumentsthemselves[24].
ThereplaceRootstageisusedtoreplaceadocumentwithaspecifiedembeddeddocument[24].
Thesample stage isused to sampleoutdataby randomlyselectingnumberofdocumentsforminput[24].
Theskipstageskipsspecifiedinitialnumberofdocumentsandpassesremainingdocumentstothepipeline[24].
Thesort stage is usefulwhile reordering document stream by a specified sortkey[24].
The sortByCounts stage groups the incoming documents based on a specifiedexpressionvalueandcountsdocumentsineachdistinctgroup[24].
Theunwindstagedeconstructsanarrayfieldfromtheinputdocumentstooutput
db.books.aggregate([{$project:{title:1,author:1}}])
db.accounts.aggregate(
[
{$match:{status:"A"}},
{
$redact:{
$cond:{
if:{$eq:["$level",5]},
then:"$$PRUNE",
else:"$$DESCEND"
}}}]);
db.produce.aggregate([
{
$replaceRoot:{newRoot:"$in_stock"}
}
])
db.users.aggregate(
[{$sample:{size:3}}]
)
db.article.aggregate(
{$skip:5}
);
db.users.aggregate(
[
{$sort:{age:-1,posts:1}}
]
)
db.exhibits.aggregate(
[{$unwind:"$tags"},{$sortByCount:"$tags"}])
adocumentforeachelement[24].
Theoutstageisusedtowriteaggregationpipelineresultsintoacollection.Thisstageshouldbethelaststageofapipeline[24].
AnotheroptionfromtheaggregationoperationsistheMap/Reduceframework,whichessentiallyincludestwodifferentfunctions,mapandreduce.Thefirstoneprovidesthekeyvaluepairforeachtaginthearray,whilethelatterone
“sumsoveralloftheemittedvaluesforagivenkey”[23].
ThelaststepintheMap/Reduceprocessittocallthemap_reduce()functionanditerateovertheresults[23].TheMap/Reduceoperationprovidesresultdatainacollectionorreturnsresultsin-line.Onecanperformsubsequentoperationswiththesameinputcollectioniftheoutputofthesameiswrittentoacollection[25].Anoperationthatproducesresultsinain-lineformmustprovideresultswithintheBSONdocument size limit.Thecurrent limit for aBSONdocument is16MB.Thesetypesofoperationsarenotsupportedbyviews[25].ThePyMongo’sAPI supports all features of the MongoDB’s Map/Reduce engine [26].Moreover,Map/Reduce has the ability to getmore detailed results by passingfull_response=Trueargumenttothemap_reduce()function[26].
9.1.3.14DeletingDocumentsfromaCollection
ThedeletionofdocumentswithPyMongo is fairly straight forward.Todo so,one would use the remove() method of the PyMongo Collection object [22].Similarlytothereadsandupdates,specificationofdocumentstoberemovedisamust.Forexample,removaloftheentiredocumentcollectionwithascoreof1,wouldrequiredonetousethefollowingcommand:
ThesafeparametersettoTrueensurestheoperationwascompleted[22].
db.inventory.aggregate([{$unwind:"$sizes"}])
db.inventory.aggregate([{$unwind:{path:"$sizes"}}])
db.books.aggregate([
{$group:{_id:"$author",books:{$push:"$title"}}},
{$out:"authors"}
])
cloudmesh.users.remove({"score":1,safe=True})
9.1.3.15CopyingaDatabase
Copying databases within the same mongod instance or between differentmongodservers ismadepossiblewith thecommand()methodafterconnectingto the desired mongod instance [27]. For example, to copy the cloudmeshdatabase and name the new database cloudmesh_copy, one would use thecommand()methodinthefollowingmanner:
There are two ways to copy a database between servers. If a server is notpassword-prodected, one would not need to pass in the credentials nor toauthenticate to the admin database [27]. In that case, to copy a database onewouldusethefollowingcommand:
On the other hand, if the server where we are copying the database to isprotected,onewouldusethiscommandinstead:
9.1.3.16PyMongoStrengths
One of PyMongo strengths is that allows document creation and queryingnatively
“through the use of existing language features such as nesteddictionariesandlists”[22].
Formoderately experienced Python developers, it is very easy to learn it andquicklyfeelcomfortablewithit.
“For these reasons, MongoDB and Python make a powerfulcombinationforrapid,iterativedevelopmentofhorizontallyscalable
client.admin.command('copydb',
fromdb='cloudmesh',
todb='cloudmesh_copy')
client.admin.command('copydb',
fromdb='cloudmesh',
todb='cloudmesh_copy',
fromhost='source.example.com')
client=MongoClient('target.example.com',
username='administrator',
password='pwd')
client.admin.command('copydb',
fromdb='cloudmesh',
todb='cloudmesh_copy',
fromhost='source.example.com')
backendapplications”[22].
Accordingto[22],MongoDBisveryapplicable tomodernapplications,whichmakesPyMongoequallyvaluable[22].
9.1.4MongoEngine
“MongoEngineisanObject-DocumentMapper,writteninPythonforworkingwithMongoDB”[28].
It is actually a library that allows a more advanced communication withMongoDBcomparedtoPyMongo.AsMongoEngineistechnicallyconsideredtobeanobject-documentmapper(ODM),itcanalsobeconsideredtobe
“equivalenttoaSQL-basedobjectrelationalmapper(ORM)”[19].
Theprimary techniquewhyonewoulduse anODM includesdataconversionbetweencomputersystemsthatarenotcompatiblewitheachother[29].Forthepurpose of converting data to the appropriate form, a virtual object databasemustbecreatedwithin theutilizedprogramming language[29].This library isalsousedtodefineschematafordocumentswithinMongoDB,whichultimatelyhelpswithminimizingcodingerrorsaswelldefiningmethodsonexistingfields[30].Itisalsoverybeneficialtotheoverallworkflowasittrackschangesmadetothedocumentsandaidsinthedocumentsavingprocess[31].
9.1.4.1Installation
Theinstallationprocessforthistechnologyisfairlysimpleasitisconsideredtobealibrary.Toinstallit,onewouldusethefollowingcommand[32]:
Ableeding-edgeversionofMongoEnginecanbeinstalleddirectlyfromGitHubbyfirstcloningtherepositoryonthelocalmachine,virtualmachine,orcloud.
9.1.4.2ConnectingtoadatabaseusingMongoEngine
Once installed, MongoEngine needs to be connected to an instance of the
$pipinstallmongoengine
mongod, similarly to PyMongo [33]. The connect() function must be used tosuccessfully complete this step and the argument that must be used in thisfunctionisthenameofthedesireddatabase[33].Priortousingthisfunction,thefunctionnameneedstobeimportedfromtheMongoEnginelibrary.
SimilarlytotheMongoClient,MongoEngineusesthelocalhostandport27017by default, however, the connect() function also allows specifying other hostsandportargumentsaswell[33].
Other types of connections are also supported (i.e. URI) and they can becompletedbyprovidingtheURIintheconnect()function[33].
9.1.4.3QueryingusingMongoEngine
ToqueryMongoDBusingMongoEngineanobjectsattribute isused,whichis,technically, a part of the document class [34]. This attribute is called theQuerySetManagerwhichinreturn
“createsanewQuerySetobjectonaccess”[34].
Tobeabletoaccessindividualdocumentsfromadatabase,thisobjectneedstobe iterated over. For example, to return/print all students in thecloudmesh_community object (database), the following command would beused.
MongoEngine also has a capability of query filtering which means that akeyword can be used within the called QuerySet object to retrieve specificinformation [34]. Let us say one would like to iterate overcloudmesh_communitystudentsthatarenativesofIndiana.Toachievethis,onewouldusethefollowingcommand:
Thislibraryalsoallowstheuseofalloperatorsexceptfortheequalityoperator
frommongoengineimportconnect
connect('cloudmesh_community')
connect('cloudmesh_community',host='196.185.1.62',port=16758)
foruserincloudmesh_community.objects:
printcloudmesh_community.student
indy_students=cloudmesh_community.objects(state='IN')
in itsqueries, andmoreover,has thecapabilityofhandlingstringqueries,geoqueries,listquerying,andqueryingoftherawPyMongoqueries[34].
The string queries are useful in performing text operations in the conditionalqueries.Aquery to find a document exactlymatching andwith stateACTIVEcanbeperformedinthefollowingmanner:
ThequerytoretrievedocumentdatafornamesthatstartwithacasesensitiveALcanbewrittenas:
Toperformanexactsamequeryforthenon-key-sensitiveALonewouldusethefollowingcommand:
TheMongoEngineallowsdataextractionofgeographicallocationsbyusingGeoqueries.Thegeo_withinoperatorchecksifageometryiswithinapolygon.
Thelistquerylooksupthedocumentswherethespecifiedfieldsmatchesexactlytothegivenvalue.Tomatchallpagesthathavethewordcodingasaniteminthetagslistonewouldusethefollowingquery:
Overall,itwouldbesafetosaythatMongoEnginehasgoodcompatibilitywithPython. It provides different functions to utilize Python easily withMongoDBand which makes this pair even more attractive to applicationdevelopers.
9.1.5Flask-PyMongo
“Flaskisamicro-webframeworkwritteninPython”[35].
db.cloudmesh_community.find(State.exact("ACTIVE"))
db.cloudmesh_community.find(Name.startswith("AL"))
db.cloudmesh_community.find(Name.istartswith("AL"))
cloudmesh_community.objects(
point__geo_within=[[[40,5],[40,6],[41,6],[40,5]]])
cloudmesh_community.objects(
point__geo_within={"type":"Polygon",
"coordinates":[[[40,5],[40,6],[41,6],[40,5]]]})
classPage(Document):
tags=ListField(StringField())
Page.objects(tags='coding')
ItwasdevelopedafterDjango,and it isverypythonic innaturewhich impliesthatitisexplicitlythetargetingthePythonusercommunity.ItislightweightasitdoesnotrequireadditionaltoolsorlibrariesandhenceisclassifiedasaMicro-Webframework.ItisoftenusedwithMongoDBusingPyMongoconnector,andit treats data within MongoDB as searchable Python dictionaries. TheapplicationssuchasPinterest,LinkedIn,andthecommunitywebpageforFlaskare using theFlask framework.Moreover, it supports various features such asthe RESTful request dispatching, secure cookies, Google app enginecompatibility,andintegratedsupportforunittesting,etc[35].Whenitcomestoconnectingtoadatabase,theconnectiondetailsforMongoDBcanbepassedasavariableorconfiguredinPyMongoconstructorwithadditionalargumentssuchasusernameandpassword,ifrequired.ItisimportantthatversionsofbothFlaskandMongoDBarecompatiblewitheachothertoavoidfunctionalitybreaks[36].
9.1.5.1Installation
Flask-PyMongocanbeinstalledwithaneasycommandsuchasthis:
PyMongocanbeaddedinthefollowingmanner:
9.1.5.2Configuration
TherearetwowaystoconfigureFlask-PyMongo.ThefirstwaywouldbetopassaMongoDBURItothePyMongoconstructor,whilethesecondwaywouldbeto
“assignittotheMONGO_URIFlaskconfiurationvariable”[36].
9.1.5.3Connectiontomultipledatabases/servers
Multiple PyMongo instances can be used to connect to multiple databases ordatabase servers. To achieve this, once would use a command similar to thefollowing:
$pipinstallFlask-PyMongo
fromflaskimportFlask
fromflask_pymongoimportPyMongo
app=Flask(__name__)
app.config["MONGO_URI"]="mongodb://localhost:27017/cloudmesh_community"
mongo=PyMongo(app)
9.1.5.4Flask-PyMongoMethods
Flask-PyMongo provides helpers for some common tasks.One of them is theCollection.find_one_or_404methodshowninthefollowingexample:
This method is very similar to the MongoDB’s find_one() method, however,insteadofreturningNoneitcausesa404NotFoundHTTPstatus[36].
Similarly,thePyMongo.send_fileandPyMongo.save_filemethodsworkon thefile-likeobjectsandsavethemtoGridFSusingthegivenfilename[36].
9.1.5.5AdditionalLibraries
Flask-MongoAlchemyandFlask-MongoEngineare theadditional libraries thatcanbeused toconnect toaMongoDBdatabasewhileusingenhancedfeatureswith the Flask app. The Flask-MongoAlchemy is used as a proxy betweenPython and MongoDB to connect. It provides an option such as server ordatabasebasedauthenticationtoconnecttoMongoDB.Whilethedefault issetserver based, to use a database-based authentication, the config valueMONGOALCHEMY_SERVER_AUTHparametermustbesettoFalse[37].
Flask-MongoEngine is the Flask extension that provides integration with theMongoEngine. It handles connection management for the apps. It can beinstalledthroughpipandsetupveryeasilyaswell.Thedefaultconfigurationisset to the local host and port 27017. For the custom port and in caseswhereMongoDB is running on another server, the host and port must be explicitlyspecifiedinconnectstringswithintheMONGODB_SETTINGSdictionarywithapp.config, alongwith the database username and password, in caseswhere adatabaseauthenticationisenabled.TheURIstyleconnectionsarealsosupportedandsupply theURIas thehost in theMONGODB_SETTINGS dictionarywithapp.config.TherearevariouscustomquerysetsthatareavailablewithinFlask-
app=Flask(__name__)
mongo1=PyMongo(app,uri="mongodb://localhost:27017/cloudmesh_community_one")
mongo2=PyMongo(app,uri="mongodb://localhost:27017/cloudmesh_community_two")
mongo3=PyMongo(app,uri=
"mongodb://another.host:27017/cloudmesh_community_Three")
@app.route("/user/<username>")
defuser_profile(username):
user=mongo.db.cloudmesh_community.find_one_or_404({"_id":username})
returnrender_template("user.html",user=user)
MongoenginethatareattachedtoMongoengine’sdefaultqueryset[38].
9.1.5.6ClassesandWrappers
Attributes such as cx and db in the PyMongo objects are the ones that helpprovideaccesstotheMongoDBserver[36].Toachievethis,onemustpasstheFlaskapptotheconstructororcallinit_app()[36].
“Flask-PyMongo wraps PyMongo’s MongoClient, Database, andCollectionclasses,andoverridestheirattributeanditemaccessors”[36].
This type of wrapping allows Flask-PyMongo to add methods to CollectionwhileatthesametimeallowingaMongoDB-styledottedexpressionsinthecode[36].
Flask-PyMongo creates connectivity between Python and Flask using aMongoDBdatabaseandsupports
“extensions that can add application features as if they wereimplementedinFlaskitself”[39],
hence, it canbeused as an additionalFlask functionality inPython code.Theextensions are there for the purpose of supporting form validations,authentication technologies, object-relational mappers and framework relatedtoolswhichultimatelyaddsalotofstrengthtothismicro-webframework[39].OneofthemainreasonsandbenefitswhyitisfrequentlyusedwithMongoDBisitscapabilityofaddingmorecontroloverdatabasesandhistory[39].
9.2MONGOENGINE☁�9.2.1Introduction
MongoEngine isadocumentmapper forworkingwithmongoldbwithpython.To be able to use mongo engine MongodD should be already installed and
type(mongo.cx)
type(mongo.db)
type(mongo.db.cloudmesh_community)
running.
9.2.2Installandconnect
Mongoenginecanbeinstalledbyrunning:
Thiswillinstallsix,pymongoandmongoengine.
To connect to mongoldb use connect () function by specifying mongoldbinstancename.Youdon’tneedtogotomongoshellbutthiscanbedonefromunix shell or cmd line. In this case we are connecting to a database namedstudent_db.
Ifmongodb is runningonaportdifferent fromdefaultport ,portnumberandhost need to be specified. If mongoldb needs authentication username andpasswordneedtobespecified.
9.2.3Basics
Mongodbdoesnotenforceschemas.ComparingtoRDBMS,Rowinmongoldbis called a “document” and table can be compared toCollection. Defining aschemaishelpfulasitminimizescodingerror’s.Todefineaschemawecreateaclassthatinheritsfromdocument.
�TODO:Canyoufixthecodesectionsandlookattheexamplesweprovided.
Fields are notmandatory but if needed, set the required keyword argument toTrue. There are multiple values available for field types. Each field can becustomizedbybykeywordargument.IfeachstudentissendingtextmessagestoUniversitiescentraldatabase,thesecanbestoredusingMongodb.Eachtextcanhavedifferentdatatypes,somemighthaveimagesorsomemighthaveurl’s.SowecancreateaclasstextandlinkittostudentbyusingReferencefield(similar
$pipinstallmongoengine
frommongoengineimport*connect(‘student_db’)
frommongoengineimport*
classStudent(Document):
first_name=StringField(max_length=50)
last_name=StringField(max_length=50)
toforeignkeyinRDBMS).
MongoDb supports adding tags to individual texts rather then storing themseparately and then having them referenced.Similarly Comments can also bestoreddirectlyinaText.
Foraccessingdata:ifweneedtogettitles.
Searchingtextswithtags.
classText(Document):
title=StringField(max_length=120,required=True)
author=ReferenceField(Student)
meta={'allow_inheritance':True}
classOnlyText(Text):
content=StringField()
classImagePost(Text):
image_path=StringField()
classLinkPost(Text):
link_url=StringField()
classText(Document):
title=StringField(max_length=120,required=True)
author=ReferenceField(User)
tags=ListField(StringField(max_length=30))
comments=ListField(EmbeddedDocumentField(Comment))
fortextinOnlyText.objects:
print(text.title)
fortextinText.objects(tags='mongodb'):
print(text.title)
10OTHER
10.1WORDCOUNTWITHPARALLELPYTHON☁�We will demonstrate Python’s multiprocessing API for parallel computation bywriting a program that counts how many times each word in a collection ofdocumentsappear.
10.1.1GeneratingaDocumentCollection
Beforewebegin,letuswriteascriptthatwillgeneratedocumentcollectionsbyspecifying the number of documents and the number ofwords per document.Thiswillmakebenchmarkingstraightforward.
To keep it simple, the vocabulary of the document collection will consist ofrandomnumbersratherthanthewordsofanactuallanguage:'''Usage:generate_nums.py[-h]NUM_LISTSINTS_PER_LISTMIN_INTMAX_INTDEST_DIR
Generaterandomlistsofintegersandsavethem
as1.txt,2.txt,etc.
Arguments:
NUM_LISTSThenumberofliststocreate.
INTS_PER_LISTThenumberofintegersineachlist.
MIN_NUMEachgeneratedintegerwillbe>=MIN_NUM.
MAX_NUMEachgeneratedintegerwillbe<=MAX_NUM.
DEST_DIRAdirectorywherethegeneratednumberswillbestored.
Options:
-h--help
'''
from__future__importprint_function
importos,random,logging
fromdocoptimportdocopt
defgenerate_random_lists(num_lists,
ints_per_list,min_int,max_int):
return[[random.randint(min_int,max_int)\
foriinrange(ints_per_list)]foriinrange(num_lists)]
if__name__=='__main__':
args=docopt(__doc__)
num_lists,ints_per_list,min_int,max_int,dest_dir=[
int(args['NUM_LISTS']),
int(args['INTS_PER_LIST']),
int(args['MIN_INT']),
int(args['MAX_INT']),
args['DEST_DIR']
]
ifnotos.path.exists(dest_dir):
Notice thatwe are using the docoptmodule that you should be familiar withfromtheSection[PythonDocOpts](#s-python-docopts}tomakethescripteasytorunfromthecommandline.
Youcangenerateadocumentcollectionwiththisscriptasfollows:
10.1.2SerialImplementation
Afirstserialimplementationofwordcountisstraightforward:
os.makedirs(dest_dir)
lists=generate_random_lists(num_lists,
ints_per_list,
min_int,
max_int)
curr_list=1
forlstinlists:
withopen(os.path.join(dest_dir,'%d.txt'%curr_list),'w')asf:
f.write(os.linesep.join(map(str,lst)))
curr_list+=1
logging.debug('Numberswritten.')
pythongenerate_nums.py1000100000100docs-1000-10000
'''Usage:wordcount.py[-h]DATA_DIR
Readacollectionof.txtdocumentsandcounthowmanytimeseachword
appearsinthecollection.
Arguments:
DATA_DIRAdirectorywithdocuments(.txtfiles).
Options:
-h--help
'''
from__future__importdivision,print_function
importos,glob,logging
fromdocoptimportdocopt
logging.basicConfig(level=logging.DEBUG)
defwordcount(files):
counts={}
forfilepathinfiles:
withopen(filepath,'r')asf:
words=[word.strip()forwordinf.read().split()]
forwordinwords:
ifwordnotincounts:
counts[word]=0
counts[word]+=1
returncounts
if__name__=='__main__':
args=docopt(__doc__)
ifnotos.path.exists(args['DATA_DIR']):
raiseValueError('Invaliddatadirectory:%s'%args['DATA_DIR'])
counts=wordcount(glob.glob(os.path.join(args['DATA_DIR'],'*.txt')))
logging.debug(counts)
10.1.3SerialImplementationUsingmapandreduce
We can improve the serial implementation in anticipation of parallelizing theprogrambymakinguseofPython’smapandreducefunctions.
In short, you can use map to apply the same function to the members of acollection.Forexample,toconvertalistofnumberstostrings,youcoulddo:
We can use reduce to apply the same function cumulatively to the items of asequence.Forexample,tofindthetotalofthenumbersinourlist,wecouldusereduceasfollows:
Wecansimplifythisevenmorebyusingalambdafunction:
YoucanreadmoreaboutPython’slambdafunctioninthedocs.
Withthisinmind,wecanreimplementthewordcountexampleasfollows:
importrandom
nums=[random.randint(1,2)for_inrange(10)]
print(nums)
[2,1,1,1,2,2,2,2,2,2]
print(map(str,nums))
['2','1','1','1','2','2','2','2','2','2']
defadd(x,y):
returnx+y
print(reduce(add,nums))
17
print(reduce(lambdax,y:x+y,nums))
17
'''Usage:wordcount_mapreduce.py[-h]DATA_DIR
Readacollectionof.txtdocumentsandcounthow
manytimeseachword
appearsinthecollection.
Arguments:
DATA_DIRAdirectorywithdocuments(.txtfiles).
Options:
-h--help
'''
from__future__importdivision,print_function
importos,glob,logging
fromdocoptimportdocopt
logging.basicConfig(level=logging.DEBUG)
defcount_words(filepath):
counts={}
withopen(filepath,'r')asf:
words=[word.strip()forwordinf.read().split()]
10.1.4ParallelImplementation
Drawingon theprevious implementationusing mapand reduce,we can parallelizetheimplementationusingPython’smultiprocessingAPI:
10.1.5Benchmarking
forwordinwords:
ifwordnotincounts:
counts[word]=0
counts[word]+=1
returncounts
defmerge_counts(counts1,counts2):
forword,countincounts2.items():
ifwordnotincounts1:
counts1[word]=0
counts1[word]+=counts2[word]
returncounts1
if__name__=='__main__':
args=docopt(__doc__)
ifnotos.path.exists(args['DATA_DIR']):
raiseValueError('Invaliddatadirectory:%s'%args['DATA_DIR'])
per_doc_counts=map(count_words,
glob.glob(os.path.join(args['DATA_DIR'],
'*.txt')))
counts=reduce(merge_counts,[{}]+per_doc_counts)
logging.debug(counts)
'''Usage:wordcount_mapreduce_parallel.py[-h]DATA_DIRNUM_PROCESSES
Readacollectionof.txtdocumentsandcount,inparallel,howmany
timeseachwordappearsinthecollection.
Arguments:
DATA_DIRAdirectorywithdocuments(.txtfiles).
NUM_PROCESSESThenumberofparallelprocessestouse.
Options:
-h--help
'''
from__future__importdivision,print_function
importos,glob,logging
fromdocoptimportdocopt
fromwordcount_mapreduceimportcount_words,merge_counts
frommultiprocessingimportPool
logging.basicConfig(level=logging.DEBUG)
if__name__=='__main__':
args=docopt(__doc__)
ifnotos.path.exists(args['DATA_DIR']):
raiseValueError('Invaliddatadirectory:%s'%args['DATA_DIR'])
num_processes=int(args['NUM_PROCESSES'])
pool=Pool(processes=num_processes)
per_doc_counts=pool.map(count_words,
glob.glob(os.path.join(args['DATA_DIR'],
'*.txt')))
counts=reduce(merge_counts,[{}]+per_doc_counts)
logging.debug(counts)
Totimeeachoftheexamples,enteritintoitsownPythonfileanduseLinux’stimecommand:
Theoutputcontains the real run timeand theuser run time. real iswall clocktime-timefromstarttofinishofthecall.useristheamountofCPUtimespentin user-mode code (outside the kernel)within the process, that is, only actualCPUtimeusedinexecutingtheprocess.
10.1.6Excersises
E.python.wordcount.1:
Run the threedifferentprograms (serial, serialw/mapandreduce,parallel)andanswerthefollowingquestions:
1. Is there any performance difference between the differentversionsoftheprogram?
2. Doesusertimesignificantlydifferfromrealtimeforanyoftheversionsoftheprogram?
3. Experimentwithdifferentnumbersofprocessesfortheparallelexample, starting with 1. What is the performance gain whenyougoalfrom1to2processes?From2to3?Whendoyoustopseeing improvement? (this will depend on your machinearchitecture)
10.1.7References
Map,FilterandReducemultiprocessingAPI
10.2NUMPY☁�NumPyisapopularlibrarythatisusedbymanyotherPythonpackagessuchasPandas, SciPy, and scikit-learn. It provides a fast, simple-to-use way ofinteracting with numerical data organized in vectors and matrices. In thissection,wewillprovideashortintroductiontoNumPy.
$timepythonwordcount.pydocs-1000-10000
10.2.1InstallingNumPy
The most common way of installing NumPy, if it wasn’t included with yourPythoninstallation,istoinstallitviapip:
IfNumPyhasalreadybeeninstalled,youcanupdatetothemostrecentversionusing:
YoucanverifythatNumPyisinstalledbytryingtouseitinaPythonprogram:
Notethat,byconvention,weimportNumPyusingthealias‘np’-wheneveryousee ‘np’ sprinkled in example Python code, it’s a good bet that it is usingNumPy.
10.2.2NumPyBasics
At its core, NumPy is a container for n-dimensional data. Typically, 1-dimensional data is called an array and 2-dimensional data is called amatrix.Beyond2-dimenionswouldbeconsideredamultidimensionalarray.Exampleswhereyou’llencounterthesedimenionsmayinclude:
1 Dimensional: time series data such as audio, stock prices, or a singleobservationinadataset.2 Dimensional: connectivity data between network nodes, user-productrecommendations,anddatabasetables.3+ Dimensional: network latency between nodes over time, video(RGB+time),andversioncontrolleddatasets.
All of these data can be placed into NumPy’s array object, just with varyingdimensions.
10.2.3DataTypes:TheBasicBuildingBlocks
Beforewedelveintoarraysandmatrices,wewillstartoffwiththemostbasic
$pipinstallnumpy
$pipinstall-Unumpy
importnumpyasnp
element of those: a single value. NumPy can represent data utilizing manydifferentstandarddatatypessuchasuint8(an8-bitusigned integer), float64(a64-bitfloat),orstr(astring).Anexhaustivelistingcanbefoundat:
https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html
Beforemovingon,itisimportanttoknowaboutthetradeoffmadewhenusingdifferentdatatypes.Forexample,auint8canonlycontainvaluesbetween0and255.This,however,contrastswithfloat64whichcanexpressanyvaluefrom+/-1.80e+308.Sowhywouldn’twejustalwaysusefloat64s?Thoughtheyallowustobemoreexpressiveintermsofnumbers,theyalsoconsumemorememory.Ifwewereworkingwith a12megapixel image, for example, storing that imageusinguint8valueswouldrequire3000*4000*8=96millionbits,or91.55MBof memory. If we were to store the same image utilizing float64, our imagewouldconsume8 timesasmuchmemory:768millionbitsor732.42MB.It’simportant use the right datatype for the job to avoid consuming unneccessaryresourcesorslowingdownprocessing.
Finally,whileNumPywillconvenientlyconvertbetweendatatypes,onemustbeawareofoverflowswhenusingsmallerdatatypes.Forexample:
In this example, it makes sense that 6+7=13. But how does 13+245=2? Putsimply, the object type (uint8) simply ran out of space to store the value andwrapped back around to the beginning. An 8-bit number is only capable ofstoring2^8,or256,uniquevalues.Anoperationthatresultsinavalueabovethatrangewill‘overflow’andcausethevaluetowrapbackaroundtozero.Likewise,anythingbelowthatrangewill‘underflow’andwrapbackaroundtotheend.Inour example, 13+245 became 258,whichwas too large to store in 8 bits andwrappedbackaroundto0andendedupat2.
NumPywill, generally, try to avoid this situation by dynamically retyping towhateverdatatypewillsupporttheresult:
a=np.array([6],dtype=np.uint8)
print(a)
>>>[6]
a=a+np.array([7],dtype=np.uint8)
print(a)
>>>[13]
a=a+np.array([245],dtype=np.uint8)
print(a)
>>>[2]
a=a+260
Here,ouradditioncausedourarray,‘a’,tobeupscaledtouseuint16insteadofuint8. Finally, NumPy offers convenience functions akin to Python’s range()functiontocreatearraysofsequentialnumbers:
Wecanusethisfunctiontoalsogenerateparametersspacesthatcanbeiteratedon:
10.2.4Arrays:StringingThingsTogether
With our knowledge of datatypes in hand, we can begin to explore arrays.Simply put, arrays can be thought of as a sequence of values (not neccesarilynumbers).Arraysare1dimensionalandcanbecreatedandaccessedsimply:
Arrays (and, later,matrices)arezero-indexed.Thismakes it convenientwhen,forexample,usingPython’srange()functiontoiteratethroughanarray:
Arraysare,also,mutableandcanbechangedeasily:
NumPy also includes incredibly powerful broadcasting features.Thismakes itvery simple to perform mathematical operations on arrays that also makes
print(test)
>>>[262]
X=np.arange(0.2,1,.1)
print(X)
>>>array([0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],dtype=float32)
P=10.0**np.arange(-7,1,1)
print(P)
forx,pinzip(X,P):
print('%f,%f'%(x,p))
a=np.array([1,2,3])
print(type(a))
>>><class'numpy.ndarray'>
print(a)
>>>[123]
print(a.shape)
>>>(3,)
a[0]
>>>1
foriinrange(3):
print(a[i])
>>>1
>>>2
>>>3
a[0]=42
print(a)
>>>array([42,2,3])
intuitivesense:
Arrayscanalsointeractwithotherarrays:
In this example, the result of multiplying together two arrays is to take theelement-wiseproductwhilemultiplyingbyaconstantwillmultiplyeachelementin the array by that constant. NumPy supports all of the basic mathematicaloperations: addition, subtraction, multiplication, division, and powers. It alsoincludesanextensivesuiteofmathematicalfunctions,suchaslog()andmax(),whicharecoveredlater.
10.2.5Matrices:AnArrayofArrays
Matrices can be thought of as an extension of arrays - rather than having onedimension,matriceshave2(ormore).Muchlikearrays,matricescanbecreatedeasilywithinNumPy:
Accessingindividualelementsissimilartohowwediditforarrays.Wesimplyneedtopassinanumberofargumentsequaltothenumberofdimensions:
In this example, our first index selected the row and the second selected thecolumn-givingusourresultof3.Matricescanbeextendingouttoanynumberofdimensionsbysimplyusingmoreindicestoaccessspecificelements(thoughuse-casesbeyond4maybesomewhatrare).
Matricessupportallofthenormalmathematialfunctionssuchas+,-,*,and/.Aspecialnote:the*operatorwillresultinanelement-wisemultiplication.Using@ornp.matmul()formatrixmultiplication:
a*3
>>>array([3,6,9])
a**2
>>>array([1,4,9],dtype=int32)
b=np.array([2,3,4])
print(a*b)
>>>array([2,6,12])
m=np.array([[1,2],[3,4]])
print(m)
>>>[[12]
>>>[34]]
m[1][0]
>>>3
MorecomplexmathematicalfunctionscantypicallybefoundwithintheNumPylibraryitself:
A full listing can be found at:https://docs.scipy.org/doc/numpy/reference/routines.math.html
10.2.6SlicingArraysandMatrices
As one can imagine, accessing elements one-at-a-time is both slow and canpotentially require many lines of code to iterate over every dimension in thematrix. Thankfully, NumPy incorporate a very powerful slicing engine thatallowsustoaccessrangesofelementseasily:
The ‘:’value tellsNumPy toselectallelements in thegivendimension.Here,we’ve requested all elements in the first row. We can also use indexing torequestelementswithinagivenrange:
Here,weaskedNumPy togiveuselements4 through7 (ranges inPythonareinclusiveatthestartandnon-inclusiveattheend).Wecanevengobackwards:
Inthepreviousexample,thenegativevalueisaskingNumPytoreturnthelast5elementsof thearray.Had theargumentbeen‘:-5’,NumPywould’vereturnedeverythingBUTthelastfiveelements:
Becoming more familiar with NumPy’s accessor conventions will allow youwritemoreefficient,clearercodeasitiseasiertoreadasimpleone-lineaccessor
print(m-m)
print(m*m)
print(m/m)
print(np.sin(x))
print(np.sum(x))
m[1,:]
>>>array([3,4])
a=np.arange(0,10,1)
print(a)
>>>[0123456789]
a[4:8]
>>>array([4,5,6,7])
a[-5:]
>>>array([5,6,7,8,9])
a[:-5]
>>>array([0,1,2,3,4])
than it is a multi-line, nested loop when extracting values from an array ormatrix.
10.2.7UsefulFunctions
The NumPy library provides several convenient mathematical functions thatusers can use. These functions provide several advantages to codewritten byusers:
They are open source typically have multiple contributors checking forerrors.Many of them utilize a C interface andwill runmuch faster than nativePythoncode.They’rewrittentoveryflexible.
NumPyarraysandmatricescontainmanyusefulaggregatingfunctionssuchasmax(),min(),mean(), etc These functions are usually able to run an order ofmagnitudefasterthanloopingthroughtheobject,soit’simportanttounderstandwhatfunctionsareavailabletoavoid‘reinventingthewheel.’Inaddition,manyof the functions are able to sum or average across axes, which make themextremely useful if your data has inherent grouping. To return to a previousexample:
In thisexample,wecreateda2x2matrixcontaining thenumbers1 through4.Thesumof thematrix returned theelement-wiseadditionof theentirematrix.Summing across axis 0 (rows) returned a new array with the element-wiseadditionacrosseachrow.Likewise,summingacrossaxis1(columns)returnedthecolumnarsummation.
10.2.8LinearAlgebra
Perhaps one of the most important uses for NumPy is its robust support for
m=np.array([[1,2],[3,4]])
print(m)
>>>[[12]
>>>[34]]
m.sum()
>>>10
m.sum(axis=1)
>>>[3,7]
m.sum(axis=0)
>>>[4,6]
Linear Algebra functions. Like the aggregation functions described in theprevious section, these functions are optimized to be much faster than userimplementationsandcanutilizeprocessesorlevelfeaturestoprovideveryquickcomputations. These functions can be accessed very easily from the NumPypackage:
Included in within np.linalg are functions for calculating theEigendecompositionofsquarematricesandsymmetricmatrices.Finally,togiveaquickexampleofhoweasy it is to implementalgorithms inNumPy,wecaneasilyuseittocalculatethecostandgradientwhenusingsimpleMean-Squared-Error(MSE):
Finally, more advanced functions are easily available to users via the linalglibraryofNumPyas:
10.2.9NumPyResources
https://docs.scipy.org/doc/numpyhttp://cs231n.github.io/python-numpy-tutorial/#numpyhttps://docs.scipy.org/doc/numpy-1.15.1/reference/routines.linalg.htmlhttps://en.wikipedia.org/wiki/Mean_squared_error
10.3SCIPY☁�SciPy is a library built around numpy and has a number of off-the-shelfalgorithmsandoperationsimplemented.Theseincludealgorithmsfromcalculus(such as integration), statistics, linear algebra, image-processing, signal
a=np.array([[1,2],[3,4]])
b=np.array([[5,6],[7,8]])
print(np.matmul(a,b))
>>>[[1922]
[4350]]
cost=np.power(Y-np.matmul(X,weights)),2).mean(axis=1)
gradient=np.matmul(X.T,np.matmul(X,weights)-y)
fromnumpyimportlinalg
A=np.diag((1,2,3))
w,v=linalg.eig(A)
print('w=',w)
print('v=',v)
processing,machinelearning.
To achieve this, SciPy bundels a number of useful open-source software formathematics,science,andengineering.Itincludesthefollowingpackages:
NumPy,
formanagingN-dimensionalarrays
SciPylibrary,
toaccessfundamentalscientificcomputingcapabilities
Matplotlib,
toconduct2Dplotting
IPython,
foranInteractiveconsole(seejupyter)
Sympy,
forsymbolicmathematics
pandas,
forprovidingdatastructuresandanalysis
10.3.1Introduction
First we add the usual scientific computing modules with the typicalabbreviations, including sp for scipy. We could invoke scipy’s statisticalpackageassp.stats,butforthesakeoflazinessweabbreviatethattoo.
Nowwecreatesomerandomdatatoplaywith.Wegenerate100samplesfroma
importnumpyasnp#importnumpy
importscipyassp#importscipy
fromscipyimportstats#referdirectlytostatsratherthansp.stats
importmatplotlibasmpl#forvisualization
frommatplotlibimportpyplotasplt#referdirectlytopyplot
#ratherthanmpl.pyplot
Gaussiandistributioncenteredatzero.
Howmanyelementsareintheset?
Whatisthemean(average)oftheset?
Whatistheminimumoftheset?
Whatisthemaximumoftheset?
Wecanusethescipyfunctionstoo.What’sthemedian?
Whataboutthestandarddeviationandvariance?
Isn’tthevariancethesquareofthestandarddeviation?
How close are the measures? The differences are close as the followingcalculationshows
Howdoesthislookasahistogram?SeeFigure18,Figure19,Figure20
s=sp.randn(100)
print('Thereare',len(s),'elementsintheset')
print('Themeanofthesetis',s.mean())
print('Theminimumofthesetis',s.min())
print('Themaximumofthesetis',s.max())
print('Themedianofthesetis',sp.median(s))
print('Thestandarddeviationis',sp.std(s),
'andthevarianceis',sp.var(s))
print('Thesquareofthestandarddeviationis',sp.std(s)**2)
print('Thedifferenceis',abs(sp.std(s)**2-sp.var(s)))
print('Andindecimalform,thedifferenceis%0.16f'%
(abs(sp.std(s)**2-sp.var(s))))
plt.hist(s)#yes,onelineofcodeforahistogram
plt.show()
Figure18:Histogram1
Letusaddsometitles.
Figure19:Histogram2
Typically we do not include titles when we prepare images for inclusion inLaTeX.Thereweusethecaptiontodescribewhatthefigureisabout.
plt.clf()#clearoutthepreviousplot
plt.hist(s)
plt.title("HistogramExample")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
plt.clf()#clearoutthepreviousplot
plt.hist(s)
plt.xlabel("Value")
Figure20:Histogram3
Let us try out some linear regression, or curve fitting. See @#fig:scipy-output_30_0
plt.ylabel("Frequency")
plt.show()
importrandom
defF(x):
return2*x-2
defadd_noise(x):
returnx+random.uniform(-1,1)
X=range(0,10,1)
Y=[]
foriinrange(len(X)):
Y.append(add_noise(X[i]))
plt.clf()#clearouttheoldfigure
plt.plot(X,Y,'.')
plt.show()
Figure21:Result1
Nowlet’strylinearregressiontofitthecurve.
Whatistheslopeandy-interceptofthefittedcurve?
Nowlet’sseehowwellthecurvefitsthedata.We’llcallthefittedcurveF’.
To save images into a PDF file for inclusion into LaTeX documents you cansavetheimagesasfollows.Otherformatssuchaspngarealsopossible,butthequalityisnaturallynotsufficientforinclusioninpapersanddocuments.Forthat
m,b,r,p,est_std_err=stats.linregress(X,Y)
print('Theslopeis',m,'andthey-interceptis',b)
defFprime(x):#thefittedcurve
returnm*x+b
X=range(0,10,1)
Yprime=[]
foriinrange(len(X)):
Yprime.append(Fprime(X[i]))
plt.clf()#clearouttheoldfigure
#theobservedpoints,bluedots
plt.plot(X,Y,'.',label='observedpoints')
#theinterpolatedcurve,connectedredline
plt.plot(X,Yprime,'r-',label='estimatedpoints')
plt.title("LinearRegressionExample")#title
plt.xlabel("x")#horizontalaxistitle
plt.ylabel("y")#verticalaxistitle
#legendlabelstoplot
plt.legend(['obseredpoints','estimatedpoints'])
#commentoutsothatyoucansavethefigure
#plt.show()
youcertainlywant tousePDF.Thesaveof thefigurehas tooccurbeforeyouusetheshow()command.SeeFigure22
Figure22:Result2
10.3.2References
Formore informationaboutSciPywe recommend thatyouvisit the followinglink
https://www.scipy.org/getting-started.html#learning-to-work-with-scipy
Additionalmaterialandinspirationforthissectionarefrom
[ ]“GettingStartedguide”https://www.scipy.org/getting-started.html
[ ]Prasanth.“SimplestatisticswithSciPy.”Comfortat1AU.February28, 2011. https://oneau.wordpress.com/2011/02/28/simple-statistics-with-scipy/.
[ ] SciPy Cookbook. Lasted updated: 2015. http://scipy-cookbook.readthedocs.io/.
plt.savefig("regression.pdf",bbox_inches='tight')
plt.savefig('regression.png')
plt.show()
createbibtexentries
10.4SCIKIT-LEARN☁�
LearningObjectives
ExploratorydataanalysisPipelinetopreparedataFulllearningpipelineFinetunethemodelSignificancetests
10.4.1IntroductiontoScikit-learn
Scikit learnisaMachineLearningspecific libraryusedinPython.Librarycanbeusedfordataminingandanalysis.ItisbuiltontopofNumPy,matplotlibandSciPy.ScikitLearnfeaturesDimensionalityreduction,clustering,regressionandclassificationalgorithms.Italsofeaturesmodelselectionusinggridsearch,crossvalidationandmetrics.
Scikitlearnalsoenablesuserstopreprocessthedatawhichcanthenbeusedformachinelearningusingmoduleslikepreprocessingandfeatureextraction.
Inthissectionwedemonstratehowsimpleitistousek-meansinscikitlearn.
10.4.2Installation
Ifyoualreadyhaveaworkinginstallationofnumpyandscipy,theeasiestwaytoinstallscikit-learnisusingpip
10.4.3SupervisedLearning
$pipinstallnumpy
$pipinstallscipy-U
$pipinstall-Uscikit-learn
SupervisedLearningisusedinmachinelearningwhenwealreadyknowasetofoutputpredictionsbasedon input characteristics andbasedon thatweneed topredictthetargetforanewinput.Trainingdataisusedtotrainthemodelwhichthencanbeusedtopredicttheoutputfromaboundedset.
Problemscanbeoftwotypes
1. Classification:Trainingdatabelongstothreeorfourclasses/categoriesandbasedon the labelwewant topredict theclass/categoryfor theunlabeleddata.
2. Regression : Training data consists of vectorswithout any correspondingtargetvalues.Clusteringcanbeusedforthesetypeofdatasetstodeterminediscover groups of similar examples. Another way is density estimationwhichdeterminethedistributionofdatawithintheinputspace.Histogramisthemostbasicform.
10.4.4UnsupervisedLearning
UnsupervisedLearning isused inmachine learningwhenwehave the trainingsetavailablebutwithoutanycorrespondingtarget.Theoutcomeoftheproblemistodiscovergroupswithintheprovidedinput.Itcanbedoneinmanyways.
Fewofthemarelistedhere
1. Clustering:Discovergroupsofsimilarcharacteristics.2. Density Estimation : Finding the distribution of datawithin the provided
inputorchanging thedata fromahighdimensional space to twoor threedimension.
10.4.5BuildingaendtoendpipelineforSupervisedmachinelearningusingScikit-learn
Adatapipelineisasetofprocessingcomponentsthataresequencedtoproducemeaningfuldata.PipelinesarecommonlyusedinMachinelearning,sincethereislotofdatatransformationandmanipulationthatneedstobeappliedtomakedatauseful formachine learning.All components are sequenced in away thatthe output of one component becomes input for the next and each of thecomponentisselfcontained.Componentsinteractwitheachotherusingdata.
Evenifacomponentbreaks,thedownstreamcomponentcanrunnormallyusingthe last output. Sklearn provide the ability to build pipelines that can betransformedandmodeledformachinelearning.
10.4.6Stepsfordevelopingamachinelearningmodel
1. Explorethedomainspace2. Extracttheproblemdefinition3. Getthedatathatcanbeusedtomakethesystemlearntosolvetheproblem
definition.4. DiscoverandVisualizethedatatogaininsights5. Featureengineeringandpreparethedata6. Finetuneyourmodel7. Evaluateyoursolutionusingmetrics8. Onceprovenlaunchandmaintainthemodel.
10.4.7ExploratoryDataAnalysis
Exampleproject=Frauddetectionsystem
Firststepistoloadthedataintoadataframeinorderforaproperanalysistobedoneontheattributes.
Performthebasicanalysisonthedatashapeandnullvalueinformation.
Hereistheexampleoffewofthevisualdataanalysismethods.
10.4.7.1Barplot
Abarchartorgraphisagraphwithrectangularbarsorbinsthatareusedtoplotcategoricalvalues.Eachbarinthegraphrepresentsacategoricalvariableandtheheightofthebarisproportionaltothevaluerepresentedbyit.
data=pd.read_csv('dataset/data_file.csv')
data.head()
print(data.shape)
print(data.info())
data.isnull().values.any()
Bargraphsareused:
TomakecomparisonsbetweenvariablesTovisualizeanytrendinthedata,i.e.,they show the dependence of one variable on another Estimate values of avariable
Figure23:Exampleofscikit-learnbarplots
10.4.7.2Correlationbetweenattributes
Attributesinadatasetcanberelatedbasedondifferntaspects.
Examplesincludeattributesdependentonanotherorcouldbelooselyortightlycoupled.Alsoexampleincludestwovariablescanbeassociatedwithathirdone.
Inordertounderstandtherelationshipbetweenattributes,correlationrepresentsthebestvisualwaytogetaninsight.Positivecorrelationmeaningbothattributesmovingintothesamedirection.Negativecorrelationreferstooppostedirections.One attributes values increase results in value decrease for other. Zero
plt.ylabel('Transactions')
plt.xlabel('Type')
data.type.value_counts().plot.bar()
correlationiswhentheattributesareunrelated.
Figure24:scikit-learncorrelationarray
10.4.7.3HistogramAnalysisofdatasetattributes
Ahistogramconsistsofasetofcountsthatrepresentthenumberoftimessome
#computethecorrelationmatrix
corr=data.corr()
#generateamaskforthelowertriangle
mask=np.zeros_like(corr,dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
#setupthematplotlibfigure
f,ax=plt.subplots(figsize=(18,18))
#generateacustomdivergingcolormap
cmap=sns.diverging_palette(220,10,as_cmap=True)
#drawtheheatmapwiththemaskandcorrectaspectratio
sns.heatmap(corr,mask=mask,cmap=cmap,vmax=.3,
square=True,
linewidths=.5,cbar_kws={"shrink":.5},ax=ax);
eventoccurred.
Figure25:scikit-learn
10.4.7.4BoxplotAnalysis
Box plot analysis is useful in detecting whether a distribution is skewed anddetectoutliersinthedata.
%matplotlibinline
data.hist(bins=30,figsize=(20,15))
plt.show()
fig,axs=plt.subplots(2,2,figsize=(10,10))
tmp=data.loc[(data.type=='TRANSFER'),:]
a=sns.boxplot(x='isFlaggedFraud',y='amount',data=tmp,ax=axs[0][0])
axs[0][0].set_yscale('log')
b=sns.boxplot(x='isFlaggedFraud',y='oldbalanceDest',data=tmp,ax=axs[0][1])
axs[0][1].set(ylim=(0,0.5e8))
c=sns.boxplot(x='isFlaggedFraud',y='oldbalanceOrg',data=tmp,ax=axs[1][0])
axs[1][0].set(ylim=(0,3e7))
d=sns.regplot(x='oldbalanceOrg',y='amount',data=tmp.loc[(tmp.isFlaggedFraud==1),:],ax=axs[1][1])
plt.show()
Figure26:scikit-learn
10.4.7.5ScatterplotAnalysis
The scatter plot displays values of two numerical variables as Cartesiancoordinates.plt.figure(figsize=(12,8))
sns.pairplot(data[['amount','oldbalanceOrg','oldbalanceDest','isFraud']],hue='isFraud')
Figure27:scikit-learnscatterplots
10.4.8DataCleansing-RemovingOutliers
Ifthetransactionamountislowerthan5percentoftheallthetransactionsANDdoesnotexceedUSD3000,wewillexcludeitfromouranalysistoreduceType1costsIfthetransactionamountishigherthan95percentofallthetransactionsAND exceeds USD 500000, we will exclude it from our analysis, and use ablanketreviewprocessforsuchtransactions(similartoisFlaggedFraudcolumninoriginaldataset)toreduceType2costslow_exclude=np.round(np.minimum(fin_samp_data.amount.quantile(0.05),3000),2)
high_exclude=np.round(np.maximum(fin_samp_data.amount.quantile(0.95),500000),2)
###UpdatingDatatoexcluderecordspronetoType1andType2costs
low_data=fin_samp_data[fin_samp_data.amount>low_exclude]
data=low_data[low_data.amount<high_exclude]
10.4.9PipelineCreation
Machinelearningpipelineisusedtohelpautomatemachinelearningworkflows.Theyoperateby enabling a sequenceofdata tobe transformedandcorrelatedtogether in a model that can be tested and evaluated to achieve an outcome,whetherpositiveornegative.
10.4.9.1DefiningDataFrameSelectortoseparateNumericalandCategoricalattributes
SamplefunctiontoseperateoutNumericalandcategoricalattributes.
10.4.9.2FeatureCreation/AdditionalFeatureEngineering
DuringEDAweidentifiedthattherearetransactionswherethebalancesdonottallyafterthetransactioniscompleted.Webelievethiscouldpotentiallybecaseswherefraudisoccurring.Toaccountforthiserrorinthetransactions,wedefinetwo new features“errorBalanceOrig” and “errorBalanceDest”, calculated byadjusting theamountwith thebeforeandafterbalances for theOriginatorandDestinationaccounts.
Below,wecreateafunctionthatallowsustocreatethesefeaturesinapipeline.
fromsklearn.baseimportBaseEstimator,TransformerMixin
#Createaclasstoselectnumericalorcategoricalcolumns
#sinceScikit-Learndoesn'thandleDataFramesyet
classDataFrameSelector(BaseEstimator,TransformerMixin):
def__init__(self,attribute_names):
self.attribute_names=attribute_names
deffit(self,X,y=None):
returnself
deftransform(self,X):
returnX[self.attribute_names].values
fromsklearn.baseimportBaseEstimator,TransformerMixin
#columnindex
amount_ix,oldbalanceOrg_ix,newbalanceOrig_ix,oldbalanceDest_ix,newbalanceDest_ix=0,1,2,3,4
classCombinedAttributesAdder(BaseEstimator,TransformerMixin):
def__init__(self):#no*argsor**kargs
pass
deffit(self,X,y=None):
returnself#nothingelsetodo
deftransform(self,X,y=None):
errorBalanceOrig=X[:,newbalanceOrig_ix]+X[:,amount_ix]-X[:,oldbalanceOrg_ix]
errorBalanceDest=X[:,oldbalanceDest_ix]+X[:,amount_ix]-X[:,newbalanceDest_ix]
returnnp.c_[X,errorBalanceOrig,errorBalanceDest]
10.4.10CreatingTrainingandTestingdatasets
Trainingsetincludesthesetofinputexamplesthatthemodelwillbefitinto ortrained on by adjusting the parameters. Testing dataset is critical to test thegeneralizability of the model . By using this set, we can get the workingaccuracyofourmodel.
Testingsetshouldnotbeexposedtomodelunlessmodeltraininghasnotbeencompleted.Thiswaytheresultsfromtestingwillbemorereliable.
10.4.11Creatingpipelinefornumericalandcategoricalattributes
IdentifyingcolumnswithNumericalandCategoricalcharacteristics.
10.4.12Selectingthealgorithmtobeapplied
Algorithimselectionprimarilydependsontheobjectiveyouaretryingtosolveandwhatkindofdatasetisavailable.Therearediffernttypeofalgorithmswhichcanbeappliedandwewilllookintofewofthemhere.
10.4.12.1LinearRegression
This algorithm can be applied when you want to compute some continuousvalue.Topredictsomefuturevalueofaprocesswhichiscurrentlyrunning,you
fromsklearn.model_selectionimporttrain_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42,stratify=y)
X_train_num=X_train[["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest"]]
X_train_cat=X_train[["type"]]
X_model_col=["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest","type"]
fromsklearn.pipelineimportPipeline
fromsklearn.preprocessingimportStandardScaler
fromsklearn.preprocessingimportImputer
num_attribs=list(X_train_num)
cat_attribs=list(X_train_cat)
num_pipeline=Pipeline([
('selector',DataFrameSelector(num_attribs)),
('attribs_adder',CombinedAttributesAdder()),
('std_scaler',StandardScaler())
])
cat_pipeline=Pipeline([
('selector',DataFrameSelector(cat_attribs)),
('cat_encoder',CategoricalEncoder(encoding="onehot-dense"))
])
cangowithregressionalgorithm.
Exampleswherelinearregressioncanusedare:
1. Predictthetimetakentogofromoneplacetoanother2. Predictthesalesforafuturemonth3. Predictsalesdataandimproveyearlyprojections.
10.4.12.2LogisticRegression
Thisalgorithmcanbeusedtoperformbinaryclassification.Itcanbeusedifyouwantaprobabilisticframework.Alsoincaseyouexpecttoreceivemoretrainingdata in the future that you want to be able to quickly incorporate into yourmodel.
1. Customerchurnprediction.2. CreditScoring&FraudDetectionwhichisourexampleproblemwhichwe
aretryingtosolveinthischapter.3. Calculatingtheeffectivenessofmarketingcampaigns.
10.4.12.3Decisiontrees
Decision trees handle feature interactions and they’re non-parametric. Doesnt
fromsklearn.linear_modelimportLinearRegression
fromsklearn.preprocessingimportStandardScaler
importtime
scl=StandardScaler()
X_train_std=scl.fit_transform(X_train)
X_test_std=scl.transform(X_test)
start=time.time()
lin_reg=LinearRegression()
lin_reg.fit(X_train_std,y_train)#SKLearn'slinearregression
y_train_pred=lin_reg.predict(X_train_std)
train_time=time.time()-start
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.model_selectionimporttrain_test_split
X_train,_,y_train,_=train_test_split(X_train,y_train,stratify=y_train,train_size=subsample_rate,random_state=42)
X_test,_,y_test,_=train_test_split(X_test,y_test,stratify=y_test,train_size=subsample_rate,random_state=42)
model_lr_sklearn=LogisticRegression(multi_class="multinomial",C=1e6,solver="sag",max_iter=15)
model_lr_sklearn.fit(X_train,y_train)
y_pred_test=model_lr_sklearn.predict(X_test)
acc=accuracy_score(y_test,y_pred_test)
results.loc[len(results)]=["LRSklearn",np.round(acc,3)]
results
supportonlinelearningandtheentiretreeneedstoberebuildwhennewtraningdatasetcomesin.Memoryconsumptionisveryhigh.
Canbeusedforthefollowingcases
1. Investmentdecisions2. Customerchurn3. Banksloandefaulters4. BuildvsBuydecisions5. Salesleadqualifications
10.4.12.4KMeans
Thisalgorithmisusedwhenwearenotawareofthelabelsandoneneedstobecreatedbasedon the featuresofobjects.Examplewillbe todivideagroupofpeopleintodifferntsubgroupsbasedoncommonthemeorattribute.
ThemaindisadvantageofK-meanisthatyouneedtoknowexactlythenumberofclustersorgroupswhichisrequired.IttakesalotofiterationtocomeupwiththebestK.
10.4.12.5SupportVectorMachines
fromsklearn.treeimportDecisionTreeRegressor
dt=DecisionTreeRegressor()
start=time.time()
dt.fit(X_train_std,y_train)
y_train_pred=dt.predict(X_train_std)
train_time=time.time()-start
start=time.time()
y_test_pred=dt.predict(X_test_std)
test_time=time.time()-start
fromsklearn.neighborsimportKNeighborsClassifier
fromsklearn.model_selectionimporttrain_test_split,GridSearchCV,PredefinedSplit
fromsklearn.metricsimportaccuracy_score
X_train,_,y_train,_=train_test_split(X_train,y_train,stratify=y_train,train_size=subsample_rate,random_state=42)
X_test,_,y_test,_=train_test_split(X_test,y_test,stratify=y_test,train_size=subsample_rate,random_state=42)
model_knn_sklearn=KNeighborsClassifier(n_jobs=-1)
model_knn_sklearn.fit(X_train,y_train)
y_pred_test=model_knn_sklearn.predict(X_test)
acc=accuracy_score(y_test,y_pred_test)
results.loc[len(results)]=["KNNArbitarySklearn",np.round(acc,3)]
results
SVM is a supervised ML technique and used for pattern recognition andclassification problems when your data has exactly two classes. Its popular intextclassificationproblems.
FewcaseswhereSVMcanbeusedis
1. Detectingpersonswithcommondiseases.2. Hand-writtencharacterrecognition3. Textcategorization4. Stockmarketpriceprediction
10.4.12.6NaiveBayes
NaiveBayesisusedforlargedatasets.ThisalgoritmworkswellevenwhenwehavealimitedCPUandmemoryavailable.Thisworksbycalculatingbunchofcounts.Itrequireslesstrainingdata.Thealgorthimcantlearninterationbetweenfeatures.
NaiveBayescanbeusedinreal-worldapplicationssuchas:
1. Sentimentanalysisandtextclassification2. RecommendationsystemslikeNetflix,Amazon3. Tomarkanemailasspamornotspam4. Facerecognition
10.4.12.7RandomForest
RanmdonforestissimilartoDecisiontree.Canbeusedforbothregressionandclassificationproblemswithlargedatasets.
Fewcasewhereitcanbeapplied.
1. Predictpatientsforhighrisks.2. Predictpartsfailuresinmanufacturing.3. Predictloandefaulters.
fromsklearn.ensembleimportRandomForestRegressor
forest=RandomForestRegressor(n_estimators=400,criterion='mse',random_state=1,n_jobs=-1)
start=time.time()
forest.fit(X_train_std,y_train)
10.4.12.8Neuralnetworks
Neural network works based on weights of connections between neurons.Weights are trained and based on that the neural network can be utilized topredicttheclassoraquantity.Theyareresourceandmemoryintensive.
Fewcaseswhereitcanbeapplied.
1. Appliedtounsupervisedlearningtasks,suchasfeatureextraction.2. Extracts features from raw images or speech with much less human
intervention
10.4.12.9DeepLearningusingKeras
Keras is most powerful and easy-to-use Python libraries for developing andevaluating deep learning models. It has the efficient numerical computationlibrariesTheanoandTensorFlow.
10.4.12.10XGBoost
XGBooststandsforeXtremeGradientBoosting.XGBoostisanimplementationof gradient boosted decision trees designed for speed and performance. It isengineeredforefficiencyofcomputetimeandmemoryresources.
10.4.13ScikitCheatSheet
ScikitlearninghasputaveryindepthandwellexplainedflowcharttohelpyouchoosetherightalgorithmthatIfindveryhandy.
y_train_pred=forest.predict(X_train_std)
train_time=time.time()-start
start=time.time()
y_test_pred=forest.predict(X_test_std)
test_time=time.time()-start
Figure28:scikit-learn
10.4.14ParameterOptimization
Machinelearningmodelsareparameterizedsothattheirbehaviorcanbetunedforagivenproblem.Thesemodelscanhavemanyparametersand finding thebestcombinationofparameterscanbetreatedasasearchproblem.
Aparameterisaconfigurationthatispartofthemodelandvaluescanbederivedfromthegivendata.
1. Requiredbythemodelwhenmakingpredictions.2. Valuesdefinetheskillofthemodelonyourproblem.3. Estimatedorlearnedfromdata.4. Oftennotsetmanuallybythepractitioner.5. Oftensavedaspartofthelearnedmodel.
10.4.14.1Hyperparameteroptimization/tuningalgorithms
Gridsearchisanapproachtohyperparametertuningthatwillmethodicallybuildandevaluateamodelforeachcombinationofalgorithmparametersspecifiedin
agrid.
Random search provide a statistical distribution for each hyperparameter fromwhichvaluesmayberandomlysampled.
10.4.15ExperimentswithKeras(deeplearning),XGBoost,andSVM(SVC)comparedtoLogisticRegression(Baseline)
10.4.15.1Creatingaparametergrid
10.4.15.2ImplementingGridsearchwithmodelsandalsocreatingmetricsfromeachofthemodel.
grid_param=[
[{#LogisticRegression
'model__penalty':['l1','l2'],
'model__C':[0.01,1.0,100]
}],
[{#keras
'model__optimizer':optimizer,
'model__loss':loss
}],
[{#SVM
'model__C':[0.01,1.0,100],
'model__gamma':[0.5,1],
'model__max_iter':[-1]
}],
[{#XGBClassifier
'model__min_child_weight':[1,3,5],
'model__gamma':[0.5],
'model__subsample':[0.6,0.8],
'model__colsample_bytree':[0.6],
'model__max_depth':[3]
}]
]
Pipeline(memory=None,
steps=[('preparation',FeatureUnion(n_jobs=None,
transformer_list=[('num_pipeline',Pipeline(memory=None,
steps=[('selector',DataFrameSelector(attribute_names=['amount','oldbalanceOrg','newbalanceOrig','oldbalanceDest'
tol=0.0001,verbose=0,warm_start=False))])
fromsklearn.metricsimportmean_squared_error
fromsklearn.metricsimportclassification_report
fromsklearn.metricsimportf1_score
fromxgboost.sklearnimportXGBClassifier
fromsklearn.svmimportSVC
test_scores=[]
#MachineLearningAlgorithm(MLA)SelectionandInitialization
MLA=[
linear_model.LogisticRegression(),
keras_model,
SVC(),
XGBClassifier()
10.4.15.3ResultstablefromtheModelevaluationwithmetrics.
]
#createtabletocompareMLAmetrics
MLA_columns=['Name','Score','Accuracy_Score','ROC_AUC_score','final_rmse','Classification_error','Recall_Score','Precision_Score'
MLA_compare=pd.DataFrame(columns=MLA_columns)
Model_Scores=pd.DataFrame(columns=['Name','Score'])
row_index=0
foralginMLA:
#setnameandparameters
MLA_name=alg.__class__.__name__
MLA_compare.loc[row_index,'Name']=MLA_name
#MLA_compare.loc[row_index,'Parameters']=str(alg.get_params())
full_pipeline_with_predictor=Pipeline([
("preparation",full_pipeline),#combinationofnumericalandcategoricalpipelines
("model",alg)
])
grid_search=GridSearchCV(full_pipeline_with_predictor,grid_param[row_index],cv=4,verbose=2,scoring='f1',return_train_score
grid_search.fit(X_train[X_model_col],y_train)
y_pred=grid_search.predict(X_test)
MLA_compare.loc[row_index,'Accuracy_Score']=np.round(accuracy_score(y_pred,y_test),3)
MLA_compare.loc[row_index,'ROC_AUC_score']=np.round(metrics.roc_auc_score(y_test,y_pred),3)
MLA_compare.loc[row_index,'Score']=np.round(grid_search.score(X_test,y_test),3)
negative_mse=grid_search.best_score_
scores=np.sqrt(-negative_mse)
final_mse=mean_squared_error(y_test,y_pred)
final_rmse=np.sqrt(final_mse)
MLA_compare.loc[row_index,'final_rmse']=final_rmse
confusion_matrix_var=confusion_matrix(y_test,y_pred)
TP=confusion_matrix_var[1,1]
TN=confusion_matrix_var[0,0]
FP=confusion_matrix_var[0,1]
FN=confusion_matrix_var[1,0]
MLA_compare.loc[row_index,'Classification_error']=np.round(((FP+FN)/float(TP+TN+FP+FN)),5)
MLA_compare.loc[row_index,'Recall_Score']=np.round(metrics.recall_score(y_test,y_pred),5)
MLA_compare.loc[row_index,'Precision_Score']=np.round(metrics.precision_score(y_test,y_pred),5)
MLA_compare.loc[row_index,'F1_Score']=np.round(f1_score(y_test,y_pred),5)
MLA_compare.loc[row_index,'mean_test_score']=grid_search.cv_results_['mean_test_score'].mean()
MLA_compare.loc[row_index,'mean_fit_time']=grid_search.cv_results_['mean_fit_time'].mean()
Model_Scores.loc[row_index,'MLAName']=MLA_name
Model_Scores.loc[row_index,'MLScore']=np.round(metrics.roc_auc_score(y_test,y_pred),3)
#CollectMeanTestscoresforstatisticalsignificancetest
test_scores.append(grid_search.cv_results_['mean_test_score'])
row_index+=1
Figure29:scikit-learn
10.4.15.4ROCAUCScore
AUC-ROCcurve isaperformancemeasurementforclassificationproblematvarious thresholds settings. ROC is a probability curve and AUC representsdegree or measure of separability. It tells how much model is capable ofdistinguishing between classes. Higher the AUC, better the model is atpredicting0sas0sand1sas1s.
Figure30:scikit-learn
Figure31:scikit-learn
10.4.16K-meansinscikitlearn.
10.4.16.1Import
10.4.17K-meansAlgorithm
Inthissectionwedemonstratehowsimpleitistousek-meansinscikitlearn.
10.4.17.1Import
10.4.17.2Createsamples
fromtimeimporttime
importnumpyasnp
importmatplotlib.pyplotasplt
fromsklearnimportmetrics
fromsklearn.clusterimportKMeans
fromsklearn.datasetsimportload_digits
fromsklearn.decompositionimportPCA
fromsklearn.preprocessingimportscale
np.random.seed(42)
digits=load_digits()
data=scale(digits.data)
10.4.17.3Createsamples
10.4.17.4Visualize
SeeFigure32
np.random.seed(42)
digits=load_digits()
data=scale(digits.data)
n_samples,n_features=data.shape
n_digits=len(np.unique(digits.target))
labels=digits.target
sample_size=300
print("n_digits:%d,\tn_samples%d,\tn_features%d"%(n_digits,n_samples,n_features))
print(79*'_')
print('%9s'%'init''timeinertiahomocomplv-measARIAMIsilhouette')
print("n_digits:%d,\tn_samples%d,\tn_features%d"
%(n_digits,n_samples,n_features))
print(79*'_')
print('%9s'%'init'
'timeinertiahomocomplv-measARIAMIsilhouette')
defbench_k_means(estimator,name,data):
t0=time()
estimator.fit(data)
print('%9s%.2fs%i%.3f%.3f%.3f%.3f%.3f%.3f'
%(name,(time()-t0),estimator.inertia_,
metrics.homogeneity_score(labels,estimator.labels_),
metrics.completeness_score(labels,estimator.labels_),
metrics.v_measure_score(labels,estimator.labels_),
metrics.adjusted_rand_score(labels,estimator.labels_),
metrics.adjusted_mutual_info_score(labels,estimator.labels_),
metrics.silhouette_score(data,estimator.labels_,metric='euclidean',sample_size=sample_size)))
bench_k_means(KMeans(init='k-means++',n_clusters=n_digits,n_init=10),name="k-means++",data=data)
bench_k_means(KMeans(init='random',n_clusters=n_digits,n_init=10),name="random",data=data)
metrics.silhouette_score(data,estimator.labels_,
metric='euclidean',
sample_size=sample_size)))
bench_k_means(KMeans(init='k-means++',n_clusters=n_digits,n_init=10),
name="k-means++",data=data)
bench_k_means(KMeans(init='random',n_clusters=n_digits,n_init=10),
name="random",data=data)
#inthiscasetheseedingofthecentersisdeterministic,hencewerunthe
#kmeansalgorithmonlyoncewithn_init=1
pca=PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_,n_clusters=n_digits,n_init=1),name="PCA-based",data=data)
print(79*'_')
bench_k_means(KMeans(init=pca.components_,
n_clusters=n_digits,n_init=1),
name="PCA-based",
data=data)
10.4.17.5Visualize
SeeFigure32
Figure32:Result
print(79*'_')
reduced_data=PCA(n_components=2).fit_transform(data)
kmeans=KMeans(init='k-means++',n_clusters=n_digits,n_init=10)
kmeans.fit(reduced_data)
#Stepsizeofthemesh.DecreasetoincreasethequalityoftheVQ.
h=.02#pointinthemesh[x_min,x_max]x[y_min,y_max].
#Plotthedecisionboundary.Forthat,wewillassignacolortoeach
x_min,x_max=reduced_data[:,0].min()-1,reduced_data[:,0].max()+1
y_min,y_max=reduced_data[:,1].min()-1,reduced_data[:,1].max()+1
xx,yy=np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
#Obtainlabelsforeachpointinmesh.Uselasttrainedmodel.
Z=kmeans.predict(np.c_[xx.ravel(),yy.ravel()])
#Puttheresultintoacolorplot
Z=Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z,interpolation='nearest',
extent=(xx.min(),xx.max(),yy.min(),yy.max()),
cmap=plt.cm.Paired,
aspect='auto',origin='lower')
plt.plot(reduced_data[:,0],reduced_data[:,1],'k.',markersize=2)
#PlotthecentroidsasawhiteX
centroids=kmeans.cluster_centers_
plt.scatter(centroids[:,0],centroids[:,1],
marker='x',s=169,linewidths=3,
color='w',zorder=10)
plt.title('K-meansclusteringonthedigitsdataset(PCA-reduceddata)\n'
'Centroidsaremarkedwithwhitecross')
plt.xlim(x_min,x_max)
plt.ylim(y_min,y_max)
plt.xticks(())
plt.yticks(())
plt.show()
10.5DASK-RANDOMFORESTFEATUREDETECTION☁�
10.5.1Setup
First we need our tools. pandas gives us the DataFrame, very similar to R’sDataFrames.TheDataFrameisastructurethatallowsustoworkwithourdatamoreeasily.Ithasnicefeaturesforslicingandtransformationofdata,andeasywaystodobasicstatistics.
numpyhassomeveryhandyfunctionsthatworkonDataFrames.
10.5.2Dataset
We are using a dataset about the wine quality dataset, archived at UCI’sMachineLearningRepository(http://archive.ics.uci.edu/ml/index.php).
Nowwewillloadourdata.pandasmakesiteasy!
Like in R, there is a .describe() method that gives basic statistics for everycolumninthedataset.
fixedacidity volatileacidity citricacid residual
sugar chlorides
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000mean 8.319637 0.527821 0.270976 2.538806 0.087467std 1.741096 0.179060 0.194801 1.409928 0.047065min 4.600000 0.120000 0.000000 0.900000 0.01200025% 7.100000 0.390000 0.090000 1.900000 0.070000
importpandasaspd
importnumpyasnp
#redwinequalitydata,packedinaDataFrame
red_df=pd.read_csv('winequality-red.csv',sep=';',header=0,index_col=False)
#whitewinequalitydata,packedinaDataFrame
white_df=pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)
#rose?otherfruitwines?plumwine?:(
#forredwines
red_df.describe()
50% 7.900000 0.520000 0.260000 2.200000 0.079000
75% 9.200000 0.640000 0.420000 2.600000 0.090000max 15.900000 1.580000 1.000000 15.500000 0.611000
fixedacidity volatileacidity citricacid residual
sugar chlorides
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000mean 6.854788 0.278241 0.334192 6.391415 0.045772std 0.843868 0.100795 0.121020 5.072058 0.021848min 3.800000 0.080000 0.000000 0.600000 0.00900025% 6.300000 0.210000 0.270000 1.700000 0.03600050% 6.800000 0.260000 0.320000 5.200000 0.04300075% 7.300000 0.320000 0.390000 9.900000 0.050000max 14.200000 1.100000 1.660000 65.800000 0.346000
Sometimesitiseasiertounderstandthedatavisually.Ahistogramofthewhitewinequalitydatacitricacidsamplesisshownnext.Youcanofcoursevisualizeother columns’ data or other datasets. Just replace theDataFrame and columnname(seeFigure33).
#forwhitewines
white_df.describe()
importmatplotlib.pyplotasplt
defextract_col(df,col_name):
returnlist(df[col_name])
col=extract_col(white_df,'citricacid')#canreplacewithanotherdataframeorcolumn
plt.hist(col)
#TODO:addaxesandsuchtosetagoodexample
plt.show()
Figure33:Histogram
10.5.3DetectingFeatures
Letustryoutasomeelementarymachinelearningmodels.Thesemodelsarenotalways for prediction. They are also useful to find what features are mostpredictiveofavariableofinterest.Dependingontheclassifieryouuse,youmayneedtotransformthedatapertainingtothatvariable.
10.5.3.1DataPreparation
LetusassumewewanttostudywhatfeaturesaremostcorrelatedwithpH.pHofcourseisreal-valued,andcontinuous.Theclassifierswewanttouseusuallyneed labeled or integer data.Hence,wewill transform the pH data, assigningwineswithpHhigherthanaverageashi(morebasicoralkaline)andwineswithpHlowerthanaverageaslo(moreacidic).#refreshtomakeJupyterhappy
red_df=pd.read_csv('winequality-red.csv',sep=';',header=0,index_col=False)
white_df=pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)
#TODO:datacleansingfunctionshere,e.g.replacementofNaN
#ifthevariableyouwanttopredictiscontinuous,youcanmaprangesofvalues
#tointeger/binary/stringlabels
#forexample,mapthepHdatato'hi'and'lo'ifapHvalueismorethanor
#lessthanthemeanpH,respectively
M=np.mean(list(red_df['pH']))#expectinelegantcodeinthesemappings
Lf=lambdap:int(p<M)*'lo'+int(p>=M)*'hi'#someC-stylehackery
#createthenewclassifiablevariable
red_df['pH-hi-lo']=map(Lf,list(red_df['pH']))
#andremovethepredecessor
delred_df['pH']
Nowwe specifywhich dataset and variable youwant to predict by assigningvluestoSELECTED_DFandTARGET_VAR,respectively.
Weliketokeepaparameterfilewherewespecifydatasourcesandsuch.Thisletsmecreategenericanalyticscodethatiseasytoreuse.
Afterwehavespecifiedwhatdatasetwewanttostudy,wesplitthetrainingandtestdatasets.We thenscale (normalize) thedata,whichmakesmostclassifiersrunbetter.
Nowwepickaclassifier.Asyoucansee, therearemany to tryout,andevenmoreinscikit-learn’sdocumentationandmanyexamplesandtutorials.RandomForests are data scienceworkhorses.They are thego-tomethod formost datascientists. Be careful relying on them though–they tend to overfit.We try toavoidoverfittingbyseparatingthetrainingandtestdatasets.
10.5.4RandomForest
Nowwewilltestitoutwiththedefaultparameters.
Notethatthiscodeisboilerplate.Youcanuseitinterchangeablyformostscikit-
fromsklearn.model_selectionimporttrain_test_split
fromsklearn.preprocessingimportStandardScaler
fromsklearnimportmetrics
#makeselectionsherewithoutdiggingincode
SELECTED_DF=red_df#selecteddataset
TARGET_VAR='pH-hi-lo'#thepredictedvariable
#generatenamelessdatastructures
df=SELECTED_DF
target=np.array(df[TARGET_VAR]).ravel()
deldf[TARGET_VAR]#nocheating
#TODO:datacleansingfunctioncallshere
#splitdatasetsfortrainingandtesting
X_train,X_test,y_train,y_test=train_test_split(df,target,test_size=0.2)
#setupthescaler
scaler=StandardScaler()
scaler.fit(X_train)
#applythescaler
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
#pickaclassifier
fromsklearn.treeimportDecisionTreeClassifier,DecisionTreeRegressor,ExtraTreeClassifier,ExtraTreeRegressor
fromsklearn.ensembleimportRandomForestClassifier,ExtraTreesClassifier
clf=RandomForestClassifier()
learnmodels.
Nowoutputtheresults.ForRandomForests,wegetafeatureranking.Relativeimportances usually exponentially decay. The first few highly-ranked featuresareusuallythemostimportant.
Featureranking:
fixed acidity 0.269778 citric acid 0.171337 density 0.089660 volatile acidity0.088965 chlorides 0.082945 alcohol 0.080437 total sulfur dioxide 0.067832sulphates0.047786freesulfurdioxide0.042727residualsugar0.037459quality0.021075
Sometimesit’seasiertovisualize.We’lluseabarchart.SeeFigure34
#testitout
model=clf.fit(X_train,y_train)
pred=clf.predict(X_test)
conf_matrix=metrics.confusion_matrix(y_test,pred)
var_score=clf.score(X_test,y_test)
#theresults
importances=clf.feature_importances_
indices=np.argsort(importances)[::-1]
#forthesakeofclarity
num_features=X_train.shape[1]
features=map(lambdax:df.columns[x],indices)
feature_importances=map(lambdax:importances[x],indices)
print'Featureranking:\n'
foriinrange(num_features):
feature_name=features[i]
feature_importance=feature_importances[i]
print'%s%f'%(feature_name.ljust(30),feature_importance)
plt.clf()
plt.bar(range(num_features),feature_importances)
plt.xticks(range(num_features),features,rotation=90)
plt.ylabel('relativeimportance(a.u.)')
plt.title('Relativeimportancesofmostpredictivefeatures')
plt.show()
Figure34:Result
10.5.5Acknowledgement
ThisnotebookwasdevelopedbyJulietteZerickandGregorvonLaszewski
10.6PARALLELCOMPUTINGINPYTHON☁�InthismodulewewillreviewtheavailablePythonmodulesthatcanbeusedforparallel computing. The parallel computing can be in form of either multi-threadingormulti-processing.Inmulti-threadingapproach,thethreadsruninthesame shared memory heap whereas in case of multi-processing, the memoryheaps of processes are separate and independent, therefore the communicationbetweentheprocessesarealittlebitmorecomplex.
10.6.1Multi-threadinginPython
ThreadinginPythonisperfectforI/Ooperationswheretheprocessisexpectedto be idle regularly, e.g. web scraping. This is a very useful feature because
importdask.dataframeasdd
red_df=dd.read_csv('winequality-red.csv',sep=';',header=0)
white_df=dd.read_csv('winequality-white.csv',sep=';',header=0)
several applications and script might spend the majority of their runtime onwaiting for network or data I/O. In several cases, e.g. web scraping, theresources, i.e. downloading from different websites, are most of the timeindependent.Thereforetheprocessorcandownloadinparallelandjointheresultattheend.
10.6.1.1ThreadvsThreading
Therearetwobuilt-inmodulesinPythonthatarerelatedtothreading,namelythreadandthreading.TheformermoduleisdeprecatedforsometimeinPython2,andinPython3itisrenamedto_threadforthesakeofbackwardsincompatibilities.The_threadmoduleprovideslow-levelthreadingAPIformulti-threadinginPython,whereasthemodulethreadingbuildsahigh-levelthreadinginterfaceontopofit.
TheThread()isthemainmethodofthethreadingmodule,thetwoimportantargumentsof which are target, for specifying the callable object, and args to pass theargumentsforthetargetcallable.Weillustratetheseinthefollowingexample:
Thisistheoutputofthepreviousexample:
Incaseyouarenotfamiliarwiththeif__name__=='__main__:'statement,whatitdoesisbasicallymakingsurethatthecodenestedunderthisconditionwillberunonlyifyourunyourmoduleasaprogramanditwillnotrunincaseyourmoduleisimportedinanotherfile.
10.6.1.2Locks
Asmentionedprior,thememoryspaceissharedbetweenthethreads.Thisisat
importthreading
defhello_thread(thread_num):
print("HellofromThread",thread_num)
if__name__=='__main__':
forthread_numinrange(5):
t=threading.Thread(target=hello_thread,arg=(thread_num,))
t.start()
In[1]:%runthreading.py
HellofromThread0
HellofromThread1
HellofromThread2
HellofromThread3
HellofromThread4
the same time beneficial and problematic: it is beneficial in a sense that thecommunication between the threads becomes easy, however, you mightexperience strange outcome if you let several threads change same variablewithoutcaution,e.g.thread2changesvariablexwhilethread1isworkingwithit.Thisiswhenlockcomesintoplay.Usinglock,youcanallowonlyonethreadtoworkwithavariable.Inotherwords,onlyasinglethreadcanholdthelock.Iftheother threadsneed toworkwith thatvariable, theyhave towaituntil theotherthreadisdoneandthevariableis“unlocked”.
Weillustratethiswithasimpleexample:
Supposewewant toprintmultiplesof3between1and12, i.e.3,6,9and12.Forthesakeofargument,wetrytodothisusing2threadsandanestedforloop.Thenwecreateaglobalvariablecalledcounterandweinitializeitwith0.Thenwhenever each of the incrementer1 or incrementer2 functions are called, the counter isincrementedby3 twice (counter is incrementedby6 in each functioncall). Ifyourunthepreviouscode,youshouldbereallyluckyifyougetthefollowingaspartofyouroutput:
Thereasonistheconflictthathappensbetweenthreadswhileincrementingthe
importthreading
globalcounter
counter=0
defincrementer1():
globalcounter
forjinrange(2):
foriinrange(3):
counter+=1
print("Greeter1incrementedthecounterby1")
print("Counteris%d"%counter)
defincrementer2():
globalcounter
forjinrange(2):
foriinrange(3):
counter+=1
print("Greeter2incrementedthecounterby1")
print("Counterisnow%d"%counter)
if__name__=='__main__':
t1=threading.Thread(target=incrementer1)
t2=threading.Thread(target=incrementer2)
t1.start()
t2.start()
Counterisnow3
Counterisnow6
Counterisnow9
Counterisnow12
counter in thenestedfor loop.Asyouprobablynoticed, thefirst levelfor loopisequivalentofadding3 to thecounterand theconflict thatmighthappen isnoteffective on that level but the nested for loop.Accordingly, the output of thepreviouscodeisdifferentineveryrun.Thisisanexampleoutput:
We can fix this issue using a lock: whenever one of the function is going toincrementthevalueby3,itwillacquire()thelockandwhenitisdonethefunctionwillrelease()thelock.Thismechanismisillustratedinthefollowingcode:
Nomatterhowmanytimesyourunthiscode,theoutputwouldalwaysbeinthecorrectorder:
$python3lock_example.py
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Counteris4
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Greeter1incrementedthecounterby1
Greeter2incrementedthecounterby1
Greeter1incrementedthecounterby1
Counteris8
Greeter1incrementedthecounterby1
Greeter2incrementedthecounterby1
Counteris10
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Counteris12
importthreading
increment_by_3_lock=threading.Lock()
globalcounter
counter=0
defincrementer1():
globalcounter
forjinrange(2):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter+=1
print("Greeter1incrementedthecounterby1")
print("Counteris%d"%counter)
increment_by_3_lock.release()
defincrementer2():
globalcounter
forjinrange(2):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter+=1
print("Greeter2incrementedthecounterby1")
print("Counteris%d"%counter)
increment_by_3_lock.release()
if__name__=='__main__':
t1=threading.Thread(target=incrementer1)
t2=threading.Thread(target=incrementer2)
t1.start()
t2.start()
$python3lock_example.py
Using the Threading module increases both the overhead associated with threadmanagementaswellasthecomplexityoftheprogramandthatiswhyinmanysituations,employingmultiprocessingmodulemightbeabetterapproach.
10.6.2Multi-processinginPython
We already mentioned that multi-threading might not be sufficient in manyapplicationsandwemightneedtousemultiprocessingsometime,orbettertosaymostof the times. That is why we are dedicating this subsection to this particularmodule.ThismoduleprovidesyouwithanAPIforspawningprocessesthewayyou spawn threads using threading module. Moreover, there are somefunctionalities that are not even available in threading module, e.g. the Pool classwhichallowsyoutorunabatchofjobsusingapoolofworkerprocesses.
10.6.2.1Process
Similartothreadingmodulewhichwasemployingthread(aka_thread)underthehood,multiprocessingemploystheProcessclass.Considerthefollowingexample:
In this example, after importing the Processmodulewecreated a greeter() function
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Counteris3
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Counteris6
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Counteris9
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Counteris12
frommultiprocessingimportProcess
importos
defgreeter(name):
proc_idx=os.getpid()
print("Process{0}:Hello{1}!".format(proc_idx,name))
if__name__=='__main__':
name_list=['Harry','George','Dirk','David']
process_list=[]
forname_idx,nameinenumerate(name_list):
current_process=Process(target=greeter,args=(name,))
process_list.append(current_process)
current_process.start()
forprocessinprocess_list:
process.join()
thattakesanameandgreetsthatperson.Italsoprintsthepid(processidentifier)oftheprocessthatisrunningit.Notethatweusedtheosmoduletogetthepid.Inthebottomofthecodeaftercheckingthe__name__='__main__'condition,wecreateaseriesofProcessesandstartthem.Finallyinthelastforloopandusingthejoinmethod,wetell Python towait for the processes to terminate. This is one of the possibleoutputsofthecode:
10.6.2.2Pool
ConsiderthePoolclassasapoolofworkerprocesses.ThereareseveralwaysforassigningjobstothePoolclassandwewillintroducethemostimportantonesinthis section. These methods are categorized as blocking or non-blocking. The formermeans that after calling the API, it blocks the thread/process until it has theresultoranswerreadyandthecontrolreturnsonlywhenthecallcompletes.Inthenon-blockinontheotherhand,thecontrolreturnsimmediately.
10.6.2.2.1SynchronousPool.map()
WeillustratethePool.mapmethodbyre-implementingourpreviousgreeterexampleusingPool.map:
Asyoucansee,wehavesevennamesherebutwedonotwanttodedicateeachgreeting toa separateprocess. Insteadwedo thewhole jobof“greetingsevenpeople”using“twoprocesses”.Wecreateapoolof3processeswithPool(processes=3)syntax and then we map an iterable called names to the greeter function usingpool.map(greeter,names).Asweexpected,thegreetingsintheoutputwillbeprintedfromthreedifferentprocesses:
$python3process_example.py
Process23451:HelloHarry!
Process23452:HelloGeorge!
Process23453:HelloDirk!
Process23454:HelloDavid!
frommultiprocessingimportPool
importos
defgreeter(name):
pid=os.getpid()
print("Process{0}:Hello{1}!".format(pid,name))
if__name__=='__main__':
names=['Jenna','David','Marry','Ted','Jerry','Tom','Justin']
pool=Pool(processes=3)
sync_map=pool.map(greeter,names)
print("Done!")
Note that Pool.map() is in blocking category and does not return the control to yourscriptuntilitisdonecalculatingtheresults.ThatiswhyDone!isprintedafterallofthegreetingsareover.
10.6.2.2.2AsynchronousPool.map_async()
As the name implies, you can use the map_async method, when you want assignmany function calls to a pool of worker processes asynchronously. Note thatunlike map, the order of the results is not guaranteed (as oppose to map) and thecontrolisreturnedimmediately.Wenowimplementthepreviousexampleusingmap_async:
As you probably noticed, the only difference (clearly apart from the map_async
methodname)iscallingthe wait()methodin the last line.The wait()method tellsyourscripttowaitfortheresultofmap_asyncbeforeterminating:
Note that the order of the results are not preserved.Moreover, Done! is printerbefore anyof the results,meaning that ifwedonot use the wait()method, youprobablywillnotseetheresultatall.
$pythonpoolmap_example.py
Process30585:HelloJenna!
Process30586:HelloDavid!
Process30587:HelloMarry!
Process30585:HelloTed!
Process30585:HelloJerry!
Process30587:HelloTom!
Process30585:HelloJustin!
Done!
frommultiprocessingimportPool
importos
defgreeter(name):
pid=os.getpid()
print("Process{0}:Hello{1}!".format(pid,name))
if__name__=='__main__':
names=['Jenna','David','Marry','Ted','Jerry','Tom','Justin']
pool=Pool(processes=3)
async_map=pool.map_async(greeter,names)
print("Done!")
async_map.wait()
$pythonpoolmap_example.py
Done!
Process30740:HelloJenna!
Process30741:HelloDavid!
Process30740:HelloTed!
Process30742:HelloMarry!
Process30740:HelloJerry!
Process30741:HelloTom!
Process30742:HelloJustin!
10.6.2.3Locks
Thewaymultiprocessingmoduleimplementslocksisalmostidenticaltothewaythethreadingmoduledoes.AfterimportingLockfrommultiprocessingallyouneedtodoistoacquireit,dosomecomputationandthenreleasethelock.WewillclarifytheuseofLockbyprovidinganexampleinnextsectionaboutprocesscommunication.
10.6.2.4ProcessCommunication
Process communication in multiprocessing is one of the most important, yetcomplicated,featuresforbetteruseofthismodule.Asopposetothreading,theProcessobjects will not have access to any shared variable by default, i.e. no sharedmemoryspacebetweentheprocessesbydefault.Thiseffectisillustratedinthefollowingexample:
Probably you already noticed that this is almost identical to our example inthreadingsection.Now,takealookatthestrangeoutput:
Asyoucansee,itisasiftheprocessesdoesnotseeeachother.Insteadofhavingtwoprocessesonecounting to6and theothercountingfrom6 to12,wehave
frommultiprocessingimportProcess,Lock,Value
importtime
globalcounter
counter=0
defincrementer1():
globalcounter
forjinrange(2):
foriinrange(3):
counter+=1
print("Greeter1:Counteris%d"%counter)
defincrementer2():
globalcounter
forjinrange(2):
foriinrange(3):
counter+=1
print("Greeter2:Counteris%d"%counter)
if__name__=='__main__':
t1=Process(target=incrementer1)
t2=Process(target=incrementer2)
t1.start()
t2.start()
$pythoncommunication_example.py
Greeter1:Counteris3
Greeter1:Counteris6
Greeter2:Counteris3
Greeter2:Counteris6
twoprocessescountingto6.
Nevertheless, there are several ways that Processes from multiprocessing cancommunicatewitheachother,includingPipe,Queue,Value,ArrayandManager.Pipeand Queueare appropriate for inter-processmessage passing. To bemore specific, Pipe isuseful for process-to-process scenarios while Queue is more appropriate forprocesses-toprocessesones.ValueandArrayarebothusedtoprovideasynchronizedaccesstoashareddata(verymuchlikesharedmemory)andManagerscanbeusedondifferentdatatypes.Inthefollowingsub-sections,wecoverbothValueandArraysincetheyarebothlightweight,yetuseful,approaches.
10.6.2.4.1Value
The following example re-implements the broken example in the previoussection.Wefixthestrangeoutput,byusingbothLockandValue:
The usage of Lock object in this example is identical to the example in threadingsection.Theusageof counter ison theotherhand thenovelpart.First,note thatcounterisnotaglobalvariableanymoreandinsteaditisaValuewhichreturnsactypes object allocated from a shared memory between the processes. The firstargument 'i' indicates a signed integer, and the second argument defines the
frommultiprocessingimportProcess,Lock,Value
importtime
increment_by_3_lock=Lock()
defincrementer1(counter):
forjinrange(3):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter.value+=1
time.sleep(0.1)
print("Greeter1:Counteris%d"%counter.value)
increment_by_3_lock.release()
defincrementer2(counter):
forjinrange(3):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter.value+=1
time.sleep(0.05)
print("Greeter2:Counteris%d"%counter.value)
increment_by_3_lock.release()
if__name__=='__main__':
counter=Value('i',0)
t1=Process(target=incrementer1,args=(counter,))
t2=Process(target=incrementer2,args=(counter,))
t2.start()
t1.start()
initializationvalue.Inthiscaseweareassigningasignedintegerinthesharedmemory initialized to size 0 to the counter variable.We thenmodified our twofunctionsandpassthissharedvariableasanargument.Finally,wechangethewayweincrementthecountersincecounterisnotanPythonintegeranymorebutactypes signed integerwherewe can access its value using the value attribute.Theoutputofthecodeisnowasweexpected:
Thelastexamplerelatedtoparallelprocessing,illustratestheuseofbothValueandArray,aswellasa technique topassmultiplearguments toa function.Note thattheProcessobjectdoesnotacceptmultipleargumentsforafunctionandthereforewe need this or similar techniques for passingmultiple arguments. Also, thistechniquecanalsobeusedwhenyouwanttopassmultipleargumentstomapormap_async:
$pythonmp_lock_example.py
Greeter2:Counteris3
Greeter2:Counteris6
Greeter1:Counteris9
Greeter1:Counteris12
frommultiprocessingimportProcess,Lock,Value,Array
importtime
fromctypesimportc_char_p
increment_by_3_lock=Lock()
defincrementer1(counter_and_names):
counter=counter_and_names[0]
names=counter_and_names[1]
forjinrange(2):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter.value+=1
time.sleep(0.1)
name_idx=counter.value//3-1
print("Greeter1:Greeting{0}!Counteris{1}".format(names.value[name_idx],counter.value))
increment_by_3_lock.release()
defincrementer2(counter_and_names):
counter=counter_and_names[0]
names=counter_and_names[1]
forjinrange(2):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter.value+=1
time.sleep(0.05)
name_idx=counter.value//3-1
print("Greeter2:Greeting{0}!Counteris{1}".format(names.value[name_idx],counter.value))
increment_by_3_lock.release()
if__name__=='__main__':
counter=Value('i',0)
names=Array(c_char_p,4)
names.value=['James','Tom','Sam','Larry']
t1=Process(target=incrementer1,args=((counter,names),))
t2=Process(target=incrementer2,args=((counter,names),))
t2.start()
t1.start()
Inthisexamplewecreatedamultiprocessing.Array()objectandassignedittoavariablecallednames.Aswementionedbefore,thefirstargumentisthectypedatatypeandsincewewanttocreateanarrayofstringswithlengthof4(secondargument),weimportedthec_char_pandpasseditasthefirstargument.
Instead of passing the arguments separately,wemerged both the Value and Arrayobjects in a tuple andpassed the tuple to the functions.We thenmodified thefunctions to unpack the objects in the first two lines in the both functions.Finally we changed the print statement in a way that each process greets aparticularname.Theoutputoftheexampleis:
10.7DASK☁�Dask is a python-based parallel computing library for analytics. Parallelcomputingisatypeofcomputationinwhichmanycalculationsortheexecutionofprocessesarecarriedoutsimultaneously.Largeproblemscanoftenbedividedintosmallerones,whichcanthenbesolvedconcurrently.
Daskiscomposedoftwocomponents:
1. Dynamic task scheduling optimized for computation. This is similar toAirflow, Luigi, Celery, or Make, but optimized for interactivecomputationalworkloads.
2. BigDatacollections like parallel arrays, dataframes, and lists that extendcommoninterfaceslikeNumPy,Pandas,orPythoniteratorstolarger-than-memoryordistributedenvironments.Theseparallelcollectionsrunontopofthedynamictaskschedulers.
Daskemphasizesthefollowingvirtues:
Familiar: Provides parallelized NumPy array and Pandas DataFrameobjects.Flexible:Providesa task scheduling interface formorecustomworkloadsandintegrationwithotherprojects.
$python3mp_lock_example.py
Greeter2:GreetingJames!Counteris3
Greeter2:GreetingTom!Counteris6
Greeter1:GreetingSam!Counteris9
Greeter1:GreetingLarry!Counteris12
Native: Enables distributed computing in Pure Pythonwith access to thePyDatastack.Fast:Operateswith lowoverhead, low latency, andminimal serializationnecessaryforfastnumericalalgorithmsScalesup:Runsresilientlyonclusterswith1000sofcoresScalesdown:TrivialtosetupandrunonalaptopinasingleprocessResponsive:Designedwithinteractivecomputinginminditprovidesrapidfeedbackanddiagnosticstoaidhumans
The section is structured in a number of subsections addressing the followingtopics:
Foundations:
anexplanationofwhatDaskis,howitworks,andhowtouselowerlevelprimitives to set up computations. Casual users may wish to skip thissection,althoughweconsideritusefulknowledgeforallusers.
DistributedFeatures:
information on runningDask on the distributed scheduler,which enablesscale-uptodistributedsettingsandenhancedmonitoringoftaskoperations.The distributed scheduler is now generally the recommended engine forexecutingtaskwork,evenonsingleworkstationsorlaptops.
Collections:
convenientabstractionsgivingafamiliarfeeltobigdata.
Bags:
Pythoniteratorswithafunctionalparadigm,suchasfoundinfunc/iter-toolsand toolz - generalize lists/generators to big data; this will seem veryfamiliartousersofPySpark’sRDD
Array:
massivemulti-dimensionalnumericaldata,withNumpyfunctionality
Dataframe:
massivetabulardata,withPandasfunctionality
10.7.1HowDaskWorks
Daskiscomputationtoolforlarger-than-memorydatasets,parallelexecutionordelayed/backgroundexecution.
WecansummarizethebasicsofDaskasfollows:
processdata thatdoesnot fit intomemorybybreaking it intoblocksandspecifyingtaskchainsparallelizeexecutionoftasksacrosscoresandevennodesofaclustermovecomputationtothedataratherthantheotherwayaround,tominimizecommunicationoverheads
Weusefor-loops tobuildbasic tasks,Python iterators,and theNumpy(array)and Pandas (dataframe) functions for multi-dimensional or tabular data,respectively.
Daskallowsus toconstructaprescriptionfor thecalculationwewant tocarryout.AmodulenamedDask.delayedletsusparallelizecustomcode.Itisusefulwhenever our problem doesn’t quite fit a high-level parallel object likedask.array or dask.dataframe but could still benefit from parallelism.Dask.delayedworksbydelayingourfunctionevaluationsandputtingthemintoadaskgraph.Hereisasmallexample:
Herewehaveusedthedelayedannotationtoshowthatwewantthesefunctionstooperatelazily-tosavethesetofinputsandexecuteonlyondemand.
10.7.2DaskBag
fromdaskimportdelayed
@delayed
definc(x):
returnx+1
@delayed
defadd(x,y):
returnx+y
Dask-bag excels in processing data that can be represented as a sequence ofarbitrary inputs. We’ll refer to this as “messy” data, because it can containcomplex nested structures, missing fields, mixtures of data types, etc. Thefunctional programming style fits very nicely with standard Python iteration,suchascanbefoundintheitertoolsmodule.
Messy data is often encountered at the beginning of data processing pipelineswhenlargevolumesofrawdataarefirstconsumed.TheinitialsetofdatamightbeJSON,CSV,XML,oranyotherformatthatdoesnotenforcestrictstructureanddatatypes.Forthisreason,theinitialdatamassagingandprocessingisoftendonewithPythonlists,dicts,andsets.
These core data structures are optimized for general-purpose storage andprocessing.Adding streaming computationwith iterators/generator expressionsorlibrarieslikeitertoolsortoolzletusprocesslargevolumesinasmallspace.Ifwe combine this with parallel processing then we can churn through a fairamountofdata.
Dask.bagisahighlevelDaskcollectiontoautomatecommonworkloadsofthisform.Inanutshell
YoucancreateaBagfromaPythonsequence,fromfiles,fromdataonS3,etc..
Bagobjectshold thestandardfunctionalAPIfoundinprojects like thePythonstandardlibrary,toolz,orpyspark,includingmap,filter,groupby,etc..
As with Array and DataFrame objects, operations on Bag objects create newbags.Callthe.compute()methodtotriggerexecution.
dask.bag=map,filter,toolz+parallelexecution
#eachelementisaninteger
importdask.bagasdb
b=db.from_sequence([1,2,3,4,5,6,7,8,9,10])
#eachelementisatextfileofJSONlines
importos
b=db.read_text(os.path.join('data','accounts.*.json.gz'))
#Requires`s3fs`library
#eachelementisaremoteCSVtextfile
b=db.read_text('s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv')
defis_even(n):
returnn%2==0
b=db.from_sequence([1,2,3,4,5,6,7,8,9,10])
FormoredetailsonDaskBagcheckhttps://dask.pydata.org/en/latest/bag.html
10.7.3ConcurrencyFeatures
Dask supports a real-time task framework that extends Python’sconcurrent.futuresinterface.Thisinterfaceisgoodforarbitrarytaskscheduling,likedask.delayed,butisimmediateratherthanlazy,whichprovidessomemoreflexibility in situations where the computations may evolve over time. Thesefeatures depend on the second generation task scheduler found indask.distributed(which,despiteitsname,runsverywellonasinglemachine).
Daskallowsus tosimplyconstructgraphsof taskswithdependencies.Wecanfind that graphs can also be created automatically for us using functional,NumpyorPandassyntaxondatacollections.Noneofthiswouldbeveryuseful,if thereweren’talsoaway toexecute thesegraphs, inaparallelandmemory-awareway.Daskcomeswithfouravailableschedulers:
dask.threaded.get:aschedulerbackedbyathreadpooldask.multiprocessing.get:aschedulerbackedbyaprocesspooldask.async.get_sync:asynchronousscheduler,goodfordebuggingdistributed.Client.get: a distributed scheduler for executing graphs on multiplemachines.
Hereisasimpleprogramfordask.distributedlibrary:
For more details on Concurrent Features by Dask checkhttps://dask.pydata.org/en/latest/futures.html
10.7.4DaskArray
c=b.filter(is_even).map(lambdax:x**2)
c
#blockingform:waitforcompletion(whichisveryfastinthiscase)
c.compute()
fromdask.distributedimportClient
client=Client('scheduler:port')
futures=[]
forfninfilenames:
future=client.submit(load,fn)
futures.append(future)
summary=client.submit(summarize,futures)
summary.result()
Dask arrays implement a subset of theNumPy interface on large arrays usingblocked algorithms and task scheduling. These behave like numpy arrays, butbreakamassivejobintotasksthatarethenexecutedbyascheduler.Thedefaultschedulerusesthreadingbutyoucanalsousemultiprocessingordistributedorevenserialprocessing(mainlyfordebugging).Youcantellthedaskarrayhowtobreakthedataintochunksforprocessing.
For more details on Dask Array checkhttps://dask.pydata.org/en/latest/array.html
10.7.5DaskDataFrame
A Dask DataFrame is a large parallel dataframe composed of many smallerPandasdataframes,splitalongtheindex.Thesepandasdataframesmayliveondisk for larger-than-memory computing on a single machine, or on manydifferentmachines in a cluster.Dask.dataframe implements a commonly usedsubset of the Pandas interface including elementwise operations, reductions,groupingoperations,joins,timeseriesalgorithms,andmore.ItcopiesthePandasinterfacefor theseoperationsexactlyandsoshouldbeveryfamiliar toPandasusers.BecauseDask.dataframeoperationsmerelycoordinatePandasoperationstheyusuallyexhibitsimilarperformancecharacteristicsasarefoundinPandas.Torunthefollowingcode,save‘student.csv’fileinyourmachine.
importdask.arrayasda
f=h5py.File('myfile.hdf5')
x=da.from_array(f['/big-data'],chunks=(1000,1000))
x-x.mean(axis=1).compute()
importpandasaspd
df=pd.read_csv('student.csv')
d=df.groupby(df.HID).Serial_No.mean()
print(d)
ID
1011
1022
1043
1054
1065
1076
1097
1118
2019
20210
Name:Serial_No,dtype:int64
importdask.dataframeasdd
df=dd.read_csv('student.csv')
dt=df.groupby(df.HID).Serial_No.mean().compute()
print(dt)
ID
1011.0
For more details on Dask DataFrame checkhttps://dask.pydata.org/en/latest/dataframe.html
10.7.6DaskDataFrameStorage
Efficient storage can dramatically improve performance, particularly whenoperatingrepeatedlyfromdisk.
Decompressing text and parsing CSV files is expensive. One of the mosteffective strategies with medium data is to use a binary storage format likeHDF5.
Createdataifwedon’thaveany
Firstwereadourcsvdataasbefore.
CSV and other text-based file formats are themost common storage for datafrommanysources,becausetheyrequireminimalpre-processing,canbewrittenline-by-lineandarehuman-readable.SincePandas’read_csviswell-optimized,CSVs are a reasonable input, but far from optimized, since reading requiredextensivetextparsing.
HDF5 and netCDF are binary array formats very commonly used in thescientificrealm.
1022.0
1043.0
1054.0
1065.0
1076.0
1097.0
1118.0
2019.0
20210.0
Name:Serial_No,dtype:float64
#besuretoshutdownotherkernelsrunningdistributedclients
fromdask.distributedimportClient
client=Client()
fromprepimportaccounts_csvs
accounts_csvs(3,1000000,500)
importos
filename=os.path.join('data','accounts.*.csv')
filename
importdask.dataframeasdd
df_csv=dd.read_csv(filename)
df_csv.head()
Pandas contains a specialized HDF5 format, HDFStore. Thedd.DataFrame.to_hdf method works exactly like the pd.DataFrame.to_hdfmethod.
For more information of Dask DataFrame Storage, clickhttp://dask.pydata.org/en/latest/dataframe-create.html
10.7.7Links
https://dask.pydata.org/en/latest/http://matthewrocklin.com/blog/work/2017/10/16/streaming-dataframes-1http://people.duke.edu/~ccc14/sta-663-2017/18A_Dask.htmlhttps://www.kdnuggets.com/2016/09/introducing-dask-parallel-programming.htmlhttps://pypi.python.org/pypi/dask/https://www.hdfgroup.org/2015/03/hdf5-as-a-zero-configuration-ad-hoc-scientific-database-for-python/https://github.com/dask/dask-tutorial
target=os.path.join('data','accounts.h5')
target
%timedf_csv.to_hdf(target,'/data')
df_hdf=dd.read_hdf(target,'/data')
df_hdf.head()
11APPLICATIONS
11.1FINGERPRINTMATCHING☁�
PleasenotethatNISThastemporarilyremovedtheFingerprintdataset.Weunfortunatelydonothaveacopyof thedataste. Ifyouhaveone,pleasenotifyus—
Pythonisaflexibleandpopularlanguageforrunningdataanalysispipelines.Inthissectionwewillimplementasolutionforafingerprintmatching.
11.1.1Overview
Fingerprint recognition refers to the automated method for verifying a matchbetweentwofingerprintsandthatisusedtoidentifyindividualsandverifytheiridentity. Fingerprints (Figure 35) are themost widely used form of biometricusedtoidentifyindividuals.
Figure35:Fingerprints
Theautomatedfingerprintmatchinggenerallyrequiredthedetectionofdifferentfingerprintfeatures(aggregatecharacteristicsofridges,andminutiapoints)andthen theuseof fingerprintmatchingalgorithm,whichcandobothone-to-oneand one-to- many matching operations. Based on the number of matches a
proximityscore(distanceorsimilarity)canbecalculated.
WeusethefollowingNISTdatasetforthestudy:
Special Database 14 - NIST Mated Fingerprint Card Pairs 2.(http://www.nist.gov/itl/iad/ig/special\_dbases.cfm)
11.1.2Objectives
Match the fingerprint images from a probe set to a gallery set and report thematchscores.
11.1.3Prerequisites
Forthisworkwewillusethefollowingalgorithms:
MINDTCT:TheNISTminutiaedetector,whichautomatically locatesandrecords ridge ending and bifurcations in a fingerprint image.(http://www.nist.gov/itl/iad/ig/nbis.cfm)BOZORTH3:ANISTfingerprintmatchingalgorithm,whichisaminutiaebasedfingerprint-matchingalgorithm.Itcandobothone-to-oneandone-to-manymatchingoperations.(http://www.nist.gov/itl/iad/ig/nbis.cfm)
Inordertofollowalong,youmusthavetheNBIStoolswhichprovidemindtctandbozorth3 installed. If you are on Ubuntu 16.04 Xenial, the following steps willaccomplishthis:
11.1.4Implementation
1. Fetchthefingerprintimagesfromtheweb2. Callouttoexternalprogramstoprepareandcomputethematchscoreds3. Storetheresultsinadatabase4. Generateaplottoidentifylikelymatches.
$sudoapt-getupdate-qq
$sudoapt-getinstall-ybuild-essentialcmakeunzip
$wget"http://nigos.nist.gov:8080/nist/nbis/nbis_v5_0_0.zip"
$unzip-dnbisnbis_v5_0_0.zip
$cdnbis/Rel_5.0.0
$./setup.sh/usr/local--without-X11
$sudomake
wewillbeinteractingwiththeoperatingsystemandmanipulatingfilesandtheirpathnames.
Somegeneralusefullutilities
Usingtheattrslibraryprovidessomeniceshortcutstodefiningobjects
wewill be randomly dividing the entire dataset, based on user input, into theprobeandgallerystets
wewill need to call out to theNBIS software.wewill also be usingmultipleprocessestotakeadvantageofallthecoresonourmachine
Asforplotting,wewillusematplotlib,thoughtherearemanyalternatives.
Finally,wewillwritetheresultstoadatabase.
11.1.5Utilityfunctions
Next,wewilldefinesomeutilityfunctions:
from__future__importprint_function
importurllib
importzipfile
importhashlib
importos.path
importos
importsys
importshutil
importtempfile
importitertools
importfunctools
importtypes
frompprintimportpprint
importattr
importsys
importrandom
importsubprocess
importmultiprocessing
importmatplotlib.pyplotasplt
importpandasaspd
importnumpyasnp
importsqlite3
11.1.6Dataset
wewillnowdefinesomeglobalparameters
First,thefingerprintdataset
deftake(n,iterable):
"Returnsageneratorofthefirst**n**elementsofaniterable"
returnitertools.islice(iterable,n)
defzipWith(function,*iterables):
"Zipasetof**iterables**togetherandapply**function**toeachtuple"
forgroupinitertools.izip(*iterables):
yieldfunction(*group)
defuncurry(function):
"TransformsanN-arry**function**sothatitacceptsasingleparameterofanN-tuple"
@functools.wraps(function)
defwrapper(args):
returnfunction(*args)
returnwrapper
deffetch_url(url,sha256,prefix='.',checksum_blocksize=2**20,dryRun=False):
"""Downloadaurl.
:paramurl:theurltothefileontheweb
:paramsha256:theSHA-256checksum.Usedtodetermineifthefilewaspreviouslydownloaded.
:paramprefix:directorytosavethefile
:paramchecksum_blocksize:blocksizetousedwhencomputingthechecksum
:paramdryRun:booleanindicatingthatcallingthisfunctionshoulddonothing
:returns:thelocalpathtothedownloadedfile
:rtype:
"""
ifnotos.path.exists(prefix):
os.makedirs(prefix)
local=os.path.join(prefix,os.path.basename(url))
ifdryRun:returnlocal
ifos.path.exists(local):
print('Verifyingchecksum')
chk=hashlib.sha256()
withopen(local,'rb')asfd:
whileTrue:
bits=fd.read(checksum_blocksize)
ifnotbits:break
chk.update(bits)
ifsha256==chk.hexdigest():
returnlocal
print('Downloading',url)
defreport(sofar,blocksize,totalsize):
msg='{}%\r'.format(100*sofar*blocksize/totalsize,100)
sys.stderr.write(msg)
urllib.urlretrieve(url,local,report)
returnlocal
DATASET_URL='https://s3.amazonaws.com/nist-srd/SD4/NISTSpecialDatabase4GrayScaleImagesofFIGS.zip'
DATASET_SHA256='4db6a8f3f9dc14c504180cbf67cdf35167a109280f121c901be37a80ac13c449'
We’lldefinehowtodownloadthedataset.Thisfunctionisgeneralenoughthatitcouldbeusedtoretrievemostfiles,butwewilldefaultittousethevaluesfromprevious.
11.1.7DataModel
we will define some classes so we have a nice API for working with thedataflow. We set slots=True so that the resulting objects will be more space-efficient.
11.1.7.1Utilities
11.1.7.1.1Checksum
The checksum consists of the actual hash value (value) as well as a stringrepresentingthehashingalgorithm.Thevalidatorenforcesthatthealgorithcanonlybeoneofthelistedacceptablemethods
defprepare_dataset(url=None,sha256=None,prefix='.',skip=False):
url=urlorDATASET_URL
sha256=sha256orDATASET_SHA256
local=fetch_url(url,sha256=sha256,prefix=prefix,dryRun=skip)
ifnotskip:
print('Extracting',local,'to',prefix)
withzipfile.ZipFile(local,'r')aszip:
zip.extractall(prefix)
name,_=os.path.splitext(local)
returnname
deflocate_paths(path_md5list,prefix):
withopen(path_md5list)asfd:
forlineinitertools.imap(str.strip,fd):
parts=line.split()
ifnotlen(parts)==2:continue
md5sum,path=parts
chksum=Checksum(value=md5sum,kind='md5')
filepath=os.path.join(prefix,path)
yieldPath(checksum=chksum,filepath=filepath)
deflocate_images(paths):
defpredicate(path):
_,ext=os.path.splitext(path.filepath)
returnextin['.png']
forpathinitertools.ifilter(predicate,paths):
yieldimage(id=path.checksum.value,path=path)
@attr.s(slots=True)
classChecksum(object):
value=attr.ib()
kind=attr.ib(validator=lambdao,a,v:vin'md5sha1sha224sha256sha384sha512'.split())
11.1.7.1.2Path
Pathsrefertoanimage'sfilepathandassociatedChecksum.Wegetthechecksum"for"free"sincetheMD5hashisprovidedforeachimageinthedataset.
11.1.7.1.3Image
Thestartofthedatapipelineistheimage.Animagehasanid(themd5hash)andthepathtotheimage.
11.1.7.2Mindtct
Thenextstepinthepipelineis toapplythe mindtctprogramfromNBIS.A mindtctobjectthereforerepresentstheresultsofapplyingmindtctonanimage.Thexytoutputisneededforthenextstep,andtheimageattributerepresentstheimageid.
Weneedawaytoconstructamindtctobjectfromanimageobject.Astraightforwardway of doing this would be to have a from_image @staticmethod or @classmethod, but thatdoesn'tworkwellwithmultiprocessingastop-levelfunctionsworkbestastheyneedtobeserialized.
@attr.s(slots=True)
classPath(object):
checksum=attr.ib()
filepath=attr.ib()
@attr.s(slots=True)
classimage(object):
id=attr.ib()
path=attr.ib()
@attr.s(slots=True)
classmindtct(object):
image=attr.ib()
xyt=attr.ib()
defpretty(self):
d=dict(id=self.image.id,path=self.image.path)
returnpprint(d)
defmindtct_from_image(image):
imgpath=os.path.abspath(image.path.filepath)
tempdir=tempfile.mkdtemp()
oroot=os.path.join(tempdir,'result')
cmd=['mindtct',imgpath,oroot]
try:
subprocess.check_call(cmd)
withopen(oroot+'.xyt')asfd:
xyt=fd.read()
result=mindtct(image=image.id,xyt=xyt)
11.1.7.3Bozorth3
Thefinal step in thepipeline is running the bozorth3 fromNBIS.The bozorth3 classrepresentsthematchbeingdone:trackingtheidsoftheprobeandgalleryimagesaswellasthematchscore.
Sincewewillbewritingtheseinstanceouttoadatabase,weprovidesomestaticmethods for SQL statements. While there are many Object-Relational-Model(ORM) libraries available for Python, this approach keeps the currentimplementationsimple.
Inordertoworkwellwithmultiprocessing,wedefineaclassrepresentuingtheinputparamaters to bozorth3 and a helper function to run bozorth3. Thisway the pipelinedefinitioncanbekeptsimpletoamaptocreatetheinputandthenamaptoruntheprogram.
AsNBIS bozorth3 can be called to compare one-to-one or one-to-many,wewillalsodynamicallychoosebetween theseapproachesdependingon if thegalleryattributeisalistorasingleobject.
returnresult
finally:
shutil.rmtree(tempdir)
@attr.s(slots=True)
classbozorth3(object):
probe=attr.ib()
gallery=attr.ib()
score=attr.ib()
@staticmethod
defsql_stmt_create_table():
return'CREATETABLEIFNOTEXISTSbozorth3'\
+'(probeTEXT,galleryTEXT,scoreNUMERIC)'
@staticmethod
defsql_prepared_stmt_insert():
return'INSERTINTObozorth3VALUES(?,?,?)'
defsql_prepared_stmt_insert_values(self):
returnself.probe,self.gallery,self.score
@attr.s(slots=True)
classbozorth3_input(object):
probe=attr.ib()
gallery=attr.ib()
defrun(self):
ifisinstance(self.gallery,mindtct):
returnbozorth3_from_one_to_one(self.probe,self.gallery)
elifisinstance(self.gallery,types.ListType):
returnbozorth3_from_one_to_many(self.probe,self.gallery)
else:
raiseValueError('Unhandledtypeforgallery:{}'.format(type(gallery)))
The next is the top-level function to running bozorth3. It accepts an instance ofbozorth3_input. The is implemented as a simple top-levelwrapper so that it can beeasilypassedtothemultiprocessinglibrary.
11.1.7.3.1RunningBozorth3
There are two cases to handle: 1.One-to-one probe to gallery sets 1.One-to-manyprobetogallerysets
Both approaches are implemented next. The implementations follow the samepattern:1.Createatemporarydirectorywithinwithtowork1.Writetheprobeand gallery images to files in the temporary directory 1. Call the bozorth3
executable 1. The match score is written to stdout which is captured and thenparsed.1.Returnabozorth3 instanceforeachmatch1.Makesuretocleanupthetemporarydirectory
11.1.7.3.1.1One-to-one
11.1.7.3.1.2One-to-many
defrun_bozorth3(input):
returninput.run()
defbozorth3_from_one_to_one(probe,gallery):
tempdir=tempfile.mkdtemp()
probeFile=os.path.join(tempdir,'probe.xyt')
galleryFile=os.path.join(tempdir,'gallery.xyt')
withopen(probeFile,'wb')asfd:fd.write(probe.xyt)
withopen(galleryFile,'wb')asfd:fd.write(gallery.xyt)
cmd=['bozorth3',probeFile,galleryFile]
try:
result=subprocess.check_output(cmd)
score=int(result.strip())
returnbozorth3(probe=probe.image,gallery=gallery.image,score=score)
finally:
shutil.rmtree(tempdir)
defbozorth3_from_one_to_many(probe,galleryset):
tempdir=tempfile.mkdtemp()
probeFile=os.path.join(tempdir,'probe.xyt')
galleryFiles=[os.path.join(tempdir,'gallery%d.xyt'%i)
fori,_inenumerate(galleryset)]
withopen(probeFile,'wb')asfd:fd.write(probe.xyt)
forgalleryFile,galleryinitertools.izip(galleryFiles,galleryset):
withopen(galleryFile,'wb')asfd:fd.write(gallery.xyt)
cmd=['bozorth3','-p',probeFile]+galleryFiles
try:
result=subprocess.check_output(cmd).strip()
scores=map(int,result.split('\n'))
11.1.8Plotting
Forplottingwewilloperateonlyonthedatabase.wewillselectasmallnumberof probe images and plot the score between them and the rest of the galleryimages.
Themk_short_labelshelperfunctionwillbedefinednext.
Theimageidsarelonghashstrings.Inorderetominimizetheamountofspaceon the figure the labelsoccupy,weprovideahelper function tocreatea shortlabelthatstilluniquelyidentifieseachprobeimageintheselectedsample
11.1.9PuttingitallTogether
First,setupatemporarydirectoryinwhichtowork:
NextwedownloadandextractthefingerprintimagesfromNIST:
return[bozorth3(probe=probe.image,gallery=gallery.image,score=score)
forscore,galleryinzip(scores,galleryset)]
finally:
shutil.rmtree(tempdir)
defplot(dbfile,nprobes=10):
conn=sqlite3.connect(dbfile)
results=pd.read_sql(
"SELECTDISTINCTprobeFROMbozorth3ORDERBYscoreLIMIT'%s'"%nprobes,
con=conn
)
shortlabels=mk_short_labels(results.probe)
plt.figure()
fori,probeinresults.probe.iteritems():
stmt='SELECTgallery,scoreFROMbozorth3WHEREprobe=?ORDERBYgalleryDESC'
matches=pd.read_sql(stmt,params=(probe,),con=conn)
xs=np.arange(len(matches),dtype=np.int)
plt.plot(xs,matches.score,label='probe%s'%shortlabels[i])
plt.ylabel('Score')
plt.xlabel('Gallery')
plt.legend(bbox_to_anchor=(0,0,1,-0.2))
plt.show()
defmk_short_labels(series,start=7):
forsizeinxrange(start,len(series[0])):
iflen(series)==len(set(map(lambdas:s[:size],series))):
break
returnmap(lambdas:s[:size],series)
pool=multiprocessing.Pool()
prefix='/tmp/fingerprint_example/'
ifnotos.path.exists(prefix):
os.makedirs(prefix)
%%time
dataprefix=prepare_dataset(prefix=prefix)
Nextwewill configure the location of of theMD5 checksum file that comeswiththedownload
Loadtheimagesfromthedownloadedfilestostarttheanalysispipeline
Wecan examineoneof the loaded image.Note that image is refers to theMD5checksumthatcamewiththeimageandthexytattributerepresentstherawimagedata.
For example purposeswewill only a use a small percentage of the database,randomlyselected,forpurprobeandgallerydatasets.
Wecannowcompute thematchingscoresbetween theprobeandgallerysets.Thiswilluseallcoresavailableonthisworkstation.
VerifyingchecksumExtracting
/tmp/fingerprint_example/NISTSpecialDatabase4GrayScaleImagesofFIGS.zip
to/tmp/fingerprint_example/CPUtimes:user3.34s,sys:645ms,
total:3.99sWalltime:4.01s
md5listpath=os.path.join(prefix,'NISTSpecialDatabase4GrayScaleImagesofFIGS/sd04/sd04_md5.lst')
%%time
print('Loadingimages')
paths=locate_paths(md5listpath,dataprefix)
images=locate_images(paths)
mindtcts=pool.map(mindtct_from_image,images)
print('Done')
LoadingimagesDoneCPUtimes:user187ms,sys:17ms,total:204ms
Walltime:1min21s
print(mindtcts[0].image)
print(mindtcts[0].xyt[:50])
98b15d56330cb17f1982ae79348f711d141462146252382237255118020
30332214
perc_probe=0.001
perc_gallery=0.1
%%time
print('Generatingsamples')
probes=random.sample(mindtcts,int(perc_probe*len(mindtcts)))
gallery=random.sample(mindtcts,int(perc_gallery*len(mindtcts)))
print('|Probes|=',len(probes))
print('|Gallery|=',len(gallery))
Generatingsamples=4=400CPUtimes:user2ms,sys:0ns,total:2
msWalltime:993µs
%%time
print('Matching')
input=[bozorth3_input(probe=probe,gallery=gallery)
forprobeinprobes]
bozorth3s=pool.map(run_bozorth3,input)
bozorth3sisnowalistoflistsofbozorth3instances.
Nowaddtheresultstothedatabase
Wenowplottheresults.Figure36
MatchingCPUtimes:user19ms,sys:1ms,total:20msWalltime:1.07
s
print('|Probes|=',len(bozorth3s))
print('|Gallery|=',len(bozorth3s[0]))
print('Result:',bozorth3s[0][0])
=4=400Result:bozorth3(probe='caf9143b268701416fbed6a9eb2eb4cf',
gallery='22fa0f24998eaea39dea152e4a73f267',score=4)
dbfile=os.path.join(prefix,'scores.db')
conn=sqlite3.connect(dbfile)
cursor=conn.cursor()
cursor.execute(bozorth3.sql_stmt_create_table())
<sqlite3.Cursorat0x7f8a2f677490>
%%time
forgroupinbozorth3s:
vals=map(bozorth3.sql_prepared_stmt_insert_values,group)
cursor.executemany(bozorth3.sql_prepared_stmt_insert(),vals)
conn.commit()
print('Insertedresultsforprobe',group[0].probe)
Insertedresultsforprobecaf9143b268701416fbed6a9eb2eb4cfInserted
resultsforprobe55ac57f711eba081b9302eab74dea88eInsertedresultsfor
probe4ed2d53db3b5ab7d6b216ea0314beb4fInsertedresultsforprobe
20f68849ee2dad02b8fb33ecd3ece507CPUtimes:user2ms,sys:3ms,total:
5msWalltime:3.57ms
plot(dbfile,nprobes=len(probes))
Figure36:Result
11.2NISTPEDESTRIANANDFACEDETECTION �☁�
No
Pedestrian and Face Detection uses OpenCV to identify people standing in apicture or a video and NIST use case in this document is built with ApacheSparkandMesosclustersonmultiplecomputenodes.
The example in this tutorial deploys software packages on OpenStack usingAnsiblewithitsroles.SeeFigure37,Figure38,Figure39,Figure40
cursor.close()
Figure37:Original
Figure38:PedestrianDetected
Figure39:Original
Figure40:PedestrianandFace/eyesDetected
11.2.0.1Introduction
Human (pedestrian) detection and face detection have been studied during thelastseveralyearsandmodelsforthemhaveimprovedalongwithHistogramsofOriented Gradients (HOG) for Human Detection [1]. OpenCV is a ComputerVision library including the SVM classifier and the HOG object detector forpedestriandetectionandINRIAPersonDataset[2]isoneofpopularsamplesforboth trainingand testingpurposes. In thisdocument,wedeployApacheSparkon Mesos clusters to train and apply detection models from OpenCV usingPythonAPI.
11.2.0.1.1INRIAPersonDataset
Thisdatasetcontainspositiveandnegativeimagesfortrainingandtestpurposeswithannotationfilesforuprightpersonsineachimage.288positivetestimages,453 negative test images, 614 positive training images and 1218 negativetraining images are included along with normalized 64x128 pixel formats.970MBdatasetisavailabletodownload[3].
11.2.0.1.2HOGwithSVMmodel
HistogramofOrientedGradient(HOG)andSupportVectorMachine(SVM)areused as object detectors and classifiers and built-in python libraries fromOpenCVprovidethesemodelsforhumandetection.
11.2.0.1.3AnsibleAutomationTool
Ansible is a python tool to install/configure/manage software on multiplemachines with JSON files where system descriptions are defined. There arereasonswhyweuseAnsible:
Expandable:LeveragesPython(default)butmodulescanbewritteninanylanguage
Agentless:nosetuprequiredonmanagednode
Security:Allowsdeploymentfromuserspace;usessshforauthentication
Flexibility:onlyrequiressshaccesstoprivilegeduser
Transparency:YAMLBasedscriptfilesexpressthestepsofinstallingandconfiguringsoftware
Modularity: SingleAnsible Role (should) contain all required commandsandvariablestodeploysoftwarepackageindependently
Sharingandportability:rolesareavailablefromsource(github,bitbucket,gitlab,etc)ortheAnsibleGalaxyportal
We use Ansible roles to install software packages for Humand and FaceDetection which requires to run OpenCV Python libraries on Apache Mesoswithaclusterconfiguration.Datasetisalsodownloadedfromthewebusinganansiblerole.
11.2.0.2DeploymentbyAnsible
Ansible is to deploy applications and build clusters for batch-processing largedatasets towards targetmachines e.g.VM instancesonOpenStackandweuseansibleroleswithincludedirectivetoorganizelayersofbigdatasoftwarestacks(BDSS). Ansible provides abstractions by Playbook Roles and reusability byInclude statements.We defineX application inXAnsible Role, for example,anduseincludestatementstocombinewithotherapplicationse.g.YorZ.Thelayers exist in sub directories (see next) to add modularity to your Ansible
deployment. For example, there are five roles used in this example that areApache Mesos in a scheduler layer, Apache Spark in a processing layer, aOpenCVlibraryinanapplicationlayer,INRIAPersonDatasetinadatasetlayerand a python script for human and facedetection in an analytics layer. If youhaveanadditionalsoftwarepackagetoadd,youcansimplyaddanewroleinamainansibleplaybookwithincludedirective.Withthis,yourAnsibleplaybookmaintainssimplebutflexibletoaddmoreroleswithouthavingalargesinglefilewhichisgettingdifficulttoreadwhenitdeploysmoreapplicationsonmultiplelayers.ThemainAnsibleplaybookrunsAnsiblerolesinorderwhichlooklike:
Directory names e.g. sched, proc, data, or anlys indicate BDSS layers like: -sched: scheduler layer - proc: data processing layer - apps: application layer -data:datasetlayer-anlys:analyticslayerandtwodigitsinthefilenameindicateanorderofrolestoberun.
11.2.0.3CloudmeshforProvisioning
It is assumed that virtual machines are created by cloudmesh, the cloudmanagementsoftware.ForexampleonOpenStack,cmclustercreate-N=6
commandstartsasetofvirtualmachineinstances.Thenumberofmachinesandgroups for clusters e.g. namenodes and datanodes are defined in the Ansibleinventory file, a list of target machines with groups, which will be generatedoncemachinesarereadytousebycloudmesh.Ansiblerolesinstallsoftwareanddatasetonvirtualclustersafterthatstage.
11.2.0.4RolesExplainedforInstallation
Mesos role is installed first as a scheduler layer formasters and slaveswheremesos-master runs on the masters group and mesos-slave runs on the slavesgroup.ApacheZookeeper is included in themesosrole thereforemesosslavesfind an electedmesos leader for the coordination. Spark, as a data processing
```
include:sched/00-mesos.yml
include:proc/01-spark.yml
include:apps/02-opencv.yml
include:data/03-inria-dataset.yml
Include:anlys/04-human-face-detection.yml
```
layer,providestwooptionsfordistributedjobprocessing,batchjobprocessingvia a cluster mode and real-time processing via a client mode. The MesosdispatcherrunsonamastersgrouptoacceptabatchjobsubmissionandSparkinteractiveshell,whichistheclientmode,providesreal-timeprocessingonanynodeinthecluster.Eitherway,Sparkisinstalledlatertodetectamaster(leader)hostforajobsubmission.OtherrolesforOpenCV,INRIAPersonDatasetandHumanandFaceDetectionPythonapplicationsarefollowedby.
Thefollowingsoftwareareexpectedinthestacksaccordingtothegithub:
mesoscluster(master,worker)
spark(withdispatcherformesosclustermode)
openCV
zookeeper
INRIAPersonDataset
DetectionAnalyticsinPython
[1]Dalal,Navneet,andBillTriggs.“Histogramsoforientedgradients forhumandetection.”2005IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition(CVPR’05).Vol.1.IEEE,
2005.[pdf]
[2]http://pascal.inrialpes.fr/data/human/
[3]ftp://ftp.inrialpes.fr/pub/lear/douze/data/INRIAPerson.tar
[4]https://docs.python.org/2/library/configparser.html
11.2.0.4.1ServergroupsforMasters/SlavesbyAnsibleinventory
Wemayseparatecomputenodesintwogroups:mastersandworkersthereforeMesos masters and zookeeper quorums manage job requests and leaders andworkers run actual tasks. Ansible needs group definitions in their inventory
thereforesoftwareinstallationassociatedwithaproperpartcanbecompleted.
ExampleofAnsibleInventoryfile(inventory.txt)
11.2.0.5InstructionsforDeployment
The following commands complete NIST Pedestrian and Face DetectiondeploymentonOpenStack.
11.2.0.5.1CloningPedestrianDetectionRepositoryfromGithub
Rolesareincludedassubmoduleswhichrequire--recursiveoptiontocheckoutthemall.
Changethefollowingvariablewithactualipaddresses:
Createainventory.txtfilewiththevariableinyourlocaldirectory.
Addansible.cfgfilewithoptionsforsshhostkeycheckingandloginname.
Checkaccessibilitybyansiblepinglike:
[masters]
10.0.5.67
10.0.5.68
10.0.5.69
[slaves]
10.0.5.70
10.0.5.71
10.0.5.72
$gitclone--recursivehttps://github.com/futuresystems/pedestrian-and-face-detection.git
sample_inventory="""[masters]
10.0.5.67
10.0.5.68
10.0.5.69
[slaves]
10.0.5.70
10.0.5.71
10.0.5.72"""
!printf"$sample_inventory">inventory.txt
!catinventory.txt
ansible_config="""[defaults]
host_key_checking=false
remote_user=ubuntu"""
!printf"$ansible_config">ansible.cfg
!catansible.cfg
!ansible-mping-iinventory.txtall
Makesure thatyouhaveacorrectsshkey inyouraccountotherwiseyoumayencounter‘FAILURE’inthepreviouspingtest.
11.2.0.5.2AnsiblePlaybook
We use a main ansible playbook to deploy software packages for NISTPedestrian and Face detection which includes: - mesos - spark -zookeeper -opencv-INRIAPersondataset-Pythonscriptforthedetection
Theinstallationmaytake30minutesoranhourtocomplete.
11.2.0.6OpenCVinPython
Beforewe run our code for this project, let’s tryOpenCV first to see how itworks.
11.2.0.6.1Importcv2
Let us import opencv pythonmodule andwewill use images from the onlinedatabase image-net.org to test OpenCV image recognition. See Figure 41,Figure42
Letusdownloadamailboximagewitharedcolortoseeifopencvidentifiestheshapewithacolor.Theexamplefileinthistutorialis:
100167k100167k00686k0–:–:––:–:––:–:–684k
!cdpedestrian-and-face-detection/&&ansible-playbook-i../inventory.txtsite.yml
importcv2
$curlhttp://farm4.static.flickr.com/3061/2739199963_ee78af76ef.jpg>mailbox.jpg
%matplotlibinline
fromIPython.displayimportImage
mailbox_image="mailbox.jpg"
Image(filename=mailbox_image)
Figure41:Mailboximage
You can try other images. Check out the image-net.org for mailbox images:http://image-net.org/synset?wnid=n03710193
11.2.0.6.2ImageDetection
Justforatest,let’strytodetectaredcolorshapedmailboxusingopencvpythonfunctions.
Therearekeyfunctionsthatweuse:*cvtColor:toconvertacolorspaceofanimage * inRange: to detect a mailbox based on the range of red color pixelvalues * np.array: to define the range of red color using aNumpy library forbettercalculation*findContours:tofindaoutlineoftheobject*bitwise_and:toblack-outtheareaofcontoursfoundimportnumpyasnp
importmatplotlib.pyplotasplt
#imreadforloadinganimage
img=cv2.imread(mailbox_image)
#cvtColorforcolorconversion
hsv=cv2.cvtColor(img,cv2.COLOR_BGR2HSV)
#definerangeofredcolorinhsv
lower_red1=np.array([0,50,50])
upper_red1=np.array([10,255,255])
lower_red2=np.array([170,50,50])
upper_red2=np.array([180,255,255])
#thresholdthehsvimagetogetonlyredcolors
mask1=cv2.inRange(hsv,lower_red1,upper_red1)
mask2=cv2.inRange(hsv,lower_red2,upper_red2)
Figure42:Maskedimage
Theredcolormailboxisleftaloneintheimagewhichwewantedtofindinthisexamplebyopencvfunctions.Youcantryotherimageswithdifferentcolorstodetect the different shape of objects using findContours and inRange fromopencv.
Formoreinformation,seethenextusefullinks.
contours features:http://docs.opencv.org/3.1.0/dd/d49/tutorial/_py/_contour/_features.html
contours:http://docs.opencv.org/3.1.0/d4/d73/tutorial/_py/_contours/_begin.html
mask=mask1+mask2
#findaredcolormailboxfromtheimage
im2,contours,hierarchy=cv2.findContours(mask,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)
#bitwise_andtoremoveotherareasintheimageexceptthedetectedobject
res=cv2.bitwise_and(img,img,mask=mask)
#turnoff-x,yaxisbar
plt.axis("off")
#textforthemaskedimage
cv2.putText(res,"maskedimage",(20,300),cv2.FONT_HERSHEY_SIMPLEX,2,(255,255,255))
#display
plt.imshow(cv2.cvtColor(res,cv2.COLOR_BGR2RGB))
plt.show()
redcolorinhsv:http://stackoverflow.com/questions/30331944/finding-red-color-using-python-opencv
inrange:http://docs.opencv.org/master/da/d97/tutorial/_threshold/_inRange.html
inrange: http://docs.opencv.org/3.0-beta/doc/py/_tutorials/py/_imgproc/py/_colorspaces/py/_colorspaces.html
numpy: http://docs.opencv.org/3.0-beta/doc/py/_tutorials/py/_core/py/_basic/_ops/py/_basic/_ops.html
11.2.0.7HumanandFaceDetectioninOpenCV
11.2.0.7.1INRIAPersonDataset
Weuse INRIAPersondataset to detect upright people and faces in images inthisexample.Letusdownloaditfirst.
100969M100969M008480k00:01:570:01:57–:–:–12.4M
11.2.0.7.2FaceDetectionusingHaarCascades
This section is prepared based on the opencv-python tutorial:http://docs.opencv.org/3.1.0/d7/d8b/tutorial/_py/_face/_detection.html#gsc.tab=0
Thereisapre-trainedclassifierforfacedetection,downloaditfromhere:
100908k100908k002225k0–:–:––:–:––:–:–2259k
This classifierXML filewill be used to detect faces in images. If you like tocreate a new classifier, find out more information about training from here:http://docs.opencv.org/3.1.0/dc/d88/tutorial/_traincascade.html
$curlftp://ftp.inrialpes.fr/pub/lear/douze/data/INRIAPerson.tar>INRIAPerson.tar
$tarxvfINRIAPerson.tar>logfile&&taillogfile
$curlhttps://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalface_default.xml>haarcascade_frontalface_default.xml
11.2.0.7.3FaceDetectionPythonCodeSnippet
Now, we detect faces from the first five images using the classifier. SeeFigure 43, Figure 44, Figure 45, Figure 46, Figure 47, Figure 48, Figure 49,Figure50,Figure51,Figure52,Figure53#importthenecessarypackages
from__future__importprint_function
importnumpyasnp
importcv2
fromosimportlistdir
fromos.pathimportisfile,join
importmatplotlib.pyplotasplt
mypath="INRIAPerson/Test/pos/"
face_cascade=cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
onlyfiles=[join(mypath,f)forfinlistdir(mypath)ifisfile(join(mypath,f))]
cnt=0
forfilenameinonlyfiles:
image=cv2.imread(filename)
image_grayscale=cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
faces=face_cascade.detectMultiScale(image_grayscale,1.3,5)
iflen(faces)==0:
continue
cnt_faces=1
for(x,y,w,h)infaces:
cv2.rectangle(image,(x,y),(x+w,y+h),(255,0,0),2)
cv2.putText(image,"face"+str(cnt_faces),(x,y-10),cv2.FONT_HERSHEY_SIMPLEX,1,(0,0,0),2)
plt.figure()
plt.axis("off")
plt.imshow(cv2.cvtColor(image[y:y+h,x:x+w],cv2.COLOR_BGR2RGB))
cnt_faces+=1
plt.figure()
plt.axis("off")
plt.imshow(cv2.cvtColor(image,cv2.COLOR_BGR2RGB))
cnt=cnt+1
ifcnt==5:
break
Figure43:Example
Figure44:Example
Figure45:Example
Figure46:Example
Figure47:Example
Figure48:Example
Figure49:Example
Figure50:Example
Figure51:Example
Figure52:Example
Figure53:Example
11.2.0.8PedestrianDetectionusingHOGDescriptor
WewilluseHistogramofOrientedGradients(HOG)todetectauprightpersonfrom images. See Figure 54, Figure 55, Figure 56, Figure 57, Figure 58,Figure59,Figure60,Figure61,Figure62,Figure63
11.2.0.8.1PythonCodeSnippet
#initializetheHOGdescriptor/persondetector
hog=cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
cnt=0
forfilenameinonlyfiles:
img=cv2.imread(filename)
orig=img.copy()
gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
#detectpeopleintheimage
(rects,weights)=hog.detectMultiScale(img,winStride=(8,8),
padding=(16,16),scale=1.05)
#drawthefinalboundingboxes
for(x,y,w,h)inrects:
cv2.rectangle(img,(x,y),(x+w,y+h),(0,255,0),2)
plt.figure()
plt.axis("off")
plt.imshow(cv2.cvtColor(orig,cv2.COLOR_BGR2RGB))
plt.figure()
plt.axis("off")
plt.imshow(cv2.cvtColor(img,cv2.COLOR_BGR2RGB))
cnt=cnt+1
ifcnt==5:
Figure54:Example
Figure55:Example
break
Figure56:Example
Figure57:Example
Figure58:Example
Figure59:Example
Figure60:Example
Figure61:Example
Figure62:Example
Figure63:Example
11.2.0.9ProcessingbyApacheSpark
INRIAPersondatasetprovides100+ imagesandSparkcanbeused for image
processinginparallel.Weload288imagesfrom“Test/pos”directory.
Spark provides a special object ‘sc’ to connect between a spark cluster andfunctionsinpythoncode.Therefore,wecanrunpythonfunctionsinparalleltodetetobjectsinthisexample.
map function is used to process pedestrian and face detection per imagefromtheparallelize()functionof‘sc’sparkcontext.
collectfonctionmergesresultsinanarray.
defapply_batch(imagePath):importcv2importnumpyasnp#initializetheHOG descriptor/person detector hog = cv2.HOGDescriptor()hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())image = cv2.imread(imagePath) # detect people in the image (rects,weights) = hog.detectMultiScale(image, winStride=(8, 8), padding=(16,16),scale=1.05)#drawthefinalboundingboxesfor(x,y,w,h) inrects:cv2.rectangle(image,(x,y),(x+w,y+h),(0,255,0),2)returnimage
11.2.0.9.1ParallelizeinSparkContext
Thelistofimagefilesisgiventoparallelize.
11.2.0.9.2MapFunction(apply_batch)
The‘apply_batch’functionthatwecreatedpreviouslyisgiventomapfunctiontoprocessinasparkcluster.
11.2.0.9.3CollectFunction
Theresultofeachmapprocessismergedtoanarray.
11.2.0.10Resultsfor100+imagesbySparkCluster
pd=sc.parallelize(onlyfiles)
pdc=pd.map(apply_batch)
result=pdc.collect()
forimageinresult:
plt.figure()
plt.axis("off")
plt.imshow(cv2.cvtColor(image,cv2.COLOR_BGR2RGB))
12REFERENCES
☁�
[1]L.Richardson,“Beautifulsouppythonpackageoverview.”WebPage,2019[Online].Available:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
[2]C.WODEHOUSE,“Shouldyouusemongodb?Alookattheleadingnosqldatabase.” Web Page, 2018 [Online]. Available:https://www.upwork.com/hiring/data/should-you-use-mongodb-a-look-at-the-leading-nosql-database/
[3]Guru99, “Introduction tomongodb.”WebPage, 2018 [Online].Available:https://www.guru99.com/mongodb-tutorials.html#1
[4] MongoDB, “Https://www.mongodb.com/.” Web Page, 2018 [Online].Available:https://docs.mongodb.com/manual/introduction/
[5]M.Papiernik,“Howto installmongodbonubuntu18.04.”WebPage,Jun-2018 [Online]. Available:https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-18-04
[6]J.Ellingwood,“Initialserversetupwithubuntu18.04.”WebPage,Apr-2018[Online]. Available: https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-18-04
[7]MongoDB,Databasesandcollections,4.0ed.NewYork,NewYork,USA:MongoDB Inc, 2008 [Online]. Available:https://docs.mongodb.com/manual/core/databases-and-collections/
[8]J.M.CraigBuckler,“Usingjoinsinmongodbnosqldatabases.”WebPage,Sep-2016 [Online]. Available: https://www.sitepoint.com/using-joins-in-mongodb-nosql-databases/
[9] MongoDB, Lookup (aggregation), 3.2 ed. New York City, New York,United States: MongoDB Inc, 2008 [Online]. Available:
https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/
[10]MongoDB,MongoDB package components - mongoexport, 4.0 ed. NewYorkCity,NewYork,UnitedStates:MongoDBInc,2008[Online].Available:https://docs.mongodb.com/manual/reference/program/mongoexport/
[11]MongoDB, Security, 4.0 ed. New York City, New York, United States:MongoDB Inc, 2008 [Online]. Available:https://docs.mongodb.com/manual/security/
[12] MongoDB, “MongoDB atlas.” Web Page, 2018 [Online]. Available:https://www.mongodb.com/cloud/atlas
[13]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/api
[14]A. J. J. Davis, “Announcing pymongo3.”Web Page,Apr-2015 [Online].Available:https://emptysqua.re/blog/announcing-pymongo-3/
[15] M. Dirolf, “PyMongo.” Web Page, Jul-2018 [Online]. Available:https://github.com/mongodb/mongo-python-driver
[16] N. Leite, “MongoDB and python.” Web Page, Mar-2015 [Online].Available:https://www.slideshare.net/NorbertoLeite/mongodb-and-python
[17]V.Oleynik, “How do you usemongodbwith python?”Web Page,Mar-2017 [Online]. Available: https://gearheart.io/blog/how-do-you-use-mongodb-with-python/
[18] I. MongoDB, “Installing / upgrading.” Web pages, 2008 [Online].Available:http://api.mongodb.com/python/current/installation.html
[19] R. Python, “Introduction to mongodb and python.” Web Page, 2016[Online]. Available: https://realpython.com/introduction-to-mongodb-and-python/
[20]W3Schools,“Pythonmongodbcreatedatabase.”WebPage,1999[Online].Available:https://www.w3schools.com/python/python_mongodb_create_db.asp
[21]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/tutorial.html
[22] N. O’Higgins, PyMongo & python. O’Reilly, 2011 [Online]. Available:http://img105.job1001.com/upload/adminnew/2015-04-07/1428393873-MHKX3LN.pdf
[23]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/examples/aggregation.html
[24] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available: https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/
[25] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available:https://docs.mongodb.com/manual/core/map-reduce/
[26] MongoDB, “PyMongo v2.0 documentation.” Web Page, 2008 [Online].Available:https://api.mongodb.com/python/2.0/examples/map_reduce.html
[27] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available:https://api.mongodb.com/python/current/examples/copydb.html
[28] MongoEngine, “MongoEngine user documentation.” Web Page, 2009[Online].Available:http://docs.mongoengine.org/
[29] Wikipedia, “Object-relational mapping.” Web Page, May-2009 [Online].Available:https://en.wikipedia.org/wiki/Object-relational_mapping
[30] MongoDB, “Flask-mongoengine.” Web Page, 2008 [Online]. Available:http://docs.mongoengine.org/guide/defining-documents.html
[31] MongoEngine, “User guide: Document instances.” Web Page, 2009[Online]. Available: http://docs.mongoengine.org/guide/document-instances.html
[32] MongoEngine, “2.1 installing mongoengine.” Web Page, 2009 [Online].Available:http://docs.mongoengine.org/guide/installing.html
[33]MongoEngine, “2.2 connection to mongodb.”Web Page, 2009 [Online].Available:http://docs.mongoengine.org/guide/connecting.html
[34]MongoEngine,“Userguide2.5.Querying thedatabase.”WebPage,2009[Online].Available:http://docs.mongoengine.org/guide/querying.html
[35]Wikipedia,“Flask(webframework).”WebPage,2010[Online].Available:https://en.wikipedia.org/wiki/Flask_(web_framework)
[36] MongoDB, “Flask-pymongo.” Web Page, 2008 [Online]. Available:https://flask-pymongo.readthedocs.io/en/latest/
[37]MongoDB,“Flaskmongoalchemy.”WebPage,2008 [Online].Available:https://pythonhosted.org/Flask-MongoAlchemy/
[38] MongoDB, “Flask-mongoengine.” Web Page, 2008 [Online]. Available:http://docs.mongoengine.org/projects/flask-mongoengine/en/latest/
[39] Wikipedia, “Flask (web framework).” Web Page, Oct-2018 [Online].Available:https://en.wikipedia.org/wiki/Flask_(web_framework)