data wrangling lab - university of arkansas at little rock...why do we choose python? •c or c++...
TRANSCRIPT
Data Wrangling Lab
Sept 26-29, 2016 (c) 2016 iCDO@UALR 1
David /WEI DAI
CDO-1 Certificate Program:Foundations for Chief Data Officers
Agenda
• Basic Python Program
• MongoDB Lab
• Clean Data Lab
Sept 26-29, 2016 (c) 2016 iCDO@UALR 2
A Tutorial on the Python Programming Language
Sept 26-29, 2016 (c) 2016 iCDO@UALR 3
Why do we choose Python?
• C or C++
• Java
• Perl
• Scheme
• Fortran
• Python
• Matlab
Modern, interpreted, object-oriented, full featured high level programming language
Portable(Unix/Linux,MacOS X,Windows) Open source, intellectual property rights held
by the Python Software Foundation Python versions: 2.x and 3.x
3.x is not backwards compatible with 2.x This course uses 3.x version
Fast program development Simple syntax Easy to write well readable code Large standard library Lots of third party libraries
Numpy, Scipy, Biopython MatplotlibSept 26-29, 2016 (c) 2016 iCDO@UALR 4
Python Program Platform
• Open a browser and access the website:
• https://teslae.host.ualr.edu:8888
• Password: python
Sept 26-29, 2016 (c) 2016 iCDO@UALR 5
Hello World
•At the prompt type “ hello world!”
Sept 26-29, 2016 (c) 2016 iCDO@UALR 6
The print and string Statement
>>> print('hello')hello>>> print('hello', David')hello David
• Elements separated by commas print with a space between them
• Strings are immutable
• “+” is overloaded to do concatenation >>> x = 'hello'
>>> x = x + ' America'>>> print(x)'hello America'
Sept 26-29, 2016 (c) 2016 iCDO@UALR 7
Substrings and Methods
>>> s = '012345'>>> print(s[3])'3'>>> print(s[1:4])'123'>>> print(s[2:])'2345'>>> print(s[:4])'0123'>>> print(s[-2])'4'
• len(String) – returns the number of characters in the String
• str(Object) – returns a String representation of the Object
>>> print(len(s))6>>> print(str(10.3))'10.3'
Sept 26-29, 2016 (c) 2016 iCDO@UALR 8
Sept 26-29, 2016 (c) 2016 iCDO@UALR 9
• Relational operators== equal
!=, <> not equal
> greater than
>= greater than or
equal
< less than
<= less than or equal
• Logical operatorsand and
or or
notnot
Variables
• Are not declared, just assigned
• The variable is created the first time you assign it a value
• Assignment is = and comparison is ==
Sept 26-29, 2016 (c) 2016 iCDO@UALR 10
Lists
• Ordered collection of data
• Data can be of different types
• Lists are mutable
• Issues with shared references and mutability
• Same subset operations as Strings
>>> x = [1,'hello', (3 + 2j)]>>> print(x)[1, 'hello', (3+2j)]>>> print(x[2])(3+2j)>>> print(x[0:2])[1, 'hello']
Sept 26-29, 2016 (c) 2016 iCDO@UALR 11
Lists: Modifying Content
• x[i] = a reassigns the ith element to the value a
• Since x and y point to the same list object, both are changed
• The method appendalso modifies the list
>>> x = [1,2,3]>>> y = x>>> x[1] = 15>>>print( x)[1, 15, 3]>>> print(y)[1, 15, 3]>>> x.append(12)>>> print(y)[1, 15, 3, 12]
Sept 26-29, 2016 (c) 2016 iCDO@UALR 12
Lists: Modifying Contents
• The method append modifies the list and returns None
• List addition (+) returns a new list
>>> x = [1,2,3]>>> y = x>>> z = x.append(12)>>> print(z == None)True>>> print(y)[1, 2, 3, 12]>>> x = x + [9,10]>>> print(x)[1, 2, 3, 12, 9, 10]>>> print(y)[1, 2, 3, 12]>>>
Sept 26-29, 2016 (c) 2016 iCDO@UALR 13
If ELSE Statements
if expression:statement(s)
else:statement(s)
Sept 26-29, 2016 (c) 2016 iCDO@UALR 14
For Loops
• Similar to perl for loops, iterating through a list of values
16123
for x in [1,6,12,3] :print(x)forloop1.py
0123
for x in range(4) :print(x)forloop2.py
range(N) generates a list of numbers [0,1, …, n-1]Sept 26-29, 2016 (c) 2016 iCDO@UALR 15
Functions are first class objects
• Can be assigned to a variable
• Can be passed as a parameter
• Can be returned from a function
• Functions are treated like any other variable in Python, the def statement simply assigns a function to a variable
Sept 26-29, 2016 (c) 2016 iCDO@UALR 16
Function Basics
def min(x,y) :if x > y :
return xelse :
return y
>>> mix(2,5)5
functionbasics.py
Sept 26-29, 2016 (c) 2016 iCDO@UALR 17
Python for graph
• Matplotlib is a python 2D plotting library which produces high quality figures
• Read demos is ready at plot_demo.ipy file.
Sept 26-29, 2016 (c) 2016 iCDO@UALR 18
MongoDB LAB
Sept 26-29, 2016 (c) 2016 iCDO@UALR 19
http://teslae.host.ualr.edu:8081
username: mongotest
Password: mongotest
MongoDB Express User Interface
Sept 26-29, 2016 (c) 2016 iCDO@UALR 20
MongoDB Express
• MongoDB Express is Web-based MongoDB admin interface
• You can create, review, export, delete data through the platform
Sept 26-29, 2016 (c) 2016 iCDO@UALR 21
MongoDB Express Lab
• Export cities.json
• Add a new city name which you like to MongoDB
• Query or find the new city name
• Delete the new city name
Sept 26-29, 2016 (c) 2016 iCDO@UALR 22
Clean Data Lab
Sept 26-29, 2016 (c) 2016 iCDO@UALR 23
Courses Data in MongoDB
Sept 26-29, 2016 (c) 2016 iCDO@UALR 24
Connect to MongoDB
Sept 26-29, 2016 (c) 2016 iCDO@UALR 25
CRUD Operation for MongoDB
Sept 26-29, 2016 (c) 2016 iCDO@UALR 26
Basic Python-MongoDB Lab
• Write codes to add a new course • {"courseid": "71XX", <--Change XX
• "subject": "information science",
• "title": "data quality algorithm", <--Change course name
• "hours": 3 <--Change hours
• }
• Write codes to search your courses• query = {"title": "data quality algorithm" } <--Change title name
• projection = {"hours": 3 <--Change hours
Sept 26-29, 2016 (c) 2016 iCDO@UALR 27
Basic Python-MongoDB lab (cont.)
• A challenge project• Write codes to add your name at teachers’ list
Sept 26-29, 2016 (c) 2016 iCDO@UALR 28
Clean Data lab (cont.)
• Teachers, Courses, and Students are MDM data so that the data is accurate and trust.
• student_course_report and
• teacher_course_report contain incorrect data, but teacherid, studentid ,and courseid are correct.
Teachersinfo teacher_course_report
Sept 26-29, 2016 (c) 2016 iCDO@UALR 29
Clean Data lab (cont.)
teacher_course_report
Sept 26-29, 2016 (c) 2016 iCDO@UALR 30
Clean Data lab (cont.)
• Write codes to clean student_course_report
• Tips:
coursesinfo
studentsinfo
student_course_report
Sept 26-29, 2016 (c) 2016 iCDO@UALR 31
Clean Data lab (cont.)
• A challenge project• Write codes to clean t_s_c_report.
coursesinfo
studentsinfo
TeachersinfoSept 26-29, 2016 (c) 2016 iCDO@UALR 32
THANK YOU
Sept 26-29, 2016 (c) 2016 iCDO@UALR 33
Reference
• http://www.scipy-lectures.org/packages/statistics/index.html
• https://github.com/mongo-express/mongo-express
• https://api.mongodb.com/python/current/
• https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&sqi=2&ved=0ahUKEwjI-uufkabPAhVOgx4KHdWsAXwQFggiMAE&url=http%3A%2F%2Fwww.fh.huji.ac.il%2F~goldmosh%2FPythonTutorialFeb152012.ppt&usg=AFQjCNH5nWz_PAanbl7JCdE6PN7SFUVxyw&sig2=SGxL0rIqfL8gbxQD7mfURA
• https://docs.mongodb.com/manual/
• http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/5944/pdf
• O'higgins, Niall. MongoDB and Python: Patterns and processes for the popular document-oriented database. " O'Reilly Media, Inc.", 2011.
Sept 26-29, 2016 (c) 2016 iCDO@UALR 34