CDS 130 - 003Fall, 2010
Computing for Scientists
Data Analysis(Nov. 11, 2010 – Nov. 23, 2010)
Jie Zhang Copyright ©
Where are We?
1. Introduction
2. Computer Fundamentals
3. Measurements
4. Scientific Simulation – Project 1
5. Visualization
6. Data Analysis
7. Ethics
8. Communication– Project 2 with 10-minute in-class presentation
Data AnalysisAnalysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.
--http://en.wikipedia.org/wiki/Data_analysis
Example: my research on sunspots
•Coordinates•Areas•Fluxes(positive, negative, unsigned)
Number of fragments ------Analogous to number of sunspots.
Butterfly DiagramAutomated MDI AR from 1996 to 2008
Example: my research on sunspots
Your ProjectAnalyze global temperature data.What can you do with the data?
http://www.globalwarmingart.com/images/a/aa/Annual_Average_Temperature_Map.jpg
Content
1. Statistical Measures
2. Histogram method
3. Regression method
Matlab Overview of Data Analysis Tools
Watch video:
http://www.mathworks.com/demos/matlab/analyzing-data-overview-matlab-video-demonstration.html
Content
1. Statistical Measures
2. Histogram method
3. Regression method
Statistical Measures
Statistical Measures
• Minimum
• Maximum
• Mean
• Variance
• Standard Deviation
For a given distribution, no matter 1-D or 2-D data, one can find:
Minimum & Maximum
Why return 3 different numbers, not the minimum value of 1?
Answer: internal “min” function returns the minimum value of each column
“a(:)” converts the 2-D 3X3 array of “a” into 1-D data
Example: a=[1,2,3;4,5,6;7,8,9]
>a=[1,2,3;4,5,6;7,8,9]a = 1 2 3 4 5 6 7 8 9
>min(a)ans = 1 2 3
>min(a(:))ans = 1
>a(:) %what is this?
Minimum & MaximumExample: a=[1,2,3;4,5,6;7,8,9]
>a=[1,2,3;4,5,6;7,8,9]a = 1 2 3 4 5 6 7 8 9
>max(a)ans = 7 8 9
>max(a(:))ans = 9
“max” is a Matlab internal function
MeanThe mean is the average value of a distribution
N
N
ii
N
ii
xxxx
N
x
.....
mean
211
1
meanExample: a=[1,2,3;4,5,6;7,8,9]
>a=[1,2,3;4,5,6;7,8,9]
%find the sum of values of the data array>sum=0>for i=[1:9]>sum = sum + a(i);>end
>mean=sum/9mean=5
>mean(a(:))ans = 5
“mean” is a Matlab internal function
Variance and Standard DeviationBoth are measures of how spread out a distribution is. In other words, they are measures of variability
Variance is the average squared deviation of each number from its mean
1
)(variance
2
2
N
xN
ii
Standard deviation is the square root of variance
1
)(varDeviation Standard
2
N
xiance
N
ii
Example: a=[1,2,3;4,5,6;7,8,9]
>a=[1,2,3;4,5,6;7,8,9]
%find the sum of values of the data array>variance=0>for i=[1:9]>variance = variance + power((a(i)-mean),2)>end
>variance=variance/8 %variance Variance = 7.5 >std=sqrt(variance) % standard deviationans = 2.7386
Variance and Standard Deviation
Calculate variance and standard variation by ourselves
Example: a=[1,2,3;4,5,6;7,8,9]
>a=[1,2,3;4,5,6;7,8,9]
>var(a(:))ans = 7.5000
>std(a(:))ans = 2.7386
Variance and Standard Deviation
Using internal matlab functions:
“var”: for variance
“std”: for standard variation
Statistical Measures
Group Exercise:
What are the statistical measures of the following 1-D data array:
a=[1.2, 0.5, 0.9, 2.4, 1.8, 3.1, 2.2, 1.0]
“temperature.dat”
% Hourly temperature predicted in Fairfax VA (22030) % on Nov. 10, 2010 (Wednesday) for 24 hours % from 1 AM to midnight1 452 443 43 4 425 416 407 408 439 4610 4911 5312 5513 56 14 5715 5716 5617 5318 4919 4620 4421 4322 4223 4124 41
Input Data in Matlab
Watch Online Video:
http://www.mathworks.com/demos/matlab/importing-data-from-files-matlab-video-tutorial.html
Input Data in Matlab
“dlmread”: internal Matalb File Input Function• Read ASCII-delimited file of numeric data into matrix
•Syntax M = dlmread(filename, delimiter, R, C)
•delimiter: e.g., comma ‘,’, space ‘‘, colon “:” •R, C: specify the row and column where the upper left corner of the data lies in the file
>data=dlmread(‘temperature.dat’,’ ‘,3,0)>plot(data(:,1),’*’)>plot(data(:,2),’*’)
Input Data in Matlab
Statistical Measures
Group Exercise:
1.Download the temperature data from: “http://solar.gmu.edu/teaching/2010_CDS130/notes/data_analysis/temperature.dat”
2. Store the data into your working directory
3. Read the data into Matlab/Octave using “dlmread.m”
4. Calculate the statistical measures of the hourly temperature data
Histogram
Histogram
• Bin: the data range over which data points are counted, also called “groups”
• Frequency: number of data points on each bin• Histogram is kind of bar plot, since each data bin is
represented by one bar in the graph
Histogram is a summary graph showing the frequency distribution of data in various data range
Histogram: ExampleHistogram of scores of Prof. Zhang’s 2007 ASTRONOMY 111 – 003 Class.What do you learn?
Bin size: 10
Histogram: Example
ASTR111-003, Astronomy, 2007, Instructor: Prof. Jie ZhangID Grade10000001 1210000002 8110000003 8010000004 6010000005 7410000006 19……………………….………………………..……………………….
10000128 6810000129 5010000130 6410000131 8310000132 86
Raw data: “ASTR111_003_2007_grad.txt” in text file
Histogram: Example
%column 1: student ID; column 2: student gradea=dlmread('ASTR111_003_2007_grad.txt','',2,0);
%obtain the grade from the second columngrad=a(:,2);
%count the frequencyfreq=[0,0,0,0,0,0,0,0,0,0]; %initialize the frequency distributionfor i=[1:132] %bin 1: grade from 0 to 10
if ((grad(i) >= 0) && (grad(i) < 10)), freq(1)=freq(1)+1; endif ((grad(i) >= 10) && (grad(i) < 20)), freq(2)=freq(2)+1; endif ((grad(i) >= 20) && (grad(i) < 30)), freq(3)=freq(3)+1; endif ((grad(i) >= 30) && (grad(i) < 40)), freq(4)=freq(4)+1; endif ((grad(i) >= 40) && (grad(i) < 50)), freq(5)=freq(5)+1; endif ((grad(i) >= 50) && (grad(i) < 60)), freq(6)=freq(6)+1; endif ((grad(i) >= 60) && (grad(i) < 70)), freq(7)=freq(7)+1; endif ((grad(i) >= 70) && (grad(i) < 80)), freq(8)=freq(8)+1; endif ((grad(i) >= 80) && (grad(i) < 90)), freq(9)=freq(9)+1; end
if ((grad(i) >= 90) && (grad(i) <= 100)), freq(10)=freq(10)+1; endendfor
•Obtain the Data
•Count the frequency in data bin
Read-in the raw data and count the frequency
Histogram: ExampleFrequency Table
Bin Frequency
0-10 0
10-20 4
20-30 3
30-40 4
40-50 5
50-60 14
60-70 31
70-80 33
80-90 33
90-100 5
“hist”“hist” is a Matlab internal function
%grad>hist(grad) % default bin: divide data into 10 bins
>hist(grad, 20) %specify the number of bins, e.g., 20 bins
>bin=[0:10:100] %specify the bin size and data range of counting >hist(grad, bin) %bin here is 1-D data array %Indicates the center of the bin
>bin=[5:10:95] %center at [5, 15,25….]>hist(grad, bin] %start point of bars [0, 10, 20….] %end point of bars [10, 20, 30….]
>a=hist(grad) % “a”: return the values of frequency distribution
Histogram: example
Group activity: show histogram of random numbers(1)Generate 10000 random numbers from 0 to 1(2)Plot histogram with 10 bins(3)How many numbers between [0.4,0.5]?(4)Plot histogram with 50 bins(5)Run “hist(rand(10,1),10)” multiple times(6)Run “hist(rand(10000,1),10)” multiple times(7)Does the histogram make sense regarding the distribution of the random number?
Histogram: ExampleHistogram of Fairfax Temperature data.
How do we interpret the data?
>temp=dlmread(‘temperature.dat’,’’,3,1) %obtain the temperature data
>hist(temp)
Matlab “if” statementSyntax: if expression, statements, end
• “expression”: consisting of variables joined by relational operator, and return a logic value
• Logic value: 1 – true ; 0 – False• If “true”, the statement is executed• If “false”, the statement is not executed• Relational operator: “<“, “>”, “==“, “<=“
• Logic operator: “&&” – and; “||” - or
Matlab “if” statement> 3>2ans = 1
> 3<2ans = 0
> 3==3ans = 1
>3 == 2ans = 0
>score = 74
>if (score >= 70), freq=1, end
>if (score >=70 && score <80), freq=1, end
Regression
Regression
Regression, or correlation, refer to the data analysis method to find the relationship that might exists between two quantities
For example: one measures the height and weight for a group of people.
What are the data obtained? X, Y
What kind of useful analysis can do you with the data?
What can you do with the methods you’ve learned already?
What else can you do?
Example
http://www.upa.pdx.edu/IOA/newsom/pa551/lectur19.htm
The Objectives• The input of data is pair of
quantities• x=[x1,x2,x3,x4,x5…]• y=[y1,y2,y3,y4,y5....]
• To quantify the relation of the two quantities
• Objective one is to find the regression line that best fits the data -> the equation of the line
• Objective two is to determine how well the line fits the data correlation coefficient R
Exp: Income and EducationFifteen people are surveyed. The following table shows the data
Participants Income ($) Education (year)
1 125000 19
2 100000 20
3 90000 16
4 75000 16
5 100000 18
6 29000 12
7 35000 14
8 24000 12
9 50000 16
10 60000 17
11 30000 13
12 60000 15
13 56000 14
14 35000 12
15 90000 18
Exp: Income and EducationLoad the data
>pwd % path of my current file folderaans = C:\Octave\3.2.4_gcc-4.4.0\bin
>dir % directory content>cd(‘~/matlab’) %change my working directory % “~” points to my default home directory
>data = dlmread(‘education_income.dat’,’’,3,0) % load data from computer hard drive to memory
>data %make sure that you have the right data>size(data) %find the dimension and size of dataans = 15 3
>parse the data to x array and y array>x=data(:,3) % education, all rows and 3rd column>y=data(:,2) %income, all rows and 2dn column
Exp: Income and Educationmake appropriate scattering plot with labels
>plot(x,y,’*’) %okay but not good. Some points on the axis
>axis([10,22,10000,150000]) %change the plotting range
>xlabel(‘Education’) %okay. But font size is small>xlabel(‘Education’, ’Frontsize’, 20) %change to a larger font size
>ylabel(‘Income’, ‘Frontsize’,20)
>title(‘Income Education Analysis’
>pring –dpng ‘income_no_fit.png’
Exp: Income and Education
Mathematic PrincipleLeast square principle is used in the fitting
We fit the scattered data points into a linear equation
rintercepto-y :b
slope :a
baxy
Minimize the following “sum square” parameter to obtain the fitting coefficients “a” and “b”
datainput thefrom:)(_
)()(_
))(_)(_(1
2
iobsy
biaxifity
iobsyifitySSN
i
Matlab: “polyfit”“polyfit.m”: internal Matlab function to make polynomial curve fitting to a pair of data point>p=polyfit(x,y,1)
fitting cubic 3,:n
fitting quadratic 2,:n
fittinglinear 1, :n
fitting of degree :
...)( 2210
n
xpxpxppxp nn
Linear Regression>p = polyfit(x,y,1) %x: x data %y: y data % degree of fitting, n=1 for linear % p:fitting coefficient in descending power
p= 1.0889e+004 -1.0449e+005
%p(1)=10889.: the slope, the “b” we are looking for%p(2)=-104490: the y-interceptor, the “a” we are looking for
It means that the linear fitting function is:
1.04490-10089xy baxy
Linear Regression: plot
>p(1) % verify the data>p(2) % verify the data
>x_fit=[10:24] %define the x values of the fitted line>y_fit=p(1)*x_fit+p(2) %obtained the fitted y-value
>plot(x_fit,y_fit) %plot the line
%add the data points>hold on>plot(x,y.’*’)
Correlation CoefficientCorrelation coefficient, the R value, characterize how well the data can be fitted by a linear functionR = 0, no correlation at allR = 1.0, perfect correlation
))((
)(
)(2
2
yixixy
yiyy
xixx
yyxx
xy
yxss
yss
xss
ssss
ssR
Correlation Coefficient“corrcoef.m”: the internal function of correlation coefficient
>R=corrcoef(x,y)R = 0.90891
Group ExerciseMake the linear regression analysis on the temperature-latitude data: linear equation and R, and plot the line along with the data points
Latitude (degree) Temperature (Celcius)
0 32
10 26
20 20
30 12
40 5
50 -1
60 -9
70 -14
80 -20
90 -25
Group Exercise2-D data array. Generate a random 2-D array of size 180 X 90
Find the average value along row 20
Find the average value along column 40
Find the value at row 50, column 30
The End