applications of computational science: data-intensive computing for student projects

6
84 Copublished by the IEEE CS and the AIP 1521-9615/12/$31.00 © 2012 IEEE COMPUTING IN SCIENCE & ENGINEERING E DUCATION Editors: Steven F. Barrett, [email protected] Rubin Landau, [email protected] APPLICATIONS OF C OMPUTATIONAL SCIENCE: DATA-INTENSIVE C OMPUTING FOR S TUDENT PROJECTS By Jessica Howard, Omar Padron, Patricia Morreale, and David Joiner C omputational science offers undergraduates a wide range of tools and methods for problem solving. Undergraduate cur- ricula in computational science are designed to provide students with exposure to tools and methods, 1 but often without an opportunity to use the tools on large, data-intensive problems of their own design. A similar problem exists in computer science. While students completing their second year, and the standard course in data structures, have a wide range of techniques at their disposal, the opportunity to consider a prob- lem, select a solution from the range of possible solutions, implement their solution, and then evaluate and reiter- ate the process (if needed) often isn’t available. We therefore developed a research course providing computational sci- ence and computer science students with the opportunity to select a so- lution, implement the solution, and see the results of their work visually presented. Using a publicly available dataset, the National Oceanic and At- mospheric Administration (NOAA) Integrated Surface Dataset (ISD; see www.ncdc.noaa.gov/oa/climate/ isd/index.php), students select ap- propriate computational methods, use them on the dataset, and pres- ent the results visually in a graphic display for further discussion and understanding. Dataset Identification Earlier work in sensor data collec- tion and data mining has shown that large volumes of data can be captured, archived, and later examined, 2,3 but identifying trends or patterns in the data, or even which data is meaning- ful, is a significant task. Our prior research work 4 involved sensor data collection on campus, which we ar- chived into a local dataset for analy- sis and presentation. We distributed the sensors on campus and gathered a range of environmental data. How- ever, the gathered dataset’s size and range wasn’t large enough for data- intensive computing, especially for pattern and motif identification. For this course, we selected environ- mental datasets, as many datasets are available to students and researchers from agencies such as NASA, NOAA, and the Environmental Protection Agency (EPA). For example, NASA and the Goddard Institute for Space Studies maintain a number of different datasets at the NASA GISS websites (http://data.giss.nasa.gov). NOAA data- sets capture data from many different reporting sites, and support a wide range of variables and data file formats. The EPA has developed a data-finder page (www.epa.gov/datafinder) that assists researchers and students in find- ing their way through the EPA’s numer- ical data sources. Information on air quality, air pollution, climate change, water contaminations, and other envi- ronmental measures is available. Climate change is an engaging problem that students understand, and it has applications to many other areas of science and mathematics, making it appealing to a wide audi- ence. Initially, we considered all three government agencies (NASA, NOAA, and EPA) and their associated publicly available datasets. As the students be- gan to identify the variables that they were interested in, this narrowed the number of datasets under consider- ation. Finally, we selected the NOAA ISD dataset, because it had the data values, collection years, and data for- mats that would be most useful. The ISD dataset consists of global, hourly observations gathered from many dif- ferent sources. The NOAA ISD dataset’s size was significant, with the primary data table having more than 120 million rows (with the earliest readings going back to 1929), and varied depending on the number of reporting stations that would be taken into account by any specific analysis. The data files in the ISD are derived from surface ob- servational data and are stored in an ASCII character format. The data are A research course for juniors and seniors has been designed to offer students a chance to work with the tools of computational science for large, data-intensive computations on publicly available datasets.

Upload: d

Post on 22-Sep-2016

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Applications of Computational Science: Data-Intensive Computing for Student Projects

84 Copublished by the IEEE CS and the AIP 1521-9615/12/$31.00 © 2012 IEEE Computing in SCienCe & engineering

E d u C A t I o n

Editors: Steven F. Barrett, [email protected]

Rubin Landau, [email protected]

ApplicAtions of computAtionAl science: DAtA-intensive computing for stuDent projectsBy Jessica Howard, Omar Padron, Patricia Morreale, and David Joiner

C omputational science offers undergraduates a wide range of tools and methods for

problem solving. Undergraduate cur-ricula in computational science are designed to provide students with exposure to tools and methods,1 but often without an opportunity to use the tools on large, data-intensive problems of their own design. A similar problem exists in computer science. While students completing their second year, and the standard course in data structures, have a wide range of techniques at their disposal, the opportunity to consider a prob-lem, select a solution from the range of possible solutions, implement their solution, and then evaluate and reiter-ate the process (if needed) often isn’t available.

We therefore developed a research course providing computational sci-ence and computer science students with the opportunity to select a so-lution, implement the solution, and see the results of their work visually presented. Using a publicly available dataset, the National Oceanic and At-mospheric Administration (NOAA) Integrated Surface Dataset (ISD; see www.ncdc.noaa.gov/oa/climate/isd/index.php), students select ap-propriate computational methods, use them on the dataset, and pres-ent the results visually in a graphic

display for further discussion and understanding.

Dataset IdentificationEarlier work in sensor data collec-tion and data mining has shown that large volumes of data can be captured, archived, and later examined,2,3 but identifying trends or patterns in the data, or even which data is meaning-ful, is a significant task. Our prior research work4 involved sensor data collection on campus, which we ar-chived into a local dataset for analy-sis and presentation. We distributed the sensors on campus and gathered a range of environmental data. How-ever, the gathered dataset’s size and range wasn’t large enough for data-intensive computing, especially for pattern and motif identification.

For this course, we selected environ-mental datasets, as many datasets are available to students and researchers from agencies such as NASA, NOAA, and the Environmental Protection Agency (EPA). For example, NASA and the Goddard Institute for Space Studies maintain a number of different datasets at the NASA GISS websites (http://data.giss.nasa.gov). NOAA data-sets capture data from many different reporting sites, and support a wide range of variables and data file formats. The EPA has developed a data-finder page (www.epa.gov/datafinder) that

assists researchers and students in find-ing their way through the EPA’s numer-ical data sources. Information on air quality, air pollution, climate change, water contaminations, and other envi-ronmental measures is available.

Climate change is an engaging problem that students understand, and it has applications to many other areas of science and mathematics, making it appealing to a wide audi-ence. Initially, we considered all three government agencies (NASA, NOAA, and EPA) and their associated publicly available datasets. As the students be-gan to identify the variables that they were interested in, this narrowed the number of datasets under consider-ation. Finally, we selected the NOAA ISD dataset, because it had the data values, collection years, and data for-mats that would be most useful. The ISD dataset consists of global, hourly observations gathered from many dif-ferent sources.

The NOAA ISD dataset’s size was significant, with the primary data table having more than 120 million rows (with the earliest readings going back to 1929), and varied depending on the number of reporting stations that would be taken into account by any specific analysis. The data files in the ISD are derived from surface ob-servational data and are stored in an ASCII character format. The data are

A research course for juniors and seniors has been designed to offer students a chance to work with the tools of computational science for large, data-intensive computations on publicly available datasets.

CISE-14-2-Edu.indd 84 1/20/12 10:55 AM

Page 2: Applications of Computational Science: Data-Intensive Computing for Student Projects

marCh/april 2012 85

accessible online, through FTP, and through geographic information ser-vices (GIS). Additionally, the NOAA data was pertinent to current events, particularly discussions of global warm-ing, and thus relevant to a large num-ber of students.

The purpose of the student work was to identify and extract data patterns in the NOAA environmental dataset. En-vironmental data mining can help pre-dict threats to public safety and health, such as air pollution, extreme tem-peratures, and flooding. The students, drawn from both computational science and computer science backgrounds, proposed to create several modules to permit users to gather information from the datasets and support data mining. The modules discussed here can fit a linear model to a dataset and locate local optima within the dataset. We provide these illustrations as ex-amples of what the students found to be a useful approach when provided with the very large NOAA dataset.

ModulesThe students were initially over-whelmed by the dataset and didn’t know how to begin analyzing the data using computational methods. A general discussion of the variables available led to students identifying the most interesting or most signifi-cant variables for data mining. The dataset wasn’t as easily accessible as some classroom projects are. Rather, students had to download the dataset from the NOAA site, parse it, and en-ter it into a MySQL relational data-base on a local server. We developed a series of scripts that automatically fetched the daily data updates from NOAA and updated the local data-base, keeping the project supplied with current data, as well as historical data. We did this dataset-preparation

work before the semester started or, if the students were familiar with data-base design, in the first weeks of class.

Data SequenceThe data in the NOAA dataset selected for this effort was sequenced in the order shown in Table 1.5

Each data record is of variable length and includes both control and mandatory data. We gathered this in-formation from the NOAA ISD data-set and moved it into a local relational dataset for student use.

LinfitWe used the module linfit to fit a linear model to a dataset of (x, y)

points (see Figure 1). This module uses linear regression, which attempts to model the relationship between two variables by fitting a linear equation to observed data.6 The fit is determined to be that which minimizes the sum of squares, the differences between the observed points, and the points predicted by the fit squared.7

SS y f xi

n

i i= −=∑

1

2( ( ))

For a line, f has the form f (x) = mx + b, with the slope m and the vertical offset b as the two parameters to be determined.

SS y mx bi

n

i= − −=∑

1

2( )

Table 1. National Oceanic and Atmospheric Administration Integrated Surface Data (NOAA ISD) sequence.

Sequence number Data element

1 Fixed–weather–station identifier

2 Geophysical–point–observation date

3 Geophysical–point–observation time

4 Geophysical–point–observation latitude coordinate

5 Geophysical–point–observation longitude coordinate

6 Geophysical–point–observation type surface report code

7 Geophysical–report–type code

Figure 1. Sample data fitted to a line using linfit. this module uses linear regression, which attempts to model the relationship between two variables by fitting a linear equation to observed data.

0.0

1.0

1.5

2.0

y va

lues

2.5

3.0

0.2 0.4 0.6

x values

0.8 1.0

CISE-14-2-Edu.indd 85 1/20/12 10:55 AM

Page 3: Applications of Computational Science: Data-Intensive Computing for Student Projects

E d u C A t I o n

86 Computing in SCienCe & engineering

Differentiating along each para-meter and setting equal to 0, we arrive at two simultaneous equations to be solved:

SS x y mx b

SS y mx b

mi

n

i i

bi

n

i

=− − −( )=

=− − −( )=

=

=

2 0

2 0

1

1

This module was written in R and uses R’s built-in function, lm(), which solves the simul-taneous equations by represent-ing them as a matrix equation, reducing the resulting matrix.

This module takes (x, y) data points provided by the user and stores them as a data table, and then uses lm() to fit the given points to a line. Input for linfit must have the (x, y) points in a two-column format, meaning the points must be on separate lines and separated by a space. The module will print out the line’s slope, the intercept, and the coefficient of determi-nation R2 of the line determined by lm(). R2 is a measure of the model’s global fit:7,8

R

y f x

y y

i

n

i i

i

n

i

2 1

2

1

21= −

−( )

−( )=

=

Σ

Σ

( ),

where y is the average of the data points’ y values. R2 ranges from 0, in-dicating no linear relationship, to 1, which indicates that the determined model is a perfect fit for the data points and that all of the dependent variable’s variability is explained.9

An interesting aspect of this tech-nique is that it generalizes quite natu-rally to more fitting parameters as well as other classes of functions (such as polynomial and exponentials), with

the only requirements being that the functions’ partial derivatives with re-spect to the parameters can be com-puted (or at least well approximated), and the quality of the fit varies linearly with each parameter.10

GfitFor occasions where either of these re-quirements aren’t met, heuristic ap-proaches are commonplace. With nonlinear models, for example, one might linearize around an initial guess, solve the linearized problem, and re-peat to iteratively improve the guess.

This can work quite well given certain characteristics of the objective function, but in general, this doesn’t guaran-tee that a returned solution is globally optimal. Indeed, this “greedy” approach is particularly prone to getting stuck in local optimum.11 For this reason, we explored the use of other heu-ristic approaches.

The generic fitting module, gfit, fits a model to a set of points, assuming nothing about its relationship with its param-eters. Arbitrary complexity and nonlinearity are addressed with simulated annealing (SA; see Figure 2), a stochastic optimiza-tion algorithm based on Monte Carlo simulations. SA likens a vector of input parameters to a particle in a hyperspace that exhibits Brownian motion. This motion lets us accept objectively inferior solutions for the sake of possibly arriving at one that’s globally superior. This quality helps the optimization avoid lo-cal optimum, a common issue with deterministic (sometimes called “greedy”) algorithms. A more comprehensive review

of SA-class algorithms is provided elsewhere.11

Find_ZerosFind_Zeros is a module that locates the x-intercepts of a function (see Figure 3). This module uses a vector of x points and a vector of y points, both in ascending order and of equal length. The x-intercepts are found by first searching through the vector of y values given by the user and search-ing for the position where the y values change signs. When that condition is found, the module will take the (x, y)

Figure 2. Sample temperature data (measured in degrees Fahrenheit, or F) with a sin wave fitted using simulated annealing (SA). the resulting fit suggests that the data depicts a cold winter followed by a mild spring.

100

80

60

40

20

0–7.25e+11 –7.1e+11

Milliseconds since the epoch

Sample temperature data

Tem

per

atur

e (F

)

Figure 3. the x-intercepts identified in the plot of sin(x) using Find_Zeros. the results of Find_Zeros were then used by Local_Optima.

Sample data: Sin(x)

x–5 –4 –3 –2

–2

–1.5

–1

–0.5

0

0.5

1

1.5

2

–1 0 1 2 3 4 5

Sin(

x)

CISE-14-2-Edu.indd 86 1/20/12 10:55 AM

Page 4: Applications of Computational Science: Data-Intensive Computing for Student Projects

marCh/april 2012 87

points of the positive and nega-tive values, input them into a point-slope form equation and solve for x given y = 0.

xy y mx

m

my yx x

,

where .

1 1

2 1

2 1

= − −

= −−

The module outputs the x-intercepts found and the index of the y1 value used in the point-slope form equation. The y1 value is included in the output to be used as an indicator of where the x-intercepts would be if they were to be placed in the vector of x values.

This module also handles cases where there is a function that has val-ues of y approaching zero but never touching the x-axis. When these cases exist, they will require another method for capturing x-intercepts. What this second method does is look for y values within the tolerance value provided by the user. The tolerance value is an optional input argument that lets the module know what val-ues are small enough to be considered zero. If no tolerance is provided, the module sets the tolerance to 10−3. Using three (x, y) points at a time, the method finds the equation of the curve that passes through the three points, and then determines that curve’s apex. This is accomplished in the module by first solving a set of equations for three variables, a, b and c, using the x component of the three variables:

ax bx c

ax bx c

ax bx c

0

0

0

.12

1

22

2

32

3

+ + =+ + =+ + =

To obtain a function’s apex, the first derivative must be taken and set to

equal zero. The derivative of a qua-dratic equation is

ax b2 0.+ =

The module finds the apex’s x value by substituting the values of a and b into this equation.

For this project, Find_Zeros wasn’t used on its own; instead, it was uti-lized by another module, Local_Optima. Local_Optima used Find_Zeros as part of its method for locat-ing minimums and maximums.

Local_OptimaLocal_Optima is a minimum, maxi-mum, and saddle-point detection module (see Figure 4). The input for this module is a vector of x points and a vector of y points, both in ascending order and of equal length. Optima are found by first taking the deriva-tive of the (x, y) points using Octave’s built-in utility function, gradient, and uses the Find_Zeros module to find the x-intercepts of the derivative. These x-intercepts are the optima’s x values. Then, two of Octave’s built-in functions, polyfit and polyval, are used to obtain the optima’s y values of the curve’s points. Polyfit returns the coefficients of a poly-nomial p(x) of degree n. Polyval evaluates the polynomial at given x

values by solving the following equation:9,12

y p x p xp x p

n n

n n

= ++ + +

+

1 21

1� .

Now that it knows the op-tima’s (x, y) points, the module goes on to determine if each point is a minimum, maximum, or saddle point.

The module determines the saddle points by comparing

the values within the vector of the original function’s first derivatives to the vector of its second derivatives. If the difference between the first and second derivatives at any point is less than 10−5, that point will be identi-fied as a saddle point. Checking for a difference that’s less than 10−5 in-stead of checking whether both val-ues are equal to zero compensates for situations where small values are be-ing processed and computations that should result in an output of zero are actually yielding output that’s rela-tively small.

Next, the module determines mini-mums and maximums by fitting the given function’s second derivative us-ing polyfit and then using polval to evaluate the second derivative at the optima points’ x values. Polyval’s output is used to determine whether the optima is a minimum or maxi-mum. If the output is positive, then the optima point is identified as a minimum. If the output is negative, then the optima point is identified as a maximum.

In the output, a 0 is appended to sad-dle points, a 1 is appended to minimums and −1 is appended to maximums.

Further AnalysisBy using computational tools and modules, patterns and trends in large

Figure 4. Local optima of sample data identified using Local_Optima. the saddle point is at 0 on both axes, while 1 designates minimums and −1 maximums on the Sin(x3) axis.

–2–2

–1.5

–1.5

–1

–1

–0.5

–0.5

Sin(

x3 )0

0

x

Sample data: Sin(x3)

0.5 1 1.5 2

0.5

1

1.5

2

CISE-14-2-Edu.indd 87 1/20/12 10:55 AM

Page 5: Applications of Computational Science: Data-Intensive Computing for Student Projects

E d u C A t I o n

88 Computing in SCienCe & engineering

datasets can be identified. Many data collections need additional methods for data fitting and smoothing, to provide appro-priate data analysis for a wide variety of fields.

Data Fitting and SmoothingThe modules that we’ve out-lined are suitable for extract-ing features from a dataset and summarizing trends. However, most numerical methods rely on well-behaved data, which wasn’t always found in the NOAA ISD dataset or in comparable “real” data-sets. The absence of reported data for a time, poor or invalid data, as well as any number of other conditions can all require the use of a fitting and smoothing process, which helps pro-vide data that’s better behaved and suitable for further processing.

The students applied the simple and exponential moving-averages (sma/ema) module to the NOAA ISD data (Figure 5), resulting in a better overall dataset. The students clearly saw how fitting and smoothing enhanced the data. We also used the generic fitting module (gfit).

Utility ModulesIn addition to the use of sma/ema and gfit, we created additional modules to facilitate direct data manipulation. For example, two modules—mux and demux—act as a data multiplexer and demultiplexer. Upon receiving a signal as input and an integer command line argument N, mux will produce as out-put multiple columns, each of length N1, where the first column contains the first N values in the input, the second the next N values, and so on. Then filt acts as a data filter, which we can use to filter an input signal according to a Boolean expression

given on the command line. Finally, stamp2int takes any input text and replaces text following the pattern of a Unix time stamp (with Unix’s equiv-alent measure in milliseconds, with the Unix epoch being the time of origin). Time stamps were assumed to be in midnight proleptic Greenwich mean time (GMT).

The results from the work with the NOAA ISD dataset have been

extremely encouraging. The students are also continuing their work and are identifying other datasets to use with their modules. Additional modules are being developed. Tested modules will be placed in a public repository, with information about the datasets on which they can be used.

The modules developed by the students to identify and investigate patterns and trends in the very large NOAA ISD dataset provided the stu-dents with a strong understanding of the needs and limits of computational science. The software is available at www.mmnetlab.com, via the software link on the left-hand navigation guide. The identification of significant data variables, the data ordering that took place before computational science

toolkits could be used, and the data’s integrity were all discussed and demonstrated visually.

For those who are interested in implementing this course, “Visualization of Computational Science” could be conducted according to the schedule in Table 2. The outlined curricu-lum provides an opportunity for computat ional science, engineering, mathematics, and computer science students to understand the merits of very large datasets available in fields

as varied as bioinformatics, environ-mental sensing, and sensor data collec-tion, while providing an awareness of which tools to use to begin analysis.

References1. A.B. Shiflet and G. Schflet, Introduction

to Computational Science, Princeton

univ. Press, 2006.

2. M. Kantardzic, Data Mining: Concepts,

Models, and Algorithms, John Wiley &

Sons, 2003.

3. u. Fayyad, G. Piatetsky-Shapiro, and

P. Smyth, “From data Mining to Knowl-

edge discovery in databases,” 1996;

www.kdnuggets.com/gpspubs/

aimag-kdd-overview-1996-Fayyad.pdf.

4. P. Morreale et al., “Real-time Environ-

mental Monitoring and notification for

Public Safety,” IEEE MultiMedia, vol. 17,

no. 2, 2010, pp. 4–11.

5. Federal Climate Complex Data Documen-

tation for Integrated Surface Data, tech.

report, nat’l Climatic data Center, Air

Force Climatology Center, 15 Jan. 2010:

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

ish-format-document-old.pdf.

6. S. Chatterjee and A. Hadi, “Simple

Linear Regression,” Regression Analysis

by Example, 4th ed., John Wiley & Sons,

2006, pp. 21–50.

7. B.S. Everitt, Cambridge Dictionary of Statis-

tics, 2nd ed., Cambridge univ. Press, 2002.

Figure 5. Sample temperature data with exponential moving averages and decreasing weighting factors. darker lines correspond to smaller weighting factors.

Sample temperature data

Milliseconds since the epoch

0

20

40

Tem

per

atu

re (F

)

60

80

100

–7.25e+11 –7.1e+11

CISE-14-2-Edu.indd 88 1/20/12 10:56 AM

Page 6: Applications of Computational Science: Data-Intensive Computing for Student Projects

marCh/april 2012 89

8. J.W. Eaton, GNU Octave Manual, net-

work theory, 2002.

9. R development Core team, “R: A Lan-

guage and Environment for Statistical

Computing,” R Foundation for Statisti-

cal Computing, 2008; www.R-project.

org.

10. R.H. Landau, M.J. Paez, and CC.

Bordeianu, A Survey of Computa-

tional Physics, Princeton univ. Press,

2008.

11. S. Kirkpatrick, C.d. Gelatt, and M.P.

Vecchi, “optimization by Simulated

Annealing,” Science, vol. 220, no. 4598,

1983, pp. 671–680.

12. A. Robbins, The GNU Awk User’s Guide,

Free Software Foundation, 2011;

www.gnu.org/software/gawk/manual/

gawk.html.

Jessica Howard is a graduate student at the

new Jersey Center for Science, technology,

and Mathematics at Kean university. Her

research interests include 2d/3d visualiza-

tion, modeling, and physics. Howard has

a BS in computational mathematics from

Kean university. Contact her at howardje@

kean.edu.

Omar Padron is a graduate student in the

department of Computer and Information

Sciences at the university of delaware. His

research interests include parallel algorithms

for distributed and shared architectures,

high-end visualizations of large datasets,

and developing numerical software for

application in a variety of scientifi c disci-

plines. Padron has an MS in computational

mathematics from Kean university. Contact

him at [email protected].

Patricia Morreale is a professor in the

department of Computer Science at Kean

university. Her research interests include

network management, data mining, multi-

media service delivery, and data visualization.

Morreale has a Phd in computer science from

the Illinois Institute of technology. Contact

her at [email protected].

David Joiner is the Kenneth L. Estabrook

Professor of Science, technology, and Math-

ematics Education at the new Jersey Center

for Science, technology, and Mathematics

at Kean university. His research interests are

in the areas of educational technology and

computational science. Joiner has a Phd

in physics from the Rensselaer Polytechnic

Institute. Contact him at [email protected].

Selected articles and columns from IEEE Computer Society publica-

tions are also available for free at http://ComputingNow.computer.org.

Table 2. Outline of semester weeks and activities for the “Visualization of Computational Science” course.

Semester week Activities

1 Select which public dataset to use.

2 Initially navigate the dataset; identify variables to be considered; discuss the arrangement of the dataset’s data sequences.

3 Prepare dataset for computing.

4 Prepare dataset for computing.

5 Linfit module.

6 Find_Zeros module.

7 Local_Optima module.

8 Midterm exam.

9 data fi tting and smoothing.

10 utility modules.

11 Implement and test additional selected modules using small teams in class.

12 Analyze and further test.

13 Present results.

14 Summarize and assess the methods’ utility.

15 Final exam.

Silver BulletSecurity Podcast

Sponsored by

www.computer.org/security/podcasts*Also available at iTunes

In-depth interviews with security gurus. Hosted by Gary McGraw.

CISE-14-2-Edu.indd 89 1/20/12 10:56 AM