data entry and assembly. data acquisition best practices for creating data data entry options ...

47
Data Entry and Assembly Module 5 Data Acquisition

Upload: silvia-reed

Post on 18-Jan-2016

234 views

Category:

Documents


2 download

TRANSCRIPT

Data Entry and Assembly

Module 5

Data Acquisition

DataAcquisition

Best Practices for Creating Data Data Entry Options Data Manipulation Options Gathering Existing Data

Module Topics

DataAcquisition

After completing this lesson, the participant will be able to: ◦ Describe the characteristics of an easily understood and

manipulated dataset◦ Identify data entry tools and validation measures that can be

done as data is entered◦ Define what a relational database is and why it is useful◦ Describe documentation associated with using existing data

Learning Objectives

DataAcquisition

Collect

Assure

Describe

Deposit

Preserve

Discover

Integrate

Analyze

The Data Life Cycle

DataAcquisition

Whether recording data on paper forms or entering data digitally, accuracy and usability are important

Have a plan for storing your data before you collect it

The quality of your data records should be defensible

Following proper procedures when creating data results in digital datasets that are:◦ Valid◦Organized◦ Easy to understand◦ Easy to subset

Goals of Data Entry and Recording

DataAcquisition

Example: Poor Data Organization

• Inconsistency between data collection events (& more than one activity type)

– Date in title or as a data element– Column names vary– Order of columns differs

• Other Issues– Inconsistent Date formats– Different Site spellings– Typos in Site spellings– Poor column names (ex: Acult)– Inappropriate column use (‘got

away’, J, N in Acult column)– Mean1 Label is in Species column– Text and numbers in same column

DataAcquisition

Same Data, Using Best Practices

• Descriptive File Name• Column data types are consistent:

– only numbers, dates, or text• Consistent Names, Codes, Formats (date) used in each column• Similar data (separate samples) are all stored in one table with a single format –

computing is simpler using a single table than multiple, unrelated small tables that require a lot of individual human manipulation and synchronization

DataAcquisition

Create simple, consistently formatted file names. Consider using a date suffix in the name to help with version control. MyFieldData_20110910a.xls

Avoid using -,+,*,^, /, and other higher ASCII characters in column names. Software may interpret them as operators or reserved characters.

Create readable, descriptive column names without spaces or special characters◦ Soil T30 Soil_Temp_30cm◦ Species-Code Species_Code

Best Practice: Use Naming Conventions

DataAcquisition

Date Time NO3_N_Conc NO3_N_Conc_Flag20081011 1300 0.01320081011 1330 0.01620081011 1400 M

20081011 1430 0.018

20081011 1500 0.001 Est

Best Practice: Handling Missing Data◦Whenever possible, leave the value empty (NULL = no value)◦Use a Data_Flag column to qualify missing values (and other

issues) Ex: NA, Est, Null (no qualification)◦ For numeric fields, you might need to provide a distinct unlikely

value such as 9999 to indicate a missing value◦ For text fields, “NA” may be appropriate ( for “Not Applicable”

or “Not Available”)

M = missing; no sample collected

Est = estimated from grab sample

DataAcquisition

Enter complete lines of data, don’t assume a person or program will know to ‘fill in below’ for cells in a spreadsheet

Best Practice: Create Complete Records

Sorting can remove the best of orderly intentions!

Also avoid creating columns that are rarely used

DataAcquisition

Document the (authoritative) data source Obtain and keep all metadata about the format, content,

quality, and limitations of the data Document any data-sharing agreements or data-use

constraints Before use, evaluate the data for its fitness to your

needs Track the provenance of the data in derived data

products Consider informing the data source of the use of their

data

When Using Existing Data…

DataAcquisition

Spreadsheets Databases Googledocs Forms

Using Data Entry Tools

DataAcquisition

Excel: Pick-list validation

Control data entry by using standardized pick lists

Also look into using Excel’s automatic ‘data form’ feature

DataAcquisition

Excel: Range Validation

20

Avoid entry errors by trapping values - by range, type, etc.

Let the spreadsheet do the work of warning you!

DataAcquisition

Googledocs: Example Form

DataAcquisition

DataAcquisition

Great for Charts, Graphs, Calculations

Poor choice for large or complex datasets

Difficult to subset Flexible about cell content

type (no type enforcement) Lack record integrity (can

independently sort columns relative to other columns)

Easy to use – but harder to maintain as complexity and size of data grows (lots of repeats in records)

Spreadsheet vs. Database Not a math or graphics tool,

but can provide data to those tools

Work well with lots of data Easy to query and subset data Data fields are typed – only

integers in integer fields Columns cannot be sorted

independently of each other (a good thing!!)

Slight learning curve compared to a spreadsheet

Normalization reduces complexity and entry effort

DataAcquisition

Why should you use a relational database?

*siteIDsite_name

latitudelongitude

description

Site

• Each data value is stored in only one place• Tables contain attributes describing a single thing (Site)• Primary Keys (marked with *) establish row identities, and are shared

between tables to link them• Based on set theory, the query language provides a concise way of

grouping, sorting, filtering, & summarizing

*speciesIDspecies_name

common_namefamilyorder

Species

*sampleIDsiteID

sample_datespeciesID

heightflowering

flagcomments

SpeciesSample

• sets of well-defined tables

• formal relationships• a query language for

manipulating data

DataAcquisition

Database Features: Explicit control over data types

Date Site Species Height Flowering<dates only> <text only> <from code

list>< real numbers only>

< ‘y’ and ‘n’ only>

Advantages• quality control• performance• organization

DataAcquisition

Date Site Species Flowering?

2/13/2010 A BOGR2 y

2/13/2010 B HODR y

4/15/2010 B BOER4 y

4/15/2010 C PLJA n

Database Features: Relationships are defined between tables

Site Latitude Longitude

A 34.1 -109.3

B 35.2 -108.6

C 32.6 -107.5

Date Site Species Flowering? Latitude Longitude2/13/2010 A BOGR2 y 34.1 -109.3

2/13/2010 B HODR y 35.2 -108.6

4/15/2010 B BOER4 y 35.2 -108.6

4/15/2010 C PLJA n 32.6 -107.5

Join tables and their data using queries

DataAcquisition

Date Plot Treatment SensorDepth SoilTemperature

2010-02-01 C R 30 12.8

2010-02-01 B C 10 13.2

2010-02-02 A N 0 15.1

Database Features: Powerful Query Language

OR Select * from SoilTemp where Treatment = ‘N’ and SensorDepth = ‘0’

OR Select Max(Soil_Temperature) from SoilTemp

Select Date, Plot, Treatment, SensorDepth, Soil Temperature from SoilTemp where Date > ‘2010-01-01’

DataAcquisition

SAS, R, SPSS Good for calculations, data analysis, subsetting data Also can be used for quality assurance

Statistical Software

DataAcquisition

Be aware of Best Practices when designing data file structures

Make sure you adequately document data that you obtain from others, and track the use of those data in your derived data products

Choose a data entry method that allows some validation of data as it is entered

Consider investing time in learning how to use a database if datasets are large or complex

Conclusion

DataAcquisition

Best Practices for Preparing Environmental Data Sets to Share and Archive. September 2010. Les A. Hook, Suresh K. Santhana Vannan, Tammy W. Beaty, Robert B. Cook, and Bruce E. Wilson. http://daac.ornl.gov/PI/BestPractices-2010.pdf

References

What did you learn?

START QUIZ

DataAcquisition

1. Which of the following datasets is entered according to best practices?

date site sp wght1/2/2010 1 1 0.22/5/2010 1 2 lost3/2/2010 1 3 0.5

date site species weight_gm1/2/2010 1 1 0.22/5/2010 1 2 Missing – eaten?3/2/2010 1 3 0.5

date site species weight_gm flag comment1/2/2010 1 Acacia constricta 0.2 2/5/2010 1 Acacia greggii Missing from

plotEaten?

3/2/2010 1 Acacia redolens 0.5

date site sp wt flag comment1/2/2010 1 Ac 0.2 2/5/2010 1 Ag Missing from plot Eaten?3/2/2010 1 Ar 0.5

DataAcquisition

You might want to review this section again.

Think about this …

Return

DataAcquisition

Proceed to the next question

Excellent!

Next

DataAcquisition

consistent

2. It is considered best practice if you indicate any _______ data.

missing

different

special characters in the

DataAcquisition

You might want to review this section again.

Think about this …

Return

DataAcquisition

Proceed to the next question

Excellent!

Next

DataAcquisition

“value missing”

3. In numeric fields use a value such as_______ for missing data.

*

-

9999

DataAcquisition

You might want to review this section again

Think about this …

Return

DataAcquisition

Proceed to the next question

Excellent!

Next

DataAcquisition

valid or void

4. It is considered best practice to insert into text fields with missing data __________or_____________ .

null or NA

missing or void

missing or NA

DataAcquisition

You might want to review this section again

Think about this …

Return

DataAcquisition

Proceed to the next question

Excellent!

Next

DataAcquisition

periods

5. In a separate column use_____ to indicate empty cells.

data flags

dashes

brackets

DataAcquisition

You might want to review this section again

Think about this …

Return

DataAcquisition

Proceed to the next question

Excellent!

Next

DataAcquisition

numbers or special characters

6. A best practice in data entry is to create descriptive column names without _____ or ________.

spaces or special characters

dashes or lengthy descriptions

errors or spaces

DataAcquisition

You might want to review this section again

Think about this …

Return

DataAcquisition

Proceed to the next question

Excellent!

Next

DataAcquisition

Bird population monitoring in a river valley that will go on for several years

7. In which case should you use a database rather than spreadsheets for your data?

Greenhouse plant growth study with a 3x3 factorial design and a one time destructive sampling of biomass

Nitrogen concentrations in leaves from three different tree species in spring of 2010

DataAcquisition

You might want to review this section again

Think about this …

Return

DataAcquisition

Proceed to the next question

Excellent!

Next

DataAcquisition

We want to hear from you! CLICK the arrow to take our short survey.

Before you go . . .