data entry, and manipulation - github pages · 2020-01-06 · data manipulation options lesson...

36
Data Entry, and Manipulation DataONE Community Engagement & Outreach Working Group

Upload: others

Post on 18-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

DataEntry,andManipulationDataONECommunityEngagement&OutreachWorking

Group

Page 2: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

BestPracticesforCreatingDataFilesDataEntryOptionsDataIntegrationBestPracticesDataManipulationOptions

LessonTopics

Page 3: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Recognizeandplanforinconsistenciesthatcanmakeadatasetdifficulttounderstandand/ormanipulateDescribecharacteristicsofstabledataformatsandlistreasonsforusingtheseformatsIdentifydataentrytoolsIdentifyvalidationmeasuresthatcanbeperformedasdataisenteredReviewbestpracticesfordataintegrationDescribethebasiccomponentsofarelationaldatabase

LearningObjectives

Page 4: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Createqualitydatasetsthatare:ValidOrganizedtosupporteaseofuseandreuse

GoalsofDataEntry

Page 5: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Example:PoorDataEntry

Page 6: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Example:PoorDataEntry,continued

Page 7: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Columnsofdataareconsistent:onlynumbers,dates,ortextConsistentNames,Codes,Formats(date)usedineachcolumnDataareallinonetable,whichismucheasierforastatisticalprogramtoworkwiththanmultiplesmalltableswhicheachrequirehumanintervention

RecommendedPractices

Page 8: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

CreatedescriptivecolumnnameswithoutspacesorspecialcharactersSoilT30toSoil_Temp_30cmSpecies-Codeto"Species_Code(Avoidusing-,+,*,^incolumnnames.Somesoftwaremayinterpretthesesymbolsasanoperator)"

Useadescriptivefilename.Forinstance,afilenamedSEV_SmallMammalData_v.5.25.2010.csvindicatestheprojectthedataisassociatedwith(SEV),thethemeofthedata(SmallMammalData)andalsowhenthisversionofthedatawascreated(v.5.25.2010).Thisnameismuchmorehelpfulthanafilenamedmydata.xls.

RecommendedPractices,continued

Page 9: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

MissingdataPreferablyleavefieldempty(NULL=novalue)Innumericfields,useadistinctvaluesuchas9999toindicateamissingvalueIntextfields,useNA(“NotApplicable”or“NotAvailable”)UseDataflagsinaseparatecolumntoqualifymissingvalue

RecommendedPractices,continued

Page 10: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Entercompletelinesofdata

RecommendedPractices,continued

Page 11: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Forthelongterm,storedatainaconsistentformatthatcanbereadwellintothefutureandthatcanbeusedbyanyapplicationnoworinthefutureAppropriatefiletypesinclude:

Non-proprietary:useanopen,documentedstandardCommonusagebyresearchcommunity:Standardrepresentation(ASCII,Unicode)UnencryptedUncompressed

ASCIIformattedfilesarelikelytobereadableintothefutureUseASCII(comma-separated)fortabulardata

BestPractices

Page 12: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

BestPracticesforPreparingEnvironmentalDataSetstoShareandArchive.September2010.LesA.Hook,SureshK.SanthanaVannan,TammyW.Beaty,RobertB.Cook,andBruceE.Wilson.https://daac.ornl.gov/PI/BestPractices-2010.pdfPreparingDataforSharing.2015.LibbieStephenson.https://doi:10.7910/DVN/BJNXVQ

Resources

Page 13: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Twocommontools:GoogleDocs,Excel

DataEntryTools

Page 14: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

GoogleDocsForms

Page 15: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

GoogleDocsSpreadsheet

Page 16: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Excel

Page 17: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Excel:DataValidation

Page 18: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Greatforcharts,graphs,calculationsFlexibleaboutcellcontenttype—cellsinsamecolumncancontainnumbersortext**Easytouse–buthardertomaintainascomplexityandsizeofdatagrows

EasytoquerytoselectportionsofdataDatafieldsaretyped–Forexample,onlyintegersareallowedinintegerfieldsColumnscannotbesortedindependentlyofeachotherSteeperlearningcurvethanaspreadsheet

SpreadsheetversusRelationalDatabase

Page 19: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Whatisarelationaldatabase?

Page 20: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

DatabaseFeatures:Explicitcontroloverdatatypes

Page 21: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Relationshipsaredefinedbetweentables

Page 22: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

UsingStructuredQueryLanguage(SQL)

Page 23: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Dataentryusingadatabase

Page 24: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

BeawareofBestPracticesinyourdomainwhendesigningdatafilestructuresChooseadataentrymethodthatallowssomevalidationofdataasitisenteredConsiderinvestingtimeinlearninghowtouseadatabaseifdatasetsarelargeorcomplex

Review:PlanningforDataEntry

class:center

Page 25: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Considertryingoneofthese:

Personal,single-userdatabasescanbedevelopedinMSAccess,whichisstoredasafileontheuser’scomputer.MSAccesscomeswitheasyGUItoolstocreatedatabases,runqueries,andwritereports.Amorerobustdatabasethatisfree,accommodatesmultipleusersandwillrunonWindowsorLinuxisMySQL.GUIinterfacesforMySQLincludephpMyadmin(free)andNavicat(inexpensive).

Ifyouwanttotryadatabase:

Page 26: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

DatabaseDesignforMereMortals:AHands-OnGuidetoRelationalDatabaseDesign(2ndEdition).MichaelJ.Hernandez.Addison-Wesley,2003.FundamentalsofRelationalDatabaseDesign.PaulLitwin.http://r937.com/relational.html.(AccessedMay12,2016).

Tolearnmoreaboutdesigningarelationaldatabase:

Page 27: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

MaintaindatasetprovenanceDocumenttransformationsBewareofaccidentalduplication

Reviewmetadataforcompatibilityofcontext,methods,andmeaningForwhatpurposewasthedatacollected?Howwasthedatacollected?Isitsensibletocombinethesedatasets?

DataIntegrationBestPractices

Page 28: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

EnsurecompatibilityConverttocommonunitsChooseappropriatenumericprecisionEvaluateandstandardizemissingvaluecodes

DocumentallassumptionsWhatassumptionsunderlietheoriginaldatasets?Whatassumptionsdidyoumakeincombiningthedatasets?

DataIntegrationBestPractices

Page 29: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

RecognizethatyouarecreatinganewdatasetRevisitthedatalifecycletoensurethenewdatasetisproperlydocumented,validated,andpreserved

UsereproducibleworkflowsEnabletransparencyandreproducibilityintheintegrationprocessEnsureothersunderstandandcanevaluateyourdecisionmakingprocess.Automatetheintegrationasmuchaspossible,especiallywhenintegratingmanyorlargedatasets

DataIntegrationBestPractices

Page 30: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

EnsureattributionoforiginaldatasetownersandrespectdatausageagreementsExampleresource:

ExamplecitationtotherelateddatasetfromtheDryadrepository:

DataIntegrationBestPractices

Page 31: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Usefulforanalyzing,subsettingandtransformingdataCanbeusedtocheckandassurequalitydataOptionsincludeSAS,SPSS,R,andMatlab(notfree)

SAS:HascomprehensivesupportSPSS:Hasauser-friendlyGUIMatlab:AnalysisandVisualizationplatformthathas“toolboxes”availablefordifferentdisciplines,suchasmodelingorgenomicanalyses

DataManipulation

Page 32: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Free(http://www.r-project.org/index.html)ProducespublicationqualitygraphicsLotsofforumsfromwhichtogethelpSoftware(suchasKeplerfordevelopingworkflows)willintegrateanalyticalcomponentswritteninR

UsingR

Page 33: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Toolssuchas(butnotlimitedto)spreadsheettoolssuchasMSExcelandrelationaldatabases(MSAccess,MySQL,andmore)canprovidestructure,flexibilityandpotentialforworkingmoreeasilywithdatasetsbutalsorequireplanningSelectionofadatabaseorspreadsheettooldependsontherelationshipsbetweenthedata,andhowitwillbeused,aswellasotherconsiderationsre:time,resources,output.

Review:Selectingtoolsfordatastorageanduse

Page 34: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

Maintainingprovenance(atrailofcustodyanddecisions)isimportantwhenintegratingmorethanonedatasetDocumentingandunderstandingcontextandrelationships,aswellaschangesiscrucialwhencreatinganewdataset(anytimeyoucombinetwoormoredisparatedatasets)Createatransparent,reproducibleworkflowMakesuretoprovideproperattributionandcitationtoallresources,includingtheoriginaldataset.ToolssuchasR,Matlab,andotherscanbeusefulinestablishingworkflowsandaccessingdatasets

Review:DataIntegration&Manipulation

Page 35: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to

AboutParticipateinourGitHubrepo:https://dataoneorg.github.io/dataone_lessons/

Suggestedcitation:DataONEEducationModule:DataManagement.DataONE.RetrievedNovember12,2016.Fromhttps://dataoneorg.github.io/dataone_lessons/

Copyrightlicenseinformation:Norightsreserved;youmayenhanceandreuseforyourownpurposes.WedoaskthatyouprovideappropriatecitationandattributiontoDataONE.

Page 36: Data Entry, and Manipulation - GitHub Pages · 2020-01-06 · Data Manipulation Options Lesson Topics. Recognize and plan for inconsistencies that can make a dataset difficult to