data preprocessing part 1 - health...
Post on 13-Sep-2020
0 Views
Preview:
TRANSCRIPT
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DataPreprocessingPart1
JanuszWojtusiak,PhDGeorgeMasonUniversity
Fall2016
HAP780DataMininginHealthCare
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
“Theworldisfullofobviousthingswhichnobodybyanychanceeverobserves.”
-SherlockHolmes
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
WhyPreprocessing?
• Multiplesources• Multipleformats• Multiplerepresentations• Errors,noise• Missingvalues• Unnecessaryattributes• Not-representativedata• ….andmanymore!
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
TwoTypesofPreprocessing
• Beforeloadingtodatabase/software– Howtogetdatafrommultiplesourcesintodatabase,datawarehouse,orotherformatonwhichDMtoolscanbeused.
• Afterloadingtodatabase/software– Thisiswhatistypicallycoveredbydatapreprocessing:datacleaning,transformation,reduction,discretization,normalization…..
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
SourcesofDataforDataMining
• EHRsystems• Billing• Surveys• Reports• Web• Excelspreadsheets• Sensors
• Sometimesweminetogetherdatafrommultiplesources. Simplyspeaking,wewanttobeabletomineanydataandallavailabledata.
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FormsofData
• One-dimensional– Signalsfromsensors(EKG,accelerometer,etc.)
• Two-dimensional– Images
• Multidimensional– Flatdatatables(attribute-valuepairs)– RelationalDatabases
• Multimedia
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FormatsofData
• Structured– Tables– Relationaldatabases– Non-relational/No-SQLdatabases– TextFiles(comaseparated,specialformats)– XML– Excelfiles– SASdatafiles– ….
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FormatsofData
• Unstructured– Textfiles– Websites– Textfieldsindatabases/structureddata– Speech– Multimedia– ….
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DirtyData
• Noise
• Incompleteness
• Inconsistency
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DirtyDataPTID DOB Age Sex ProvID Dx1 Dx2 Dx3 Dx4 Dx5
1 1/2/70 48 M 345 250.0
2 30 N 010.0 Patientissuffering formberculosis …
473.0
2 1/1/80 33 3456 487
34 487
5 9/8/60 F 327.0 327.2
Thefollowingrecords are imported afterJanuary
6 8/8/54 M 320 250.0 487 296.7 361.0 E858
7 Unknown
M JohnSmith
8 25 F 377 150 151 038.9
Howmanyproblemsareinthisdataset?
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DealingwithDirtyData
• Loaddatatodatabase– DataTypes– Obviousproblemsindatafiles
• Datacleaning&transformation– Inconsistencies,missingvalues,sampling,attributeselection,discretization,….
http://www.prosoftsolutions.net/blog/bid/146041/Dirty-Data-What-is-it-how-does-it-cause-problems-and-what-is-the-solution
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DataTypes
• Differentnamesforthesame– Field:usedindatabases– Attribute:usedindataminingandmachinelearning– Variable:usedinstatistics– Feature:usedinmachinelearning(usuallymeansbinaryattribute)
• DatabaseAttributeTypes
• AnalyticAttributeTypes
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FundamentalConcepts• Symbol:aphysicalentity,itsstate,oritsbehaviorthatconveysachoicefrom
apredefinedsetofchoices.Thechoicesmayrefertoanyentities(physicalorabstractobjects),totheirproperties,ortheiractions.Thechoiceindicatedbyasymboliscalleditsmeaning
• Data:arecordedsetofsymbolscharacterizingasetofentities• Information:interpreteddata;datawhosesymbolshavebeenassignedmeaning
• Knowledge:informationthatisverifiedtobetrueortruetosomedegree,whichcanbeobtainedbydirectobservationorbyinference
• Belief:hypothetical knowledge; knowledge that has not been validated, but is characterized by some measure of it’s the relationship to the reality it describes.
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
SymbolsData
Information
Knowledge
Belief
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
FundamentalConcepts
• Concept:asetofentitiesconsideredasaunit,andtypicallygivenaname• Language:asystemofsymbolsandrulesforcreatingexpressionsfromthesesymbolsforthepurposeofcommunicatinginformation
• Description:anexpressioninsomelanguagethatconveysinformationaboutasetofentities.Thesetbeingdescribediscalledthereferenceset.Aconceptdescription describesallentitiesbelongingtotheconcept(conceptinstances)
• Generalization:aprocessofextendingthereferencesetofadescription,oritsresult
• Abstraction:aprocessofreducinginformationaboutareferenceset,oritsresult
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
DatabaseAttributeTypes
• SystemSpecific• Forexample,inSQLServer2012:
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
Numeric andDate
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
StringsandOther
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
AnalyticDataTypes
• Symbolic– Symbolsusedtorepresententities
• Numeric– Numbers,usuallyusedforcalculations
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
AnalyticAttributeTypes
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
Extract,Transform,Load
• ETLisalmostalwaysusedincontextofdatawarehouses,butalsoappliedtodatamining– Extract datafromexternalsources(oftenmany)– Transform intouniformrepresentation– Load intothetargetsystem(DW,DM)
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
ETLinContext
EMR
Rx
Billing
PACS
ExtractTransform
Load DataWarehouse
FlatFiles
Reporting
DataMining
Analysis
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
ETLinContext
EMR
Rx
Billing
PACS
ExtractTransform
LoadFlatfilereadyforDataMining
FlatFiles
DataMining
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
ToolstoHave
• FileViewer• Textfileeditor,Editpad Pro,Notepad++– Notwordprocessor!
• Processingverylargetextfiles– awk,sed,grep,….
• Fileconverters,builtinsoftware…ornot– lotsoffreeones...
HAP780DataMininginHealthCare Copyright©JanuszWojtusiak,2016
HAP780
JanuszWojtusiak,PhDGeorgeMasonUniversity
jwojtusi@gmu.edu
top related