compsci516 data intensive computing systems · 2016-08-31 · compsci516 data intensive computing...
Post on 14-Mar-2020
5 Views
Preview:
TRANSCRIPT
CompSci 516DataIntensiveComputingSystems
Lecture1Introduction
andDataModels
Instructor:Sudeepa Roy
1DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems
CourseWebsite
• http://www.cs.duke.edu/courses/fall16/compsci516/
• Pleasecheckfrequentlyforupdates
DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems 2
Instructor• Sudeepa Roy
– sudeepa@cs.duke.edu– https://users.cs.duke.edu/~sudeepa/– officehour:Mondays1:30-2:30pm,LSRCD325
• Aboutmyself– AssistantProfessorinCS– PhD:UPenn,Postdoc:Univ.ofWashington– JoinedDukeCSinFall2015– Researchinterests:
• Databases(theoryandapplications)• DataAnalysis,causality,explaininganswers• Uncertaindata,dataprovenance,crowdsourcing
3DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
TA
• Junghoon Kang– jungkang@cs.duke.edu– officehour:Thursdays1–2pm,NorthN303B
• Noofficehourthisweekandon09/05,Mon(Memorialday)
• AdditionalofficehourbyJungnextweekTuesday1pm– 2pm,NorthN303B
4DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Logistics
• Homeworksubmission:Sakai– Allenrolledstudentsarealreadythere
• Discussionforum:Piazza– Allenrolledstudentsarealreadythere– SendmeanemailifyouhavenotreceivedawelcomeemailfromPiazza
• Lectureslideswillbeuploadedbeforetheclass– butwillbeupdatedaftertheclass
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5
Grading
• ThreeHomework:30%• Project:20%• Midterm:20%• Final:30%
6DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems
GradingStrategy• Relativegrading
– TopperoftheclassgetsA+irrespectiveofthenumber,andallandonly“aboveexpectation”performancesgetA+
– Nolowestgradeorfixeddistribution– soeveryonecangetverygoodgradesbyworkinghard!
– Theactualgradedistributionattheendwilldependontheperformanceoftheentireclassonallthecomponents
– Ifyouareaborderlinecasefortwogrades,yourclassparticipationandeffortthroughoutthesemesterwillputyouinthehighergrade
• Activelyparticipateintheclass!• Askquestionsinclassandonpiazza• Answereachother’squestionsonpiazza• Send(anonymousornot)feedback,suggestions,orconcernsonPiazza
7DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems
Homework• Duein~3weeksaftertheyareposted/previoushw isdue
– 2weeksshouldbeenough– Startearly
• Nolatedays– contacttheinstructorifyouhavea*valid*reasontobelate– Anotherexam,project,hw isnotavalidreason– wewillalwaysbefair
toall– Computercrash/suddeninterviewtrips/medicalissues(following
officialprocedures)maycountasvalidreasons– Noguaranteethatyourrequestwillbegranted– again,startearly!
• Tobedoneindividually
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 8
HomeworkOverview• Youwilllearnhowtousetraditionalandnewdatabasesystemsinthe
homework– Havetolearnthemmostlyonyourownfollowingtutorialsavailableonline
andwithsomehelpfromtheTA
• HW1andHW2alreadyonsakai!Havefun!– Workonthematyourownpace
• HW1:TraditionalDBMS– SQLandPostgres– Dueon09/16(Fri)
• HW2:Distributeddataprocessing– SparkandAWS– WillreceiveinstructionsforAWS(secondpartofthehw)– Dueon10/12(Wed)
• HW3:NOSQL– e.g.MongoDBorDynamoDB– Willbepostedlater(afterTransactions)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 9
Exams
• Midterm– Oct5(Wed)• Final– TBD(byuniv schedule)
• Inclass• Closedbook,closednotes,noelectronicdevices• Totalweight:20+30%=50%• Examswilltestyourunderstandingofthematerial
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 10
Projects• 20%weight• Ingroupsofatmost3
– YoucanlookforgroupmembersthroughPiazzabyannouncingyourgeneralareaofinterestorifyouhaveaprobleminmind
– Eachgroupmembershoulddoapprox.equalwork
• Workdoneshouldbe(atleast)equivalenttotwoHWs,i.e.– theworkforonehw *2*#groupmembers
• Takeitveryseriously!– showyourcreativityandresearcher-side
• ThereisanACMSIGMODStudentResearchCompetition– Withpublicationand$asreward,andwinnersgotoACM-widecompetition– Separatecategoriesforundergradsandgrads– Deadline:November18,2016(justtheabstract)– http://sigmod2017.org/student-research-competition/
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 11
ProjectTopics• Anythingrelatedto“Data”
– Datamanagement/processing/cleaning– Datavisualization– Dataexplorationoranalysis– Applicationsofdata(toanyfield)– Theoreticalfindingswithdata– Newtoolfordataanalysis
• Chooseaprojectaccordingtoyourresearchinterest
• Youcancheckoutmajordatabaseconferencesforideas,e.g.– Demonstrations (buildaprototypesolvingaproblemorimprovingUI)
• SIGMOD’16:http://sigmod2016.org/sigmod_demo_list.shtml• VLDB’16:http://vldb2016.persistent.com/demonstrations.php
– Researchpapers(solveaproblem,doexperimentswithdata)• SIGMOD’16:http://sigmod2016.org/sigmod_research_list.shtmlvldb 16• VLDB’16:http://www.vldb.org/pvldb/vol9.html
– Youcancheckoutpreviousyearstoo,andconferencesfromyourownresearcharea
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 12
ProjectDeliverables1. Projectproposal(due:9/21,1-3pages)
– problemselectionispartoftheproject– 3weeksfromnow– butstartasap,lookforproblems,dorelatedworkstudy,findan
interestingquestion,doafewiterationswithme(trytomeetmeonce),allbythedeadline
2. Midtermprogressreport(due:10/21,3-5pages)3. Finalprojectreport(due:11/28,4-8pages)4. Afinal~10minsprojectpresentationand/ordemonstration
(inthelast1-2classes)
• Thesamedocumentwillbeupdated– Templateandhighlevelideaswillbepostedonsakai soon
13
ProjectEvaluationCriteriaScaleof100:1. Well-motivated?102. Novel?103. Comprehensiverelatedworksurvey?104. Qualityofwriting?10
– shouldreflectallotherfactorstooexceptclasspresentation
5. Classpresentation/demo?15– shouldreflectallotherfactorstooexceptwriting
6. Technicalcontributions?45– Problemformulation/Algorithms/Experiments/Theory/System/
Userinterface/Efficiency/Usability/Datasetexplorationetc.
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 14
ReadingMaterial
• Willmostlyfollowthe”cowbook”byRamakrishnan-Gehrke– Thechapternumberswillbeposted
• Youdonothavetobuythebooks,butitwillbegoodtoconsultthemfromtimetotime
• Youshouldbepreparedtodoquiteabitofreadingfromvariousbooksandpapers
15DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Whatisthiscourseabout?
• Thisisagraduate-leveldatabasecourseinCS
• Wewillcoverprinciples,internals,andapplicationsofdatabasesystemsindepth
• Wewillalsohaveanintroductiontoafewadvancedresearchtopicsindatabases(laterinthecourse)
16DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
AQuickSurvey• Haveyoutakenanundergraddatabasecourseearlier
– CS316/equivalent?
• Areyoufamiliarwith– SQL?– RA?(σ, Π, ´, ⨝, r, È, Ç, -)– Keys, foreign keys?– Indexindatabases?– Logic:∧,∨,∀,∃,¬,∈, =>
– Transactions?– Map-reduce/Spark?
• Haveyoueverworkedwithadataset?– relationaldatabase,text,csv,XML
• Haveyoueverusedadatabasesystem?– PostGres,MySQL,SQLServer,SQLAzure
17
Whatwillbecovered?• Databaseconcepts
– DataModels,SQL,Views,Constraints,RA,Normalization
• Principlesandinternalsofdatabasemanagementsystems(DBMS)– Indexing,QueryExecution-Algorithms-Optimization,Transactions,
ParallelandDistributedQueryProcessing,MapReduce
• Advancedandresearchtopicsindatabases– e.g.Datalog,NOSQL,Datamining,Datawarehouse– Morewillbeaddedinthe“TBD”lectures
• Wewillgofastforsomebasictopicsindatabases– Datamodel,SQL,RA
18DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Background• Youshouldhavesomeunderstanding(attheCS
undergraduatelevel)– datastructure,discretemaths,algorithms– databases– orhavetolearntheseyourselfasnecessary
• Needtopickupnewcodingframeworkandprogramminglanguagesonyourown– andhowtoprocessdatausingthem– Homeworkassignmentswillmostlybeself-taught– …withhelpfromtheTA
• Willinvolvesomemathematicalandanalyticalreasoningtoo
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19
Whyshouldwecareaboutdatabases?
• Weareinadata-drivenworld
• “BigData”issupposedtochangethemodeofoperationforalmosteverysinglefield– Science,Technology,Healthcare,Business,Manufacturing,Journalism,Government,Education,…
• Wemustknowhowtocollect,store,process,andanalyzesuchdata
20DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:“TheLargeHadronColliderexperimentsrepresentabout150millionsensorsdeliveringdata40 milliontimespersecond.Therearenearly600 millioncollisionspersecond.IfallsensordatawererecordedinLHC,….thisisequivalentto500quintillion(5×1020)bytesperday,almost200timesmorethanalltheothersourcescombinedintheworld.”
21
Science
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:– eBay.com usestwodatawarehousesat7.5PB(x1012)and40PBaswellasa40PBHadoopclusterforsearch,consumerrecommendations,andmerchandising
– Facebookhandles50 billionphotosfromitsuserbase– AsofAugust2012,Googlewashandlingroughly100 billionsearchespermonth
22
Technology
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:– Healthcare:digitizationofpatient’sdata,prescriptiveanalytics
– Media:Tailorarticlesandadvertisementsthatreachtargetedpeople,validateclaims
• “ComputationalJournalism”projectinDukeDBgroup
– Manufacturing:supplyplanning– Sports:improvetraining,understandingcompetitors
23
HealthcareMediaManufacturingSports…..
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Whyshouldwecareaboutdatabases?
• Simplystoringsuchlargedatasetsinaflatfilestopsworkingatsomepoint– Needefficientmodel,storage,andprocessing
• ADBMStakescareofsuchissues– theuseronlyhastorunqueriestoprocesssuchdatasets– muchsimplerthanwritinglowlevelcode
24DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Today
• DBMS• DataModels
• [RG]1.1,1.3-1.5
25DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
WhatisaDatabase?
• Adatabaseisacollectionofdata– typicallyrelatedanddescribingactivitiesofanorganization
• Adatabasemaycontaininformationabout– Entities
• students,faculty,courses,classroom
– Relationshipsbetweenentities• students’enrollment,facultyteachingcourses,roomsforcourses
26DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
WhyuseaDBMS• i.e.whynotusefilesystemandaprogramminglanguage?
• Supposeacompanyhasalargecollectionofdataonemployees,departments,products,salesetc.
• Requirements:– Quicklyanswerquestionsondata
• Notethatallthedatamaynotfitinmainmemory– Concurrentaccess:applychangesconsistently– Restrictedaccess(e.g.salary)
27DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
WhyuseaDBMS?
• ADBMSisapieceofsoftware(i.e.abigprogramwrittenbysomeoneelse)thatmakesthesetaskseasier– Quickaccess– Robustaccess– Safeaccess– Simpleraccess
• Next:somenicepropertiesofaDBMS
28DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
WhyuseaDBMS?
1. DataIndependence– Applicationprogramsshouldnotbeexposedtothedata
representationandstorage– DBMSprovidesanabstractviewofthedata
2. EfficientDataAccess– ADBMSutilizesavarietyofsophisticatedtechniquesto
storeandretrievedata(fromdisk)efficiently
29DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
WhyuseaDBMS?
3. DataIntegrityandSecurity– DBMSenforces“integrityconstraints”– e.g.check
whethertotalsalaryislessthanthebudget– DBMSenforces“accesscontrols”– whethersalary
informationcanbeaccessesbyaparticularuser
4. DataAdministration– Centralizedprofessionaldataadministrationby
experienceduserscanmanagedataaccess,organizedatarepresentationtominimizeredundancy,andfinetunethestorage
30DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
WhyuseaDBMS?
5. ConcurrentAccessandCrashRecovery– DBMSschedulesconcurrentaccessestothedatasuch
thattheusersthinkthatthedataisbeingaccessedbyonlyoneuseratatime
– DBMSprotectsdatafromsystemfailures
6. ReducedApplicationDevelopmentTime– Supportsmanyfunctionsthatarecommontoanumber
ofapplicationsaccessingdata– Provideshigh-levelinterface– Facilitatesquickandrobustapplicationdevelopment
31DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems
WhenNOTtouseaDBMS?• DBMSisoptimizedforcertainkindofworkloadsand
manipulations
• Theremaybeapplicationswithtightreal-timeconstraintsorafewwell-definedcriticaloperations
• AbstractviewofthedataprovidedbyDBMSmaynotsuffice
• Toruncomplex,statistical/MLanalyticsonlargedatasets
32DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems
DataModel• Applicationsneedtomodelsomerealworldunits• Entities:
– Students,Departments,Courses,Faculty,Organization,Employee,…
• Relationships:– Courseenrollmentsbystudents,Productsalesbyanorganization
• Adatamodelisacollectionofhigh-leveldatadescriptionconstructsthathidemanylow-levelstoragedetails
33DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
DataModelCanSpecify:
1. Structureofthedata– likearraysorstructs inaprogramminglanguage– butatahigherlevel(conceptualmodel)
2. Operationsonthedata– unlikeaprogramminglanguage,notanyoperationcanbeperformed– allowlimitedsetsofqueriesandmodifications– astrength,notaweakness!
3. Constraintsonthedata– whatthedatacanbe– e.g.amoviehasexactlyonetitle
34DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
ImportantDataModels
• StructuredData• Semi-structuredData• UnstructuredData
Whatarethese?
35DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
ImportantDataModels• StructuredData
– Allelementshaveafixedformat– RelationalModel(table)
• Semi-structuredData– Somestructurebutnotfixed– Hierarchicallynestedtagged-elementsintreestructure– XML
• UnstructuredData– Nostructure– text,image,audio,video
36DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
RelationalDataModel
• ProposedbyEdward(Ted)Codd in1970– wonTuringawardforit!
• Motivation:– Simplicity– Betterlogicalandphysicaldataindependence
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 37
RelationalDataModel
• ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)
• orderdoesnotmatter• andallrecordsaredistinct
• however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?
38DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students
RelationalDataModel
• ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)– orderdoesnotmatter– andallrecordsaredistinct
• however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?
39DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students
Bagvs.Set
• Bag:{1,1,2,2,3,2,1,5,6,1} Set:{1,2,3,5,6}• Why“bagsemantic”andnot“setsemantic”instandardDBMSs?
– Primarilyperformancereasons– Duplicateeliminationisexpensive(requiressorting)– Someoperationslike“projection”s aremuchmoreefficientonbagsthan
sets
40DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students
RelationalDataModel
41DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students Attribute/Column/Field
Tuple/Row/Record
Value
Whatisapoorlychosenattributeinthisrelation?
• Relationaldatabase=asetofrelations• ARelation:madeupoftwoparts
1. Schema2. Instance
SchemaandInstance• Oneschemacanhavemultipleinstances
• Schema:– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumne.g.Students(sid:string,name:string,login:string,age:integer,gpa:real).
• Instance:– Whenwefillinactualdatavaluesinaschema– atable,hasrowsandcolumns– eachrow/tuplefollowstheschemaanddomainconstraints– #Rows=cardinality,#fields=degree/arity– examplebelow
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 42
Cardinality = 3, degree = 5sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@ee 18 3.2
53650 Smith smith1@math 19 3.8
RelationalDatabase:Definitions• Relationaldatabase=asetofrelations• Relations:madeupof2parts:
• Schema– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumn– Students(sid:string,name:string,login:string,age:integer,gpa:
real).– Instance :atable,hasrowsandcolumns
• eachrow/tuplefollowstheschemaanddomainconstraints• #Rows=cardinality,#fields=degree/arity.
• Canthinkofarelationasaset ofrowsortuples,i.e.,allrowsaredistinct– however,itistruefortherelationalmodel,notforstandardDBMSthat
allowduplicaterows.Why?DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 43
Cardinality = 3, degree = 5, all rows distinct
LevelsofAbstractionsinaDBMS
• Physicalschema– Storageasfiles,rowvs.
columnstore,indexes– willdiscussthesein
laterlectures
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 44
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
LevelsofAbstractionsinaDBMS
• Logical/Conceptualschema– describesthestoreddatainthe
physicalschema
• Decidedbyconceptualschemadesign
– e.g.ERDiagram• notcoveredinthiscourse
– Normalization• willbecovered
Students(sid:string,name:string,login:string,age:integer,gpa:real)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 45
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
LevelsofAbstractionsinaDBMS
• Externalschema– different“views”ofthe
databasetodifferentusers
– willdiscussviewslater
• Onephysicalandlogicalschemabuttherecanbemultipleexternalschemas
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 46
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
DataIndependence
• Applicationprogramsareinsulatedfromchangesinthewaythedataisstructuredandstored
• AveryimportantpropertyofaDBMS
• LogicalandPhysical
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 47
LogicalDataIndependence• Userscanbeshieldedfromchangesinthelogical
structureofdata• e.g.Students:
Students(sid:string,name:string,login:string,age:integer,gpa:real)• Divideintotworelations
Students_public(sid:string,name:string,login:string)Students_private(sid:string,age:integer,gpa:real)
• Stilla“view”Studentscanbeobtainedusingtheabovenewrelations– by“joining”themwithsid
• AuserwhoqueriesthisviewStudentswillgetthesameanswerasbefore
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 48
PhysicalDataIndependence
• Thelogical/conceptualschemainsulatesusersfromchangesinphysicalstoragedetails– howthedataisstoredondisk– thefilestructure– thechoiceofindexes
• Theapplicationremainsunaltered– Buttheperformancemaybeaffectedbysuchchanges
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 49
Semi-structuredDataandXML• XML:ExtensibleMarkupLanguage
• Willnotbecoveredindetailinclass,butmanydatasetsavailabletodownloadareinthisform– YouwilldownloadtheDBLPdatasetinXMLformatandtransformintorelationalform(inHW1)
• Datadoesnothaveafixedschema– “Attributes”arepartofthedata– Thedatais“self-describing”– Tree-structured
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 50
XML:Example<articlemdate="2011-01-11”key="journals/acta/Saxena96">
<author>SanjeevSaxena</author><title>ParallelIntegerSortingandSimulationAmongstCRCW
Models.</title><pages>607-619</pages><year>1996</year><volume>33</volume><journal>Acta Inf.</journal><number>7</number><url>db/journals/acta/acta33.html#Saxena96</url><ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 51
Attributes
Elements
Attributevs.Elements
• Elementscanberepeatedandnested• Attributesareuniqueandatomic
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 52
WhyXML?+ Servesasamodelsuitableforintegrationofdatabasescontainingsimilardatawithdifferentschemas
– e.g.trytointegratetwostudentdatabases:S1(sid,name,gpa)andS2(sid,dept,year)
– Manynullsifdoneinrelationalmodel,veryeasyinXML
+Flexible– easytochangetheschemaanddata
- Makesqueryprocessingmoredifficult
Whichoneiseasier?• XML(semi-structured)torelational(structured)or• relational(structured)toXML(semi-structured)?
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 53
XMLtoRelationalModel• Problem1:Repeatedattributes<book>
<author>Ramakrishnan</author><author>Gehrke</author><title>DatabaseManagementSystems</title><pubisher>McGrawHill
</book>
Whatisagoodrelationalschema?
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 54
XMLtoRelationalModel• Problem1:Repeatedattributes<book>
<author>Ramakrishnan</author><author>Gehrke</author><title>DatabaseManagementSystems</title><pubisher>McGrawHill</publisher>
</book>
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 55
Title Publisher Author1 Author2
XMLtoRelationalModel• Problem1:Repeatedattributes<book>
<author>Garcia-Molina</author><author>Ullman</author><author>Widom</author><title>DatabaseSystems– TheCompleteBook</title><pubisher>PrenticeHall</publisher>
</book>
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 56
Title Publisher Author1 Author2
Doesnotwork
XMLtoRelationalModel
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 57
BookId Title Publisher
b1 DatabaseManagementSystems
McGrawHill
b2 DatabaseSystems– TheCompleteBook
PrenticeHall
BookBookId Author
b1 Ramakrishnan
b1 Gehrke
b2 Garcia-Molina
b2 Ullman
b2 Widom
BookAuthoredBy
XMLtoRelationalModel• Problem2:Missingattributes
<book><author>Ramakrishnan</author><author>Gehrke</author><title>DatabaseManagementSystems</title><pubisher>McGrawHill<edition>Third</edition>
</book><book>
<author>Garcia-Molina</author><author>Ullman</author><author>Widom</author><title>DatabaseSystems– TheComplete
Book</title><pubisher>PrenticeHall</publisher>
</book>
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 58
BookId
Title Publisher Edition
b1 DatabaseManagementSystems
McGrawHill
Third
b2 DatabaseSystems–TheCompleteBook
PrenticeHall
null
Summary• Relationaldatamodelisthemoststandardfordatabasemanagements– semi-structuredmodel/XMLisalsousedinpractice– youwillusetheminhw assignments
– unstructureddata(text/photo/video)isunavoidable,butwon’tbecoveredinthisclass
• ADBMSprovidesdataindependenceandinsulatestheapplicationprogrammerfrommanylowleveldetails
• Wewilllearnaboutthoselowleveldetailsaswellashighleveldatamanagementinthiscourse
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 59
Veryimportant
UnderstandtheCourse-Policy
See“whatisallowed/notallowed”
willberemindedineveryhwassignmenttoo
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 60
top related