unit 5 networks

Upload: sujy-cau

Post on 04-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Unit 5 networks

    1/67

    Database System Concepts, 6 th Ed . Silberschatz, Korth and Sudarshan

    See www.db-book.com for conditions on re-use

    UNIT- 5

    http://www.db-book.com/http://www.db-book.com/http://www.db-book.com/http://www.db-book.com/
  • 8/13/2019 Unit 5 networks

    2/67

    Silberschatz, Korth and Sudarshan22.2Database System Concepts - 6 th Edition

    Centralized Systems

    Run on a single computer system and do not interact with other

    computer systems.General-purpose computer system: one to a few CPUs and a numberof device controllers that are connected through a common bus thatprovides access to shared memory.

    Single-user system (e.g., personal computer or workstation): desk-top

    unit, single user, usually has only one CPU and one or two harddisks; the OS may support only one user.

    Multi-user system: more disks, more memory, multiple CPUs, and amulti-user OS. Serve a large number of users who are connected tothe system vie terminals. Often called server systems.

  • 8/13/2019 Unit 5 networks

    3/67

    Silberschatz, Korth and Sudarshan22.3Database System Concepts - 6 th Edition

    A Centralized Computer System

  • 8/13/2019 Unit 5 networks

    4/67

    Silberschatz, Korth and Sudarshan22.4Database System Concepts - 6 th Edition

    Client-Server Systems

    Server systems satisfy requests generated at m client systems, whose

    general structure is shown below:

  • 8/13/2019 Unit 5 networks

    5/67

    Silberschatz, Korth and Sudarshan22.5Database System Concepts - 6 th Edition

    Client-Server Systems (Cont.)

    Database functionality can be divided into:

    Back-end : manages access structures, query evaluation andoptimization, concurrency control and recovery.

    Front-end : consists of tools such as forms , report-writers , andgraphical user interface facilities.

    The interface between the front-end and the back-end is through SQL orthrough an application program interface.

  • 8/13/2019 Unit 5 networks

    6/67

    Silberschatz, Korth and Sudarshan22.6Database System Concepts - 6 th Edition

    Client-Server Systems (Cont.)

    Advantages of replacing mainframes with networks of workstations or

    personal computers connected to back-end server machines:better functionality for the cost

    flexibility in locating resources and expanding facilities

    better user interfaces

    easier maintenance

  • 8/13/2019 Unit 5 networks

    7/67 Silberschatz, Korth and Sudarshan22.7Database System Concepts - 6 th Edition

    Server System Architecture

    Server systems can be broadly categorized into two kinds:

    transaction servers which are widely used in relational databasesystems, and

    data servers , used in object-oriented database systems

  • 8/13/2019 Unit 5 networks

    8/67 Silberschatz, Korth and Sudarshan22.8Database System Concepts - 6 th Edition

    Transaction Servers

    Also called query server systems or SQL server systems

    Clients send requests to the serverTransactions are executed at the server

    Results are shipped back to the client.

    Requests are specified in SQL, and communicated to the serverthrough a remote procedure call (RPC) mechanism.

    Transactional RPC allows many RPC calls to form a transaction.

    Open Database Connectivity (ODBC) is a C language applicationprogram interface standard from Microsoft for connecting to a server,sending SQL requests, and receiving results.

    JDBC standard is similar to ODBC, for Java

  • 8/13/2019 Unit 5 networks

    9/67 Silberschatz, Korth and Sudarshan22.9Database System Concepts - 6 th Edition

    Transaction Server Process Structure

    A typical transaction server consists of multiple processes accessing

    data in shared memory.Server processes

    These receive user queries (transactions), execute them and sendresults back

    Processes may be multithreaded , allowing a single process to

    execute several user queries concurrentlyTypically multiple multithreaded server processes

    Lock manager process

    More on this later

    Database writer process

    Output modified buffer blocks to disks continually

  • 8/13/2019 Unit 5 networks

    10/67 Silberschatz, Korth and Sudarshan22.10Database System Concepts - 6 th Edition

    Transaction Server Processes (Cont.)

    Log writer process

    Server processes simply add log records to log record bufferLog writer process outputs log records to stable storage.

    Checkpoint process

    Performs periodic checkpoints

    Process monitor processMonitors other processes, and takes recovery actions if any ofthe other processes fail

    E.g., aborting any transactions being executed by a serverprocess and restarting it

  • 8/13/2019 Unit 5 networks

    11/67 Silberschatz, Korth and Sudarshan22.11Database System Concepts - 6 th Edition

    Transaction System Processes (Cont.)

  • 8/13/2019 Unit 5 networks

    12/67 Silberschatz, Korth and Sudarshan22.12Database System Concepts - 6 th Edition

    Transaction System Processes (Cont.)

    Shared memory contains shared data

    Buffer poolLock tableLog bufferCached query plans (reused if same query submitted again)

    All database processes can access shared memory

    To ensure that no two processes are accessing the same data structureat the same time, databases systems implement mutual exclusion using either

    Operating system semaphoresAtomic instructions such as test-and-set

    To avoid overhead of interprocess communication for lockrequest/grant, each database process operates directly on the locktable

    instead of sending requests to lock manager processLock manager process still used for deadlock detection

  • 8/13/2019 Unit 5 networks

    13/67 Silberschatz, Korth and Sudarshan22.13Database System Concepts - 6 th Edition

    Data Servers

    Used in high-speed LANs, in cases where

    The clients are comparable in processing power to the serverThe tasks to be executed are compute intensive.

    Data are shipped to clients where processing is performed, and thenshipped results back to the server.

    This architecture requires full back-end functionality at the clients.

    Used in many object-oriented database systems

    Issues:

    Page-Shipping versus Item-Shipping

    Locking

    Data CachingLock Caching

  • 8/13/2019 Unit 5 networks

    14/67 Silberschatz, Korth and Sudarshan22.14Database System Concepts - 6 th Edition

    Data Servers (Cont.)

    Page-shipping versus item-shipping

    Smaller unit of shipping more messagesWorth prefetching related items along with requested itemPage shipping can be thought of as a form of prefetching

    LockingOverhead of requesting and getting locks from server is high dueto message delaysCan grant locks on requested and prefetched items; with pageshipping, transaction is granted lock on whole page.Locks on a prefetched item can be P{called back} by the server,and returned by client transaction if the prefetched item has notbeen used.Locks on the page can be de escalated to locks on items in thepage when there are lock conflicts. Locks on unused items canthen be returned to server.

  • 8/13/2019 Unit 5 networks

    15/67 Silberschatz, Korth and Sudarshan22.15Database System Concepts - 6 th Edition

    Data Servers (Cont.)

    Data Caching

    Data can be cached at client even in between transactionsBut check that data is up-to-date before it is used ( cachecoherency )

    Check can be done when requesting lock on data item

    Lock Caching

    Locks can be retained by client system even in betweentransactions

    Transactions can acquire cached locks locally, withoutcontacting server

    Servercalls back

    locks from clients when it receives conflictinglock request. Client returns lock once no local transaction isusing it.

    Similar to deescalation, but across transactions.

  • 8/13/2019 Unit 5 networks

    16/67 Silberschatz, Korth and Sudarshan22.16Database System Concepts - 6 th Edition

    Parallel Systems

    Parallel database systems consist of multiple processors and multiple

    disks connected by a fast interconnection network.A coarse-grain parallel machine consists of a small number ofpowerful processors

    A massively parallel or fine grain parallel machine utilizesthousands of smaller processors.

    Two main performance measures:throughput --- the number of tasks that can be completed in agiven time interval

    response time --- the amount of time it takes to complete a singletask from the time it is submitted

  • 8/13/2019 Unit 5 networks

    17/67 Silberschatz, Korth and Sudarshan22.17Database System Concepts - 6 th Edition

    Speed-Up and Scale-Up

    Speedup : a fixed-sized problem executing on a small system is given

    to a system which is N-times larger.Measured by:

    speedup = small system elapsed time

    large system elapsed time

    Speedup is linear if equation equals N.

    Scaleup : increase the size of both the problem and the system

    N-times larger system used to perform N-times larger job

    Measured by:

    scaleup = small system small problem elapsed time

    big system big problem elapsed timeScale up is linear if equation equals 1.

  • 8/13/2019 Unit 5 networks

    18/67 Silberschatz, Korth and Sudarshan22.18Database System Concepts - 6 th Edition

    Speedup

  • 8/13/2019 Unit 5 networks

    19/67 Silberschatz, Korth and Sudarshan22.19Database System Concepts - 6 th Edition

    Scaleup

  • 8/13/2019 Unit 5 networks

    20/67 Silberschatz, Korth and Sudarshan22.20Database System Concepts - 6 th Edition

    Batch and Transaction Scaleup

    Batch scaleup :

    A single large job; typical of most decision support queries andscientific simulation.

    Use an N-times larger computer on N-times larger problem.

    Transaction scaleup :

    Numerous small queries submitted by independent users to ashared database; typical transaction processing and timesharingsystems.

    N-times as many users submitting requests (hence, N-times asmany requests) to an N-times larger database, on an N-timeslarger computer.

    Well-suited to parallel execution.

  • 8/13/2019 Unit 5 networks

    21/67 Silberschatz, Korth and Sudarshan22.21Database System Concepts - 6 th Edition

    Factors Limiting Speedup and Scaleup

    Speedup and scaleup are often sublinear due to:

    Startup costs : Cost of starting up multiple processes may dominatecomputation time, if the degree of parallelism is high.

    Interference : Processes accessing shared resources (e.g., systembus, disks, or locks) compete with each other, thus spending timewaiting on other processes, rather than performing useful work.

    Skew : Increasing the degree of parallelism increases the variance inservice times of parallely executing tasks. Overall execution timedetermined by slowest of parallely executing tasks.

  • 8/13/2019 Unit 5 networks

    22/67 Silberschatz, Korth and Sudarshan22.22Database System Concepts - 6 th Edition

    Parallel Database Architectures

    Shared memory -- processors share a common memory

    Shared disk -- processors share a common diskShared nothing -- processors share neither a common memory norcommon disk

    Hierarchical -- hybrid of the above architectures

  • 8/13/2019 Unit 5 networks

    23/67 Silberschatz, Korth and Sudarshan22.23Database System Concepts - 6 th Edition

    Parallel Database Architectures

  • 8/13/2019 Unit 5 networks

    24/67

    Silberschatz, Korth and Sudarshan22.24Database System Concepts - 6 th Edition

    Shared Memory

    Processors and disks have access to a common memory, typically via

    a bus or through an interconnection network.Extremely efficient communication between processors data inshared memory can be accessed by any processor without having tomove it using software.

    Downside architecture is not scalable beyond 32 or 64 processorssince the bus or the interconnection network becomes a bottleneck

    Widely used for lower degrees of parallelism (4 to 8).

  • 8/13/2019 Unit 5 networks

    25/67

    Silberschatz, Korth and Sudarshan22.25Database System Concepts - 6 th Edition

    Shared Disk

    All processors can directly access all disks via an interconnection

    network, but the processors have private memories.The memory bus is not a bottleneck

    Architecture provides a degree of fault-tolerance if aprocessor fails, the other processors can take over its taskssince the database is resident on disks that are accessible from

    all processors.Examples: IBM Sysplex and DEC clusters (now part of Compaq)running Rdb (now Oracle Rdb) were early commercial users

    Downside: bottleneck now occurs at interconnection to the disksubsystem.

    Shared-disk systems can scale to a somewhat larger number ofprocessors, but communication between processors is slower.

  • 8/13/2019 Unit 5 networks

    26/67

    Silberschatz, Korth and Sudarshan22.26Database System Concepts - 6 th Edition

    Shared Nothing

    Node consists of a processor, memory, and one or more disks.

    Processors at one node communicate with another processor atanother node using an interconnection network. A node functions asthe server for the data on the disk or disks the node owns.

    Examples: Teradata, Tandem, Oracle-n CUBE

    Data accessed from local disks (and local memory accesses) do notpass through interconnection network, thereby minimizing theinterference of resource sharing.

    Shared-nothing multiprocessors can be scaled up to thousands ofprocessors without interference.

    Main drawback: cost of communication and non-local disk access;sending data involves software interaction at both ends.

  • 8/13/2019 Unit 5 networks

    27/67

    Silberschatz, Korth and Sudarshan22.27Database System Concepts - 6 th Edition

    Hierarchical

    Combines characteristics of shared-memory, shared-disk, and shared-

    nothing architectures.Top level is a shared-nothing architecture nodes connected by aninterconnection network, and do not share disks or memory with eachother.

    Each node of the system could be a shared-memory system with afew processors.

    Alternatively, each node could be a shared-disk system, and each ofthe systems sharing a set of disks could be a shared-memory system.

    Reduce the complexity of programming such systems by distributedvirtual-memory architectures

    Also called non-uniform memory architecture (NUMA)

  • 8/13/2019 Unit 5 networks

    28/67

    Silberschatz, Korth and Sudarshan22.28Database System Concepts - 6 th Edition

    Hybrid architecture

    hybrid architecture includes:

    Non-Uniform Memory Architecture (NUMA), which involves the Non-Uniform Memory Access .

    Cluster (shared nothing + shared disk: SAN/NAS), which is formed bya group of connected computers.

    Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing , where the memory access timedepends on the memory location relative to a processor. UnderNUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory sharedbetween processors).

    NUMA architectures logically follow in scaling from symmetricmultiprocessing (SMP) architectures.

    http://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Access
  • 8/13/2019 Unit 5 networks

    29/67

    Silberschatz, Korth and Sudarshan22.29Database System Concepts - 6 th Edition

    XML

    Using an example explain the distinction between attribute and a sub

    element. Explain the purpose and use of namespaces

    Give the DTD for an XML representation of the following nested-relational schema.

    Emp = (ename, ChildrenSet setof(Children), SkillsSet setof(Skills))

    Children = (name, Birthday)Birthday = (day, month, year)Skills = (type, ExamsSet setof(Exams))Exams = (year, city)

    Explain the limitations of DTD. Describe the alternative to overcomethis limitation.

  • 8/13/2019 Unit 5 networks

    30/67

    Silberschatz, Korth and Sudarshan22.30Database System Concepts - 6 th Edition

    Introduction

    XML: Extensible Markup Language

    Defined by the WWW Consortium (W3C)Derived from SGML (Standard Generalized Markup Language), butsimpler to use than SGML

    Documents have tags giving extra information about sections of thedocument

    E.g. XML Introduction

    Extensible , unlike HTML

    Users can add new tags, and separately specify how the tag should behandled for display

  • 8/13/2019 Unit 5 networks

    31/67

    Silberschatz, Korth and Sudarshan22.31Database System Concepts - 6 th Edition

    Comparison with Relational Data

    Inefficient: tags, which in effect represent schema information, are

    repeatedBetter than relational tuples as a data-exchange format

    Unlike relational tuples, XML data is self-documenting due topresence of tags

    Non-rigid format: tags can be added

    Allows nested structuresWide acceptance, not only in database systems, but also inbrowsers, tools, and applications

  • 8/13/2019 Unit 5 networks

    32/67

    Silberschatz, Korth and Sudarshan22.32Database System Concepts - 6 th Edition

    Structure of XML Data

    Tag : label for a section of data

    Element : section of data beginning with < tagname > and ending withmatching

    Elements must be properly nested

    Proper nesting

    .

    Improper nesting

    .

    Formally: every start tag must have a unique matching end tag,that is in the context of the same parent element.

    Every document must have a single top-level element

  • 8/13/2019 Unit 5 networks

    33/67

    Silberschatz, Korth and Sudarshan22.33Database System Concepts - 6 th Edition

    Structure of XML Data (Cont.)

    Mixture of text with sub-elements is legal in XML.

    Example:

    This course is being offered for the first time in 2009. BIO-399 Computational Biology Biology

    3 Useful for document markup, but discouraged for datarepresentation

  • 8/13/2019 Unit 5 networks

    34/67

    Silberschatz, Korth and Sudarshan22.34Database System Concepts - 6 th Edition

    Attributes

    Elements can have attributes

    Intro. to Computer Science Comp. Sci. 4

    Attributes are specified by name=value pairs inside the starting tag of anelementAn element may have several attributes, but each attribute name canonly occur once

  • 8/13/2019 Unit 5 networks

    35/67

    Silberschatz, Korth and Sudarshan22.35Database System Concepts - 6 th Edition

    Attributes vs. Subelements

    Distinction between subelement and attribute

    In the context of documents, attributes are part of markup, whilesubelement contents are part of the basic document contents

    In the context of data representation, the difference is unclear andmay be confusing

    Same information can be represented in two ways

    CS- 101

    Suggestion: use attributes for identifiers of elements, and use

    subelements for contents

  • 8/13/2019 Unit 5 networks

    36/67

    Silberschatz, Korth and Sudarshan22.36Database System Concepts - 6 th Edition

    Namespaces

    XML data has to be exchanged between organizations

    Same tag name may have different meaning in different organizations,causing confusion on exchanged documents

    Specifying a unique string as an element name avoids confusion

    Better solution: use unique-name:element-name

    Avoid using long unique names all over document by using XMLNamespaces

    CS-101 Intro. to Computer Science Comp. Sci. 4

    http://www.yale.edu/http://www.yale.edu/
  • 8/13/2019 Unit 5 networks

    37/67

    Silberschatz, Korth and Sudarshan22.37Database System Concepts - 6 th Edition

    XML Document Schema

    Database schemas constrain what information can be stored, and the

    data types of stored valuesXML documents are not required to have an associated schema

    However, schemas are very important for XML data exchange

    Otherwise, a site cannot automatically interpret data received fromanother site

    Two mechanisms for specifying XML schemaDocument Type Definition (DTD)

    Widely used

    XML Schema

    Newer, increasing use

  • 8/13/2019 Unit 5 networks

    38/67

    Silberschatz, Korth and Sudarshan22.38Database System Concepts - 6 th Edition

    Document Type Definition (DTD)

    The type of an XML document can be specified using a DTD

    DTD constraints structure of XML dataWhat elements can occur

    What attributes can/must an element have

    What subelements can/must occur inside each element, and howmany times.

    DTD does not constrain data types

    All values represented as strings in XML

    DTD syntax

  • 8/13/2019 Unit 5 networks

    39/67

    Silberschatz, Korth and Sudarshan22.39Database System Concepts - 6 th Edition

    Element Specification in DTD

    Subelements can be specified as

    names of elements, or#PCDATA (parsed character data), i.e., character strings

    EMPTY (no subelements) or ANY (anything can be a subelement)

    Example

    Subelement specification may have regular expressions

    Notation: | - alternatives +- 1 or more occurrences *- 0 or more occurrences

  • 8/13/2019 Unit 5 networks

    40/67

    Silberschatz, Korth and Sudarshan22.40Database System Concepts - 6 th Edition

    University DTD

    ]>

  • 8/13/2019 Unit 5 networks

    41/67

    Silberschatz, Korth and Sudarshan22.41Database System Concepts - 6 th Edition

    Attribute Specification in DTD

    Attribute specification : for each attribute

    NameType of attribute

    CDATAID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)

    more on this later

    Whethermandatory (#REQUIRED)has a default value (value),or neither (#IMPLIED)

    Examples, or

    course_id ID #REQUIREDdept_name IDREF #REQUIREDinstructors IDREFS #IMPLIED >

  • 8/13/2019 Unit 5 networks

    42/67

    Silberschatz, Korth and Sudarshan22.42Database System Concepts - 6 th Edition

    IDs and IDREFs

    An element can have at most one attribute of type ID

    The ID attribute value of each element in an XML document must bedistinct

    Thus the ID attribute value is an object identifier

    An attribute of type IDREF must contain the ID value of an element inthe same document

    An attribute of type IDREFS contains a set of (0 or more) ID values.Each ID value must contain the ID value of an element in the samedocument

    h b

  • 8/13/2019 Unit 5 networks

    43/67

    Silberschatz, Korth and Sudarshan22.43Database System Concepts - 6 th Edition

    University DTD with Attributes

    University DTD with ID and IDREF attribute types.

    dept_name ID #REQUIRED >

    IID ID #REQUIREDdept_name IDREF #REQUIRED > declarations for title, credits, building,

    budget, name and salary ]>

    Li i i f DTD

  • 8/13/2019 Unit 5 networks

    44/67

    Silberschatz, Korth and Sudarshan22.44Database System Concepts - 6 th Edition

    Limitations of DTDs

    No typing of text elements and attributes

    All values are strings, no integers, reals, etc.Difficult to specify unordered sets of subelements

    Order is usually irrelevant in databases (unlike in the document-layout environment from which XML evolved)

    (A | B)* allows specification of an unordered set, but

    Cannot ensure that each of A and B occurs only once

    IDs and IDREFs are untyped

    The instructors attribute of an course may contain a reference toanother course, which is meaningless

    instructors attribute should ideally be constrained to refer toinstructor elements

    XML S h

  • 8/13/2019 Unit 5 networks

    45/67

    Silberschatz, Korth and Sudarshan22.45Database System Concepts - 6 th Edition

    XML Schema

    XML Schema is a more sophisticated schema language whichaddresses the drawbacks of DTDs . Supports

    Typing of values

    E.g. integer, string, etc

    Also, constraints on min/max values

    User-defined, comlex types

    Many more features, including

    uniqueness and foreign key constraints, inheritance

    XML Schema is itself specified in XML syntax, unlike DTDs

    More-standard representation, but verbose

    XML Scheme is integrated with namespacesBUT: XML Schema is significantly more complicated than DTDs.

    D i i S S

  • 8/13/2019 Unit 5 networks

    46/67

    Silberschatz, Korth and Sudarshan22.46Database System Concepts - 6 th Edition

    Decision Support Systems

    Decision-support systems are used to make business decisions,often based on data collected by on-line transaction-processingsystems.

    Examples of business decisions:

    What items to stock?

    What insurance premium to change?To whom to send advertisements?

    Examples of data used for making decisions

    Retail sales transaction details

    Customer profiles (income, age, gender, etc.)

    D i i S t S t O i

  • 8/13/2019 Unit 5 networks

    47/67

    Silberschatz, Korth and Sudarshan22.47Database System Concepts - 6 th Edition

    Decision-Support Systems: OverviewData analysis tasks are simplified by specialized tools and SQLextensions

    Example tasksFor each product category and each region, what were the totalsales in the last quarter and how do they compare with thesame quarter last yearAs above, for each product category and each customercategory

    Statistical analysis packages (e.g., : S++) can be interfaced withdatabases

    Statistical analysis is a large field, but not covered hereData mining seeks to discover knowledge automatically in the form ofstatistical rules and patterns from large databases.

    A data warehouse archives information gathered from multiplesources, and stores it under a unified schema, at a single site.

    Important for large businesses that generate data from multipledivisions, possibly at multiple sitesData may also be purchased externally

    D t W h i

  • 8/13/2019 Unit 5 networks

    48/67

    Silberschatz, Korth and Sudarshan22.48Database System Concepts - 6 th Edition

    Data Warehousing

    Data sources often store only current data, not historical data

    Corporate decision making requires a unified view of all organizationaldata, including historical data

    A data warehouse is a repository (archive) of information gatheredfrom multiple sources, stored under a unified schema, at a single site

    Greatly simplifies querying, permits study of historical trends

    Shifts decision support query load away from transactionprocessing systems

    D W h i

  • 8/13/2019 Unit 5 networks

    49/67

    Silberschatz, Korth and Sudarshan22.49Database System Concepts - 6 th Edition

    Data Warehousing

    D i I

  • 8/13/2019 Unit 5 networks

    50/67

    Silberschatz, Korth and Sudarshan22.50Database System Concepts - 6 th Edition

    Design Issues

    When and how to gather data

    Source driven architecture : data sources transmit newinformation to warehouse, either continuously or periodically(e.g., at night)

    Destination driven architecture : warehouse periodicallyrequests new information from data sources

    Keeping warehouse exactly synchronized with data sources(e.g., using two-phase commit) is too expensive

    Usually OK to have slightly out-of-date data at warehouse

    Data/updates are periodically downloaded form onlinetransaction processing (OLTP) systems.

    What schema to useSchema integration

  • 8/13/2019 Unit 5 networks

    51/67

    Silberschatz, Korth and Sudarshan22.51Database System Concepts - 6 th Edition

    More Warehouse Design Issues

    Data cleansing

    E.g., correct mistakes in addresses (misspellings, zip codeerrors)

    Merge address lists from different sources and purge duplicates

    How to propagate updates

    Warehouse schema may be a (materialized) view of schema

    from data sourcesWhat data to summarize

    Raw data may be too large to store on-line

    Aggregate values (totals/subtotals) often suffice

    Queries on raw data can often be transformed by queryoptimizer to use aggregate values

  • 8/13/2019 Unit 5 networks

    52/67

    Silberschatz, Korth and Sudarshan22.52Database System Concepts - 6 th Edition52

    Why Data Mining?

    The Explosive Growth of Data

    Data collection and data availability

    Automated data collection tools, database systems, Web, computerized

    societyMajor sources of abundant data

    Business: Web, e- commerce, transactions, stocks,

    Science: Remote sensing, bioinformatics, scientific simulation,

    Society and everyone: news, digital cameras,

    We are drowning in data, but starving for knowledge!

    Necessity is the mother of invention Data mining Automated analysis ofmassive data sets

  • 8/13/2019 Unit 5 networks

    53/67

    Silberschatz, Korth and Sudarshan22.53Database System Concepts - 6 th Edition53

    Why Data Mining? Potential Applications

    Data analysis and decision support

    Market analysis and management

    Target marketing, customer relationship management (CRM), marketbasket analysis, cross selling, market segmentation

    Risk analysis and management

    Forecasting, customer retention, improved underwriting, qualitycontrol, competitive analysis

    Fraud detection and detection of unusual patterns (outliers)

    Other Applications

    Text mining (news group, email, documents) and Web mining

    Stream data mining

    Bioinformatics and bio-data analysis

  • 8/13/2019 Unit 5 networks

    54/67

    Silberschatz, Korth and Sudarshan22.54Database System Concepts - 6 th Edition

    Data Mining: A KDD Process

    Data mining: the core ofknowledge discoveryprocess.

    Data CleaningData Integration

    Databases

    Data Warehouse

    Task-relevant Data Data SelectionData Preprocessing

    Data Mining

    Pattern Evaluation

  • 8/13/2019 Unit 5 networks

    55/67

    Silberschatz, Korth and Sudarshan22.55Database System Concepts - 6 th Edition

    Steps of a KDD Process

    Learning the application domain:relevant prior knowledge and goals of application

    Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)

    Data reduction and transformation :Find useful features, dimensionality/variable reduction, invariantrepresentation.

    Choosing functions of data mining summarization, classification, regression, association, clustering.

    Choosing the mining algorithm(s)Data mining : search for patterns of interestPattern evaluation and knowledge presentation

    visualization, transformation, removing redundant patterns, etc.

    Use of discovered knowledge

  • 8/13/2019 Unit 5 networks

    56/67

    Silberschatz, Korth and Sudarshan22.56Database System Concepts - 6 th Edition56

    Data Mining Functionalities

    General functionality

    Descriptive data mining

    Predictive data mining

    Different views lead to different classifications

    Data view: Kinds of data to be mined

    Knowledge view: Kinds of knowledge to be discovered

    Method view: Kinds of techniques utilized

    Application view: Kinds of applications adapted

  • 8/13/2019 Unit 5 networks

    57/67

    Silberschatz, Korth and Sudarshan22.57Database System Concepts - 6 th Edition57

    Data Mining Functionalities

    Multidimensional concept description : Characterization and discrimination

    Generalize, summarize, and contrast data characteristics, e.g., dry vs.wet regions

    association analysis

    Diaper Beer [0.5%, 75%] (Correlation or causality?)

    Classification and prediction

    Construct models (functions) that describe and distinguish classes orconcepts for future prediction

    E.g., classify countries based on (climate), or classify cars based on(gas mileage)

    Predict some unknown or missing numerical values

  • 8/13/2019 Unit 5 networks

    58/67

    Silberschatz, Korth and Sudarshan22.58Database System Concepts - 6 th Edition

    58

    Data Mining Functionalities (2)Cluster analysis

    Class label is unknown: Group data to form new classes, e.g., clusterhouses to find distribution patterns

    Maximizing intra-class similarity & minimizing interclass similarity

    Outlier analysis

    Outlier: Data object that does not comply with the general behavior of thedata

    Noise or exception? Useful in fraud detection, rare events analysis

    Trend and evolution analysis

    Trend and deviation: e.g., regression analysis

    Periodicity analysisSimilarity-based analysis

    Other pattern-directed or statistical analyses

  • 8/13/2019 Unit 5 networks

    59/67

    Silberschatz, Korth and Sudarshan22.59Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 59

    Data Cleaning

    ImportanceData cleaning is one of the three biggest problemsin data warehousing Ralph KimballData cleaning is the number one problem in data

    warehousing DCI surveyData cleaning tasks

    Fill in missing values

    Identify outliers and smooth out noisy data

    Correct inconsistent data

    Resolve redundancy caused by data integration

  • 8/13/2019 Unit 5 networks

    60/67

    Silberschatz, Korth and Sudarshan22.60Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 60

    Missing Data

    Data is not always available

    E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data

    Missing data may be due to

    equipment malfunction

    inconsistent with other recorded data and thus deleted

    data not entered due to misunderstanding

    certain data may not be considered important at the time ofentry

    not register history or changes of the data

    Missing data may need to be inferred.

  • 8/13/2019 Unit 5 networks

    61/67

    Silberschatz, Korth and Sudarshan22.61Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 61

    How to Handle Missing Data?

    Ignore the tuple: usually done when class label is missing (assuming

    the tasks in classification not effective when the percentage of

    missing values per attribute varies considerably.

    Fill in the missing value manually: tedious + infeasible?

    Fill in it automatically with

    a global constant : e.g., unknown, a new class?!

    the attribute mean

    the attribute mean for all samples belonging to the same class:

    smarter

    the most probable value: inference-based such as Bayesian

    formula or decision tree

  • 8/13/2019 Unit 5 networks

    62/67

    Silberschatz, Korth and Sudarshan22.62Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 62

    Noisy Data

    Noise: random error or variance in a measured variableIncorrect attribute values may due to

    faulty data collection instruments

    data entry problems

    data transmission problems

    technology limitation

    inconsistency in naming convention

    Other data problems which requires data cleaning

    duplicate records

    incomplete data

    inconsistent data

  • 8/13/2019 Unit 5 networks

    63/67

    Silberschatz, Korth and Sudarshan22.63Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 63

    How to Handle Noisy Data?

    Binningfirst sort data and partition into (equal-frequency) bins

    then one can smooth by bin means, smooth by binmedian, smooth by bin boundaries , etc.

    Regression

    smooth by fitting the data into regression functions

    Clustering

    detect and remove outliersCombined computer and human inspection

    detect suspicious values and check by human (e.g.,deal with possible outliers)

    Simple Discretization Methods: Binning

  • 8/13/2019 Unit 5 networks

    64/67

    Silberschatz, Korth and Sudarshan22.64Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 64

    Simple Discretization Methods: Binning

    Equal-width (distance) partitioningDivides the range into N intervals of equal size: uniform grid

    if A and B are the lowest and highest values of the attribute, the

    width of intervals will be: W = (B A)/ N.

    The most straightforward, but outliers may dominate presentation

    Skewed data is not handled well

    Equal-depth (frequency) partitioning

    Divides the range into N intervals, each containing approximatelysame number of samples

    Good data scaling

    Managing categorical attributes can be tricky

    Binning Methods for Data Smoothing

  • 8/13/2019 Unit 5 networks

    65/67

    Silberschatz, Korth and Sudarshan22.65Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 65

    Binning Methods for Data Smoothing

    Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,29, 34

    * Partition into equal-frequency (equi-depth) bins:

    - Bin 1: 4, 8, 9, 15

    - Bin 2: 21, 21, 24, 25

    - Bin 3: 26, 28, 29, 34

    * Smoothing by bin means:

    - Bin 1: 9, 9, 9, 9

    - Bin 2: 23, 23, 23, 23

    - Bin 3: 29, 29, 29, 29

    * Smoothing by bin boundaries:

    - Bin 1: 4, 4, 4, 15

    - Bin 2: 21, 21, 25, 25-

  • 8/13/2019 Unit 5 networks

    66/67

    Silberschatz, Korth and Sudarshan22.66Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 66

    Regression

    x

    y

    y = x + 1

    X1

    Y1

    Y1

  • 8/13/2019 Unit 5 networks

    67/67

    Cluster Analysis