mike smorul saurabh channan

23
Mike Smorul Saurabh Channan Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park

Upload: cathy

Post on 15-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park. Mike Smorul Saurabh Channan. Overview. Digital Preservation Research ADAPT Project and Components Pilot Persistent Archive - PowerPoint PPT Presentation

TRANSCRIPT

  • Digital Preservation and Archiving at the Institute for Advanced Computer StudiesUniversity of Maryland, College ParkMike SmorulSaurabh Channan

  • OverviewDigital Preservation ResearchADAPT Project and ComponentsPilot Persistent Archive

    Digital Library and Production Data DistributionGlobal Land Cover Facility

    Conclusion

  • A Digital Approach to Preservation Technology (ADAPT)Premise:Preservation of digital entities into self-describing objectsOAIS Information Packet model as a frameworkSeparation of management into three layers, bitstream, semantic, and access/discoveryDistributed and Secure InfrastructureAutomatic ingestion and replication Policy-Driven Management of Preservation ProcessesGlobal Format RegistrySeparate Peer-to-Peer Deep Archive

  • ADAPT Architecture

    Data Management

    Metadata Management

    Descriptive Metadata

    Preservation Metadata

    Administrative Metadata

    Deep Archive

    Data Grid

    ConventionalArchive

    PAWN

    Management of Preservation Processes

    CAN

    Metadata

    Data

  • ADAPT ComponentsIngestionProducer-Archive Workflow Network (PAWN)Management of Preservation ProcessesLightweight Preservation Environment (LPE)Access and DiscoveryGrid Retrieval and Search Platform (GRASP)EAP Collection browser

  • Overall Principles (PAWN)Distributed, secure ingestionOAIS based Information Packet creationUse of web/grid technologies platform independentMinimal client-side requirementsEase of integration with archive and data grid systems.Designed to satisfy data integrity requirements of scientific collections and digital preservation

  • Distributed Ingestion (PAWN)

    `

  • Ingestion Workflow (PAWN)Negotiate Submission Agreement.Workflow Initialization and Submission Information Packet (SIP) creation.Transfer of SIPs to Data Grid site.Validation of SIP transferOrganization of data into collections and transfer into Data Grid.

  • Component Overview (PAWN)

  • Target Collections (PAWN)Digital Image CollectionRich metadata in various formatsWeb site crawlingOnline and interactive contentGLCF Landsat dataSpatial and temporal metadataLarge quantity (over 15,000 objects)

  • Lightweight Preservation Environment (LPE)The Lightweight Preservation Environment is an archival system based on a modular design using grid and web services.

    The current implementation relies mostly on Globus technologies.

    Primarily, weve focused on wrapping logic around those components.

  • Developed Components (LPE)Data Manager (DM): Organizes data and queries between the user and the other components

    Policy Manager (PM): Ensures that a minimum number of copies exist for any given file

    Transformation Manager (TM): Executes specific transformations on a named file on a given storage node and returns the results

  • Grid Retrieval and Search Platform (GRASP)Based on concepts developed in the Earth Science Data Interface (ESDI) developed at the UMIACS GLCF.Provides a graphical interface into data grid holdings. Access to entire GLCF holdings through the Storage Resource Broker(SRB)

  • GRASP Architecture

    I/O Abstraction Layer

  • GRASP ArchitectureGRASP uses a data grid as an abstract storage repository.Metadata in the grid is mined from the grid itself or from external sources and published into a browsable form.Data grids may allow for platform independent metadata, but may not be optimal for access

  • GRASP Screenshot

  • Global Land Cover FacilityMission:The GLCF Mission is to encourage the use of remotely sensed imagery, derived products and applications within a broad range of science communities in a manner that improves comprehension of the nature and causes of land cover change and its impact on the Earth.

    Goal:The GLCF Goal is to provide free access to an integrated collection of critical land cover and Earth science data through systems that are designed to maximize user outreach and that promote development of novel tools for ordering, visualizing and manipulating spatial data.

  • Data CollectionsMajority of the holdings are of Landsat and MODIS data

  • Data DistributionData at the GLCFApproximately 5.1 TB compressedApproximately 13 TB uncompressed

    Anticipated Production RateTriple or Quadruple current data holding within the next two year

    Chart1

    3069.45

    10815.11

    269121.47

    647432.61

    1047074

    565833.28

    1133355.52

    1151651.19

    42959.72

    974617.09

    2179940.59

    2771023.11

    2205639.4

    4140785.21

    3436878.95

    4539745.95

    4101629.09

    3249906.85

    4020107.61

    4903974.43

    5438624.95

    7707098.48

    8626152.86

    7245293.24

    9072188.58

    8458979.08

    8109359.08

    6427214.83

    9061943.48

    10023031.44

    Megabytes

    Month

    Megabytes

    Data Traffic

    data

    MonthHitsPercent HitsMegabytesPercent Megabytes

    Aug-022040.00%3069.450.00%

    Sep-0212000.00%10815.110.00%

    Oct-02292360.20%269121.470.20%

    Nov-02665750.40%647432.610.50%

    Dec-02938480.60%10470740.80%

    Jan-03504130.30%565833.280.40%

    Feb-03872200.50%1133355.520.90%

    Mar-031182090.70%1151651.190.90%

    Apr-0343950.00%42959.720.00%

    May-031086730.60%974617.090.80%

    Jun-034530712.70%2179940.591.70%

    Jul-033698072.20%2771023.112.20%

    Aug-032366001.40%2205639.41.80%

    Sep-034065972.40%4140785.213.30%

    Oct-034593092.70%3436878.952.70%

    Nov-035241003.10%4539745.953.60%

    Dec-034373632.60%4101629.093.30%

    Jan-043911312.30%3249906.852.60%

    Feb-047454254.40%4020107.613.20%

    Mar-045816703.50%4903974.433.90%

    Apr-045286743.10%5438624.954.30%

    May-046909354.10%7707098.486.10%

    Jun-0411586536.90%8626152.866.80%

    Jul-047293024.30%7245293.245.70%

    Aug-0411788847.00%9072188.587.20%

    Sep-0410452826.20%8458979.086.70%

    Oct-0411647826.90%8109359.086.40%

    Nov-0411711396.90%6427214.835.10%

    Dec-0414213288.40%9061943.487.20%

    Jan-0516239979.60%10023031.448.00%

    Feb-059810815.80%4444209.123.50%

    data

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Megabytes

    Month

    Megabytes

    Data Traffic

  • Data Discovery ApplicationsESDI Web Interface User friendly Search Retrieve Discover Scalable Over 9TB a month !

  • GLCF ArchitectureScalable and Reliable

    SunFire V100

    Sun

    ProFTPd servers

  • Participation PossibilitiesPAWN ingestion componentMinimal geospatial metadata support planned, can be expanded to support NGDA endpointGRASP display componentSolid core components, end-user interfaces need additional polishingGLCF data holdingsAdditional hardware required if additional data and access mechanisms (grid, etc) requiredOther possibilities include: grid infrastructure, GSI security, format registry, etc.

  • Questions