profile serialization iipc ga 2015
TRANSCRIPT
![Page 1: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/1.jpg)
Archive ProfileSerialization
| Sawood Alam @ibnesayeed
Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529
![Page 2: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/2.jpg)
Archive ProfileHigh-level digest of an archivePredicts presence of mementos of a URI-R in an archiveProvides various statistics about the holdingsSmall in sizePublicly availableEasy to update and partially patchUseful for Memento query routing and other things
![Page 3: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/3.jpg)
Profiles ContentsHow to organize contents?What goes in it?How to serialize it?
![Page 4: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/4.jpg)
Flat Organization{ " . . . " : { } , " s t a t s " : { " s u b u r i " : { " e d u ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 3 2 } , " u r i r " : 1 2 } , " e d u , h a r v a r d ) / " : { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 2 } , " u r i r " : 2 } , " e d u , h a r v a r d , l a w , b l o g s ) / " : { " u r i m " : { " m a x " : 1 ,
![Page 5: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/5.jpg)
Grouped Organization{ " . . . " : { } , " s t a t s " : { " t l d " : { " c o m ) / " : { " u r i m " : { " m a x " : 1 0 , " m i n " : 2 , " t o t a l " : 7 2 } , " u r i r " : 3 4 } , " e d u ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 3 2 } , " u r i r " : 1 2 } , " . . . " : { } } , " d o m a i n " : {
![Page 6: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/6.jpg)
Nested Organization{ " . . . " : { } , " s t a t s " : { " t l d " : { " c o m ) / " : { " d o m a i n " : { " c o m , a d o b e ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 3 , " t o t a l " : 6 } , " u r i r " : 2 } , " . . . " : { } , } , " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 1 7 } , " u r i r " : 1 3 } ,
![Page 7: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/7.jpg)
Frequency Metrics{ " . . . " : { } , " s t a t s " : { " s u b u r i " : { " c o m ) / " : { " u r i m " : { " 1 s t q u " : 4 . 2 , " 3 r d q u " : 7 . 1 3 , " m a x " : 1 2 , " m e a n " : 6 . 5 2 , " m e d i a n " : 8 , " m i n " : 1 , " s d " : 4 . 1 8 , " t o t a l " : 8 6 } , " u r i r " : 1 5 } , " . . . " : { } } , " . . . " : { } }}
![Page 8: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/8.jpg)
JSON SerializationCan have complex nested data structureJSON-LD for linked dataNo partial key lookupUnsuitable for text processing toolsAllows processing only when fully loadedA single malformed character makes it unparsableDifficult to patch
![Page 9: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/9.jpg)
Sample JSON Profile{ " @ c o n t e x t " : " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t / a r c h p r o f i l e . j s o n l d " " @ i d " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / " , " a b o u t " : { " a c c e s s p o i n t " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / w a y b a c k / " , " m e c h a n i s m " : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / m e c h a n i s m # c d x " , " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " p r o f i l e _ u p d a t e d " : " 2 0 1 5 - 0 1 - 2 0 T 1 7 : 2 5 : 3 0 Z " , " s u b u r i _ c l a s s " : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / s u b u r i # H 3 P 1 " , " m o r e _ m e t a _ d a t a " : " . . . " } , " s t a t s " : { " l a n g u a g e " : { " e n - U S " : { " u r i m " : { " m a x " : 1 3 , " m i n " : 1 , " t o t a l " : 4 7 5 2 9 } , " u r i r " : 2 5 6 2 1 } , " m o r e _ l a n g u a g e s " : " . . . " } ,
![Page 10: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/10.jpg)
CDXJSON SerializationFusion of CDX and JSON file formatsA key followed by strict single line JSON valueUnlike CDX, values can have arbitrary attributesText processing tool friendlyNo single root node or single document restrictionsEnables binary searchEnables partial key lookupError resilient
![Page 11: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/11.jpg)
Sample CDXJSON ProfileKey String SPACE Single Line JSON
NEWLINE
@ c o n t e x t " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t s / a r c h i v e p r o f i l e . j s o n l d "@ i d " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "@ a b o u t { " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " t y p e " : " s u b u r i # H 3 P 1 " , " . . . " : u k ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } ,u k , c o ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6u k , c o , b b c ) / { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8 } , " u r i r " : 1 1 5 } ,u k , c o , b b c ) / i m a g e s { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 3 } , " u r i r " :
![Page 12: Profile Serialization IIPC GA 2015](https://reader030.vdocuments.us/reader030/viewer/2022020307/55b737c9bb61eb30038b47c2/html5/thumbnails/12.jpg)
Conclusions and Future WorkCDXJSON offers scalability and failure resilienceReduces the profile size as it allows partial key lookupTODO: Update profiler script to output in CDXJSONTODO: Fomalize CDXJSON formatImplementation codes are available at:
GitHub:GitHub:
/oduwsdl/suburi_generator/oduwsdl/archive_profiler