heritrix 3: librarian features bnf proposal march 2015

27
Heritrix 3: librarian features BnF proposal March 2015

Upload: annice-perkins

Post on 31-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heritrix 3: librarian features BnF proposal March 2015

Heritrix 3: librarian features

BnF proposalMarch 2015

Page 2: Heritrix 3: librarian features BnF proposal March 2015

Context

• Follow up of our NetarchiveSuite workshop in Tallinn:– https://sbforge.org/display/NAS/2015+Workshop+Conclusion

• Identified work packages: – tests– template migration– implementation of important but missing curator

features for common operations in Heritrix 3• BnF will further describe use cases, share them with

the community for feedback and implement the following features as a minimal Heritix UI add-on

Page 3: Heritrix 3: librarian features BnF proposal March 2015

From H1…

Page 4: Heritrix 3: librarian features BnF proposal March 2015

… to H3

Page 5: Heritrix 3: librarian features BnF proposal March 2015
Page 6: Heritrix 3: librarian features BnF proposal March 2015
Page 7: Heritrix 3: librarian features BnF proposal March 2015

Common curator operations

• Search crawl.log• Add filter on current job (job configuration)• Change domains/hosts budget (job

configuration)• View or delete frontier URIs

Page 8: Heritrix 3: librarian features BnF proposal March 2015
Page 9: Heritrix 3: librarian features BnF proposal March 2015
Page 10: Heritrix 3: librarian features BnF proposal March 2015

Search crawl.log (NASC61)

• Add a page with the same layout but with 2 additional form fields:– Regular expression:– Show matches: 1000 (default # of matching URIs)– Action => Display URIs (reversed order by default)

• Possibility to refresh display (F5)

Page 11: Heritrix 3: librarian features BnF proposal March 2015

Draft UI for « Search crawl log »

Display URIs

Status + job ID Home

ForwardReversed

Matching lines: 1000

Lines: displaying 1-1000 out of 12345

Lines: displaying 1-1000 out of 12345

Page 12: Heritrix 3: librarian features BnF proposal March 2015

Common curator operations

• Search crawl.log• Add filter on current job (job configuration)• Change domains/hosts budget (job

configuration)• View or delete frontier URIs

Page 13: Heritrix 3: librarian features BnF proposal March 2015
Page 14: Heritrix 3: librarian features BnF proposal March 2015

Add filter on current job (DecideRule) (NASC60)

• Not necessary to view active filters that were included from job start (NASC59)

• Add a page containing a rejectTemporarily area working with the following parameters:– Decision: REJECT– List-logic: OR– Regexp-list : empty at job start, free textarea which can be

manually edited and sorted (440 px wide, 20 lines)– Action => Save: save current filters and activate them for

current job

Page 15: Heritrix 3: librarian features BnF proposal March 2015

Draft UI for « Add filter on current job »

Status + job ID Home

All URIs matching any of the following regular expressions will be rejected from the current job.

Regular expressions: Save

Page 16: Heritrix 3: librarian features BnF proposal March 2015
Page 17: Heritrix 3: librarian features BnF proposal March 2015

Common curator operations

• Search crawl.log• Add filter on current job (job configuration)• Change domains/hosts budget (job

configuration)• View or delete frontier URIs

Page 18: Heritrix 3: librarian features BnF proposal March 2015

Change domains/hosts budget

• Works with queue-total-budget and quota-enforcer systems

• Add a page containing:– a list of domains/hosts (in domain alphabetical order)– their associated budget value (which can be edited)– only those which budget is not set by default – and a form field to add a new domain/host

Page 19: Heritrix 3: librarian features BnF proposal March 2015

Draft UI for « Change domains/hosts budget »

Status + job ID Home

Save

Budget defined in job configuration: queue-total-budget of 100 000 URIs .

bnf.fr 140 000 ina.fr 139 000cnc.fr 139 500

Budgets of following domains/hosts have been changed in the current job:

New domain/host: toto.fr – 130 000

Save

Save

Save

Page 20: Heritrix 3: librarian features BnF proposal March 2015

Common curator operations

• Search crawl.log• Add filter on current job (job configuration)• Change domains/hosts budget (job

configuration)• View or delete frontier URIs

Page 21: Heritrix 3: librarian features BnF proposal March 2015
Page 22: Heritrix 3: librarian features BnF proposal March 2015

View or delete frontier URIs (NASC56 + NASC57 + NASC58)

• Add a page containing 2 form fields:– Regular expression:– Show matches: 1000 (default # of matching URIs)– Action A => Display URIs: displays the matching

URIs, the # of matching URIs and gives the possibility to view the next bloc of matching URIs

– Action B => Delete URIs: delete matching URIs and indicates the # of matching URIs

Page 23: Heritrix 3: librarian features BnF proposal March 2015

Draft UI for « View or delete frontier URIs »

Status + job ID Home

URIs: displaying 1-1000 out of 12345

Matching lines: 1000

URIs: displaying 1-1000 out of 12345

Pause the job first to view frontier

Page 24: Heritrix 3: librarian features BnF proposal March 2015

search

search

Job configuration add filter – change budget

Page 25: Heritrix 3: librarian features BnF proposal March 2015

Comparaison with BAnQ

Page 26: Heritrix 3: librarian features BnF proposal March 2015
Page 27: Heritrix 3: librarian features BnF proposal March 2015