british library - digitising historic newspapers
DESCRIPTION
Presentation given at Optical Character Recognition (OCR) for the mass digitisation of textual materials: Improving Access to Text workshop held at UKOLN, University of Bath on 24th September 2009TRANSCRIPT
British Library/JISC – Digital Newspapers
ALY CONTEH
DIGITISATION PROGRAMME MANAGER
BATH, 24 SEPTEMBER 2009
825 million pages
BACKGROUND
Funding was secured in 2004 & 2007 from JISC to provide on-line access to a mass of historic newspaper content for learning, teaching and research.
Deliverables Scanning of complete newspaper
runs held by BL 3 million pages of C18 & C19
newspapers stabilised and filmed Article zoning and page extraction OCR of page images Production of required metadata
PROJECT AIMS
• Free access for the academic community to a content-rich online service
• Access to out-of-copyright UK printed material
• Access to a mix of national and provincial newspapers, the majority from new microfilm
• Access to the entire content of each newspaper via OCR, including adverts, pictures, tables and all articles
SELECTION AND CONSULTATION
Creation of User Panel of academics UK wide coverage, breadth of
century, national and regional titles
Online questionnaire made and the exercise conducted Feb-Mar 2005
Users asked to rank titles in order of priority
Replies endorsed UK wide coverage; relevant mix of national/ regional titles
48 titles
3 million pages
10 million articles?
Variations in quality
Variations in structure Daily vs weekly Size Layout
PHYSICAL CHARACTERISTICS OF SOURCE MATERIAL
Bleed through
Stains
Tight binding
Holes/tears
Creases
Paper quality
Inconsistent inking
Dirt
Stamps
Printer errors
Animals
Repairs
Lamination
High level view of processes
Original Metadata XML Encoding
Microfilm Digital Images Website/Delivery System
Delivery: Greyscale v Bitonal
OCR: Greyscale v Bitonal
THE OCR CHALLENGE
• Tiny text• Varying formats• Uneven printing• Vertical skew• Multiple columns
Optical Character Recognition (OCR)
THE VIKING'S SONG
Now skall to the Vikings, the Vikings so bold,So fearless in battle, so famous of old,With swarthy, tanned features, and long locks of gold;Ahoi ! my bold Vikings, ahoi !
We plunder the noble, we plunder the priest,We rob the fat abbot to furnish our feast,There's no fare so fine as the convent-fed beast;Ahoi ! my bold Vikings, ahoi I
What vessels of Venice can vaunt to be lighter?What blades of Toledo can boast being brighter?What man to the Viking can match as a fighter?Ahoi I my bold Vikings, ahoi I
Our sword is our father, our ship is our mother,Our shield is our sister, our breastplate our brother,-Thus, ask us our kindred, we say we've no other;Ahoi ! my bold Vikings, ahoi !
So now slack the ropes, turn the sails to the wind,And smartly the reefs of the canvas unbind,As we sweep o'er the ocean more plunder to find;Ahoi ! my bold Vikings, ahoi !
Exceptional Good Poor Worthless
(Exrh-ads from the New York Papers.)
JACKSON IONEY.It is with great pleasure that we per-ceive the true Jackson money is now iacirculation.. Half eagles of the.newJackson coinage are passing freely from,hand to hand this morning, and all who^get hold of them seem to feel at oncethe superiority of such real money tothe miserable p.laper substitute withlwhich, the spirit of aristocracy wouldstill continue to cheat the people. Thenew eagles, half eagles, and quartersare really beautiful coins-at least sowe ate assured, in relation to the eaglesand quarters, and so we can attestfroux:our own examination, in relation to thehalves. The Globe says, "It is de-voutly to be hoped that the Mint maybe able to suppl, all the pressing de-mands -on it-and .that overy~indepen-dent citizen may obtain a low pieces tocarry and preserve as a charm againstthe sorceries of the mammoth.
I SINGULAR AND SERIOUS ACCIDENT.- -O11 WeU Iwtje enoon Mr. Charles %Vyber, of the Borougll-roadt, VFleet Prison to visit a friend, and joined a party IIIroom, who entered into the foolish a seincitllt of 1pcnny-p)icce to the top of thle room, andl eatchivugtle mothli upon its descent to the lloor. Mr.) Lsidered a perfect adept at this game bot time Ile"last found its way into thle throat, where it V't0tdtwards of half an hour. A Surgeon tried tv folCebut being ulnable to do so, lie contrived to mDOVC tinto the stomach. Mr. Wyyber was commmpariatLvllieved by the penny-piece being riemoved out ot thewas enabiled, in the evening, to be carried tohackney-coach.
la 112 B ik e my lat arrived the>Pylades,-. lliot; aod. Abe- 3ineva, CNeee 4orn Neath,' titch ,cuim; ,'t;ohn_ IoMelwl fri ytiil SUn-.die8; ,FrietndiLp, St&ar, froniidon, 'Ui wine andgrocerieu ;: ;aletn, Bker, from Liverpool,. witfi eoal.;'4Stalled the AluidonG.: ceror' Lkndon, with sundries;: ;Two Rrothwsj'@ Whe~atn-;- Pylade', Eiot; Har'tinny,;;: Fisbley; ::Iiiveiy Peggy:-(flth add tie JAne, Redman,for eathly Newpot;agd llford; -Tw Br.otherAs, lawces,fos Lysixowjvithbinehol V pirI-ihzure;vi etsey, Per-wIliti; iIudstry, ModA - ~tbi ,Al~t,,'enniugs, for.:IP1~iOntI, StIth Ltu .c*ar An'l? Hawkinss foirouck , + iii ballasto I _______~ ~ ~ ~~~Ai
Key factors affecting OCR accuracy
1. Mass production environment – impossible to hand-tweak every image, compromise between time and quality
2. Software – always improving and developing
3. Quality of text varies within a run – see images
4. Complexity of layouts and formats varies between 48 titles
5. Microfilm source – doesn’t affect this project as the microfilm is of a very good standard, but could in future projects
Why bother with OCR?
Calculating OCR character accuracy is time consuming and ultimately misleading
Character accuracy vs Word accuracy
Word accuracy vs Significant Words
Why OCR?
Provides smallest level of access into the information
Size of project is such that detailed descriptions in the metadata are impossible
50.0
55.0
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0
100.0
Newspaper Code
characters words significant words words with capital letter start
They had the internet in 1816 !
The Morning Chronicle (London, England), Saturday, May 18, 1816; Issue 14678
and a DVD in 1803!
The Morning Chronicle (London, England), Friday, June 10, 1803; Issue 10625
Why Good Quality OCR Matters
January 1874
Three ways to access information
By
1. Metadata — title, place of publication, dates of publication, issue number, number of pages, page quality rating, illustration indicator
2. Browsing — article images, page images, browse by issue or title
3. OCR — actual text of page as rendered by automatic OCR process
Storage
TIFF
Or
JPEG2000
Costs
British Newspapers 1800-1900 Budget re-categorised to show set up costs 31 August 2006
Setup, 2%
Website, 5%
Salary, 25%
Ongoing Overheads, 5%
Microfilming, 23%
Digitisation, 40%
Digitisation Microfilming Ongoing Overheads Salary Setup Website
Summary
Access is determined by
– The available technology e.g. OCR, document structure analysis
– By the size of the project – mass production environment is a limitation; no hand tweaking
– By the source material – there are limitations with poor source material
This project has been a trail blazer, complex and challenging.
We have learnt a great deal, to give users better, quicker and fuller access to the content.