media service highlights - atapy software | data … · another strong point about atapy’s media...

22
ATAPY Media Service Department was established to help libraries, data archives, publishing houses, and other information-intensive organizations with digitization and electronic publishing. For materials dating back many decades or even centuries, digitization is synonymous with preservation and successful dissemination of cultural values. ATAPY has worked with a wide variety of materials, including books in old European languages and backlogs of periodicals, including wide format, theater scripts, and more. However complicated the material is - featuring pale/uneven print, multiple language symbols in one page, outdated fonts, scientific formulae and other elements that are deemed obstacles for the majority of modern OCR systems - ATAPY possesses sufficient means and resources to transform the material into a searchable, accessible and well-structured electronic archive that is safe from fire, flood, or the uncivilized reader. ATAPY also carries out mass data capture from standard structured and semi-structured documents (forms). The sensitive nature of information contained in forms often requires exceptionally high OCR accuracy (e.g., financial documents or educational testing), which cannot be achieved without manual verification and data validation. ABBYY FormReader and ABBYY FlexiCapture technology, strengthened by ATAPY's engineering experience and backed up by a pool of qualified operators, ensures the required accuracy in practically any European language, with minimal manual intervention. ATAPY has hands-on experience in the development of custom data validation tools and export modules that allow export to third-party information and Document Management systems, pre-OCR image enhancement tools, and other technical means that enable smarter and more error- free data entry, as compared to the traditional brute-force approach. ATAPY Software was estab- lished in 2001 with active participation from ABBYY Software House, the manu- facturer of the FineReader OCR product family. ATAPY focuses on custom software development in the fields of OCR and data capture, document imaging, docu- ment management and computer linguistics. Media Service (scanning, re- cognition, data entry, proof- reading, formatting, XML markup, etc.) is an im- portant part of the compa- ny's success. Compared to conventional media service bureaus, ATAPY is able to offer better results in shorter time, through developing on-demand software tools which allow ATAPY to streamline or even fully automate certain jobs. This approach is especially efficient in largescale digitization projects, allowing for a significant reduction in the amount of manual labor and ensuring the highest quality within reasonable digitization budgets. Media Service Highlights TM

Upload: vuongdung

Post on 18-Sep-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

ATAPY Media Service Department was established to help libraries, data archives, publishing houses, and other information-intensive organizations with digitization and electronic publishing. For materials dating back many decades or even c e n t u r i e s , d i g i t i z a t i o n i s synonymous with preservation and successful dissemination of cultural values.

ATAPY has worked with a wide variety of materials, including books in old European languages and backlogs of periodicals, including wide format, theater scripts, and more. However complicated the material is - featuring pale/uneven print, multiple language symbols in one page, outdated fonts, scientific formulae and other elements that are deemed obstacles for the majority of modern OCR systems - ATAPY possesses sufficient means and resources to transform the material into a searchable, accessible and well-structured electronic archive that is safe from fire, flood, or the uncivilized reader.

ATAPY also carries out mass data capture from standard structured and semi-structured documents (forms). The sensitive nature of information contained in forms often requires exceptionally high OCR accuracy (e.g., financial documents or educational testing), which cannot be achieved without manual verification and data validation. ABBYY FormReader and ABBYY FlexiCapture technology, strengthened by ATAPY's engineering experience and backed up by a pool of qualified operators, ensures the required accuracy in practically any European language, with minimal manual intervention. ATAPY has hands-on experience in the development of custom data validation tools and export modules that allow export to third-party information and Document Management systems, pre-OCR image enhancement tools, and other technical means that enable smarter and more error-free data entry, as compared to the traditional brute-force approach.

ATAPY Software was estab-lished in 2001 with active participation from ABBYY Software House, the manu-facturer of the FineReader OCR product family. ATAPY focuses on custom software development in the fields of OCR and data capture, document imaging, docu-ment management and computer linguistics.

Media Service (scanning, re-cognition, data entry, proof-reading, formatting, XML markup, etc.) is an im-portant part of the compa-ny's success. Compared to conventional media service bureaus, ATAPY is able to offer better results in shorter time, through developing on-demand software tools which allow ATAPY to streamline or even fully automate certain jobs. This approach is especial ly eff ic ient in largescale d i g i t i z a t i on p ro j e c t s , allowing for a significant reduction in the amount of manual labor and ensuring the highest quality within reasonable digitization budgets.

Media Service Highlights

TM

Another strong point about ATAPY’s Media Service is the ability to handle material with complex layout, such as newspapers and magazines. ATAPY has created digital archives for several European publications issued in English, Danish, German, and Swedish languages.

This experience led to ATAPY’s development of the Smart Newspaper Page Zoning Tool - a specialized page segmentation product targeted at newspaper-type layouts. This tool has been continuously developed and improved over a number of years based on ATAPY’s work on various periodicals, especially focusing on old editions (first half of the XX century). A distinctive feature of the tool is a flexible set of parameters that affects the segmentation process. Users can tune up the tool performance to achieve the best results on a particular type of material.

The intelligence accumulated in this development allows for the correct identification of column borders, difficult headings, sorting out decorative and layout-specific elements - such as frames and separators - that often mislead modern OCR systems. This significantly reduces the manual labor required to correct the segmentation produced by most other OCR systems.

These are actual examples of one and the same newspaper page natively segmented by ABBYY FineReader and segmented by ABBYY FineReader with the help of ATAPY’s Smart Newspaper Page Zoning Tool. Segmentation using ATAPY’s product is visibly more accurate and requires no correction.

ABBYY FineReader ABBYY FineReader + Smart Newspaper Page Zoning Tool

The way it works at ATAPYMedia Service: services and techniques

I. Scanning

ATAPY is well equipped to provide scanning services. For Metzler Verlag, a German Publishing house, ATAPY carried out scanning and recognition of the 85-volume Pauly's Encyclopedia of Antiquity (Realencyclopadie der classischen Altertumswissenschaften). 59,500 pages were scanned in high resolution grayscale mode, and stored on 198 CD-ROMs.

University of Innsbruck entrusted ATAPY with a batch of XIX-century Austrian books for high-resolution scanning. Due to the high value of the books, they were shipped back and forth via courier mail; yet, the postal charges did not outweigh the cost savings that the University gained through outsourcing the task to ATAPY.

Scanning services can be provided at the recently opened ATAPY Sales and Technical support office in Munich, Germany that brings ATAPY closer to many of its customers. When reasonable, the material can be scanned locally and sent to ATAPY over the Internet in a mutually agreed format.

II. Pre-OCR image processing

This phase is required when recognition quality suffers due to source image flaws such as garbage (speckles of different nature), page skew, colored or patterned background, etc. ATAPY uses a number of its own imaging tools, as well as third-party ones, to enhance the image quality prior to OCR, in order to ensure the ultimate efficiency of the automatic recognition phase.

III. Recognition

Scanned images are recognized using ABBYY FineReader OCR/ICR technology. Sometimes additional programming effort is required to tune up and customize ABBYY products. This may happen if the customer has specific requirements not covered by “off-the-shelf” products, or if the material exhibits special characteristics, such as unusual fonts, unsupported language dialects, non-standard symbols and characters, specific layout, and the like.

A good example is our contribution to the international Meta-E initiative - a project undertaken by a consortium of 14 universities from 7 European countries and the US and co-funded by the European Commission. ATAPY's part of the project included tuning ABBYY FineReader to work with text printed in old European languages. ATAPY implemented Language Models for five Old European languages to be used in ABBYY FineReader OCR dictionaries. These dictionaries, together with ABBYY's part of the project - adaptation of FineReader for reading Frakturschrift (a specific the Gothic black-letter typeface typical for old European books), led to the creation of the ABBYY FineReader XIX - a product intoduced by ABBYY as a specialized FineReader version targeted at old printed sources.

Since then, ABBYY FineReader XIX has become one of main tools ATAPY uses when working with old material and sources printed in Fraktur. It has proven its efficiency on a number of projects for European organizations.

IV. Verification and QA

Despite the intelligence applied during the pre-OCR phase and powerful ABBYY OCR technology, in many cases - especially with difficult materials such as old print or scientific content - a human eye is required to back up machine recognition. ATAPY employs a number of professional multilingual operators well-trained in proofreading/correction techniques. When ultimate quality of recognition is a project requirement, the double verification technique is used, which means that each page is verified by two independent operators. Such an approach drastically improves the results as compared to ordinary, single verification, and makes it possible for ATAPY to achieve quality rates as high as 99.997% (the figure obtained by ATAPY QA Department in one of the projects).

ATAPY's engineering strenghts and experience come into play again by supplying handy custom utilities that help operators automate routine tasks, such as inserting non-keyboard characters, or fixing OCR mistakes specific for certain types of material. During years of providing Media services, ATAPY has developed an arsenal of such utilities, which are re-used in new projects.

ABBYY

ATAPY Software

Engineernaya Street, 16 630090 Novosibirsk, RussiaTel. +7 383 36 39 699 www.atapy.com [email protected]

V. Pre-publishing

If required, ATAPY converts the entire mass of recognized and verified material into any specific data format: multi-layer PDF; XML/XHMTL; database, etc. - including those not offered by off- the-shelf OCR products. Most ATAPY's tools and algorithms used at preceding phases are already optimized for subsequent XML conversion - therefore, this phase is largely automatic. In some cases additional processing is required, such as specific XML markup, DTP layout correction, etc.

An example of work in this area is the project for the Royal Danish Library, in which ATAPY converted 230 old Danish books forming the entire Danish Literary Canon, into XML. ATAPY marked up the text with XML tags and validated the resulting files against the customer's XML schema.

In the Landolt-Bornstein Encyclopedia digitization project, which ATAPY Software carried out for Springer Verlag, the material was converted into a customer-specific XML format known at Springer as A++.

VI. Additional IT services

ATAPY’s services also include enhancing third-party Electronic Record Management Systems and Document Management Systems with OCR modules based on ABBYY OCR toolkits.

For PRNet, a media monitoring agency in Turkey, ATAPY provided such integration, and also enhanced PRNet's web application with a number of additional features and modules, such as Statistics and Reporting, Web-based Administration, etc.

ATAPY’s head office is located in Novosibirsk, Russia. Novosibirsk has long been recognized as one of the largest Russian hubs of scientific research, including Artificial Intelligence technology. The first industrial ICR system for reading ZIP codes, developed back in the 1970's and still used by the Russian postal service, was developed in Novosibirsk. At the same time, our Novosibirsk location allows us to offer services at attractive prices that fit into the budgets of libraries and public organizations.

The recently opened Sales and Technical support office in Munich, Germany, provides for closer interaction between ATAPY Software and its European clientele. In addition, by signing a contract with the German company ATAPY Software GmbH, customers can eliminate concerns associated with foreign contracts.

Location

Elsenheimerstrasse 4780687 Munich, GermanyTel. +49 89 5111 5968 [email protected]

ATAPY Software GmbH

©2010-2016 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House.

All the other trademarks are the property of their respective owners.

Digitization of the Landolt-BörnsteinEncyclopediaSpringer Verlag, the largest European scientific publisher, and ATAPY Software undertake a joint project that will benefit the worlds’ scientific community

For the last 6 years, ATAPY Software has been engaged in ongoing cooperation with the world’s second largest scientific Publishing house, Springer Verlag, a company in the Springer Science+Business Media group.

In 2003, Springer was searching for a contractor to provide the digitization of t Numerical Data and Functional he Landolt-BornsteinRelationships in Science and Technology, New Series, a modern edition of the 180,000-page encyclopedia in chemistry, physics and technology. Digital conversion of scientific material presents a number of specific challenges. Before coming to ATAPY, Springer had made several digitization attempts but they weren’t efficient enough, mostly due to out of date technology.

Cooperation started with a pilot project in which ATAPY converted several excerpts from the Encyclopedia to text format. This successful start grew into a solid partnership. Throughout this period, ATAPY has converted 166 volumes of to XML, each one containing 500 Landolt-Bornsteinpages, and more volumes are still to come.

The most serious challenge was the large amount of purely scientific data (formulae, tables, etc.) that contained special symbols missing from the Unicode Ñharacter map. This issue was partially overcome by the creation of special dictionaries inside ABBYY FineReader, the software package used for fu l l - text recognition, and by implementing a specialized program to help operators promptly insert non-keyboard symbols during the verification phase.

H. Landolt's reference book "Physikalisch chemische Ta-bellen" (first issued in 1883 in Germany) presents physi-cochemical constants of or-ganic and inorganic matters in a tabular format. The totally remade 6th edition, named “Landolt-Bornstein Z a h l e n w e r t e u n d F u n k t i o n e n a u s Naturwissen-Physik, Che-mie, Astronomie, Geophy-sik und Technik" was issued from 1950-1980.

The appearance of new me-thods of research inspired the release of “New series”, a reference book named “Landolt Bornstein. New Se-ries. Numerical Data and Functional Relationships in Science and Technology". Since 1961 more than 150 volumes have been issued.

One of Springer’s major requirements is an accuracy level above 99.99%, which means less than one mistake per 10,000 characters.

The goal has been consistently achieved with the help of the excellent ABBYY OCR technology, adjusted for this particular task by ATAPY’s engineers, and the meticulous verification work of the company’s Media Service Department.

ATAPY’s developers have automated the conversion of reference lists following each chapter of the edition to A++XML, the customer-specific XML format.

LB volumes, which have already been converted, covering such subjects as Elementary particles, Nuclei and atoms, Molecules and free radicals, Condensed matter, Physical chemistry, Geophysics, Astronomy and Astrophysics, and Biophysics, are available online in PDF format at Springer’s Online Landolt-Bornstein resource, empowered with an extended search engine providing keyword and substance/property index-based search.

Springer Science+Business Media, or Springer, is a worldwide Publishing house based in Germany, that publishes textbooks, academic reference books, and peer-reviewed topical journals, with a focus on science, technology, mathematics, and medicine. Within the science, technology, and medicine sector, Springer is the largest book publisher, and second-largest journal publisher worldwide, with over 60 publishing houses, 1,900 journals, 5,500 new books published each year, sales of 924 million euro (in 2006) and 5,000 employees. Springer has major offices in Berlin, Heidelberg, Dordrecht (the Netherlands) and New York.

© 2010-2016 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House.

All the other trademarks used are the property of their respective owners.

Scanning Segmenting, recognition

Double proofreading

Layout correction, formulae editing

PDF document

Online publishing

Heidelberger Platz 3 14197 Berlin DeutschlandTel. +49 6221 4870 Fax +49 6221 3450www.springer.com

Springer Science+Business MediaATAPY Software

Engineernaya Street, 16 630090, Novosibirsk, RussiaTel. +7 383 36 39 699 www.atapy.com [email protected]

’I am happy to have such long-term cooperation with ATAPY Software in this complex matter.

They’re doing an excellent job preparing Landolt-Bornstein for the future. I thank ATAPY’s team for

all the extensive work done, and I’m ready to continue this partnership,’ - said Dr.Rainer Poerschke,

Head of the Landolt-Bornstein Department (2008), Springer-Verlag.

Electronic dictionaries and translation systems are an area of great practical importance in the ever-globalizing world. ABBYY Software House, the world leader in OCR/ICR and linguistic technologies, develops and sells Lingvo electronic dictionaries. For many years Lingvo has been known as the best English-Russian dictionary on the market. Version 8.0, supported three more languages. For the next version, it was decided to use the world’s latest best-of-breed dictionaries that represent the modern state of supported languages.

ABBYY turned to ATAPY Software, its outsourcing partner in Novosibirsk, for digitization of two dictionaries from the list selected by the Linguistics Department. The 3-volume 1,750-page Leping German-Russian Dictionary and the 830-page Narumov Spanish-Russian Dictionary were chosen to be recognized and proofread for automatic conversion to the Lingvo database.

The highest possible text recognition accuracy was obviously a requirement. A single mistake could break the words alphabetical order and tear the word away from its paradigm. If the number of mistakes went beyond a very modest threshold, the dictionary would unsearchable. Proper interpretation of special dictionary marks was equally important for the project. They were used as field delimiters in the automatic database conversion process and had achieve 100% recognition. Special marks appeared either as text characteristics (bold/italics), as special symbols (brackets, asterisks), or as a combination of the two (e.g., italics brackets indicated a dictionary comment). Omitting a single bracket or missing italics would break the article's structure. That is why the project required both intelligent programming and a highly qualified manual effort - a true challenge for any contractor in the media service sphere.

“ATAPY attained 99.992% text accuracy in the Ger-man-Russian Dictionary (1 mistake per 8,760 symbols), and 99.997% quality for the Spanish-Russian Dictionary project (1 mistake per 31,500 symbols). They also corrected many mistakes in the source dictionary text, including typographical misprints and even mistakes in special marks that are almost impossible to detect without special program-ming tools and a profound knowledge of linguistics.”

Anna ZhavoronkovaProject Manager,

ABBYY Software House

Development of International Computer Dictionaries for ABBYY

The ABBYY Lingvo 8.0 product line includes ABBYY Lingvo 8.0 Multilingual Edition, ABBYY Lingvo 8.0 for Pocket PC, as well as an updated and expanded version of ABBYY Lingvo English-Russian Edition.5

ABBYY Lingvo 8.0 Multilingual Edition supports eight translation directions: English-Russian, German-Russian, French-Russian, Italian-Russian, Russian-English, Russian-German, Russian-French, and Russian-Italian. This Edition of ABBYY Lingvo includes more than 40 dictionaries containing more than 2,400,000 entries.

ATAPY Software

©2013-2016 ATAPY Software. All rights reserved. ABBYY and ABBYY Lingvo are registered trademarks of ABBYY Software.

All the other trademarks are the property of their respective owners.

The dictionaries were scanned and automatically recognized with an ABBYY FineReader OCR that was specially tuned for processing this material. Then a team of qualified operators proofread and cross-checked the results using the Double verification technique to ensure recognition accuracy. Double verification detected unexpected situations, such as typos in the source dictionary text, that were corrected according to the ABBYY's guidelines. In its effort to maximize automation of the proofreading work,

ATAPY developed and customized a number of in-house utilities such as Glyphica, a tool for quick input of characters that are not found on the keyboard. For Leping’s Dictionary, ATAPY developed a custom converter with built-in spell- and punctuation- checking utilities that weeded out mistakes unnoticed in previous stages and finally converted the material into the Lingvo vocabulary database.

ABBYY

2b/6 Otradnaya Street 127273 Moscow, Russia Tel. +7 (495) 783 3700 www.abbyy.com [email protected]

ABBYY Software House (www.abbyy.com) is based in Moscow, Russia. The company was founded in 1989. ABBYY has over 880 employees, with offices in Russia, USA, Ukraine, UK, Germany, Taiwan, Japan and Cyprus. ABBYY has

developed software in the fields of artificial intelligence, document recognition, data capture and applied linguistics. ABBYY is most notable for their optical character recognition software, FineReader.

16 Engineernaya Street630090 Novosibirsk, RussiaTel. +7 383 36 39 69 9www.atapy.com [email protected]

Participation in the Development of FineReader XIX

ATAPY’s linguists created Language Models of five Old European languages for the FineReader XIX - an OCR system for the conversion of old European books to “modern” digital formats

Meta-E is a collaborative initiative established by a consortium of 14 universities from 7 European countries and the US that is co-funded by the European Union. The project is focused on providing a technology base for the digitization and web-publishing of valuable old printed sources spanning several centuries of European history. This required an OCR system capable of recognizing historical texts for the period 1800 - 1938, including those printed with Frakturschrift (an old-styled black-letter typeface that was prevalent). There were no omnifont-Frakturschrift systems available: all OCR products had to be trained on each individual book before processing it. Meta-E coordinators started looking for a high quality OCR package to augment according to their requirements. ABBYY FineReader was chosen due to its unrivalled recognition accuracy, support for 176 modern languages, and user-friendliness. ABBYY Software House, the international manufacturer of FineReader products, began work as a direct contractor to develop the omnifont element of the project (introducing the Frakturschrift graphics to FineReader). The linguistic part of the project was subcontracted to ATAPY Software, ABBYY's long-term partner in OCR and linguistic development.

ATAPY's role in the Meta-E project was constructing Old Language Models (LM) for 5 European languages: English, French, German, Italian, and Spanish. LM is a computer database that describes the vocabulary of a language. FineReader uses LMs during recognition to build OCR hypotheses and for spell-checking. LMs are not just full lists of words in all possible grammatical forms because such a database would be enormous and hard to manage. FineReader's LMs store only the stems of each word and describe the grammar as a set of flexible rules (paradigms). Each stem is assigned a list of paradigms; applying them to the stem produces all possible forms of the word. ATAPY studied a large number of authentic dictionaries and original old European texts dating back to the targeted time period, reviewed the word stock, added the words that were phased out of the languages, and corrected the paradigm assignments to synchronize the LMs with the actual grammatical practices used at the time.

To complete this task, ATAPY's linguists carefully selected 10 dictionaries published between 1808 and 1930 that reflected the state of the 5 languages. ATAPY also analyzed thoroughly 105 authentic books from that period comprising more than 50 MB of text. The next step was to build FineReader LMs. ATAPY's linguists manually compared the information from the authentic dictionaries and texts — about 500 000 entries overall — to the existing FineReader vocabularies. This work amounted to a total of 458,767 words out of which 61% remained unchanged, and 36% were added to the vocabularies from the analyzed sources. About 3% of the words had their paradigms corrected according to XVIII-early XX century grammar rules; to make this correction the linguists added 159 historical grammar paradigms that were missing in the contemporary models.

Finally, the LMs were compiled and tested on the control text corpus. They demonstrated 98.91% vocabulary coverage for Old English, 99.16% for Old French, 96.58% for Old German, 98.58% for Old Italian, and 98.79% for Old Spanish language.

“I'd have the FineReader XIX installed here on my computer. The Frakturschrift recognition is very good. Even though old text recognition is not a large and growing market, I am sure all the service bureaus here in Germany will be ordering 1 or 2 copies and have it run 7x24”

Johannes St petie öCEO

ABBYY Europe GmbH

To illustrate the above, let's look at a few examples where the regular FineReader package, or any other contemporary OCR system, will make a lot of mistakes. 'Alterthumskunde' may become 'Allerlhumskunde' in the first fragment and in the second fragment, 'UEBERSICHT' ('Übersicht' in modern German) gets recognized as two words 'UEBER SICHT', etc. These mistakes happen for two reasons. The first is poor printing quality and there is no way to improve it at this stage. The second is the old spelling used in the incorrectly recognized words. All existing OCR systems are targeted at modern texts and therefore only know modern spelling.

Once the 5 LMs were merged into the FineReader 7

ABBYY Europe GmbH is a European department of ABBYY based in Munich, Germany. ABBYY Software House is the manufacturer of software products in the fields of artificial intelligence, document recognition and applied linguistics. One of the most

notable products by ABBYY is the optical character recognition package ABBYY FineReader.

shell, ABBYY was able to offer a specialized product that "knows" the spelling specifics of old European languages - FineReader XIX. There is much less chance that this product will make mistakes in areas similar to those mentioned above. Users are now able to OCR old texts with higher quality and save a lot of time that was previously spent on error correction.

ABBYY FineReader XIX has become a powerful tool assisting the Meta-E consortium in its large-scale digitization work. In addition, as the industry's first box OCR product to recognize Renaissance and Late Medieval sources, it is specially targeted at European libraries and public organizations engaged in the preservation and publication of cultural assets.

©2010-2016 ATAPY Software. All rights reserved. ABBYY, ABBYY FineReader and FineReader XIX are registered trademarks of ABBYY Software House.

All the other trademarks are the property of their respective owners.

Elsenheimerstrasse 4980687 Munich, GermanyTel. +49 89 5111590 [email protected] www.abbyyeu.com

ABBYY Europe Software HouseATAPY Software

Engineernaya Street 16 630090 Novosibirsk, RussiaTel. +7 383 36 39 699 www.atapy.com [email protected]

ATAPY Helps a UNESCO School in a Transatlantic Slave Trade Education Project

To date, Fredensborg is the best documented wreck of a Transatlantic slave trade ship located.

The ship left Copen-hagen in June 1767 and traded for 265 slaves at Danish-Norwegian forts along the Gold Coast of Africa. About 10% of the slaves died during the ship's middle passage, but over one-third of the crew also died during the voyage. The Dutch sold their human cargo at St. Croix, and then loaded the ship with sugar, tobacco, and other tropical products for the return trip. The ship had a lmost reached i t s destination when it wrecked during a violent storm.

On September, 15, 1974, divers Odd K Osmundsen, Tore Svalesen and Leif Svalesen discovered wreckage and giant elephant tusks at the bottom of the sea near Tromoy, off the southern coast of Norway. Along with the ivory, cannons, ship timber and other interesting objects were found. Almost everything was hidden under layers of seaweed, rocks, and sand. However, as a result of thorough planning and intense study of old documents from the archives, the three divers knew exactly what they found.

The Danish-Norwegian slave ship Fredensborg that sank on December 1, 1768, was a typical ship engaged in the Triangular Trade.

Triangular Trade is the name given to the trading route used by European merchants who exchanged goods with Africans for slaves, shipped the slaves to the Americas, sold them and brought goods from the Americas back to Europe. Ships left Europe with cargoes carrying a broad assortment of goods considered suitable for the slave trade. Once anchored at the forts, the interiors of the ships were rebuilt to accommodate enslaved Africans.

In September, 2003 ATAPY was contacted by Mr. Jeff Klinto, an educator at the Vesthimmerlands Gymnasium. This UNESCO school participates actively in the Transatlantic Slave Trade Education Project. The project was supported by the Danish UNESCO Committee and The Digital North Denmark (Det Digitale Nordjylland) project.

As a part of the project, Mr. Klinto initiated the creation of a CD-ROM with teaching materials about Danish involvement in the Triangular Trade. The CD-ROM had to contain contemporary materials as well as materials from the age when the use of Gothic letters was common. That is where the project faced a challenge.

Det Digitale Nordjylland

Mr. Klinto explained,

"Gothic letters caused a number of problems in relation to the OCR programs that are currently available on the market. These problems led us to the Royal Library in Copenhagen. They recommended we approach the Russian company ATAPY Software, that handled the Royal Library's Gothic materials. However, the thought of having Gothic texts that were written in Danish, handled in Novosibirsk by Russian employees seemed unrealistic. The materials were from a period when there was no national orthography so the dictionaries in the OCR program would be useless. How would they be able to work on this? Add to this the very uneven quality of the printing in the old works, and the task seemed impossible."

©2003-2016 ATAPY Software. All rights reserved.

Projekt 202Att. Jeff Klinto Vesthimmerlands [email protected]

Det Digitale Nordjylland

Old Danish books page samples

Photo: Reconstruction of a slave ship by students of Vesthimmer-lands Gymnasium

The Media Service Department at ATAPY accepted the challenge and did excellent work on recognition, proofreading and exporting to HTML more than 5,500 pages of the Old Danish books.

Mr. Klinto goes on to emphasize the quality of results,

The Transatlantic Slave Trade Project is aimed break ing the si lence surrounding the Transatlantic Slave Trade. By learning about the past, young people can fully understand the present and prepare a better future together in a world free of all types of enslavement, injustice, discrimination and prejudice.

UNESCO, the UNESCO logo are copyrights of UNESCO - the United Nations Educational, Scientific and Cultural Organization.

All trademarks used are the property of their respective owners.

"The materials that were returned were of a high standard, and ATAPY was incredibly obliging and helpful. The communication with project leadership functioned excellently and the team worked wonders with materials that were often of poor quality. Therefore, I sincerely recommend this company. Thanks to the competent staff of ATAPY, it is now possible for the public to have access to materials that may not be issued at libraries anymore because of their age and rarity. It is also worth noting that the work was done for a very reasonable price.”

ATAPY Software

Engineernaya Street, 16 630090, Novosibirsk, RussiaTel. +7 383 36 39 699 www.atapy.com [email protected]

ATAPY Software converts the entire Danish Classic Literature Canon into XML

Det Kongelige Bibliotek, København og Arkiv for Dansk Litteratur

The Royal Danish Library in Copenhagen (http://www.kb.dk) has the largest book collection in Northern Europe and strives to facilitate access to its resources using advanced technologies. As part of this effort, the Library launched an ambitious project, "Danish Literature Archive", to convert the entire Danish literary canon (the works of 70 selected Danish authors from the ÕIth to early part of the ÕÕth century) into computer text. Specifically, to XML format so the work would be available on the web.

The large number of books, their diverse content (verse, prose, pictures, tables, notes and comments) and the preservation requirements for layout and typesetting made this an especially challenging job. To make it possible, the contractor would have to possess seemingly incompatible qualities. On the one hand, the company had to be competent with modern Optical Character Recognition packages, proficient in XML coding and capable of designing specialized software instruments to facilitate the conversion process. This required high IT qualifications and extensive hands-on experience in data capture technologies. On the other hand, almost all real-life mass data input projects still involve a lot of manual labor. No matter how accurate an OCR system is, it will make mistakes - especially when working with such difficult material as old books with complex layouts. Another issue with the Library’s material was that full automation of XML coding was not possible because of the diversity of attributes. The contractor had to be able to provide many qualified operators at a reasonable price or the project cost would exceed the financial capability of any library.

The Library IT staff looked for a partner outside the EU to solve these problems. They were attracted to Russia because it is the home of ABBYY FineReader, the world-renowned OCR system. Following a several months of a trial process, the Library selected ATAPY Software, a leading developer of custom OCR solutions based on FineReader technologies and experienced media service provider. The pilot projects demonstrated that ATAPY combined high IT professionalism with access to an extensive pool of qualified multi-lingual operator resources.

"Working with ATAPY has been a pleasure. We were impressed with the degree of a t tent ion pa id to producing the best possible text of the works and the accuracy of the results.”

Virginia LaursenWebmaster

Royal Danish Library

The Royal Danish Library in Copenhagen is the na-tional library of Denmark and the largest library in the Nordic countries. It contains numerous historical treasu-res and all works printed in Denmark since the XVIIth century are available there. Thanks to past donations, the library houses most of the known Danish printed works since the printing of the first Danish book in 1482.

P.O.Box 2149 DK-1016 Copenhagen, DenmarkTel. +45 33 47 47 47 www.kb.dk [email protected]

The Royal Danish Library*Det Kongelige Bibliotek

The books conversion process was organized into three key phases:

1. Reading scanned images into text format. The Library provided ATAPY with scanned pages in TIFF format. The quality of the images was remarkably good, which provided an important contribution to the efficiency of the remaining stages. The ABBYY FineReader analyzed automatically the images, which segmented them to distinguish text from pictures and revealed the table structure. Layout operators reviewed the segmentation results. Pages were recognized using FineReader's outstanding omnifont capabilities that are augmented with many font-specific patterns. This raised the recognition quality for most of the old books. Then a group of operators proofread the OCR results. Special attention was given to non-Danish texts, as some of them could not even be OCRed (Old Greek, Hebrew etc).

2. Preparation of initial XML documents.®Verified text was exported to Microsoft Word format. XML operators, armed with an arsenal of custom tools and

®macro programs, used Microsoft Word as the environment for adding XML tags. This was a meticulous task since the full list of tags contained over 50 entries and only half of them came through automatic identification. The remaining half had to be spotted and marked manually in the Danish language.

3. Assembly of book XML files. Once markup was finished, XML specialists assembled the books adding supplementary "entire-book" tags and bibliographic information.

As a software company and a media service company, it was possible for ATAPY to dispatch experienced customization engineers and develop project-specific program utilities for every conversion phase. This made it possible for ATAPY to decrease the processing time 10 to 20% as the project evolved and to pass along the savings to the client. The project was successfully completed and all the books are available online at http://www.adl.dk.

After years in the sphere of media service, ATAPY has become an expert in the field by working with texts that had different layouts, structures and languages. These included library cards, encyclopedia articles, magazine publications, rarities dating back to the XIXth century and other materials representing all genres and formats. In addition to the Royal Danish Library, the list of ATAPY's Media service clients include Springer Publishing house (Germany), University of Innsbruck (Austria), J.B. Metzler Verlag (Germany), EasyData B.V. (Netherlands), Consodata (France), PRNet (Turkey) among other institutions and companies. ATAPY utilizes a highly effective data capture process that relies on both IT infrastructure and human resources. High-speed, high-quality multi-language material processing, client communication in four languages and very affordable pricing are ATAPY's trademarks that are applied to every contract, big or small.

©2010-2016 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House.

All the other trademarks are the property of their respective owners.

Danish page samples

ATAPY Software

Engineernaya Street 16 630090 Novosibirsk, RussiaTel. +7 383 36 39 699 www.atapy.com [email protected]

ATAPY acquires new clients in the field of media services

ATAPY prides itself on being able to handle most challenging data capture tasks by employing its intelligent digitization approach. The approach involves using specialized software tools during all phases of the process to ensure high accuracy of results, with minimum manual effort.

One of the recent projects involving intelligent text digitization was conducted for "Nordic Sounds", a Danish musical magazine that brings together coverage of a range of musical genres in contemporary Northern European music. The magazine is widely distributed outside Denmark and therefore published in English.

"Nordic Sounds" editors requested the creation of a digital archive for all issues up to the current publication. The resulting archive could not just be a collection of digitized texts, but had to serve as a true archive that could be searched and structured. This goal was perfectly achievable using XML as an output format.

The task was given to the ATAPY’s Media Services Department. In order to meet the customer's requirements, the department specialists had to create one XML file per article, and that requires magazine content analysis. Although the majority of magazine materials were in English, a lot of proper names and quotations were in Northern European languages (Danish, Swedish, Norwegian, Finnish, and Icelandic). This peculiarity required special attention from engineer-linguists who worked on "Nordic Sounds" digitization and XML conversion. Former and current experiences in processing multi-language information sources (Danish in particular) was of a great use.

ATAPY Software achieved the goal on time (the project lasted approximately two months) with excellent quality results, as noted by "Nordic Sounds".

"All files validated against the schema, very nice! I took a closer look at a random selection of files, and was very impressed by the quality of your work! The quality of the meta-data as well as OCR-treated text is excellent, so, I think we can regard this as "mission accomplished" and sign the Act of Acceptance."

Henning Olesen, IT Project ManagerThe State and University LibraryUniversitetsparkenDK-Aàrhus

Backlog Conversion of Danish Musical Magazines

Thanks to ATAPY, starting May, 2005 the Nordic Sounds magazine archive is available online as part of the Online Music Research Library (www.dvm.nu).

After such a successful start, ATAPY digitized backlogs for two more popular Danish musical magazines: "MM" and "GAFFA". The GAFFA magazine archive (1983 up to 2008) is also published online with full keyword search and original page image retrieval.

©2010-2016 ATAPY Software. All rights reserved.

Aarhus University

Nordre Ringgade 18000 Arhus C Tel. +45 8942 1111 Fax: +45 8942 [email protected] www.au.dk

AARHUS

UNIVERSITY

Aarhus University (www.au.dk), located in the city of Aarhus, Denmark, is Denmark's second oldest and second largest University (after the University of Copenhagen). The University was founded in 1928 and has an annual enrollment of more than 35,000 students. Aarhus University housed Denmark's first Professor of sociology (Theodor Geiger, from 1938–1952) and in 1997 Professor Jens Christian Skou received the Nobel Prize for Chemistry for his discovery of the sodium-potassium pump.

All trademarks used are the property of their respective owners.

GAFFA and MM covers

GAFFA spread samples

ATAPY Software

Engineernaya Street, 16 630090, Novosibirsk, RussiaTel. +7 383 36 39 699 www.atapy.com [email protected]

ATAPY Media Service Operations: Helping Preserve Swedish Cultural Heritage

In this project, ATAPY processed more than 12,000 pages in Old Swedish. Double verification was used in almost 50% of the material to ensure high recognition accuracy and excellent text searchability. All the digitized material is now available

®online on one of Rikrteatern’s web sites in Microsoft Word and PDF formats.

The Swedish National Touring Theatre (Riksteatern Sweden) made decided to convert its collection of Old Swedish plays into a digital format.

According to the project requirements, texts were to ®be converted to Microsoft Word with the original

page design preserved as much as possible. Due to the material’s age and layout specifics, the job required an expert knowledge of OCR technology in addition to a heavy manual formatting effort.

Building an archive of Old Swedish plays for Riksteatern Sweden

Results�

Riksteatern is the name of the popular "National Touring Theater"/"National Theater

Company" in Sweden. Established in 1933 with the goal to promote and produce quality

theater throughout Sweden, Riksteatern is now the largest touring theater company in

Sweden. It is financed and owned by 240 local Swedish economic associations.

ATAPY converted a series of old Swedish printed sources dating back to XVIII-XIX centuries into text format for the Gothenburg University Library.

This ongoing project, comprised of several phases, currently exceeds 75,000 pages and approx. 65% of the material was subject to full verification. Material yielding better OCR results, underwent partial verification (uncertainly recognized symbols only). This approach made it possible to provide considerable cost savings. The next step, ATAPY conducted a manual markup of files for subsequent conversion to XML.

Digitization of Old Prints Collection for Gothenburg University

ATAPY’s track record in Scandinavian countries includes such projects as digitization of a large collection of books for the Royal Danish Library, creating an archive of XVIII-century Northern European prints for a UNESCO educational project and magazine backlog conversion for a Danish musical publishing house. As a result of this work, ATAPY was recently entrusted with several new Scandinavian projects primarily focused in Sweden.

Selma Lagerlof (1858-1940) is one of Sweden’s most prominent authors, winner of the Nobel Prize in Literature and a Swedish Academy Member. She left a literary legasy of more than 2,500 pages. In 2010, the National Library of Sweden launched a project to make this material available online. One of ATAPY’s former customers recommended the company as an excellent service provider with affordable prices and hands-on experience with sources in Scandinavian languages.

The project involved the following phases:

OCR of scanned images; Full verification of OCR results;XML markup of basic layout elements: titles, page numbers, separator ele-ments, etc.

Creation of an electronic archive of Selma Lagerlof’s works for National Library of Sweden

The National Library of Sweden is a state agency with offices in Stockholm. The Library

has collected virtually everything printed in Sweden or in Swedish since 1661. Currently

the Library coordinates services and programs for all research libraries in Sweden and

administers LIBRIS, the Swedish national library catalog system.

ATAPY processed the material with a limited deadline by using three media service operators. That year the Library was able to publish selected portions of its Lagerlof Collection online to commemorate the 150th anniversary of Selma Lagerlof's birth.

Low quality of the original page images (old weathered paper, pale print, etc.);Uneven lines, “jumping” print, varied spacing between letters, words, lines, etc.;Old Swedish words and grammar (ABBYY FineReader and sometimes even FineReader XIX dictionaries failed).

In both projects, ATAPY faced challenges typical for working with old books:

ATAPY overcame these challenges by using ABBYY FineReader XIX (a specialized package for processing prints in Old European languages and typefaces), smart segmentation of the material and applying qualified manual services when necessary. ATAPY’s stratedy is to automate work whenever possible to minimize customer cost without sacrificing quality.

Results�

The University of Gothenburg is a major university in Northern

Europe (approximately 37,000 students). The University’s 40

Departments cover most scientific disciplines, making the University one of Sweden’s most diversified

higher education institutions.

ABBYY, ABBYY FineReader and ABBYY FineReader XIX are registered trademarks of ABBYY Software House. All the other trademarks used are the property of their respective owners.

©2011-2016 ATAPY Software. All rights reserved.

ATAPY SoftwareEngineernaya Street, 16 630090 Novosibirsk, RussiaTel. +7 383 36 39 699 www.atapy.com [email protected]

Data Input Project for Novosibirsk Mayor’s Office

ATAPY together with “Zolotaya Korona”, a Russian nationwide retail electronic payment network, conducted a project for the Novosibirsk Mayor's Office.

Designed to modernize the procedure for public transportation fare collection, the project introduced electronic passes with microprocessor plastic cards for use on subways, buses and trolleybuses.

A primary challenge was the large number of passengers with social security benefits and discounts. The Novosibirsk Public Transportation Authority compensates carriers for these passengers from the city budget. One of the goals of the “Transportation Pass” project was to retain the benefits/discounts plan for people who were entitled to them while providing a precise and convenient mechanism for gathering transportation statistics on these passengers. It was also important to eliminate opportunities for fraud, as this was a weak point in the previous transportation pass system.

“Zolotaya Korona” offered a solution based on contactless microprocessor cards with personalized cards for citizens entitled to benefits/discounts and generic cards for regular passengers. “Zolotaya Korona” issued separate types of personal cards for each category of beneficiary: students (the Student Transportation Pass), school children (the School Child Transportation Pass) and social security beneficiaries (the Social Security Transportation Pass).

There was another challenge. To obtain a personal transportation pass, a person had to fill out a machine-readable form. “Zolotaya Korona” would receive tens of thousands of application forms that needed to be

processed quickly and with a high degree of accuracy. An additional requirement was that a color photo of the applicant be stored in the database.

The “paper flood” peak was expected when Student Transportation Passes were issued. To deal with this, “Zolotaya Korona” turned to ATAPY Software, a data capture company with a proven track record.

The data capture process involved the following phases:1. Filling-in the form by applicant (in handprint)2. Scanning3. Machine recognition4. Vrification of recognized data5. Export to the database6. Card production

Card issue

Working in close cooperation with eng ineers f rom “ Z o l o t a y a Korona”, ATAPY developers helped to design a machine-readable application form to be filled in by students in Phase 1. The form was specially optimized for processing by the ABBYY FormReader. Next, ATAPY engineers developed a special pre-processing algorithm to be applied to form images between Phases 2 and 3. It removed the color background from the form and improved the hand-printed text recognition quality while retaining the colored photo. The procedure allowed for significantly reduced form

ТМ

Zolotaya Korona is a Russian nation-wide retail electronic payment network uniting 220 banks from 75 regions of Russia, CIS and foreign countries. “Zolotaya Korona” cards are accepted in 273 cities in Russia as well as in Ukraine, Belarus, Kyrgyzstan, Mongolia and China. The “Zolotaya Korona” system was established in 1994 by the Center of Financial Technologies.

©2010-2016 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House.

All the other trademarks are the property of their respective owners.

processing turnaround time during Phases 3-5 (recognition and verification of data using ABBYY FormReader) while ensuring the high accuracy of the captured data.

Thanks to the combined efforts of “Zolotaya Korona” and ATAPY Software, Novosibirsk students can now travel on all means of transportion using the Student Transportation Pass without reaching for money, a student ID or entering a PIN code.

In the near future, “Zolotaya Korona” plans to replicate this valuable experience in other Russian cities.

Shaturskaya street, 2630055 Novosibirsk, RussiaTel. +7 383 336 49 49 www.korona.net [email protected]

Zolotaya�KoronaATAPY Software

Engineernaya Street, 16 630090 Novosibirsk, RussiaTel. +7 383 36 39 699 www.atapy.com [email protected]

sense�your�media

PRNet, Turkey, is a Media Monitoring and Analysis company serving over 300 corporate clients.The company acts as a strategic partner for com-munication specialists and executives, aiming to de-ve lop the i r corporate reputation and assess the results of their commu-nication strategies. PRNet provides access to their online database where cus-tomers can search more than 25 thousand clips and 80 million results stored since 2000, survey 4,500 pages of newspapers and magazines, view videos of 74 TV channels recorded on a 24/7 basis, and access more than 1,000 Internet portals.According to ISO 500 research, 7 of the top 10 Turkish companies, and 84 of the top 100, prefer PRNet for serving their media-monitoring and industrial information needs.

For more than a century, daily, systematic analysis of printed media has been an important tool for successful businesses worldwide.

For more than century, daily, systematic analysis of printed media has been an important tool for successful businesses worldwide. Throughout those years, the rustling of pages and jingle of scissors were the constant audio background for media clipping process.

The arrival of the digital age changed the scene. Fewer and fewer scissors were used as agencies switched to scanning printed material. Paper no longer left the scanner room, and reading was done from computer monitors.

Overall, the processing of newspapers and magazines still required too much human input to automate, so the amount of labor spent by media clipping companies remained largely unchanged. In the early-90s OCR programs worked for letters and faxes, but were useless when confronted with the complex layout and font variety of newspapers.

Press Clipping Solution DALIAN 2.0for PRNet

PRNetSpring Giz Plaza B Blok 17/18, Maslak 80670 Istanbul, TurkeyTel. +90 212 328 18 09 Fax +90 212 328 18 07www.prnet.com.tr [email protected]

ATAPY Software

shifted to unattended computers: OCR Pcs were rack mounted 10-units-tall to fit into a single room, with one hot-switchable monitor for control.

When the new version of FineReader OCR was released, PRNet invited ABBYY to migrate Dalian to this new platform. In accordance with new corporate outsourcing policies, ABBYY transferred the project to ATAPY Software, an IT development company specializing in custom OCR tools. In addition to migration, PRNet asked ATAPY to add web-based administration, system statistics and reports, a web client for extended media search, improved output for clippings, along with other features and enhancements.

The new Dalian 2.0 went into operation in 2003. It provides media insights to approx. 80 clients, including the Turkish offices of Alcatel, Compaq, Toyota, Uniliver, Vestel, CNN, Reebok, and Siemens, as well as such local giants as members of the Koñ Group and leading Turkish banks.

The dramatic improvement in recognition rate, the ability to employ home-based operators working through web interfaces, and other advancements in system functionality and manageability place Dalian 2.0 in the top ranks of modern press clipping software solutions.

In 1997, a Turkish media research company named PRNet approached ABBYY Software House, the manufacturer of FineReader OCR products, with a request to design a system for streamlining the clipping process. Dalian 1.0 went into operation in 1998, delivering subscribers a previously unheard of service. As early as nine in the morning, subscribers could log on to PRNet's web site, click on their own customized albums, and view a new page with clippings from that day's morning newspapers. Only clippings containing this subscriber's keywords went to his/her albums. Content was delivered as text and pictures in HTML format, allowing the subscriber to copy & paste it into other software for distribution or editing. Pictures were also delivered and keywords were highlighted. All major Turkish publications were covered (50 titles). The clippings were preserved in an MS SQL Server database for long-term storage and future reference.

This was achieved with an average staff of 14 operators, a great efficiency compared to less- sophisticated systems. The workload was largely

©2003-2016 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software. All other

trademarks are the property of their respective owners.

16 Engineernaya Street630090 Novosibirsk, RussiaTel. +7 383 36 39 69 9www.atapy.com [email protected]