rosette - basis technology€¦ · rosette ® is a suite of ... make real-world connections in your...

Modern enterprise is well-acquainted with the promise of big data to revolutionize our insights and decision making, although it is less well-known that up to 80% of big data is represented by Big Text. Big Text is large quantities of “unstructured” text chunks found in documents, webpages, and databases with all the hallmarks of big data: the three Vs (Volume, Velocity, and Variety). Big Text is also multilingual, covering many languages and scripts, in all of their complexities and challenges.

Because of the intrinsic nature of unstructured text, standard enterprise data solutions have a very limited ability to understand and utilize this treasure trove of information.

Rosette® is a suite of software components for use in enterprise applications. It uses linguistic analysis, statistical modeling, and machine learning to accurately process Big Text, revealing valuable information and actionable data.

Individually, each component is a robust tool for processing language, documents, or names. When combined together, they create powerful solutions that deliver useful information for better decisions and deep value for their users. Our customers across the globe, in government, finance, eDiscovery, search, social media, and beyond, depend on Rosette to analyze and transform their Big Text.

Gain insight and deep value from unstructured text 55 Supported

Languages

Start using ROSETTE today Try our free product evaluation

www.basistech.com

KEY FEATURES

- Simple API

- Fast and scalable

- Industrial-strength support

- Easy installation

- Flexible and customizable

- Java or C++

- Unix, Linux, Mac, or Windows

- Built to work with Apache Solr™ and

Elasticsearch

- Cloudera certified partner

www.basistech.com [email protected]

+1 617-386-2090

En

glis

h

Pronoun Verb Name

Concept

Pronoun+VerbVerb NounConjunction

Person

Title Name

Pronoun Verb Name

Concept

Determiner Noun

Place

Verb Name

Place

Title Name NamePrep.

Person Place

Urdu: “Islamabad”

Noun VerbRel. Pronoun Verb Verb Inf. Verb Noun

Pronoun+Verb Adjective Noun Prep. Adjective Adjective

Japanese:"Fukushima”

What is Big Text ?

It's huge volumes of multilingual , unstructured

text that must be processed to deliver insights

and build connections . It’s President Clinton

helping Malawi . Secretary Clinton in .

The福島 meltdown . This is Big Text .

Rosette®

BIG TEXT ANALYTICS

RES

RNT

RNI

REX

RBL

RLILanguage Identifier Identify languages and encodings

Base Linguistics Search many languages with high accuracy

Entity Extractor Tag names of people, places, and organizations

Name Indexer Match names between many variations

Name Translator Translate foreign names into English

CategorizerCategorize Everything In Sight

Sentiment AnalyzerDetect The Sentiments Of Your Text

Entity Resolver Make real-world connections in your data

Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL

RLI ROSETTELanguage Identifier

ROSETTEBase Linguistics

ROSETTEEntity Extractor

ROSETTEName Indexer

ROSETTEName Translator

ROSETTECategorizer

ROSETTESentiment Analyzer

ROSETTEEntity Resolver

RCA

RSA

RCA

RSA

Select Customers

Select Government Customers

Code Base Platform Support

Compatibility

© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette”, and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-29-RLP)

WEST COAST

1700 Montgomery St.San Francisco, CA 94111

FEDERAL

2553 Dulles View Dr.Suite 450Herndon, VA 20171

HEADQUARTERS

One Alewife CenterCambridge, MA 02140

EUROPE

Furzeground WayMiddlesex UB11 1BD, UK

ASIA

9-6 Nibancho, Chiyoda-kuTokyo 102-0084, Japan

THE ROSETTE SOLUTIONTHE PROBLEM THE RESULT

CHARACTERISTICS

- 80% of Big Data

- Unstructured

- Multilingual

- Huge Volume

Rosette®

BIG TEXT ANALYTICS

RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Rosette®

BIG TEXT ANALYTICS

RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Rosette®

BIG TEXT ANALYTICS

RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Search Engines

Rosette® Language Identifier (RLI) analyzes text from a few words to whole documents, to detect the languages and character encoding with speed and very high accuracy. Automatic language identification is the necessary first step for applications that categorize, search, process, and store text in many languages. Individual documents may be routed to language specialists, or sent into language-specific analysis pipelines (such as Rosette Base Linguistics) to improve the quality of search results.

For applications that analyze tweets, search keywords, and other short text, RLI offers market-leading accuracy for language detection given 1-3 words (<20 bytes) up to a full sentence.

RLI achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection. Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

Identify languages and transform encodings 55 Supported

Languages

KEY FEATURES

- Simple API

- Fast and scalable


- Easy installation


- Java or C++


- Component of the Rosette SDK

Select Customers


+1 617-386-2090

Start using RLI today Try our free product evaluation

www.basistech.com

Primary Language

FrenchPrimary Script

Latin

English

Chinese

French

Arabic

8%

22%

31%

39%

English

Arabic

Instantly identify and triagemany languages within largevolumes of text.

Chinese

Identifiez et triez instantanément plusieurs

langues à travers de nombreux textes. French

即时识别和处理大量多语言文本。

التحديد والتصنيف الفوري للعديد من اللغات ضمن كميات كبيرة من النصوص.

StumbleUpon

Rosette®BIG TEXT ANALYTICS

RES

RNT

RNI

REX

RBL

RLI Language Identifier Identify languages and encodings








Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Albanian — ISO-8859-1, Windows-1252Arabic — ISO-8859-6, Windows-720, Windows-1256Arabic (transliterated) — ISO-8859-1, Windows-1252, Windows-1256Bengali — ISCII-BengaliBulgarian — ISO-8859-5, Windows-1251, KOI8-RCatalan — ISO-8859-1, Windows-1252Chinese, Simplified — GB-2312, GB-18030, HZ-GB-2312, ISO-2022-CNChinese, Traditional — Big5, Big5-HKSCSCroatian — Windows-1250Czech — ISO-8859-2, Windows-1250Danish — ISO-8859-1, Windows-1252Dutch — ISO-8859-1, Windows-1252English — ISO-8859-1, Windows-1252Estonian — ISO-8859-13, Windows-1257Finnish — ISO-8859-1, Windows-1252French — ISO-8859-1, Windows-1252German — ISO-8859-1, Windows-1252Greek — ISO-8859-7, Windows-1253Gujarati — ISCII-GujaratiHebrew — ISO-8859-8, Windows-1255Hindi — ISCII-HindiHungarian — ISO-8859-2, Windows-1250Icelandic — ISO-8859-1, Windows-1252Indonesian — ISO-8859-1, Windows-1252Italian — ISO-8859-1, Windows-1252Japanese — EUC-JP, ISO-2022-JP, Shift-JIS, Shift-JIS-2004 (JIS X 0213)Kannada — ISCII-KannadaKorean — EUC-KR, ISO-2022-KRKurdish — Windows-1256Kurdish (transliterated) — ISO-8859-1, Windows-1252, Windows-1256Latvian — ISO-8859-13, Windows-1257

IDENTIFICATION FEATURES

- Identifies the primary or dominant language of a document

- Identifies the language scripts within the document, such as Latin and Cyrillic

- Determines the languages and their percentages within multilingual documents

- Works with texts that have been transliterated, such as Arabic chat that is written in the Latin script

- Accurate with short strings—from 1-3 words (<20 bytes) to a full sentence to enable full analysis of search queries, tweets, image captions, metadata, news headlines, email subject lines, and more.

Digital text is often composed of multiple languages within the same document, presenting a challenge to computers and humans alike. RLI enriches the text with start and end markers for each language placed within multilingual documents—even if all the languages are written in the same script—such as English, French, German, or Italian. Boundaries of each writing system are also detected, such as Latin, Cyrillic, Japanese kana, or Chinese hanzi.

LANGUAGE BOUNDARY LOCATOR

188557

44

Latin ScriptVariants (Transliterations)

LegacyEncodings

Languageswith Unicode

Language/Encoding Pairs

ENCODING CONVERSION

Although modern text encoding standards, such as XML, mandate the use of Unicode, many existing applications, documents, websites, and data streams use “legacy encodings,” such as ASCII, ISO 8859-1, Shift-JIS, and many others.

Rosette accurately converts large collections of text with these legacy encodings into a single, uniform format in the Unicode standard. This converted text can then be used in any language, which eliminates data corruption and other problems due to incompatible code.

Lithuanian — ISO-8859-13, Windows-1257Macedonian — ISO-8859-5, Windows-1251Malay — ISO-8859-1, Windows-1252Malayalam — ISCII-MalayalamNorwegian — ISO-8859-1, Windows-1252Pashto — ISO-8859-6, Windows-1256Pashto (transliterated) — ISO-8859-1, Windows-1252Persian — ISO-8859-6, Windows-1256Persian (transliterated) — ISO-8859-1, Windows-1252, Windows-1256Polish — ISO-8859-2, Windows-1250Portuguese — ISO-8859-1, Windows-1252Romanian — ISO-8859-2, Windows-1250Russian — ISO-8859-5, Windows-1251, KOI8-R, IBM-866, Mac CyrillicSerbian — ISO-8859-5, Windows-1251Serbian (transliterated) — ISO-8859-2, Windows-1250Slovak — Windows-1250Slovenian — Windows-1250Somali — ISO-8859-1, Windows-1252Spanish — ISO-8859-1, Windows-1252Swedish — ISO-8859-1, Windows-1252Tagalog — ISO-8859-1, Windows-1252Tamil — ISCII-TamilTelugu — ISCII-TeluguThai — Windows-874Turkish — ISO-8859-9, Windows-1254Ukrainian — ISO-8859-5, Windows-1251, KOI8-RUrdu — ISO-8859-6, Windows-1256Urdu (transliterated) — ISO-8859-1, Windows-1252Uzbek — ISO-8859-5, Windows-1251, KOI8-RUzbek (transliterated) — Windows-1251Vietnamese — TCVN, VIQR, VISCII, VNI, VPS

LANGUAGE AND ENCODING COMPATIBILITY


Compatibility

ENGLISH FRENCH GERMAN SPANISH

J'ai été surprise par cette surprise. Vice President

Biden spoke about this in Munich. El carpintero

prensa los bordes de la placa decorativa. Proper

wound care management prevents die Geige gibt

einen schoenen Laut von sich.

© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette”, and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-29-RLI)

WEST COAST


FEDERAL


HEADQUARTERS


EUROPE


ASIA



RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed, and accuracy.

As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

Supported Languages

Search many languages with high accuracy 40

KEY FEATURES

- Simple API

- Fast and scalable


- Easy installation


- Java or C++


- Customizable features such as user

dictionaries, orthographic normalization,

and script conversion

- Built to work with Apache Solr™ and

Elasticsearch

- Cloudera certified partner

Select Customers


+1 617-386-2090

Start using RBL today Try our free product evaluation

www.basistech.com

Verb Determiner

Preposition Determiner

Noun

Noun Noun

Noun

Noun Punctuation

Conjunction

Preposition Adjective

Adjective

Improve the speed and

accuracy of your search

application with advanced

linguistic analysis .


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

TOKENIZATION

Many search tools use bigrams to understand languages written without spaces between words. This results in a larger index size and a reduction in relevancy. RBL, in contrast, accurately identifies and separates each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.

DECOMPOUNDING

RBL breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.

WESTERN EUROPE- Catalan*- Czech- Danish- Dutch- English- Finnish*- French- German- Greek- Italian- Norwegian- Portuguese- Spanish- Swedish

EASTERN EUROPE- Albanian*- Bulgarian*- Croatian*- Estonian*- Hungarian- Latvian*- Polish- Romanian- Russian- Serbian*- Slovak*- Slovenian*- Turkish- Ukranian*

Search Engines

Advanced Morphological Features

Available Languages

LEMMATIZATION

Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of removing unimportant differences. This method, called stemming, often results in extra recall and poor precision. Instead, RBL finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.

PART OF SPEECH TAGGING

As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words.

Each token is then tagged for enhanced comprehension and search relevancy.Because different languages have different grammars, part-of-speech tags differ.

Rosette supports the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.

Compatibility

MIDDLE EAST- Arabic- Hebrew- Pashto- Persian- Urdu

ASIA- Chinese, Simplified- Chinese, Traditional- Indonesian- Japanese- Korean- Malay*- Thai

Example: GermanSamstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for "Samstag".

Example: EnglishLinguistic analysis is useful for every language; lemmatization for English improves recall and precision.

NOUN PHRASE EXTRACTION

Certain nouns, especially proper names, canbe very tricky to identify as a single entity.RBL groups the nouns and their modifiers, which is useful in document clustering and concept extraction.

SENTENCE DETECTION

The start and end of each sentence is automatically identified even though punctuation use may be ambiguous.

CHALLENGE QUERY STEM LEMMA

Two unrelated words may share a stem.

animalsanimated

anim animalanimate

Stemming may deliver unintended results.

several sever several

Irregular verbs and nouns stump the stemmer.

spoke spoke speak (v.)spoke (n.)

WEST COAST


FEDERAL


HEADQUARTERS


EUROPE


ASIA



Example: Chinese Consider the problem of indexing “Beijing University Biology Department” and a subsequent search for “student”:

Beijing University

Biology Department

(Student)

INDEX

BIGRAMMING

RBL MORPHOLOGICAL TOKENIZATION

SEARCH

学

学

学

4 51 2

1 2

652 3 3 4 6 7

Beijing

Beijing University Biology Department

(non-word) University (Student) Biology Dept.

(non-word)

"Student" Incorrectly hits “Beijing University Biology Department”

Correctly misses “Beijing University Biology Department”

* Limited Support

© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette”, and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-29-RBL)


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Rosette® Entity Extractor (REX) delivers structure, clarity, and insight, by revealing the key information—names, places, organizations, products, and other words and phrases—lying hidden within large volumes of unstructured Big Text.

REX is the foundation for applications in eDiscovery, social media analysis, financial compliance, and government intelligence. The effectiveness of these mission-critical applications depend on REX for its accuracy, robustness, and ability to find entities across many languages.

By nature, statistically trained models are most accurate on the type of data they are trained on. Besides machine learning from a wide range of text beyond news articles, REX is unique among named entity recognition software in its adaptability. REX’s field training mechanism enables you to add your text data to increase REX’s accuracy on your text.

Accurate & adaptable statistical entity extraction 17 Supported

Languages

KEY FEATURES


- Simple API

- Fast and scalable


- Easy installation


- Java or C++


Start using REX today Try our free product evaluation

www.basistech.com

Select Customers


+1 617-386-2090

Automatically find names

of people, places , products ,

and organizations in text

across many languages.

ACTIVE INTELLIGENCE


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Additional languages are available through custom development.

- Arabic- Hebrew- Pashto- Persian- Urdu

- Chinese, Simplified- Chinese, Traditional- Indonesian- Japanese- Korean

- Dutch- English- French- German- Italian- Portuguese- Russian- Spanish

Available Languages

REX natively supports the following entity types. User-defined entities, such as SKU numbers, are also available.

- Person- Location- Organization- Product- Title- Nationality- Religion- Credit Card Number- Geographic Coordinate- Money- Generic Number- Personal ID Number- Phone Number- Email Address/URL- Distance- Date- Time

Predefined Entity Types

STATISTICAL ENTITY EXTRACTION

Statistical modeling with advanced linguistics solves the three biggest challenges in entity extraction: finding entities which cannot be exhaustively listed, finding entities which are yet unknown, and using context to distinguish between similar entities, e.g., the place “Newton, MA” and the person “Isaac Newton”.

How it works REX in action

FIELD TRAINING FOR INCREASED ACCURACY

For users with text that is particularly challenging in format, style, or vocabulary, REX’s unique field training capability has multiple mechanisms to adapt its statistical model to their data. Users just add a quantity of their data (unannotated or annotated), and rebuild the model for maximum accuracy.

PATTERN-MATCHING RULES

Rules expressed as regular expressions find entities which follow a pattern, such as dates, times, and email addresses. Many standard string patterns are included with REX; customers can customize by editing or adding their own rules, based on their specific needs.

CUSTOM ENTITY LISTS

Custom lists are helpful when users know that specific words or phrases in their data are almost never misspelled and always refer to the same thing (i.e., are

unambiguous). REX comes with such lists for entity types like religions and nationalities.


Compatibility

Person

Location

Organization

Date

Time

Title

The New York Philharmonic Orchestra will make a historic trip to North Korea in February, it has announced. Dominique de Villepin a été nommé Premier ministrece mardi en fin de matinée par Jacques Chirac.

The orchestra's president and executive director, Zarin Mehta said it would play in the capital Pyongyang on February 26. In August, the reclusive communist country's Ministry of Culture sent an invitation to the orchestra at Lincoln Center in Manhattan.

L'ancien ministre de l'Intérieur, qui n'a jamais participé à une élection, a déjeuné avec les députés UMP et UDF à l'invitation du président de l'Assemblée nationale, Jean-Louis Debré.

اخلميس 5/2/1431 هـ - املوافق 21/1/2010 م (آخر حتديث) الساعة 10:01 (مكة املكرمة)، 7:01 (غرينتش) ناتويفكر مبسؤول مدني ألفغانستان يخطط حلف شمال األطلسي (ناتو)

لتعيني مسؤول مدني كبير في أفغانستان، وسط دعواتلتحسني التنسيق السياسي والتنموي فيالبالد وفق ما نقلته صحيفة وول ستريت

小澤征爾は、日本を代表する世界的な指揮者である。1973年、38歳のときに、アメリカ5大オーケストラの一つであるボストン交響楽団の音楽監督に就任した。

© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette”, and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-29-REX)

WEST COAST


FEDERAL


HEADQUARTERS


EUROPE


ASIA



RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Customized extraction

Rosette® Entity Resolver (RES) reveals meaningful information in your text. It connects the words that represent real-world things to one another and to entities in an entity database like Wikipedia, both within and across documents.

Good quality entity resolution means dealing with three key problems: variety, where one thing can have many names; ambiguity, where many things can have very similar if not exactly the same name, and ghosts, where some collection of names identify a previously unknown real-world thing.

RES enriches your text with high, quality metadata, enabling you to perform intuitive, entity-centric search and discovery. With it you can power notification applications designed to detect and track new people in text streams. It provides excellent raw material for building the custom knowledge graphs at the heart of many of today’s most innovative applications.

Linking and learning for real-world dataKEY FEATURES

- Standard training from 2.5M Wikipedia entities

- “Learning” Mode: Identifies previously unknown

entities (“ghosts”) and learns new aliases from

text as it processes

- “Linking” Mode: Rapidly links only known entities

- Custom entity database training

- Fast and scalable



- Java


Start using RESTry our free product evaluation

www.basistech.com


+1 617-386-2090

Paris Tamerlan Tsarnaev Apple

Connect your unstructured

text to the real-world people,

organizations and places you

care about.

Paris, Texas (33°39 N, 95°32 W)—or—Paris, France (48°51 N 2°21 E)

Tamerlin Tsarnaev (TheAtlantic.com) —or— Tamerlane Tsarnaevy (Mir24.net)

Apple Corps Ltd. (Music)—or—Apple Inc. (Technology)

EXAMPLES:


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Organize Big Text using entity linking and learning

In linking mode, RES will link the names of people, places and organizations in the text to entities in the entity database.

Anything that can’t be associated with an existing entity will be ignored.

This mode is optimized for high scale and stable throughput.

Linking Learning


Requirements

Custom training

Partial Support Languages

Confidence measures are essential for effective use of statistically based systems. RES can be configured to deliver confidence measures with each of its clustering and linking decisions, allowing developers to use the RES output intelligently.

© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette”, and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-18-RES)

- Russian

- Japanese

- Arabic

- Korean

Rosette Entity Resolver can be run in one of two modes:

In learning mode, RES not only links names to known entities, but also discovers new entities mentioned in the text (often called “ghosts”), and remembers the new aliases and contexts it has found for all entities.

For example, once “J. Doe” has been encountered and linked to the “John Doe” entity, future occurrences of “J. Doe” will be matched with greater confidence.

RES uses a machine-learned model to associate names and their contexts with collections of information drawn from the entity databases with known entities.

In linking mode, RES fixes both the number of entities and the information within.

Learning mode allows new entities to be created and new information to be added to existing entities. As this system state grows, RES intelligently prunes the information to maintain performance.

Under the hood

RES comes pre-trained to link to a Wikipedia-derived 2M+ entity database. RES may be further trained by adding to this entity database or by providing an entirely new entity database.

What is training?Training currently involves adding information about real-world entities to the system such as names, aliases, related entities, and example documents.

A simple example is adding a new alias to a Wikipedia-derived entity to improve resolution accuracy.

EXAMPLE:Basketball player Jeremy Lin is often referred to as “Linsanity”.

Training allows developers to add the “Linsanity” alias to the entry for Jeremy Lin. The next time “Linsanity” is encountered, it will be resolved appropriately.

Un estadounidense que aprendió a saltar en paracaídas

durante la Segunda Guerra Mundial cumplió su sueño de

poner en práctica su habilidad a los 90 años de edad. Lester

Slate saltó este domingo en el estado de Maine acompañado

de un guía paracaidista. A pesar de haber volado en

numerosas ocasiones, según la prensa local, este veterano de

la marina estadounidense nunca se había lanzado en

paracaídas. Slate señaló que se sintió inspirado por el

expresidente de Estados Unidos Goerge H.W. Bush, quien

realizó un salto con motivo de su 85 cumpleaños en 2009.

Tras pisar tierra, el nonagenario dijo que esperaba poder

repetir en su 95 y 100 cumpleaños.

George H. W. Bush

41st U.S. PresidentID: USPRES41DOB: June 12, 1924

George W. Bush

43rd U.S. PresidentID: USPRES43DOB: July 6, 1946

FISHING NEWS

Noelle Bush

ID: BD239852

Unknown Person

Unknown OrganizationPenobscot Boat Builders

ID: TF354723

KENNEBUNKPORT, Maine — Three

generations of an American political

dynasty went fishing off the Maine coast.

The family set out together in the morning

on their new powerboat, built by

Penobscot Boat Builders.

On board were the first President Bush;

the second President Bush

and his niece, Noelle Bush.

1 Link to Known Entities

2 Learn about New Entities

API

- Pashto

- Persian (Dari)

- Persian (Farsi)

- Urdu

Measuring Confidence

- English

- Spanish

- Chinese

Full Support Languages

WEST COAST


FEDERAL


HEADQUARTERS


EUROPE


ASIA



RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Names are the linchpin that connect data points in financial compliance, anti-fraud, government intelligence, law enforcement, and identity verification. Yet, names are challenging to connect because of their incredible variation in misspellings, nicknames, initials, and titles. In international databases, a single name may also appear in many languages!

Rosette® Name Indexer (RNI) solves these challenges with a linguistic, knowledge-based system that compares and matches names of people, places, and organizations despite their many variations. RNI is unrivalled in its ability to match names because of its intelligent approach.

As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world. RNI is unrivalled in its ability to match the names of entities—find out how your organization can utilize this pioneering technology for extraordinary results.

Accurate fuzzy name matching in many languages 14 Supported

Languages

KEY FEATURES


- Simple API

- Fast and scalable


- Easy installation


- Java


- Matches names of people, places, and

organizations

- Increases name search accuracy

- Ranks results by relevancy with a similarity

score

- Built to work with Apache™ Solr and

Elasticsearch

Select Customers


+1 617-386-2090

Start using RNI today Try our free product evaluation

www.basistech.com

Franklin D. Roosevelt

32nd U.S. PresidentID: USPRES32DOB: Jan. 30, 1882

82%

97%

77%

82%

84%

85%

74%

79%

73%

富兰克林·罗塞费尔特

Gov. Franklin Roosevelt

Frank Delano Roosevelt

Franklin Rosenvelt

President Roosevelt

Рузвельт, Франклин

F. D. R.F. D. Roosev

Franklin Delano Roosevelt, also known by his initials, FDR, was the 32nd President of the United States and a central figure in world events during the mid-20th century, leading the United States during....


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Our knowledge-based system combines the latest in Natural Language Processing (NLP) to intelligently match names based on their linguistic and cultural structures and norms.

Unlike expensive and less accurate legacy solutions driven by thousands of spelling variants from known names, RNI analyzes the intrinsic structure of each name component and performs an intelligent comparison using advanced linguistic algorithms.

Our approach is not limited to a particular list of variants and reduces the likelihood of both “false positives” (wrong matches) and “false negatives” (missed matches).

List driven systems cannot equal RNI for matching never-seen-before names or mis-segmented names (Mary Ellen vs. MaryEllen).

- Arabic scripts: Arabic, Persian, Pashto, Urdu- Cyrillic: Russian - Hangul: Korean - Hanzi (Simplified & Traditional): Chinese- Kanji, Katakana, Hirigana: Japanese- Roman scripts: English, Spanish, French,

Italian, German, Portuguese

RNI matches names from these languages either in transliteration to English or written in their native scripts.

Available Languages and Scripts

Name Matching Capabilities


Compatibility

Same name in multiple languagesMao Zedong 1 Мао Цзэдун 1 毛泽东

Phonetic spelling di erencesCairns 1 Kearns 1 Kerns

Transliteration spelling di erencesAbdul Rasheed 1 Abd-al-Rasheed 1 Abdulrashid

NicknamesWilliam 1 Will 1 Bill 1 Billy

InitialsJ. E. Smith 1 James Earl Smith

Titles and honorificsDr. 1 Mr. 1 Ph.D.

Out-of-order name componentsDiaz, Carlos Alfonzo 1 Carlos Alfonzo Diaz

Missing name componentsPhillip Charles Carr 1 Phillip Carr

Missing spaces or hyphensMaryEllen 1 Mary Ellen 1 Mary-Ellen

Truncated name componentsMcDonalds 1 McD 1 McDonald

Name split inconsistently across database fieldsDick • Van Dyke 1 Dick Van • Dyke

© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette” and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-29-RNI)

WEST COAST


FEDERAL


HEADQUARTERS


EUROPE


ASIA



RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

The Rosette Advantage

Financial institutions use RNI to manage and update watchlists to block terrorist access to funds, simultaneously avoiding compliance violations and protecting their reputation. Applications also include fraud detection, money laundering, and document triage.

Financial Compliance

Names are often the most critical data point in intelligence, law enforcement, and border control. RNI is being adopted throughout the U.S. government to address the challenge of matching names in all their variations—particularly names from non-Latin languages such as Arabic, Russian, Chinese, Korean, or Persian.

Government Intelligence

Trust is foundational to the sharing economy. Whether booking room rentals, rides, or odd jobs, it is important to establish ways to connect the online and offline worlds to reinforce that trust and confidence.

Name matching is a key component of verifying online identities with real-world documentation (passports, driver’s licenses). Members of the sharing economy such as Airbnb rely on RNI to match names originating from all over the world, and internationally between names written in alphabets besides the Roman A-to-Z.

Identity Verification in the Sharing Economy

Rosette® Name Indexer integrates easily into Apache Solr™ as a plug-in or into applications as a Java library to support its main use cases. RNI can also be adapted to match the needs of each application.

Apache SolrApache Solr™-based search systems can easily add high-quality fuzzy name matching to every search by simply adding name fields. RNI provides a special Solr field type for names. This mechanism means Solr can index documents with multiple name fields, each with multiple values (e.g., an “alias” field may contain more than one name). Each document could also contain non-name fields like dates or plain text.

<fieldname=”primary”>MuhammadAli</field> <fieldname=”alias”>CassiusClayJr</field> <fieldname=”alias”>TheGreatest</field> <fieldname=”dob”>1/7/1942</field>

A single query can then be constructed that gives different weight to the various fields. For example, a single query can find movies starring “Binedict Cumberbund” with screenplays by “Giyermo Diltoro” that were released around 2014.

Java LibraryAny application that needs name matching can directly integrate a Java library which takes care of storing watchlists without incurring the overhead of a web-service call.

Integration Options

- Set the minimum threshold of the similarity score to manage the precision and recall of the returned search results.

- Ignore a given list of words (“stopwords”) with respect to matching (e.g., titles, honorifics).

- Force two name words to always match with a given score (e.g., “Elizabeth” and “Lisbeth” always match at 90%).

- Force two names to always match with a given score (e.g., “John Doe” and “Joe Bloggs” always match at 95%).

- Link multiple names to a single individual (e.g., queries for "Marilyn Monroe" and "Norma Jeane Mortensen" include the same person).

Customize To Your Needs

Same name in multiple languagesMao Zedong 1 Мао Цзэдун 1 毛泽东

Phonetic spelling di erencesCairns 1 Kearns 1 Kerns

Transliteration spelling di erencesAbdul Rasheed 1 Abd-al-Rasheed 1 Abdulrashid

NicknamesWilliam 1 Will 1 Bill 1 Billy

InitialsJ. E. Smith 1 James Earl Smith

Titles and honorificsDr. 1 Mr. 1 Ph.D.

Out-of-order name componentsDiaz, Carlos Alfonzo 1 Carlos Alfonzo Diaz

Missing name componentsPhillip Charles Carr 1 Phillip Carr

Missing spaces or hyphensMaryEllen 1 Mary Ellen 1 Mary-Ellen

Truncated name componentsMcDonalds 1 McD 1 McDonald

Name split inconsistently across database fieldsDick • Van Dyke 1 Dick Van • Dyke

Use Cases

Names are an essential source of information, but most names in the world are not written in English, rendering them nearly useless to Anglocentric corporations and governments. These organizations must quickly and accurately translate names, often at a very large scale. Rosette® Name Translator (RNT) can quickly process millions of names from foreign languages and produce highly accurate, standardized English translations using industry-leading technologies, such as linguistic algorithms and statistical modeling. In addition, RNT can also translate any name written in English into its equivalent in any supported language, such as Arabic or Chinese.

As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

Instantly translate many names to (and from) English 10 Supported

Languages

KEY FEATURES


- Simple API

- Fast and scalable


- Easy installation


- Java


Select Customers


+1 617-386-2090

Start using RNT today Try our free product evaluation

www.basistech.com

Abu-Yusif Ya'qubأبو يوسف يعقوب

Yao MingOrigin Chinese

Entity Type Person

Language Chinese

Origin Japanese

Entity Type Location

Language Japanese

Shinano River

John KennedyOrigin English

Entity Type Person

Language Arabic

Origin Arabic

Entity Type Person

Language Arabic

Chan Ho PakOrigin Korean

Language Russian

Entity Type Person

جون كينيدي

姚明

Чан Хо Пак

信濃川


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

A DIFFICULT PROBLEMTranslating names from other languages into English is quite difficult. Even the most powerful and expensive “machine translation” systems struggle when confronted with the task of accurately translating large numbers of names. Why is this so hard?

A FEW CHALLENGES:- Which words in a name should be

translated according to their spelling (i.e., transliterated) and which words according to their meaning?

- Within a language, there may be conflicting conventions for translation. Both “Fuji” and “Huzi” are accepted name spellings of the iconic Japanese volcano. Arguments over spelling the capital of Ukraine as “Kiev” vs. “Kyiv” have almost triggered diplomatic crises.

- Common practice may conflict with organizational standards. For example, the name of the former ruler of Iraq typically appears in the news media as “Saddam Hussein”. However, the CIA’s official spelling is “Saddam Husayn”. Similarly, the conventional spelling of the Syrian ruler is “Assad”. However, CIA guidelines say “Asad”.

- A name written in a foreign language may be native to that language, such as محمود أحمدي جناد (Mahmoud Ahmadinejad), or may be an English name written in a foreign alphabet, such as جورج دبليو بوش (George W. Bush).

HOW IT WORKSRNT combines dictionary look-ups and transliteration to find the most accurate English spelling of a name. First, the foreign name is examined in user-supplied name dictionaries, known as gazetteers. If the name is not found, RNT transliterates the name into English by using linguistic algorithms and statistical modeling, then matches it using preferred name standards. For example, names written in Chinese are converted from ideographic characters into a phonetic representation. Names written in “unvocalized” Arabic (i.e., without short vowels) are automatically vocalized to enable a phonetic translation according to any of several user-selected standard systems.

UNIQUE CAPABILITIES- Generate consistent “conventional spellings”

of frequently appearing foreign names

- Process “unrecognized” names, i.e., those not appearing in any known catalog of foreign names

- Incorporate complex transliteration standards (such as the IC or U.S. Board on Geographic Names) for translating a name from a foreign alphabet into English

- Automatically resolve name spelling ambiguities in the source language, such as partial vocalization of Arabic, or word segmentation in Chinese

Arabic 1 EnglishDari 1 EnglishFarsi 1 EnglishPashto 1 EnglishUrdu English

Chinese 1 EnglishJapanese EnglishKorean 1 EnglishRussian 1 English

Additional languages are available via custom development.

1 Indicates names can be translated to and from English

Indicates names can be translated only to English

Available Languages Pairs


Compatibility

COMBINING REX & RNT

Washington

Tanzania

Anne Patterson

American

Darfur

Security Council

United NationsAfrican Union

Person

Location

Organization

Nationality

Rosette Entity Extractor (REX) may be paired with RNT to extract and translate key names in a document, with accuracy superior to either statistical or rule-based machine translation systems. This approach may also be used to enrich or remediate the output of such systems in situations where translations of entire paragraphs or documents are required.

واقترحت واشنطن بدال من ذلك إنشاء محكمة جديدة تابعة لألمم املتحدة واالحتاد األفريقي

في تنزانيا وتعهدت بتقدمي دعم مالي كبير لها، مطالبة الدول الغنية األخرى بتوفير

مساعدات مماثلة. وأكدت القائمة بأعمال املندوب األميركي في مجلس األمن

آن باترسون اهتمام بالدها مبساءلة من وصفتهم مبرتكبي األعمال الوحشية في دارفور

REX RNT

King 'Abdallah Bin-'Abd-al-'Aziz

Instead of “Malik” Instead of “Servant of God”,son of “Servant of the Precious One”

عبد اهللا بن عبد العزيز ملك

© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette”, and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-29-RNT)

WEST COAST


FEDERAL


HEADQUARTERS


EUROPE


ASIA



RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA


RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

rosette - basis technology€¦ · rosette ® is a suite of ... make real-world connections in your...

Documents