ii-sdv 2015, 20 - 21 april 2015 in nice

26
Boehringer Ingelheim Pharma GmbH & Co. KG Scientific Information Center – S.I.C. WebCrawling / Internet Research Emancipation from Public Search Aleksandar Kapisoda & Klaus Kater (black swan )

Upload: dr-haxel-cem-gmbh

Post on 16-Jul-2015

759 views

Category:

Internet


0 download

TRANSCRIPT

Boehringer Ingelheim Pharma GmbH & Co. KGScientific Information Center – S.I.C.

WebCrawling / Internet Research Emancipation from Public Search

Aleksandar Kapisoda & Klaus Kater (black swan )

Content

1. Intro: Why we need our own web crawler and search engine

2. Focus on competitive technology and startups:Building proprietary SEARCHCORPORA to

• Find new technology, e.g. university spin-offs / licenses (search)

• Monitor activities of known competitors (alerting)

3. Scientific Information Center - Workflow

4. What S.I.C. Can Now Offer to the Customers

• Targeted SEARCHCORPORA

• Automatic alerting

5. Outlook: What we want to achieve in the next steps

• Ontology mapping

2

Intro

Why We Need Our Own Web Crawler and Search Engine

3

The Sea of Information

Our claim is to search all of the sea,not just its surface!

4

The Sea of Information

Personal Web Observation

(Browser with Google)

5

The Sea of Information

News Feeds(RSS, Email-Alerts, Newsletters)

Personal Web Observation

(Browser with Google)

6

The Sea of Information

Personal Web Observation

(Browser with Google)

Social Media

News Feeds(RSS, Email-Alerts, Newsletters)

7

The Sea of Information

Personal Web Observation

(Browser with Google)

Internet of Things(Patient Health Sensor Data)

Social Media

News Feeds(RSS, Email-Alerts, Newsletters)

http://www.teleskop-austria.at/information/bino-coin-tl/Coin100-1.jpghttp://www.easymarmaris.com/uploaded_tour_files/1397573325jet_Ski_7.jpg 8

The Sea of Information

Personal Web Observation

(Browser with Google)

Internet of Things(Patient Health Sensor Data)

Internal Information(Corporate Databases, Intranet)

Social Media

News Feeds(RSS, Email-Alerts, Newsletters)

http://www.teleskop-austria.at/information/bino-coin-tl/Coin100-1.jpghttp://www.easymarmaris.com/uploaded_tour_files/1397573325jet_Ski_7.jpg 9

Our Lack of Information

Personal Web Observation

Social Media

What we actually find using public search (Google)

10

Our Lack of Information

All other information is Deep Web informationthat cannot be searched with Public Search.

11

Google repository

Google Rating Magic

Google Ads

Surf behavior

User profile

Array of Googlebots

WWW

.comgoogle

.de …

max 1000 results

Public search does not allow access to Deep Web information

• Number of results artificially limited

• Search hit filter logic is not revealed

• Single document content index

The Lack of Information

and also

12

The Sea of Information

13

Focus on Competitive Technology and Startups

Building proprietary SEARCHCORPORA

Case Studies

14

Focus on Competitive Technology and Startups:Building Proprietary SEARCHCORPORA

Find new technology, e.g. university spin-offs / licenses (PULL)

• Provide custom SEARCHCORPUS

• Start from technology transfer organizations / universities (spin-offs in 1st step)

1. Crawl information about spin-offs companies (address, website)

2. Extract technology categories

3. Crawl and index websites

4. Build SEARCHCORPUS

• Customize SEARCHCORPUS Viewer1)

• Publish SEARCHCORPUS Viewer in corporate intranet

1) In addition to common search queries we support fuzzy search, proximity search and phrases15

Side Note: Annotating target documents with topic specific content to build searchable contexts

Surface Web

Deep WebCorporateResources

We can find documents using search terms that appear in the context but not necessarily in the document’s content.

16

Focus on Competitive Technology and Startups:Building Proprietary SEARCHCORPORA - PULL

Find new technology, e.g. university spin-offs / licenses (PULL)

http://www.example_url.com

names and data of targets

crawl

extract

crawl

Target SEARCHCORPUS

expressions to scrape data from pages of published targets

SEARCHCORPUS Viewer

17

Focus on Competitive Technology and Startups:Building Proprietary SEARCHCORPORA - PUSH

Monitor activities of known competitors (PUSH)

• Weekly alerts

• Currently concentrating on public companies (3 different websites as sources)

1. Crawl and extract ticker symbols (>15.000 public companies)

2. Crawl and scrape company information (address, website, industry, sector)

3. Crawl and index company news

• For each topic of interest, we create targets as search queries1)

e.g. “oncology AND acquisition” to find out, who acquired oncology companies

• Alerts are automatically sent by email

1) In addition to common search queries we support fuzzy search, proximity search and phrases18

Monitor activities of known competitors (PUSH)

Focus on Competitive Technology and Startups:Building Proprietary SEARCHCORPORA - PUSH

http://www.example1.com…

http://www.finance.example.com- seed urls

crawl

extract

crawl

Company data, industry, sectorDescription, …

expressions to extractstock market ticker symbols

newspage.com seed urls

crawl

newspage.com

company news pages

crawl

newspage.com

company news

linkCompany news corpus

User profile

matchMatching news

sendalerts

On a monthly scheduleOn a weekly schedule

Email alert

19

Scientific Information Center Workflow

Implementing a Business Process to Offer SEARCH as a Service

20

Scientific Information CenterWorkflow

Project Inquiry

Specify Scope

Setup Chains

Review

Research DepartmentCustomer

Crawler

Crawling

Analyzing

Daily use Scheduled Updates

possiblyin iterations

Information Scientist

21

Information Scientist Engineer

Scientific Information CenterWorkflow

Viewer

SEARCHCORPUS Designer

Scheduler / Engine

ContainerToolsReport / XLS

Research Department

22

Scientific Information CenterWorkflow

Pay off

tDevelopment Test

Actual Usage

Ongoing Optimization

no predetermined end of life time

The value of a SEARCHCORPUS increases over time.Cost

23

What S.I.C. Can Now Offer to the Customers

Automatic alerting

Targeted SEARCHCORPORA

Email Client

SERACHCORPUS ViewerB

len

ded

into

B

I In

tran

et S

olu

tion

Project

Alert Profile(Search Terms)

Scheduled Alerts

Push

Scheduled Updates

Project

SEARCH Profile(Targets)

Scheduled Updates

Faceted SEARCH

Pull

Crawler

SIC Crawler

24

Outlook: What We Want to achieve in the Next Steps

Technology

User Perspective

GUI for defining Alert Profiles

• Broader project scopes• Larger SEARCHCORPORA• More sources

Ontology Mapping

• Map SEARCHCORPUS entries to Ontologies • Faceting over Ontologies• Ontology Management: Import AND build ontologies

25

Contact Infromation

Aleksandar Kapisoda

[email protected]

[email protected]

Klaus Kater

26