surfacing the deep web (2 slides per page)

15
Surfacing the Web Websearch Academy 2013 14 October 2013 © Arthur Weiss, AWARE, 2013 1 © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk Arthur Weiss Email: [email protected] / Twitter: @awareci www.marketing-intelligence.co.uk 14 October 2013 Surfacing the Deep Web WebSearch Academy Internet Librarian International © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk Not everything can be found with Google…. The ‘Invisible Web’ or ‘Deep Web’ consists of web pages and documents which are not indexed by conventional search engines or are poorly or incompletely indexed.

Upload: arthur-weiss

Post on 05-Dec-2014

809 views

Category:

Business


1 download

DESCRIPTION

Presentation given at Internet Librarian International 2013 - Websearch Academy. 14 October 2013 on finding deep-web / invisible web content

TRANSCRIPT

Page 1: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 1

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Arthur Weiss Email: [email protected] / Twitter: @awareci

www.marketing-intelligence.co.uk 14 October 2013

Surfacing the Deep Web WebSearch Academy

Internet Librarian International

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Not everything can be found with Google…. The ‘Invisible Web’ or ‘Deep Web’ consists of web pages and documents which are not indexed by conventional search engines or are poorly or incompletely indexed.

Page 2: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 2

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

5 Types of “Invisibility”

3

Not search engine

optimised so pages fail to appear in

“simple” searches

Not indexed by search engines

Subscription or

proprietary content

Excluded

from search index

Encrypted or non-

indexable content

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Know your tool kit

4

or

Standard Google Multiple approaches & tools

Page 3: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 3

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

What do I need to find?

5

What sort of needle? What sort of haystack?

http://www.morguefile.com/archive/display/21091

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Why will the information be available? Where will it be held (Who will know it?)

Can I obtain it legally and ethically from this source & if so, how?

If not, are there other sources or ways of obtaining the information?

After obtaining the information are any checks needed to verify it?

What is the information’s relationship to other information?

6

Page 4: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 4

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Not everything is online or can be found! •  Try to find:

  Original TV coverage of the storming of the Bastille1

  A newspaper interview with Christopher Columbus, following his return from discovering America

  A recording of Abraham Lincoln delivering the Gettysburg address

  A photo of Jesus in his crib (Question from a 9 year old: “Why didn’t anybody take photos with their phones?”)

1 With thanks to Karen Blakeman of RBA Information (rba.co.uk) for these examples

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

“Forty-two! Is that all you’ve got to show for seven and a half million year’s work?”

“I checked it very thoroughly and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”

Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”

If your search approach is wrong, it doesn’t matter which approach or tool you use, or how you use it. Your results will be poor or wrong.

Page 5: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 5

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Before starting to search consider sources for the subject / topic of interest…

Would any of the relevant pages be in another language? “cheap hotel in Dubai” OR “فندق اقتصادي في دبي”

Are there societies, organisations, people, or groups that may have information? (Who/where else could have information?)

What search tool / approach is most likely to access or index the information’s location (container)

Are there unique terms or jargon that lead to a specialist tool e.g. Lung cancer (consumer) versus pulmonary carcinoma (medical)

Why is information likely to be available? Consider also file-formats, and location of search terms

9

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Before starting to search: consider search terms for the topic or subject of interest

How might the information be written? “I work for Xcompany” to search for

employees of Xcompany “X is better than” for comparisons

Are any keywords likely to be in irrelevant documents that should be excluded from searches?

Are any keywords part of a common phrase?

Are there any other words likely to be in documents on the topic?

Are there any synonyms or variant spellings? Tyre or tire; Aluminum Candy or sweet Basle or Basel

10

Page 6: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 6

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Research Planning

Information Requirements

Break down into individual

questions that, when answered, will provide the

required knowledge

Don’t start searching

without knowing what

you are looking for, and why

11

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

An example research plan Copy & fill in sheet for each key information question / topic

Research Topic Research Questions (breakdown topic into answerable questions)

Sources Search Approach / Parameters

Type of information expected

Comments / Possible problems

LINKEDIN JOB TITLE, CURRENT EMPLOYER, ETC.

PEOPLE PROFILES MAY NOT BE ACCURATE OR IN-DATE

GOOGLE SCHOLAR

AUTHOR NAME, TOPIC, DATE, ETC.

CITATIONS, ACADEMIC RESEARCH PAPERS….

DOESN’T COVER EVERYTHING

NATIONAL STATISTICS

SITE SEARCH ENGINE CENSUS & DEMOGRAPHIC DATA

MAY BE OLD OR INCOMPLETE

12

Page 7: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 7

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Types of “Invisibility”

13

Not search engine

optimised so pages fail to appear in

“simple” searches

Not indexed by search engines

Subscription or

proprietary content

Excluded

from search index

Encrypted or non-

indexable content

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Advanced Searching •  Use advanced search operators and options e.g.

Filetype: / InTitle: / InUrl: / .. (numeric) and * (wildcard)

14

Page 8: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 8

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Search Engines – not just Google

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Types of “Invisibility”

16

Not search engine

optimised so pages fail to appear in

“simple” searches

Not indexed by search engines

Subscription or

proprietary content

Excluded

from search index

Encrypted or non-

indexable content

Page 9: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 9

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Specialist Search / Deep Web Search

17

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Search for Information “Containers” •  Knowing a reason for the information to be

available can lead to an information source   Who else would want this information?   Search for topic + “Database”

e.g. Coffee database – first two results:

18

Page 10: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 10

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Case Examples – Economics by Country

19

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Case Examples – Trade Statistics

20

Page 11: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 11

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Case Examples – Economic Indicators

21

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Case Examples – Genealogy

22

Page 12: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 12

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Types of “Invisibility”

23

Not search engine

optimised so pages fail to appear in

“simple” searches

Not indexed by search engines

Subscription or

proprietary content

Excluded

from search index

Encrypted or non-

indexable content

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Proprietary sites / Blocked from Index •  Register for password protected sites •  Use site search or site map – if available •  If Robots.txt file exists may be able to view the

hidden pages e.g. nytimes.com/robots.txt

24

Page 13: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 13

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Types of “Invisibility”

25

Not search engine

optimised so pages fail to appear in

“simple” searches

Not indexed by search engines

Subscription or

proprietary content

Excluded

from search index

Encrypted or non-

indexable content

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Content that can’t / won’t be indexed •  Non-textual information e.g. multimedia /

audiovisual   Bing has search operators that can find RSS feeds

(hasfeed:) and pages containing specific types of file (e.g. mp3 files – contains:mp3)

  Search for related textual information e.g. descriptions, or sources (e.g. artwork or film titles)

•  Encrypted information / .Onion sites   Project Tor (torproject.org) and the TOR browser

Access encrypted sites via proxy servers

26

Page 14: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 14

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Searching TOR •  On regular Google: fake passport site:onion.to

27

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

TOR / .Onion Sites

28

Page 15: Surfacing the deep web (2 slides per page)

Surfacing the Web Websearch Academy 2013

14 October 2013

© Arthur Weiss, AWARE, 2013 15

© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk

Arthur Weiss is the managing director of AWARE - a UK based consultancy specialising in marketing & competitive intelligence analysis.

Contact Details: Web Sites: www.marketing-intelligence.co.uk E-mail: [email protected]

Twitter: @awareci

Telephone: +44 20 8954 9121 Fax: +44 20 8954 2102

29

Any Questions?