copyright © 2008 mark logic corporation. all rights reserved.1 unlock content™ copyright © 2008...

49
pyright © 2008 Mark Logic Corporation. All rights reserved. 1 Unlock Content™ pyright © 2008 Mark Logic Corporation. All rights reserved. 1 MarkLogic Server: Under The Hood Mary Holstege Principal Engineer

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Copyright © 2008 Mark Logic Corporation. All rights reserved. 1

Unlock Content™

Copyright © 2008 Mark Logic Corporation. All rights reserved. 1

MarkLogic Server: Under The Hood

Mary HolstegePrincipal Engineer

Copyright © 2008 Mark Logic Corporation. All rights reserved. 2

MarkLogic Server

XML Server

Special-purpose DBMS for XML

Semi-structured

Hierarchical

Designed for 100s of TB of XML

Copyright © 2008 Mark Logic Corporation. All rights reserved. 3

How Did We Get Here?

Founder: Christopher Lindblad

MIT

Architect of Ultraseek ServerIntranet seach engine product

Met people that wanted to use a search engine like a database

Rich query language

Guaranteed correctness

Transactions

Copyright © 2008 Mark Logic Corporation. All rights reserved. 4

Consider an Application

Documents + metadata

Documents: rich, variable structure

Want: complex full-text search

Want: combined text, metadata, structure-aware search

Want: granular ad hoc access

Want: real-time query

How do you build it?

Copyright © 2008 Mark Logic Corporation. All rights reserved. 5

Two-headed Monster

Copyright © 2008 Mark Logic Corporation. All rights reserved. 6

A Different Approach

Soul of Search Engine: Data Model And Queries

Database: On-disk Organization And Transactions

Copyright © 2008 Mark Logic Corporation. All rights reserved. 7

Data Model

Document

Title

Author

Abstract

Section

Section

Footer

Section

Section

Section (cont’d)

Metadata

Copyright © 2008 Mark Logic Corporation. All rights reserved. 8

Data Model

A database for XML . . .

. . . uses the XML Data Model

XML is a tree

Document

Title Author

Section

Section Section Section Section Section

FirstLast

Metadata

Copyright © 2008 Mark Logic Corporation. All rights reserved. 11

Example Document

<article>

<title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>

Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2008 Mark Logic Corporation. All rights reserved. 12

What Queries Is It Good At?

1) Full-Text Search

Find all documents that contain the phrase “high performance”.

2) XML Structure

Find all articles that have an abstract.

3) XML Semantics

Find all documents that mention the company “Mark Logic”.

4) All of the above . . .

Find all articles that contain the phrase “high performance” and mention the company Mark Logic in the abstract.

at the same time

Copyright © 2008 Mark Logic Corporation. All rights reserved. 13

1) Full-text Search

Find all documents that contain the phrase “high performance”

<article>

<title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>

Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2008 Mark Logic Corporation. All rights reserved. 14

1) Full-text Search

very

high

perform

ance

index

122 0 1 0 0

123 1 0 1 1

124 0 0 0 0

125 0 1 0 0

126 0 1 1 0

127 1 0 0 0

129 1 1 0 0

130 0 1 1 1

Find all documents that contain the phrase “high performance”

Copyright © 2008 Mark Logic Corporation. All rights reserved. 15

1) Full-text Search

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

“very high”

“performance index”

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document References

126, 130, 167, 212, 219, 377 . . .

Find all documents that contain the phrase “high performance”

Copyright © 2008 Mark Logic Corporation. All rights reserved. 16

2) XML Structure

Find all articles that have an abstract

<article><title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2008 Mark Logic Corporation. All rights reserved. 17

2) XML Structure

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document References

126, 130, 167, 212, 219, 377 . . .

Find all articles that have an abstract

Copyright © 2008 Mark Logic Corporation. All rights reserved. 18

3) XML Semantics

Find all documents that mention the company “Mark Logic”

<article>

<title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>

Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2008 Mark Logic Corporation. All rights reserved. 19

3) XML Semantics

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<company>Mark Logic</

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document References

126, 130, 167, 212, 219, 377 . . .

Find all documents that mention the company “Mark Logic”

Copyright © 2008 Mark Logic Corporation. All rights reserved. 20

4) All Of The Above

Find all articles that contain the phrase “high performance” and mention the company “Mark Logic” in the abstract

<article><title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>

Where should one put their XML? <company>Mark Logic</company> has the best answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2008 Mark Logic Corporation. All rights reserved. 21

4) All Of The Above

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document References

126, 130, 167, 212, 219, 377 . . .

Find all articles that contain the phrase “high performance” and mention the company “Mark Logic” in the abstract

Copyright © 2008 Mark Logic Corporation. All rights reserved. 22

Scalar Indexes

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document References

126, 130, 167, …

Identify a set of documents based on criteria and then characterize the set with scalar indexes (float, dateTime, string etc.)

Copyright © 2008 Mark Logic Corporation. All rights reserved. 23

Geospatial, too

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document References

126, 130, 167, …

Just a special kind of scalar index, except values are points and scan operators know about Earth geometry

Copyright © 2008 Mark Logic Corporation. All rights reserved. 25

Universal Index Is Our Hammer

We turn queries into nails

Copyright © 2008 Mark Logic Corporation. All rights reserved. 26

Examples Of Nails

Directories

Exclusive, hierarchical, analogous to file

system, map to URI

Collections

Set-based, N:N relationship

Security

Invisible to your app

Copyright © 2008 Mark Logic Corporation. All rights reserved. 27

Many Shapes And Sizes

News Article Book Research Report

Slide Presentation Product Sheet Operations Manual

Copyright © 2008 Mark Logic Corporation. All rights reserved. 28

Load As Is

XML is self-describing

<article>

<title>MarkLogic Server: . . .</title>

<author>

<first-name>John</first-name>

<last-name>Kreisa</last-name>

</author>

<abstract>

. . . . <company>Mark Logic</company>

</abstract>

<body>

<section>

<section> . . .</section>

</section>

<section> . . . index . . . </section>

</body>

<copyright>Copyright© . . . </copyright>

</article>

Copyright © 2008 Mark Logic Corporation. All rights reserved. 29

Load As Is

<article>

<title>MarkLogic Server: . . .</title>

<author>

<first-name>John</first-name>

<last-name>Kreisa</last-name>

</author>

<abstract>

. . . . <company>Mark Logic</company>

</abstract>

<body>

<section>

<section> . . .</section>

</section>

<section> . . . index . . . </section>

</body>

<copyright>Copyright© . . . </copyright>

</article>

XML is self-describing

<article>

<author>

<title>

<abstract>

<body>

<copyright>

<first-name>

<last-name>

<company>

<section>

<section>

<section>

MarkLogic Server: . . .

John

Kreisa

MarkLogic

. . . index. . .

Copyright © 2008 Mark Logic Corporation. All rights reserved. 30

Load As Is

<article>

<title> <abstract><body> <copyright>

<author>

<first-name>

<last-name>

<section> <section>

<section>

<company>

"MarkLogic Server: . . ."

"John"

"Kreisa"

"MarkLogic"

" . . . " " . . . "

" . . . "

“ . . . "" . . . index. . . "

XML is self-describing

Copyright © 2008 Mark Logic Corporation. All rights reserved. 31

Load As Is

<article>

<title> <abstract><body> <copyright>

<author>

<first-name>

<last-name>

<section> <section>

<section>

<company>

"MarkLogic Server: . . ."

"John"

"Kreisa"

"MarkLogic"

" . . . " " . . . "

" . . . "

“ . . . "" . . . index. . . "

XML is self-describing No Schema Needed!

Copyright © 2008 Mark Logic Corporation. All rights reserved. 32

Degrees Of Flexibility

Str

uct

ure

Ad

hoc

Pre

defin

ed

Queries

Ad hocPredefined

IMSIDMS

RelationalDatabases

Search Engines MarkLogic

ServerXML

Server

Copyright © 2008 Mark Logic Corporation. All rights reserved. 33

The Query Language

XMLUniversal

Index

XQuery

Full-Text Search XML StructureXML Semantics

Application Logic Manipulate XML Render Results

Load As Is

Copyright © 2008 Mark Logic Corporation. All rights reserved. 34

The Programming Language

XMLUniversal

Index

XQuery

Full-Text Search XML StructureXML Semantics

Application Logic Manipulate XML Render Results

Load As Is

Copyright © 2008 Mark Logic Corporation. All rights reserved. 37

A Different Approach

Sould of a Search Engine: Data Model And Queries

Database: On-disk Organization And Transactions

Copyright © 2008 Mark Logic Corporation. All rights reserved. 38

What’s In A Database?

No tables

No rows

forests . . .

. . . . of trees

Database

Forest1 Forest2Forest3

Copyright © 2008 Mark Logic Corporation. All rights reserved. 39

The Cluster

Host e1

Forest1Forest1

Host ek

Host d1 Host d2 Host d3 Host dl

Forest2Forest2 Forest3

Forest3 ForestmForestm

Host e2

Forest4Forest4

Copyright © 2008 Mark Logic Corporation. All rights reserved. 40

What About Updates?

Typical XML document:

10KB – 1MB

Referenced by 1,000s to 10,000s of term lists

Search engines are bad at updates

Many indexes to update

Option: Index and Information out of sync

Option: Slow

We want

High throughput

Transactions (ACID)

So how do we avoid updates?

Copyright © 2008 Mark Logic Corporation. All rights reserved. 41

Solution: Temporal Database

No update! No delete!

Only insert and read-at-a-time

Every document has two timestamps

“created”, “expired”

Copyright © 2008 Mark Logic Corporation. All rights reserved. 42

Temporal Database

520 528

Createa.xml

Createb.xml

Updatea.xml Updatea.xml

Deleteb.xml...

QueryQuery

Copyright © 2008 Mark Logic Corporation. All rights reserved. 43

The Cluster

Host e1

Forest1Forest1

Host ek

Host d1 Host d2 Host d3 Host dl

Forest2Forest2 Forest3

Forest3 ForestmForestm

Host e2

Forest4Forest4

Copyright © 2008 Mark Logic Corporation. All rights reserved. 44

Host

A Single Forest

Stand1 Stand2 Standn

BufferForestk

Buffer

Copyright © 2008 Mark Logic Corporation. All rights reserved. 45

Host

1. Create A New Tree

Stand1 Stand2 Standn

BufferForestk

Buffer

Copyright © 2008 Mark Logic Corporation. All rights reserved. 46

Host

2. Expire Trees

Stand1 Stand2 Standn

BufferForestk

Buffer

Copyright © 2008 Mark Logic Corporation. All rights reserved. 47

Host

3. Save A Buffer To Disk

Stand1 Stand2 Standn

BufferForestk

Buffer

Copyright © 2008 Mark Logic Corporation. All rights reserved. 48

Host

4. Optimization: Merge Stands

Buffer

Forestk

Copyright © 2008 Mark Logic Corporation. All rights reserved. 49

The Four Forest Operations

1. Create a new document• Into a buffer

2. Mark a document as expired• Memory-mapped document timestamps per stand

3. Write buffer out to disk• Our buffers are 100s of megabytes• For performance, double buffer

4. Merge• Background process• Optimization: reduces number of stands in forest

Copyright © 2008 Mark Logic Corporation. All rights reserved. 50

Consistency And Throughput

2-phase commit

Transactions span forests

Recovery

Forest Journals

Lock-free queries

Use the search engine at a point-in-time

Increased throughput

Time travel?

Copyright © 2008 Mark Logic Corporation. All rights reserved. 51

A Different Approach

Sould of a Search Engine: Data Model And Queries

Database: On-disk Organization And Transactions

Copyright © 2008 Mark Logic Corporation. All rights reserved. 52

Summary

XML as data model

Ad hoc schema

A search engine core

Universal Index

Temporal transaction model

High throughput while keeping . . .

Performance and scalability of a search engine

Copyright © 2008 Mark Logic Corporation. All rights reserved. 53

Mary Holstege

Principal Engineer

[email protected]

t: 650.655.2336

f: 650.655.2310

Thank You

Copyright © 2008 Mark Logic Corporation. All rights reserved. 54

The Cluster

Host e1

Forest1Forest1

Host ek

Host d1 Host d2 Host d3 Host dl

Forest2Forest2 Forest3

Forest3 ForestmForestm

Host e2

Forest4Forest4