a hop is a relationship, r, hopping from entity, e, to entity, f. strong rule mining finds all...

A hop is a relationship, R, hopping from entity, E, to entity, F. Strong Rule Mining finds all frequent, confident rules

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1234

E

F2 3 4 5

SRMs are categorized by the number of hops, k, whether transitive or non-transitive and by the focus entity. ARM is 1-hop, non-transitive (A,CE), F-focused SRM (1nF)

ct(&eARe &PC) / ct(&eARe) mncfct(&eARe) mnsp

consequent upward closure: If AC is non-confident, then so is AD for all subsets, D, of C. So frequent antecedent, A, use upward closure to mine for all of its' confident consequents.

antecedent downward closure: If A is frequent, all of its subsets are frequent. Or, if A is infrequent, then so are all of its supersets. Since frequency involves only A, we can mine for all qualifying antecedents efficiently using downward closure.

Transitive (a+c)-hop Apriori strong rule mining with a focus entity which is a hops from the antecedent and c hops from the consequent, if a/c is odd/even then one can use downward/upward closure on that step in the mining of strong (frequent and confident) rules.

In this case A is 1-hop from F (odd, use downward closure). C is 0-hops from F (even, use upward closure).

We will be checking more examples to see if the Odddownward Evenupward theorem seems to hold.

1-hop, transitive (AE,CF), F-focused SRM (1tF)

1-hop, transitive, E-focused rule, AC SRM (1tE) ct(PA&fCRf) / ct(PA) mncf|A|=ct(PA) mnsp

antecedent upward closure: If A is infrequent, then so are all of its subsets.

consequent downward closure) If AC is non-confident, then so is AD for all supersets, D, of C.

In this case A is 0-hops from E (even, use upward closure). C is 1-hop from E (odd, use downward closure).

AC strong if: ct(&eARe &gCSg) / ct(&eARe) mncfct(&eARe) mnsp and2-hop transitive F-focused

S(F,G)

R(E,F)

0 0 0 1

0 0 1 0

0 0 0 1

0 1 0 0

1 0 0 1

0 1 1 1

1 0 0 0

1 1 0 0

1

2

3

4

E

F2 3 4 5

1

2

3

4

G

A

C

Apriori for 2-hops: Find all freq antecedents, A, using downward closure. find C1G, the set of g's s.t. A{g} is confident. Find C2G, set of C1G pairs that are confident consequents for antecedent, A. Find C3G, set of triples (from C2G) s.t. all subpairs are in C2G (ala Apriori), etc.

1,1 odd so down, down correct.

2-hop trans G-foc mncfct(&flist&eAReSf & PC) / &flist&eARe

Sf ct(&flist&eARe

Sf)mnsp

1. (antecedent upward closure) If A is infrequent, then so for are all subsets.

2. (consequent upward closure) If AC non-conf, so is AD for all subsets, D.

2,0 even so up,up is correct.

2-hop trans E-foc

antecedent upward closure: If A is infrequent, so are all subsets.

consequent upward closuree: If AC non-conf so is AD for all subsets, D.

0,2 even so up,up is correct.

mncfct(PA&f&gCSgRf ) / ct(PA)ct(PA)mnsp mncfct(&f&eARe

Sf & PC) / &f&eAReSf

ct(&fl&eAReSf)mnsp

AC, is confident if a high fraction of the fF which are related to every aA, are also related to every cC

F is the Focus Entity and the high fraction is the MinimumConfidence ratio.

1

2

…

9

Term

1 2 3 D

DTPe k=1..7 TDRolodexCd

1

2

…

7

Pos1 2 3 D

DTPe k=1..9 PDCd

1

2

…

7

Pos1 2 … 9 T

DTPe k=1..3 PTCdWe can form multi-hop relationships from RoloDex cards. Does this open up a new area of text mining for the three DTP Rolodexes Recall:AC, is confident if a high fraction of the fF which are related to every aA, are also related to every cC.

F is the Focus Entity and the high fraction is the MinimumConfidence ratio.

DT (P=k)

DT (P=h)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

D

T1 … 9

3

…

1

D

A

C A confident DThk rule means:A high fraction of the terms, tT in Position=h of every doc A, are also in Position=k of every doc C.

Is there a high payoff research area here?

DP (T=k)

DP (T=h)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

D

P1 … 7

3

…

1

D

A

C A confident DPhk rule means:A high fraction of the Positions, pP which hold Term=h for every doc A, also hold Term=k in Pos=p for every doc C.

Is this a high payoff research area?

TP (D=k)

TP (D=h)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

9

…

1

T

P1 … 7

9

…

1

T

A

C A confident TPhk rule means:A high fraction of the Positions, pP in Doc=h which hold every Term, t A, also hold every Term, t C in Doc=k

This only makes sense for A ,C singleton Terms.Also it seems like P would have to be singleton?


TD (P=k)

TD (P=h)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

9

…

1

T

D1 … 3

9

…

1

T

A

C A confident TDhk rule means:A high fraction of the Documents, dD having in Position=h, every Term, t A, also have in Position=k, every Term, t C.Again, A,C must be singletons. High payoff ?It suggests in 1-hop ARM:

Looking for strong TD rules:A high fraction of the Documents, dD having every Term, t A, also have every Term, t C.Again, A,C must be singletons.Is there a high payoff research area here?

PD (T=k)

PD (T=h)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

7

…

1

P

D1 … 3

7

…

1

P

A

C A confident PDhk rule means:A high fraction of the Documents, dD having Term=h in every Pos, pA, also have Term=k in every Pos. pC.High payoff ?

PT (D=k)

PT (D=h)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

7

…

1

P

T1 … 9

7

…

1

P

A

C A confident PThk rule means:A high fraction of the Terms, tT in Doc=h which occur at every Pos, p A, also occur at every Pos, pC in Doc=k


More on forming multi-hop relationships from RoloDex cards.

AC, is confident if a high fraction of the fF which are related to every aA, are also related to every bB.F is the Focus Entity and the high fraction is the MinimumConfidence ratio.

Buys (Day=2)

Buys (Day=1)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

I

C1 … 9

3

…

1

I

A

B A confident Buy12 rule means:Some customers Buys all of A on Day=1, then most of those customers will Buy all of B on Day=2

Consider the Market Basket RoloDex (different Cust-Item card for each day)

Buys (Day=k)

0 0 0

0 0 1

0 0 0

3

…

1

Cust1 … 9

Item

“Buys” pathways?

Buys (Day=2)

Buys (Day=1)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

I

C1 … 9

3

…

1

I

A

I

A confident Buy123 pathway means:Some customers Buys all of A on Day=1, then most of those customers will Buy all of B on Day=2And most of those customers will Buy all of D on Day=3

Buys (Day=3)

0 0 0

0 0 1

0 0 0

DC1 … 9

Buys (Day=2)

Buys (Day=1)

0 0 0

0 0 1

0 0 0

1 0 0

0 1 1

1 0 0

3

…

1

I

C1 … 9

3

…

1

I

A

I

A confident Buy1234 pathway means:Some customers Buys all of A on Day=1, then most of those customers will Buy all of B on Day=2, then most ofthose customers will Buy all of D on Day=3And most of those customers will Buy all of E on Day=4

Buys (Day=3)

0 0 0

0 0 1

0 0 0

C1 … 9

Buys (Day=4)

1 0 0

0 1 1

1 0 0

3

…

1

EI

More on forming multi-hop relationships from RoloDex cards.

AC, is confident if a high fraction of fF related to every aA, are also related to every cC.

Consider the Protein-Protein Interaction RoloDex (different Gene-Gene card for each interaction involved in some pathway)

Interaction=k

0 0 0

0 0 1

0 0 0

3

…

1

Gene1 … 9

Gene

What is a biological pathway?A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in the cell. Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on and off, or spur a cell to move.

How do biological pathways work?For your body to develop properly and stay healthy, many things must work together at many different levels - from organs to cells to genes.From both inside and outside the body, cells are constantly receiving chemical cues prompted by such things as injury, infection, stress or even the presence or lack of food. To react and adjust to these cues, cells send and receive signals through biological pathways. The molecules that make up biological pathways interact with signals, as well as with each other, to carry out their designated tasks.Biological pathways can act over short or long distances. For example, some cells send signals to nearby cells to repair localized damage, such as a scratch on a knee. Other cells produce substances, such as hormones, that travel through the blood to distant target cells.These biological pathways control a person's response to the world. For example, some pathways subtly affect how the body processes drugs, while others play a major role in how a fertilized egg develops into a baby. Other pathways maintain balance while a person is walking, control how and when the pupil in the eye opens or closes in response to light, and affect the skin's reaction to changing temperature.Biological pathways do not always work properly. When something goes wrong in a pathway, the result can be a disease such as cancer or diabetes.

What are some types of biological pathways?There are many types of biological pathways. Among the most well-known are pathways involved in metabolism, in the regulation of genes and in the transmission of signals.Metabolic pathways make possible the chemical reactions that occur in our bodies. An example of a metabolic pathway is the process by which cells break down food into energy molecules that can be stored for later use. Other metabolic pathways actually help to build molecules.Gene-regulation pathways turn genes on and off. Such action is vital because genes provide the recipe by which cells produce proteins, which are the key components needed to carry out nearly every task in our bodies. Proteins make up our muscles and organs, help our bodies move and defend us against germs.Signal transduction pathways move a signal from a cell's exterior to its interior. Different cells are able to receive specific signals through structures on their surface called receptors. After interacting with these receptors, the signal travels into the cell, where its message is transmitted by specialized proteins that trigger a specific reaction in the cell. For example, a chemical signal from outside the cell might direct the cell to produce a particular protein inside the cell. In turn, that protein may be a signal that prompts the cell to move.

What is a biological network?Researchers are learning that biological pathways are far more complicated than once thought. Most pathways do not start at point A and end at point B. In fact, many pathways have no real boundaries, and pathways often work together to accomplish tasks. When multiple biological pathways interact with each other, they form a biological network.

How do researchers find biological pathways?Researchers have discovered many important biological pathways through laboratory studies of cultured cells, bacteria, fruit flies, mice and other organisms. Many of the pathways identified in these model systems are the same as, or are similar to, counterparts in humans.Still, many biological pathways remain to be discovered. It will take years of research to identify and understand the complex connections among all the molecules in all biological pathways, as well as to understand how these pathways work together.

Customer

1

2

3

4

Item

6

5

4

3

Gene

11

1

Doc

1

2

3

4

Gene

11

3

Exp

11

11

11

11

1 2 3 4 Author

1 2 3 4 G 5 6term 7

5 6 7People

11

11

11

3

2

1

Doc

2 3 4 5PI

People

cust item card

authordoc card

termdoc card

docdoc

expgene card

genegene card (ppi)

expPI card

genegene card (ppi)

mov

ie

0 0 0 0

0 2

0 0

3 0 0 0

1 0 0

5 0

0

0

0

5

1

2

3

4

4 0 0

0 0 0

5

0

0

1

0

3

0

0

customer rates movie card

0 0 0 0

0 0

0 0

0 0 0 0

0 0 0

1 0 0

0

0

0

1

0 0 0

0 0 0

1

0

0

0

0

0

customer rates movie as 5 card

4

3

2

1

Course

Enrollments

1 5people 2 3 4

1

2

3

4

item

s

3 2

1

term

s

DataCube Model for 3 entities, items, people and terms.

76

54

32

t

1

termterm card (share stem?)

Items: i1 i2 i3 i4 i5

|0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10|

People: p1 p2 p3 p4

|0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M|

Terms: t1 t2 t3 t4 t5 t6

|1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00|

Relationship: p1 i1 t1

|0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2

Relational Model:

2 3 4 5PI

RoloDex Model: 2 Entities many relationships

One can form multi-hops with any of these cards.Are there any that provide and interesting setting for ARM data mining?

3-hop

S(F,G)

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

E

F 2 3 4 5

1234

G

A

C

T(G,H)

0 0 0 11 0 1 00 0 0 10 1 0 1

H2 3 4 5

Collapse T: TC≡ {gG|T(g,h) hC} That's just 2-hop case w TCG replacing C. ( can be replaced by or any other quantifier. The choice of quantifier should match that intended for C.). Collapse T and S: STC≡{fF |S(f,g) gTC} Then it's 1-hop w STC replacing C.

Focus on G

mncnfct(&eARe &g&hCThSg) / ct(&eARe

mncnf&hCTh) ct(&f&eAReSf / ct(&f&eARe

Sf)

ct( 1001 &g=1,3,4 Sg ) /ct(1001)ct( 1001 &1001&1000&1100) / 2ct( 1000 ) / 2 = 1/2

Focus on F Are they different? Yes, because the confidences can be different numbers. Focus on G.

ct(&eARe &glist&hCThSg ) /ct(&eARe

&hCTh)ct(&flist&eAReSf / ct(&flist&eARe

Sf)ct(&f=2,5Sf &1101 ) / ct(&f=2,5Sf

ct(1101 & 0011 &&1101 ) / ct(1101 & 0011 )ct(0001 ) / ct(0001) = 1/1 =1

mnsup ct(&eARe

mnspct(&f&eAReSf)

Focus on F

antecedent downward closure: A infreq. implies supersets infreq. A 1-hop from F (down

consequent upward closure: AC noncnf implies AD noncnf. DC. C 2-hops (up

antecedent upward closure: A infreq. implies all subsets infreq. A 2-hop from G (up) consequent downward closure: AC noncnf impl AD noncnf. DC. C 1-hops (down)

ct(PA & Rf) f&g&hCThSg

/ ct(PA) mncnf mnsup ct(PA)Focus on E

antecedent upward closure: A infreq. implies subsets infreq. A 0-hops from E (up)

consequent downward closure: AC noncnf implies AD noncnf. DC. C 3-hops (down)

Focus on H

antecedent downward closure: A infreq. implies all subsets infreq. A 3-hops from G (down) consequent upward closure: AC noncnf impl AD noncnf. DC. C 0-hops (up)

ct(& Tg & PC) g&f&eAReSf

mncnf /ct(& Tg) g&f&eAReSf

ct(& Tg) g&f&eAReSf

mnsp

4-hop

S(F,G)

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

E

F 2 3 4 5

1234

G

A

C

T(G,H)

0 0 0 11 0 1 00 0 0 10 1 0 1

H2 3 4 5

U(H,I)

1 0 0 10 1 0 11 0 0 01 1 0 0

1234

I

Focus on G? Replace C by UC; A by RA as above (not different from 2 hop?)

Focus on H (RA for A, use 3-hop) or focus on F (UC for C, use 3-hop).

Another focus on G (the main way)

mncnf ct( &f&eAReSf &h&iCUi

Th ) / ct(&f&eAReSf)

mnsup

ct(&f&eAReSf)

F=G=H=genes and S,T=gene-gene intereactions.More than 3, S1, ..., Sn?

&iCUi))+(ct(S1(&eARe

mncnf/ ( (ct(&eARe))n

* ct(&iCUi) )

&iCUi))+... ct(S2(&eARe

&iCUi)) ) ct(Sn(&eARe

If the S cube can be implemented so counts can be can be made of the 3-rectangle in blue directly, calculation of confidence would be fast.

...

R(E,G)

0 0 1 10 0 1 10 0 0 10 1 0 0

1234

E

G 2 3 4 5

A

1234

GSn(G,G)

S1(G,G)

1 0 0 10 1 1 11 0 0 01 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

U(G,I)

1 0 1 10 1 1 11 0 0 01 1 0 0

CI2 3 4 5

2. (consequent upward closure) If AC is non-confident, then so is AD for all subsets, D, of C (the "list" will be larger, so the AND over the list will produce fewer ones) So frequent antecedent, A, use upward closure to mine out all confident consequents, C.

1. (antecedent upward closure) If A is infrequent, then so are all of its subsets (the "list" will be larger, so the AND over the list will produce fewer ones) Frequency involves only A, so mine all qualifying antecedents using upward closure.

4-hop APRIORI focus on G:

mncnf ct(&f&eAReSf &h&iCUi

Th) / ct(&f&eAReSf)

mnsupct(&f&eARe

Sf)

5-hop

Focus on G:

mncnf ct( &f&eAReSf &h(& )Ui

Th ) /

2. (consequent downward closure) If AC is non-confident, then so is AD for all supersets, D, of C. So frequent antecedent, A, use downward closure to mine out all confident consequents, C.

1. (antecedent upward closure) If A is infrequent, then so are all of its subsets (the "list" will be larger, so the AND over the list will produce fewer ones) Frequency involves only A, so mine all qualifying antecedents using upward closure.

5-hop APRIORI focus on G:

S(F,G)

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

E

F 2 3 4 5

1234

G

A

C

T(G,H)

0 0 0 11 0 1 00 0 0 10 1 0 1

H2 3 4 5

U(H,I)

1 0 0 10 1 0 11 0 0 01 1 0 0

1234

I V(I,J)

0 0 0 11 0 1 00 0 0 10 1 0 1

J2 3 4 5

i(&jCVj)ct(&f&eARe

Sf)

mnsupct(&f&eAReSf)

6-hop

Focus on G:

mncnf

ct( &h(& )UiTh) /

2. (consequent downward closure) If AC is non-confident, then so is AD for all supersets, D, of C. So frequent antecedent, A, use downward closure to mine out all confident consequents, C.

1. (antecedent downward closure) If A is infrequent, then so are all of its supersetsbsets. Frequency involves only A, so mine all qualifying antecedents using downward closure.

6-hop APRIORI:

i(&jCVj)

mnsup

S(F,G)

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

EF2 3 4 5

1234

G

A

C

T(G,H)

0 0 0 11 0 1 00 0 0 10 1 0 1

H2 3 4 5

U(H,I)

1 0 0 10 1 0 11 0 0 01 1 0 0

1234

I

V(I,J)

0 0 0 11 0 1 00 0 0 10 1 0 1

J2 3 4 5

D2 3 4 5

Q(D,E)

1 1 0 10 0 0 11 1 0 11 1 0 0

&f(& )ReSfe(&dDQd)

&f(& )ReSf )e(&dDQd)

ct(

&f(& )ReSf)e(&dDQd)

ct(

The conclusion we have demonstrated (but not proven) is: for (a+c)-hop transitive Apriori ARM with focus the entity which is a hops from the antecedent and c hops from the consequent, if a/c is odd/even use downward/upward closure on that step in the mining of strong (frequent and confident) rules.

Given any 1-hop labeled relationship (e.g., cells have values from {1,2,…,n} then there is:1. a natural n-hop transitive relationship, A implies D, by alternating entities for each specific label value relationship.2. cards for each entity consisting of the bitslices of cell values.

E.g., in netflix, Rating(Cust,Movie) has label set {0,1,2,3,4,5}, so in 1. it generates a bonafide 6-hop transitive relationship.

In 2. an alternative is to bitmap each label value (rather than bitslicing them). Below Rn-i can be bitslices or bitmaps

R3(C,M)

R2(M,C)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

MC2 3 4 5

1234

M

A

D

R4(M,C)

0 0 0 11 0 1 00 0 0 10 1 0 1

C2 3 4 5

R5(C,M)

1 0 0 10 1 0 11 0 0 01 1 0 0

1234

M

R0(M,C)

0 0 0 11 0 1 00 0 0 10 1 0 1

C2 3 4 5

C2 3 4 5

R1(C,M)

1 1 0 10 0 0 11 1 0 11 1 0 0

E.g., equity trading on a given day, QuantityBought(Cust,Stock) w labels {0,1,2,3,4,5} (where n means n thousand shares) so that generates a bonafide 6-hop transitive relationship:

E.g., equity trading - moved similarly, (define moved similarly on a day --> StockStock(#DaysMovedSimilarlyOfLast10)

E.g., equity trading - moved similarly2, (define moved similarly to mean that stock2 moved similarly to what stock1 did the previous day.Define relationship StockStock(#DaysMovedSimilarlyOfLast10)

E.g., Gene-Experiment, Label values could be "expression level". Intervalize and go!

Has Strong Transitive Rule Mining (STRM) been done? Are their downward and upward closure theorems already for it?Is it useful? That is, are there good examples of use: stocks, gene-experiment, MBR, Netflix predictor,...

R0(E,F)

Rn-2(E,F)Rn-1(E,F)

F 2 3 4 5

1234

EA

0 0 0 10 0 1 00 0 0 10 1 0 0

0 0 0 10 0 1 00 0 0 10 1 0 0

0 0 0 10 0 1 00 0 0 10 1 0 0

0 0 0 10 0 1 00 0 0 10 1 0 0

0 0 0 10 0 1 00 0 0 10 1 0 0 ...

D

Let Types be an entity which clusters Items (moves Items up the semantic hierarchy),

E.g., in a store, Types might include; dairy, hardware, household, canned, snacks, baking, meats, produce, bakery, automotive, electronics, toddler, boys, girls, women, men, pharmacy, garden, toys, farm).

Let A be an ItemSet wholly of one Type, TA, and let D by a TypesSet which does not include TA. Then:

AD might mean If iA s.t. BB(i,c) then tT, B(c,t)




AD frequent might mean

ct(&iABBi) mnsp

ct( | iABBi) mnsp

ct(&tDBt) mnsp

ct( | tDBt) mnsp

ct(&iABBi &tDBt) mnsp, etc.

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

0 0 0 1

Buys(C,T)

BoughtBy(I,C,)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

ItemsCustomers2 3 4 5

1234

Types (of Items)

A

D

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17181920

ct(&iABBi &tDBt) / ct(&iABBi) mncf

AD confident might mean

ct(&iABBi | tDBt) / ct(&iABBi) mncf

ct( | iABBi | tDBt) / ct( | iABBi) mncf

ct( | iABBi &tDBt) / ct( | iABBi) mncf

Text Mining using pTrees

Pos

1 0 0 0 0 1 0 . . .

Term buy

DTPe in PpTreeSet index (T,D)

Doc3

Doc2

Doc1

1 0

DTPe Position TablePos T1D1 T1D2 T1D3...T9D1…T9D3

1 1 0 1 ... 0 … 0

7 0 … 0 . . . 1 … 1

.

.

.

1 2 3 4 5 6 7 3 2

1 1

.Doc

... T

erm

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 . . .

0 . . .

0 . . .

1

0

0

0 0 0 0 0 0 0 . . .

0 1 0 0 0 0 0 . . .

1 0 0 0 0 1 0 . . .

DTPe Data Cube

1

2

…

9

Term1 2 3 D

TDcardP=kk=1..7

DTPe k=1..7 TDRolodexCd

1

2

…

7

Pos1 2 3 D

PDcardT=kk=1..9

DTPe k=1..9 PDCd

1

2

…

7

Pos1 2 … 9 T

PT cardD=kk=1,2,3

DTPe k=1..3 PTCd

DTPe Document Table: Doc T1P1…T1P7 . . . T9P1…T9P71 1 … 0 . . . 0 … 0

2 0 … 0 . . . 1 … 0

3 0 … 0 . . . 1 … 1

Classical Document Table:Doc Auth… Date . . .Subj1 …Subjm1 1 1/2/13 . . . 0 … 0

2 0 2/2/15 . . . 1 … 0

3 0 3/3/14 . . . 1 … 1

0 0 0 0 0 0 0 . . .

DTPe DocTbl DpTreeSet indexed by (T,P))Position 1 2 3 4 5 6 7Term

an

and

April

are

apple

0 0 0 0 0 0 0 . . .

0 0 1 0 0 0 0 . . .

0 0 0 1 0 0 1 . . .

0 0 0 0 0 0 0 . . .

always 1 0 0 0 0 0 0 . . .

all 0 0 0 0 0 0 0 . . .

AAPL

buy

0 1 0 0 0 0 0 . . .

01 0 0 0 0 1 0 . . .

Classical DocTbl DpTreeSet

1

Auth Date

0

Subj1

0

Subjm

DTPe Term Table:Term P1D1 P1D2 P1D3...P7D1…P7D3

1 1 0 1 ... 0 … 0

9 0 … 0 . . . 1 … 1

.

.

.

DTPe Term Usage Table:Term P1D1 P1D2 P1D3...P7D1…P7D3

1 noun verb adj adv …noun

9 adj noun noun adj noun

.

.

.

Doc3

Doc2

Doc1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

1

0

0

0

1

0

DTPe TpTreeSet index (D,P) Positions 1 2 …

0

0

0

0

0

1

0

0

1

P1D1noun

1

0

0

0

0

0

0

0

0

P1D1 adj

tf is the +rollup of the DTPe datacube along the position dimension. One can use any measurement or data structure of measurements, e.g., DT tfidf in which each cell has a decimal tfidf, which can be bitsliced directly into whole number bitslices plus fractional bitslices (one for each binary digit to the right of the binary point-no need to shift!) using: MOD(INT(x/(2k),2), e.g., a tfidf =3.5 is

k: 3 2 1 0 -1 -2 bit: 0 0 1 1 1 0

3 2

1

.Doc

s

T

erm

s

0

0

1

2

0

0

0

1

2

DTtf DocTerm termfreq Data Cube

DT tfidf Doc Table: Doc T1 T2 . . . T9

1 .75 0 . . . 1

2 0 1 .25

3 0 0 0

DT tfidf DpTreeSet

0

T1k1

0 1

T1k0 T1k-1 T1k-2

1

Rating of T=stock at doc date close:1=sell, 2=hold,3=buy0=non-stock Term

3 2

1

.Doc

s

T

erm

s

0

0

0

0

0

0

0

3

0

DT SR DocTerm StockRating Cube

DT SR bitslice DpTreeSet1

T2k2

1

T2k1

DT SR bitmap DpTreeSet

1

T2,R=buy

0 0

T2,R=hold T2,R=sell