Download - GET THAT PROTEIN!
![Page 1: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/1.jpg)
1
Patrick Lambrix
Department of Computer and Information Science
Linköpings universitet
Information retrieval
GET THAT PROTEIN!
![Page 2: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/2.jpg)
2
Electronic Data Sources
• Data in electronic form• Used in every day life and research
![Page 3: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/3.jpg)
3
Data sourcesScientificresults Model
Queries Answers
Data source
PhysicalData source
Data sourceManagementSystem
Query/update processing
Access to stored data
![Page 4: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/4.jpg)
4
Storing and accessing textual information
• What information is stored?• How is the information stored?
- high level• How is the information retrieved?
![Page 5: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/5.jpg)
5
What information is stored?
• Model the information
- Entity-Relationship model (ER)
- Unified Modeling Language (UML)
![Page 6: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/6.jpg)
6
What information is stored? - ER
• entities and attributes• entity types• key attributes• relationships• cardinality constraints
• EER: sub-types
![Page 7: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/7.jpg)
7
1 tgctacccgc gcccgggctt ctggggtgtt ccccaaccac ggcccagccc tgccacaccc 61 cccgcccccg gcctccgcag ctcggcatgg gcgcgggggt gctcgtcctg ggcgcctccg 121 agcccggtaa cctgtcgtcg gccgcaccgc tccccgacgg cgcggccacc gcggcgcggc 181 tgctggtgcc cgcgtcgccg cccgcctcgt tgctgcctcc cgccagcgaa agccccgagc 241 cgctgtctca gcagtggaca gcgggcatgg gtctgctgat ggcgctcatc gtgctgctca 301 tcgtggcggg caatgtgctg gtgatcgtgg ccatcgccaa gacgccgcgg ctgcagacgc 361 tcaccaacct cttcatcatg tccctggcca gcgccgacct ggtcatgggg ctgctggtgg 421 tgccgttcgg ggccaccatc gtggtgtggg gccgctggga gtacggctcc ttcttctgcg 481 agctgtggac ctcagtggac gtgctgtgcg tgacggccag catcgagacc ctgtgtgtca 541 ttgccctgga ccgctacctc gccatcacct cgcccttccg ctaccagagc ctgctgacgc 601 gcgcgcgggc gcggggcctc gtgtgcaccg tgtgggccat ctcggccctg gtgtccttcc 661 tgcccatcct catgcactgg tggcgggcgg agagcgacga ggcgcgccgc tgctacaacg 721 accccaagtg ctgcgacttc gtcaccaacc gggcctacgc catcgcctcg tccgtagtct 781 ccttctacgt gcccctgtgc atcatggcct tcgtgtacct gcgggtgttc cgcgaggccc 841 agaagcaggt gaagaagatc gacagctgcg agcgccgttt cctcggcggc ccagcgcggc 901 cgccctcgcc ctcgccctcg cccgtccccg cgcccgcgcc gccgcccgga cccccgcgcc 961 ccgccgccgc cgccgccacc gccccgctgg ccaacgggcg tgcgggtaag cggcggccct 1021 cgcgcctcgt ggccctacgc gagcagaagg cgctcaagac gctgggcatc atcatgggcg 1081 tcttcacgct ctgctggctg cccttcttcc tggccaacgt ggtgaaggcc ttccaccgcg 1141 agctggtgcc cgaccgcctc ttcgtcttct tcaactggct gggctacgcc aactcggcct 1201 tcaaccccat catctactgc cgcagccccg acttccgcaa ggccttccag ggactgctct 1261 gctgcgcgcg cagggctgcc cgccggcgcc acgcgaccca cggagaccgg ccgcgcgcct 1321 cgggctgtct ggcccggccc ggacccccgc catcgcccgg ggccgcctcg gacgacgacg 1381 acgacgatgt cgtcggggcc acgccgcccg cgcgcctgct ggagccctgg gccggctgca 1441 acggcggggc ggcggcggac agcgactcga gcctggacga gccgtgccgc cccggcttcg 1501 cctcggaatc caaggtgtag ggcccggcgc ggggcgcgga ctccgggcac ggcttcccag 1561 gggaacgagg agatctgtgt ttacttaaga ccgatagcag gtgaactcga agcccacaat 1621 cctcgtctga atcatccgag gcaaagagaa aagccacgga ccgttgcaca aaaaggaaag 1681 tttgggaagg gatgggagag tggcttgctg atgttccttg ttg
![Page 8: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/8.jpg)
8
DEFINITION Homo sapiens adrenergic, beta-1-, receptorACCESSION NM_000684SOURCE ORGANISM humanREFERENCE 1
AUTHORS Frielle, Collins, Daniel, Caron, Lefkowitz,Kobilka
TITLE Cloning of the cDNA for the human beta 1-adrenergic receptorREFERENCE 2
AUTHORS Frielle, Kobilka, Lefkowitz, CaronTITLE Human beta 1- and beta 2-adrenergic
receptors: structurally and functionally related receptors derived from distinctgenes
![Page 9: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/9.jpg)
9
Reference
protein-id
accession definition
source
article-id
title
author
PROTEIN
ARTICLE
m
n
Entity-relationship
![Page 10: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/10.jpg)
10
Storing and accessing textual information
• What information is stored?• How is the information stored?
- high level• How is the information retrieved?
![Page 11: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/11.jpg)
11
Storing textual information
• Text (IR)• Semi-structured data• Data models (DB)• Rules + Facts (KB)
structure precision
![Page 12: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/12.jpg)
12
Storing textual information -Text - Information Retrieval
• search using words• conceptual models:
boolean, vector, probabilistic, …• file model:
flat file, inverted file, ...
![Page 13: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/13.jpg)
13
WORD HITS LINK DOC# LINK DOCUMENTS
receptor
cloning
adrenergic 32
1
5
2
5
1
22
53
…
…
…
…
…
…
…
…
…
…
…
…
… …
… …
… …
Doc1
Doc2
…
Inverted file Postings file Document file
IR - File model: inverted files
![Page 14: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/14.jpg)
14
IR – File model: inverted files
• Controlled vocabulary• Stop list• Stemming
![Page 15: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/15.jpg)
15
IR - formal characterization
Information retrieval model: (D,Q,F,R)• D is a set of document representations• Q is a set of queries• F is a framework for modeling document
representations, queries and their relationships
• R associates a real number to document-query-pairs (ranking)
![Page 16: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/16.jpg)
16
IR - conceptual models
Classic information retrieval
• Boolean model• Vector model• Probabilistic model
![Page 17: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/17.jpg)
17
Boolean model
cloningadrenergic receptor
Doc1
Doc2
(1 1 0)
(0 1 0)
yes
no
yes
yes
-->
-->
no
no
Document representation
![Page 18: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/18.jpg)
18
Boolean model
Q1: cloning and (adrenergic or receptor)
queries : boolean (and, or, not)
Queries are translated to disjunctive normal form (DNF)
DNF: disjunction of conjunctions of terms with or without ‘not’Rules: not not A --> A not(A and B) --> not A or not B not(A or B) --> not A and not B (A or B) and C --> (A and C) or (B and C) A and (B or C) --> (A and B) or (A and C) (A and B) or C --> (A or C) and (B or C) A or (B and C) --> (A or B) and (A or C)
![Page 19: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/19.jpg)
19
Boolean modelQ1: cloning and (adrenergic or receptor)
--> (cloning and adrenergic) or (cloning and receptor)
(cloning and adrenergic) or (cloning and receptor)--> (cloning and adrenergic and receptor) or (cloning and adrenergic and not receptor) or (cloning and receptor and adrenergic) or (cloning and receptor and not adrenergic)--> (1 1 1) or (1 1 0) or (1 1 1) or (0 1 1)--> (1 1 1) or (1 1 0) or (0 1 1)
DNF is completed + translated to same representation as documents
![Page 20: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/20.jpg)
20
Boolean model
cloningadrenergic receptor
Doc1
Doc2
(1 1 0)
(0 1 0)
yes
no
yes
yes
-->
-->
Q1: cloning and (adrenergic or receptor)
--> (1 1 0) or (1 1 1) or (0 1 1) Result: Doc1
Q2: cloning and not adrenergic
--> (0 1 0) or (0 1 1) Result: Doc2
no
no
![Page 21: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/21.jpg)
21
Boolean model
Advantages• based on intuitive and simple formal model
(set theory and boolean algebra)
Disadvantages• binary decisions
- words are relevant or not
- document is relevant or not, no notion of partial match
![Page 22: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/22.jpg)
22
Boolean model
cloningadrenergic receptor
Doc1
Doc2
(1 1 0)
(0 1 0)
yes
no
yes
yes
-->
-->
Q3: adrenergic and receptor
--> (1 0 1) or (1 1 1) Result: empty
no
no
![Page 23: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/23.jpg)
23
Vector model (simplified)
Doc1 (1,1,0)Doc2 (0,1,0)cloning
receptor
adrenergic
Q (1,1,1)
sim(d,q) = d . q |d| x |q|
![Page 24: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/24.jpg)
24
Vector model
• Introduce weights in document vectors
(e.g. Doc3 (0, 0.5, 0))• Weights represent importance of the term
for describing the document contents• Weights are positive real numbers• Term does not occur -> weight = 0
![Page 25: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/25.jpg)
25
Vector model
Doc1 (1,1,0)Doc3 (0,0.5,0)cloning
receptor
adrenergic
Q4 (0.5,0.5,0.5)
sim(d,q) = d . q |d| x |q|
![Page 26: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/26.jpg)
26
Vector model
• How to define weights? tf-idf
dj (w1,j, …, wt,j)
wi,j = weight for term ki in document dj
= fi,j x idfi
![Page 27: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/27.jpg)
27
Vector model
• How to define weights? tf-idf
term frequency freqi,j: how often does term ki occur in document dj?
normalized term frequency:
fi,j = freqi,j / maxl freql,j
![Page 28: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/28.jpg)
28
Vector model
• How to define weights? tf-idf
document frequency : in how many documents does term ki occur?
N = total number of documents
ni = number of documents in which ki occurs
inverse document frequency idfi: log (N / ni)
![Page 29: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/29.jpg)
29
Vector model
• How to define weights for query?
recommendation:
q= (w1,q, …, wt,j)
wi,q = weight for term ki in q
= (0.5 + 0.5 fi,q) x idfi
![Page 30: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/30.jpg)
30
Vector model
• Advantages
- term weighting improves retrieval performance
- partial matching
- ranking according to similarity
Disadvantage
- assumption of mutually independent terms?
![Page 31: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/31.jpg)
31
Probabilistic model
weights are binary (wi,j = 0 or wi,j = 1)
R: the set of relevant documents for query q
Rc: the set of non-relevant documents for q
P(R|dj): probability that dj is relevant to q
P(Rc|dj): probability that dj is not relevant to q
sim(dj,q) = P(R|dj) / P(Rc|dj)
![Page 32: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/32.jpg)
32
Probabilistic model
sim(dj,q) = P(R|dj) / P(Rc|dj)
(Bayes’ rule, independence of index terms, take logarithms, P(ki|R) + P(not ki|R) = 1)
--> SIM(dj,q) ==
SUM wi,q x wi,j x
(log(P(ki|R) / (1- P(ki|R))) +
log(P(ki|Rc) / (1- P(ki|Rc))))
ti=1
![Page 33: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/33.jpg)
33
Probabilistic model
• How to compute P(ki|R) and P(ki|Rc)?
- initially: P(ki|R) = 0.5 and P(ki|Rc) = ni/N
- Repeat: retrieve documents and rank them
V: subset of documents (e.g. r best ranked)
Vi: subset of V, elements contain ki
P(ki|R) = |Vi| / |V|
and P(ki|Rc) = (ni-|Vi|) /(N-|V|)
![Page 34: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/34.jpg)
34
Probabilistic model
• Advantages:
- ranking of documents with respect to probability of being relevant
• Disadvantages:
- initial guess about relevance
- all weights are binary
- independence assumption?
![Page 35: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/35.jpg)
35
IR - measures
Precision =
number of found relevant documents
total number of found documents
Recall =
number of found relevant documents
total number of relevant documents
![Page 36: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/36.jpg)
36
IR - measures
Relevant documents |R|
Answer set |A|
Relevant documentsin the answer set |RA|
Precision = |RA| / |A|
Recall = |RA| / |R|
![Page 37: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/37.jpg)
37
Related work at IDA/ADIT
• Use of IR/text mining in – Ontology engineering
• Defining similarity between concepts (OA)• Defining relationships between concepts (OD)
• Semantic Web• Databases
![Page 38: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/38.jpg)
38
![Page 39: GET THAT PROTEIN!](https://reader031.vdocuments.us/reader031/viewer/2022032805/56813323550346895d99ffb0/html5/thumbnails/39.jpg)
39
Literature
Baeza-Yates, R., Ribeiro-Neto, B., Modern Information Retrieval, Addison-Wesley, 1999.