a full-text search algorithm for long queries
DESCRIPTION
Laboratory of Information Systems. Tula State University Faculty of Cybernetics. A Full-Text Search Algorithm for Long Queries. Alexey Kolosoff, Michael Bogatyrev. Table of Contents. Problem statement Suggested algorithm Queries processing Documents ranking Experimental results. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/1.jpg)
1
A Full-Text Search Algorithmfor Long Queries
Alexey Kolosoff, Michael Bogatyrev
Tula State UniversityFaculty of CyberneticsLaboratory of
Information Systems
![Page 2: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/2.jpg)
2
Table of ContentsProblem statementSuggested algorithmQueries processingDocuments rankingExperimental results
![Page 3: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/3.jpg)
3
Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results
![Page 4: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/4.jpg)
4
EnvironmentA question-answer portal is considered.
Answers are produced by technical support persons.
Several existing databases may contain the needed answer for a question.
The task is to decrease the workload of the support team.
![Page 5: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/5.jpg)
5
Workflow
Customer
Web form Support
Customer
Web form Search Support
Search doesn’t help
Before:
After:
Search helps
![Page 6: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/6.jpg)
6
Data Processing
Q&A database
Forums
Web form
Documents database
(help, FAQ, etc.)
Support team
Search system
Customers’ questions(natural language text)
Links to documents
Input Output
![Page 7: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/7.jpg)
7
![Page 8: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/8.jpg)
8
Using Message Subject for Search - Results
![Page 9: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/9.jpg)
9
Why it’s not a Typical Web Search
Queries consist of multiple sentences, instead of several keywords
The number of documents is not very big (tens of thousands)
Indexed documents consider a single subject or several related subjects
![Page 10: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/10.jpg)
10
Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results
![Page 11: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/11.jpg)
11
The Suggested Algorithm
Build CG for input text, filter out unrelated
words
Get concepts mentioned in
the text (context matrix)
Get documents with the same concepts (filter out irrelevant documents)
Rank documents
![Page 12: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/12.jpg)
12
Advantages of the AlgorithmWords and sentences filtering allows excluding
words and phrases which possibly do not affect the meaning of the text. The task of text search decreases to phrases search.
Using concepts for articles filtering decreases the impact of polysemy on search results.
Getting articles with specific concepts is expected to be faster than searching for articles with specific keywords in the entire corpus.
![Page 13: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/13.jpg)
13
Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results
![Page 14: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/14.jpg)
14
Queries Processing
Noise words filtering
Phrases detection
Word forms expansion (with lesser weight)
Synonyms expansion (with lesser weight)
![Page 15: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/15.jpg)
15
Phrases Detection – Punctuation Marks (During Indexing)
Examples:
issue-tracking tools => [N, N + 0.25]...the issue, but{stop word} tracking
changes... => [N, N + 3]Object.Method() =>
Object[N], Method[N + 0.25]…some object. Method A shows… =>
object[N], Method[N + 15]
![Page 16: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/16.jpg)
16
Phrases Detection - Semantics
Despite possible errors, users tend to use correct word combinations for technical details description.
A conceptual graph build from a question’s text allows filtering out unrelated words and word combinations which are not grammatically correct.
![Page 17: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/17.jpg)
Begin
Input text
Recognize language
Parsing sentences
For all sentences
Parsing sent. elements
For all sent.
elements
Normal word forms and
morphologyLanguage?
Concepts & relations in
Russian
Concepts & relations in
English
Output CG
End
EnglishRussian
1
2
3
4
5
6
7
89
9 10
11
12
Morphological analysis
• word formation paradigms from Russian & English languages• using dictionaries Semantic analysis
• semantic role labelling• using templates
How Conceptual Graphs are Built
![Page 18: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/18.jpg)
18
Sample Query“Hi there!
i have a script test with a bunch of checkpoints, but when it hits a checkpoint cannot be verified, the execution of the script stops and any tests after the failed checkpoint do not get executed. Thank's in advance.Randy”
![Page 19: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/19.jpg)
19
A Conceptual Graph Fragment
![Page 20: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/20.jpg)
20
Filtered out Sentences
![Page 21: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/21.jpg)
21
Parsing results
Phrases used as an input for full-text search.
Each phrase has its own weight.
As a result, the task of searching for a given text can be reduced to searching for a number of phrases. This task can be solved via the suggested algorithm.
![Page 22: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/22.jpg)
22
Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results
![Page 23: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/23.jpg)
23
Document ModelVector model is used to represent an indexed document or a query:
The native methods of the control =>[the (0.333), native(0.166), methods(0.166), of (0.166), control(0.166)]
[0, 5] [1] [2] [3] [4]
![Page 24: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/24.jpg)
24
The Sought-for Phrase is Present in an Indexed Document if…
...(distance between each words pair) < M, where M is the artificial word position increment value for sentence breaks.
Sample query: AJAX applications testing
AJAX web applications are, indeed, difficult for testing. => total words distance: 7
No AJAX applications. Testing desktop applications is another task. => total words distance: 16
![Page 25: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/25.jpg)
25
Phrases Relevance
N
wdpRdqR i jipjphrase
pi
),(),(
Arithmetical mean for each phrase (pi) detected in a query (q). Where wpi – the weight of the phrase in the query,
Rp – document’s relevance for pi calculated via the following formula:
kkp
n
jipi
it
dpR,
22
),(
where – the number of words in pi,
– total words distance in a document (dj), calculated for each occurrence of pi.
itn
kpi ,
![Page 26: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/26.jpg)
26
Resulting Relevance
)),(),( ( jphrasefieldj dqRWdqR k
where Rphrase – phrases relevance,
Wfield – indexed field’s weight
![Page 27: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/27.jpg)
27
Problem statementSuggested algorithmQueries processingDocuments rankingExperimental results
![Page 28: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/28.jpg)
28
Performance Measurement Formula
p
i
ip i
relrelDCG
2 21 log
where reli – assessor-defined relevance [0..2],i – the result’s order number,p = 10 – the number of considered results
![Page 29: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/29.jpg)
29
Experimental ResultsThe average quality of «top 10» search results(discounted cumulative gain), max=10,51
Number of queries
5 10 15 20 25 300
1
2
3
4
5
6
7
8
New AlgorithmSQL Server iFTSGoogle
![Page 30: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/30.jpg)
30
Conclusions
It is necessary to perform phrase search when finding an answer in an automated way.
Conceptual graphs allow detecting phrases in natural language queries.
Storing conceptual graphs instead of document vectors as indexes and using the graphs directly for relevance calculation can be an interesting approach, which will be examined in a future work.
![Page 31: A Full-Text Search Algorithm for Long Queries](https://reader036.vdocuments.us/reader036/viewer/2022062305/56814d36550346895dba6408/html5/thumbnails/31.jpg)
31
Thank you! Any questions?