multilingual search system
DESCRIPTION
Multilingual search system as part of Information Retrieval. The presentation deals with the implementation of a search system using Solr.TRANSCRIPT
![Page 1: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/1.jpg)
Multilingual Search System
TEAM NAME –SHIELD
Vamshi Krishna Padidela(50169645)
Manikant Manohar Kapuganti(50170071)
Pramod Rangaraju(50169514)
Sudheer Bondada(50170321)
Nikhil Ayyagari(50169485)
![Page 2: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/2.jpg)
Introduction
In this project, we built a retrieval system powered by Solr to search within tweets.
The dataset includes 11,000 tweets(multiple languages) consumed using the Twitter’s REST API. The tweets belong to two sets of topics isis and health with significant sub topics in each.
The UI for the search system is built on banana framework which has powerful dashboard capabilities to visualize big data analytics.
![Page 3: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/3.jpg)
We have implemented below components
1. Content Tagging (Monolingual)
2. Faceted Search
3. Cross-Document Analytics
4. Topic Models and/or LSI
![Page 4: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/4.jpg)
Content Tagging (Monolingual)
We realized content tagging using Alchemy’s Entity Extraction API.
The Alchemy API identifies proper nouns(places, people, organizations) using Natural Language Processing.
The tags for each tweet returned by the Alchemy API is added to the respective tweet using another field “tags”.
The new JSON file with the added “tags” is re-indexed in Solr.
The tags give insights into interesting metrics like popularity of a person, place etc over a period of time.
![Page 5: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/5.jpg)
Results from Alchemy API’s content tagging
![Page 6: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/6.jpg)
Tags for a search field
![Page 7: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/7.jpg)
The tags displayed in the order of most used
![Page 8: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/8.jpg)
Faceted Search
Faceted Search is available with banana framework where the search can be limited based on the fields like text, language, location and etc.
The functionality of facets are similar to filters with added document count.
Faceted search helps displaying dashboards for various analytical purposes.
Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search.
![Page 9: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/9.jpg)
Facets and filters
![Page 10: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/10.jpg)
Pie chart showing the geographical distribution
![Page 11: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/11.jpg)
Cross Document Analytics
![Page 12: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/12.jpg)
Distribution of tweets against time and location
![Page 13: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/13.jpg)
Topic Models-LSI
Implemented Latent Semantic Indexing(LSI) on the data collected to demonstrate semantic search instead of keyword search.
Latent Dirichlet Allocation (LDA) is an initial probabilistic extension of the LSI technique.
LDA is responsible for extraction of collections of topics.
LDA processes tweets in order to find the topic distribution fro each document and also the document distribution for each topic.
The LDA algorithm is invoked on the vectors generated from the Sequence file.
We are using MALLET(Machine Learning for Language Toolkit) for topic generation.(Results pending)
![Page 14: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/14.jpg)
Search System UI – 1/2
![Page 15: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/15.jpg)
Search System UI – 2/2
![Page 16: Multilingual Search System](https://reader036.vdocuments.us/reader036/viewer/2022062410/5695cfe71a28ab9b02900ba6/html5/thumbnails/16.jpg)
Thank You!!