faceted browsing: analysis and implementation of a big data solution using apache solr

18
Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache Solr. Paolo Malavolta University of Modena and Reggio Emilia April 15, 2014 dvisor: rof. Sonia Bergamaschi o-Advisor: rof. H.V. Jagadish ott. Ing. Francesco Guerra

Upload: york

Post on 24-Feb-2016

49 views

Category:

Documents


1 download

DESCRIPTION

April 15, 2014. Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache Solr . Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof. H.V. Jagadish Dott. Ing. Francesco Guerra. Paolo Malavolta University of Modena and Reggio Emilia. Big Data visualization. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache

Solr.

Paolo MalavoltaUniversity of Modena and Reggio Emilia

April 15, 2014

Advisor:Prof. Sonia BergamaschiCo-Advisor:Prof. H.V. JagadishDott. Ing. Francesco Guerra

Page 2: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Big Data visualization

90% of the data in the world today was created in the last 2 years alone. -IBM-

2/18

50% of our brains are

involved invisual processing

70% of all of our sensory

receptorsare in our eyes

Why should we be interested in visualization? Because the human visual system is a pattern seeker of enormous power and subtlety. The eye and the visual cortex of the brain form a massively parallel processor that provides the highest-bandwidth channel into human cognitive centers. At higher levels of processing, perception and cognition are closely interrelated, which is the reason why the words “understanding” and “seeing” are synonymous.

-Colin Ware, 2000-

Page 3: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Apache Solr

Solr (pronounced “solar”) is the most popular and open source enterprise search platform from the Apache Lucene project. Its major features include: Full-text search, Dynamic clustering, Near Real-time indexing, Traditional DB, rich document (e.g., Word, PDF, jpeg) handling, NoSQL DB support, HTTP Restfull requests.

3/18

Page 4: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Big Data solution

Jetty

4/18

Jetty features: Tomcat features:• Full-featured and standards-based.• Embeddable and Asynchronous.• Open source and commercially usable.• Dual licensed under Apache and Eclipse.• Flexible and extensible, Enterprise

scalable. Low maintenance cost.• Small and Efficient.

• Famous open source under Apache.• Easier to embed Tomcat in your

applications.• Implements the Servlet 3.0, JSP 2.2 and

JSP-EL 2.2 support.• Strong and widely commercially usable

and use.• Easy integrated with other application

such as Spring.• Flexible and extensible, Enterprise

scalable.• Faster JSP parsing.• Stable.

Tomcat

Page 5: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Big Data solution

Solr Tomcat Apache ZooKeeper:

Sharding Replication

5/18

Page 6: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Big Data solution

Solr Tomcat Apache ZooKeeper:

Sharding Replication

Hadoop Distributed File System (HDFS)

6/18

Big Data solutionover

Apache Solr

Page 7: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Faceted Browsing

Work Solr over Big Data DONE

Big Data visualization over Solr NOT YET!

7/18

WE NEED IT

Page 8: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Faceted Browsing

Faceted Browsing over Solr AND between facets OR within facet

8/18

Page 9: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Faceted Browsing

Faceted Browsing concerns: Checkbox Scale Efficiency

9/18

CAPCOLOR (79)[x] Brown (23)[x] Gray (45)[ ] Red (11)CAPSURFACE (18)[ ] Scaly (7)[ ] Smooth (11)ODOR (183) [ ] None (34)[x] Pungent (80)[ ] Spicy (65)[ ] Anise (4)

Page 10: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Faceted Browsing

How to make Faceted Browsing over Solr: Analysis and learning of a multi-faceted query Modified Solr front-end Added javascript

10/18

TAG MEANING

/solr/collection1/browse? the url that show the Solr browsing web page

&q= tag that identify a main query

&fq= tag that identify a subquery

{!tag=dt}CapColor:brown tag used to select a specific faced field and its facet value

%22OR%22 tag needed when are chosen more facet value of the same facet field

facet=on tag to enable the faceted browsing

&facet.field= tag to choose a facet field when faceted browsing is enabled

{!ex=dt}Bruises tag to select a specific facet field to show when faceted browsing is enabled

http://localhost:8081/solr/collection1/browse? &q=&fq={!tag=dt}CapColor:brown%22OR%22{!tag=dt}CapColor:gray &q=&fq={!tag=dt}Odor:pungent &facet=on &facet.field={!ex=dt}CapColor &facet.field={!ex=dt}CapSurface &facet.field={!ex=dt}Odor

Page 11: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Bayesian Networks

Issues: Huge volume of facet User search experience

11/18

How: Bayesian networks Open Markov

Open source GUI available API available

Page 12: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Bayesian Networks

How to: Added javascript Added Servlet Open Markov API modified

12/18

Page 13: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Bayesian Networks

Aware User Search

Dynamic Summary

13/18

Page 14: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Keywords Search

Keywords Search over Relational Database: Very popular method Allow non-expert user to formulate queries Not needed to how the data is represented inside DB

14/18

Current techniques: A-priori instance analysis Construct an index Retrieve information from index

Experimented approach: Learning Bayesian Network from

DB Nodes represents columns Retrieve information from BN

Page 15: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Keywords Search

How it works:

15/18

Learning Bayesian NetworkAutomatic Hill Climbing algorithm

Compute probabilities:Variable Elimination algorithm

Print results:Quick Sort algorithm

Page 16: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Keywords Search

Test: Portion of a real DB 2-way path Full outer join:

First 1000 rows

50 runs

16/18

2 EVIDENCE 3 EVIDENCE

TOP 1 TOP 3 TOP 1 TOP 3

6/13=0.461 10/13 = 0.769 6/12 = 0.5 9/12 = 0.692

8/13 = 0.615 10/13 = 0.769 8/12 = 0.666 9/12 = 0.692

... ... ... ...

10/13 = 0.769 12/13=0.923 9/12 = 0.692 11/12=0.916

8/13 = 0.615 10/13 = 0.769 7/12=0.583 9/12 = 0.692

MEAN OF ALL THE RESULTS

0.615 0.7228 0.633 0.7832

Execution time: ~ 3.0 sec.

Page 17: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Conclusions and Future work

17/18

Conclusions: I have provided over Apache Solr:

Big Data solution Faceted browsing Integrated and ready-to-use solution Aware user search using Bayesian Networks Dynamic summary using Bayesian

Networks

I have provided over Keywords Search:

A different approach for keyword search over DB

A software that allow user to formulate queries

Future work: For Solr I can:

Use Machine Learning algorithm to automate the learning of the facets

For Keywords Search I can: Learn the BN with more Database rows for better

results

Page 18: Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache  Solr

Thank for your time.