Ramzi Alqrainy • MSc. In computer science, University of
Jordan, Amman - Jordan • Senior Enterprise Search / Data Engineer @
OpenSooq.com • Technical Reviewer for “Scaling Apache Solr”
and “Apache Solr Search Patterns” (Books) • Co-founder of Solr.ar group • Built 8 search engines for different models in
the last 2 years • Active blogger and Presenter about
Information Retrieval
Agenda
• Why is Arabic Language Important ?
• Arabic Language is Complex
• How we use Apache Solr @ OpenSooq ?
• Localization Concept with SolrCloud
• Ranking and Relevancy
• Apache Solr Implementations @ OpenSooq
Why is Arabic Language Important ?
• The Arabic Language is ranked as the fourth top language on the web
• The number of Arab Internet users grew from 65 million in 2011 to 135 million in 2013
Arabic Language is Complex • Arabic Orthography and Print
§ Arabic has a right-‐to-‐le0 connected script that uses 28 basic le7ers, which change shape depending on their posi:ons in words.
• Arabic Diacritics
§ Diacri:cs help disambiguate the meaning of words.
§ For example, the two words Alam)عَلَم -‐ meaning “flag”) and Eilm)عِلم -‐ meaning
“knowledge”) share the same le7ers علم )Elm( but differ in diacri:cs.
Arabic Language is Complex
• Arabic Morphology
§ Arabic words are divided into three main types: nouns, verbs, and par:cles.
§ Arabic nouns, which include adjec:ves and adverbs, and verbs are derived from a closed set of around 10,000 roots
Arabic Language is Complex
• Arabic Dialects § There are 6 dominant with many more varia:ons of them and dozens more less spoken
dialects.
§ EG. The concept corresponding to “I want” is expressed as عاوز )Eawz( in Egyp:an, أبغى (Abgy) in Gulf, أبي )Aby( in Iraqi, and بدي )bdy( in Levan:ne.
• Arabizi (Transliteration) § Arabic is some:mes wri7en using La:n characters in transliterated form. § Arabizi uses numerals to represent Arabic le7ers. § EG. "2" and “3” represent the le7ers أ (that sounds like “a” as in apple) and ع )E( (that is
a gu7ural “aa”) respec:vely.
How we use Apache Solr @ OpenSooq ? • A leading classifieds ads website in the Middle East and North Africa.
• Right now : Average > 7K Concurrent Users.
• Activity-Per-Second : 240 APS. • Adding/Edi:ng/Dele:ng Post • Adding Comments • Sending Message to Buyer/Seller, etc.
• More than 40k hits on Apache Solr Per Minute.
Arabic Normalization
• There are common spelling mistakes that are widely accepted. For example, the verb ادرس (Adrs) in impera:ve mood (meaning “study” – in a command form) would turn to أدرس .
• Arabic content would be normalized according to the following steps: § Remove punctua:on § Remove diacri:cs (primarily weak vowels). § Remove non le7ers § Replace ا , إ , and أ with ا from first le7er in each word (A -‐ alef) § Replace final ى with ي (Ya) § Replace final ة with ه )Ha(
Arabic Light Stemmer • A light stemmer is not dictionary driven.
• This algorithm follows a rule-based prefix-removal mechanism.
Arabic Light Stemmer • The light stemmer, light10, outperformed the other approaches. It is becoming
widely used in Arabic information retrieval.
Arabic Light Stemmer • Sometimes a stemmer might not do what you want out of the box.
• Protects words from being modified by stemmers. Stop words and Synonyms • Removing stop words is important to ensure high performance and improve recall
h7ps://github.com/Ramzi-‐Alqrainy/Arabic-‐IR/blob/master/stopwords-‐ar.txt
• Matching strings of tokens and replacing them with other strings of tokens will improve precision and recall .
Ranking and Relevancy: Boost documents by age
• Just do a descending sort by age = done?
• Boost more recent documents and penalize older documents just for being old • Recency Boosting
Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-‐11,0.08,0.05) ^5
Solr Implementations @ OpenSooq ?
§ Anti Spam
§ Checking Relevancy
§ Tags Generations
§ Recommendation System