solr and the cloud

73
UPTEC IT 13 002 Examensarbete 30 hp Januari 2013 Solr and the cloud Johannes Nilsson

Upload: others

Post on 12-Sep-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Solr and the cloud

UPTEC IT 13 002

Examensarbete 30 hpJanuari 2013

Solr and the cloud

Johannes Nilsson

Page 2: Solr and the cloud
Page 3: Solr and the cloud

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Solr and the cloud

Johannes Nilsson

This thesis is produced within the M. Sc. in Computer and Information Engineering program at Uppsala University in collaboration with Itera Consulting AB.

Itera Consulting is a IT company based in Stockholm, that among other things work with implementations of Apache Solr. Apache Solr is very scalable, and it is the engine for search and navigation functionalities on many of the large websites around the world, such as apple.com and DN.se.

The scope of this thesis was to give a general overview of what the cloud is and the services it offers, analyse security aspects that are specific for the cloud, and analyse how Solr's indexing time performance is affected by scaling of infrastructure and sharding.

To define the cloud as a service, and specific security aspects for the cloud, a literature study was compiled into a summarising definition of the cloud, as well as a compilation of the analysis of security issues specific for the cloud.

To analyse how Solr's indexing time performance is affected by scaling the infrastructure and sharding, a prototype was done with Solr on Windows Azure, which is Microsoft's cloud platform. Tests have since designed and implemented in Windows Azure VMs, which has been set up in varying size and number.

Tryckt av: ITC

Sponsor: Itera Consulting ABISSN: 1401-5749, UPTEC IT 13 002Examinator: Lars-Åke NordénÄmnesgranskare: Arnold PearsHandledare: Johan Persson

Page 4: Solr and the cloud
Page 5: Solr and the cloud

Solr and the cloud

Swedish Summary

Detta examensarbete har gjorts inom civilingenjorsprogrammet i Informa-tionsteknologi for Uppsala Universitet i samarbete med Itera Consulting AB.

Itera Consulting AB ar ett IT konsult bolag med kontor i Stockholm,som bland annat arbetar med implementationer av Apache Solr. ApacheSolr ar en kraftfull sokplattform som anvands pa manga av varldens storawebbsidor, som t.ex apple.com och DN.se.

Problembeskrivningen for arbetet kan delas upp i tva delar, dar den forstadelen bestar av att ge en generell definition av vad ett moln ar, och de tjanstersom ett moln erbjuder, samt analysera sakerhetsaspekter som ar specifika formolnet. Den andra delen bestar av att analysera hur Solr’s indexeringstidpaverkas av skalning av infrastruktur och sharding.

For att definiera molnet som tjanst och analysera specifika sakerhetsaspekterkring molnet, har en litteraturstudie genomforts, som resulterat i en samman-fattande definition av molnet, samt en sammanstallande analys av sakerhetsaspekter specifika for molnet.

For att sedan analysera hur Solr’s indexeringstid paverkas av skalning avinfrastruktur och sharding, har en prototyp gjorts av Solr pa Windows Azure,som ar Microsofts molnplattform. Tester har sedan utformats och genomfortspa Windows Azures virtuella maskiner, som har sats upp i varierande storlekoch antal.

i Johannes Nilsson

Page 6: Solr and the cloud
Page 7: Solr and the cloud

Solr and the cloud

0.1 List of Acronyms

ACID Atomicity, Consistency, Isolation and Durability . . . . . . . . . . . . . . . . . . 18

AICPA American Institute of Certified Public Accountants . . . . . . . . . . . . . 18

API Application Program Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

EC2 Elastic Compute Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

CICA Canadian Institute of Chartered Accountants . . . . . . . . . . . . . . . . . . . . .18

CPU Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

CSA Cloud Security Alliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

ENISA European Network and Information Security Agency . . . . . . . . . . . . 12

GAPP Generally Accepted Privacy Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

GUI Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

HTTP Hyper Text Transfer Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

IaaS Infrastructure as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

IIS Internet Information Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

I/O Input Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

JVM Java Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iii Johannes Nilsson

Page 8: Solr and the cloud

Solr and the cloud

LRU Least Recently Used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

MAC Message Authentication Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

NIST National Institute of Standards and Technology . . . . . . . . . . . . . . . . . . . 11

OVF Open Virtualization Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

PaaS Platform as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

SaaS Software as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

SLA service-level agreement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

SPI Software, Platform or Infrastructure (as a Service) . . . . . . . . . . . . . . . . . . 12

TLS Transport Layer Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

VHD Virtual Hard Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

VM Virtual Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

iv Johannes Nilsson

Page 9: Solr and the cloud

Contents

0.1 List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . iii

1 Introduction 41.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . 51.5 Study of Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Method 62.1 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Implementation of a Prototype . . . . . . . . . . . . . . . . . 62.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Related Work 83.1 Existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Scaling Lucene and Solr . . . . . . . . . . . . . . . . . 93.2.2 Solr Near Realtime Search . . . . . . . . . . . . . . . . 9

3.3 Overview of Cloud Technologies . . . . . . . . . . . . . . . . . 9

4 The Cloud 114.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 The SPI model . . . . . . . . . . . . . . . . . . . . . . 124.1.2 Deployment Models . . . . . . . . . . . . . . . . . . . . 15

4.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.1 Responsibility of Security in the SPI model . . . . . . . 174.2.2 Location of Data . . . . . . . . . . . . . . . . . . . . . 174.2.3 Confidentiality and Integrity of Data . . . . . . . . . . 184.2.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.5 Vendor Lock In . . . . . . . . . . . . . . . . . . . . . . 19

1

Page 10: Solr and the cloud

CONTENTS Solr and the cloud

4.2.6 Availability . . . . . . . . . . . . . . . . . . . . . . . . 204.2.7 Centralisation of Data . . . . . . . . . . . . . . . . . . 224.2.8 Shared Resources . . . . . . . . . . . . . . . . . . . . . 22

4.3 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.1 Pros . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.2 Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Implementation 265.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Windows Azure . . . . . . . . . . . . . . . . . . . . . . 265.1.2 Solr 4.0 and SolrCloud . . . . . . . . . . . . . . . . . . 28

5.2 The Implementation . . . . . . . . . . . . . . . . . . . . . . . 315.2.1 Setting up the VM Cluster . . . . . . . . . . . . . . . . 315.2.2 Setting up Solr . . . . . . . . . . . . . . . . . . . . . . 32

6 Test set up 346.1 JVM heap size . . . . . . . . . . . . . . . . . . . . . . . . . . 346.2 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.4 Test Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 356.5 Base-line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.6 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.6.1 Scaling infrastructure . . . . . . . . . . . . . . . . . . . 356.6.2 Sharding . . . . . . . . . . . . . . . . . . . . . . . . . . 366.6.3 Combination of Scaling Infrastructure and Sharding . . 366.6.4 Collecting Data . . . . . . . . . . . . . . . . . . . . . . 36

7 Results and Analysis 377.1 Scaling of Infrastructure . . . . . . . . . . . . . . . . . . . . . 377.2 Sharding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.3 Combination of Scaling Infrastructure and Sharding . . . . . . 437.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.5 The Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . 46

8 Conclusion and Discussion 47

9 Future Work 50

Appendices 55

A PowerShell Script 56

2 Johannes Nilsson

Page 11: Solr and the cloud

CONTENTS Solr and the cloud

B The Client Code 57

C Collected Data 61

3 Johannes Nilsson

Page 12: Solr and the cloud

Chapter 1

Introduction

1.1 Background

Solr is a open-source search platform from the Apache Lucene project. In-cluded functionalities are a powerful texts-search tool, hit highlighting, faceting,dynamic clustering, database integration and rich document (eg Word, PDF)handling. Solr is very scalable, and it is the engine for search and naviga-tion functionalities on many of the large websites around the world, such asapple.com and DN.se [30]. Solr is written in Java and run as a standalonetext-search server in a servlet container such as Tomcat [5].

Itera Consulting has developed a plug-in for Swedish language supportand has connected Solr to EPiServer, this solution is today distributed tothe customer as a local installation. This thesis analyses the possibilities anddisadvantages of the cloud, and how Solr can utilise the scalability of thecloud.

1.2 Purpose

This thesis aims to define the cloud as a service; give a general overviewof what the cloud is and the services it offers. The thesis will also analysissecurity aspects that are specific for the cloud, and how Solr’s indexing timeperformance is affected by scaling of infrastructure and sharding.

4

Page 13: Solr and the cloud

CHAPTER 1. INTRODUCTION Solr and the cloud

1.3 Task Description

• Define the cloud as a service and its possibilities and disadvantages.

• Analyse security issues that are specific for the cloud.

• Develop a prototype that integrates Solr with Microsofts cloud platformWindows Azure.

– Analyse how scaling of infrastructure affects Solr’s performance ofthe indexing process.

– Analyse how sharding affects Solr’s indexing time performance.

• Propose how Solr can utilize a cloud.

1.4 Scope and Limitations

• The Solr version used for the prototype in this report will be Solr 4.0.

• The developed prototype is intended to be used to analyse and evalu-ate how scaling of Solr affect the performance of the indexing process.Testing of this will be performed in the cloud. The prototype is notintended to be a commercial product, but how such a product can bedeveloped, will be discussed in this report.

• This report will not investigate how hard drive I/O affects performance.

• This report will not discuss other performance factors for Solr indexingthan scaling of VMs on Windows Azure, and Sharding.

1.5 Study of Solr

To be able to conduct this study, a certain level of knowledge about Solr isrequired. This will be achieved by taking the course Solr Unleached, a coursesupervised by LucidWorks.

5 Johannes Nilsson

Page 14: Solr and the cloud

Chapter 2

Method

To define the cloud as a service, and analyse specific security aspects ofthe cloud a literature study will be done. Then a prototype will be imple-mented on Microsoft’s cloud platform Windows Azure, to analyse the effectsof scaling the infrastructure of Solr and how sharding affects indexing timeperformance. The analysis will be based on tests performed with Solr onWindows Azure. From the result of these tests and the analysis of the cloud,a proposal for how the project should be further developed will be compiled.

2.1 Literature Study

The literature study will be initiated by an analysis of related work, and willbe followed by a study of the cloud. The study will define the cloud as aservice, and determine its possibilities and weaknesses. This phase will alsoinvolve definition of the security aspects that are specific for the cloud.

2.2 Implementation of a Prototype

To analyse how Solr perform in the cloud, Microsoft’s Windows Azure willbe used as the cloud platform. A design of the prototype will be developed,and implemented onto Windows Azure. When the prototype is implementedan analysis will be done of the indexing time performance when scaling theinfrastructure that runs Solr, and how sharding ( see 5.1.2 and 5.2.2 ) affectsthe indexing performance.

6

Page 15: Solr and the cloud

CHAPTER 2. METHOD Solr and the cloud

2.3 Tests

To test how scaling of infrastructure affects Solr’s indexing time performance,Solr will be set up on Virtual Machines of different sizes, and varying num-bers. To test how sharding affects Solr’s indexing time performance, Solrwill be set up on a single virtual machine, with different sizes and a varyingamount of shards.

2.4 Result

When the above analysis are completed, an analysis will be performed tocompile a proposal for how the project should be developed in the future.

7 Johannes Nilsson

Page 16: Solr and the cloud

Chapter 3

Related Work

3.1 Existing solutions

There are several providers that have already adopted their Solr solution tothe cloud, one of them is LucideWorks Enterprise. LucideWorks has devel-oped a free, but closed, solution that is built on top of Apache Solr. They areoffering this solution both as a on premise service, and as a cloud service onthe Amazone platform Elastic Compute Cloud (EC2), as well as Microsoft’sAzure platform. They claim that the difference between a cloud-based de-ployment, compared to an on premise server solution, is that the on premisesolution acquires configuration and tests, whilst a cloud-based solution canbe provisioned within minutes [17]. Other examples of providers that havealready adopted their Solr solution to the cloud are Axis12 with their ”A12find” [1] and TNR Global with SolrHQ [34].

Amazon also have a enterprise search solution named Amazon Cloud-Search. Users create their search domain, upload the data they want tomake searchable to Amazone CloudSerach. Amazon then automatically pro-visions the required resources. This means that Amazon take care of scalingof the resources, as data increases and the query frequency grows larger, andcustomers do not have to worry about data partitioning, software patches,or hardware provisioning [3].

3.2 Solr

Baracuda Networks, Inc and Nutschell, LLZ

Baracuda Networks, Inc and Nutschell, LLZ did a presentation at the O’ReillyOSCON Data conference in July 2011, titled ”Scaling Solr Horizontally in

8

Page 17: Solr and the cloud

CHAPTER 3. RELATED WORK Solr and the cloud

the Cloud” [26]. This presentation presents different techniques that can beused to scale Solr in the cloud, when working with multiple indexes. Exam-ples of techniques are using multicores with common schemas, and activatingand deactivating cores using Least Recently Used (LRU) algorithm.

3.2.1 Scaling Lucene and Solr

Mark Miller is a Lucene/Solr committer and Apache member, and is wellknown for his work with solr. Mark has done a post on the open sourcesearch community site called SearchHub, with the title ”Scaling Lucene andSolr” [25]. In the post he explains how to get the most of a single machine,as well as utilise multiple machines to handle a large index and large queryvolumes.

3.2.2 Solr Near Realtime Search

Near real time search is a way in Solr that makes documents available forsearch almost immediately after being indexed. This means that if indexing100 documents, users can search document 1 as soon as it is ready, and donot have to wait until all the 100 documents are ready.

To do this Solr uses soft commit instead of the usual commit, to avoidparts of the standard commit that can be costly. However a usual ”hard”commit is still needed to ensure that documents are in stable storage, the softcommits allows users to see and search a very near realtime version of theindex during the indexing process. But if the something goes wrong, userscan only be certain to have a consistent index up to the last ”hard” commit[29].

3.3 Overview of Cloud Technologies

Great articles that are related to the definition and security of the cloud:

• The National Institute of Standards and Technology (NIST) have com-piled a very neat definition of the cloud in ”The NIST Definition ofCloud Computing” [37]

• The European Network and Information Security Agency (ENISA) hasanalysed risks of cloud computing in ”Cloud Computing Risk Assess-ment” [10].

9 Johannes Nilsson

Page 18: Solr and the cloud

CHAPTER 3. RELATED WORK Solr and the cloud

• The Cloud Security Alliance (CSA) have written about security inthe cloud, in ”Security Guidance for Critical Areas of Focus in CloudComput- ing V2.1”.

• A very good article that go through many interesting aspects of cloudcomputing is ”A view of cloud computing” by Michael Armbrust et al.[7].

10 Johannes Nilsson

Page 19: Solr and the cloud

Chapter 4

The Cloud

“A cloud is a type of parallel and distributed system consisting of a collectionof inter-connected and virtualized computers that are dynamically provisionedand presented as one or more unified computing resource(s) based on service-level agreements established through negotiation between the service providerand consumers.” [9]

4.1 Definition

It is not the easiest question to answer what cloud computing is, but therehave been a few attempts.

The basics of cloud computing are that the hardware, operating systemand applications have been abstracted from each other into three separatedvirtual components. One example of enabled possibilities with this is thata operating system component easily can migrate from one hardware com-ponent to another. This is not new but cloud computing has constructeda concept to utilise this in a new way, to optimise the use of computingresources, including servers, applications, storage and network services.

This section discusses cloud approaches and provides an overview of thecurrent state of the art . In the following sub-sections the cloud service anddeployment model will be described.

Tim Mather et al [20] choose to define the cloud by five characteristics, asdo Cloud Security Alliance (CSA) [27, 11] that follows the National Instituteof Standards and Technology (NIST) definition in [37]. The five defined char-acteristics by [20] are multi tenancy (shared resources), massive scalability,elasticity, pay as you go, and self-provisioning of resources, which differs alittle from the NIST [37] definition. But with a closer examination the funda-

11

Page 20: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

mentals of Tim Mather et al [20] and NIST [37] characteristics are the same.The NIST definition is though more comprehensive, and also used by CSA

[27], and European Network and Information Security Agency (ENISA) [11]and is here presented in a summarised form:

Acording to NIST [37] cloud computing enables on demand network ac-cess to a shared pool of configurable computing resources, including servers,applications, storage and network services, that can be released and provi-sioned with minimal management or interaction with the service provider.The five characteristics according to NIST is:

• On demand self service: A consumer of a cloud should be able toprovision computing capabilities, including network storage and servertime. This should be provided without human interaction.

• Broad network access: A cloud should be accessible over the networkand through standard mechanisms such as thin or thick clients.

• Resource pooling: A clouds resources should be pooled to serve mul-tiple consumers. Physical and virtual resources should be assigned andreassigned dynamically, according to the demand of the consumers.The consumer do not have control over the exact location of the re-sources that is provided, but may be able to specify regions, such ascountry or continent. Resources include processing, memory, storageand network bandwidth.

• Rapid elasticity: A cloud should be able to scale rapidly outwardand inward in proportion to the demand, and often appear to haveunlimited resources.

• Measured service: A cloud automatically optimize and control re-sources by taking advantage of metering capability that is used to pro-vide a pay per use service. This is done at a appropriate level of ab-straction, depending of the type of service. The usage of resource canbe controlled, monitored and is reported in a way that provides trans-parency for both consumers and providers.

4.1.1 The SPI model

Cloud service delivery can be categorised into three levels. The three levelsare Software, Platform or Infrastructure (as a Service) (SPI) [10, 27, 37], andis therefore referred to as the SPI model [20, 27].

12 Johannes Nilsson

Page 21: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

If the SPI model would be described from a consumers perspective, IaaSwould be the hardware and network. PaaS would provide the operating sys-tem and capability to deploy consumer-created applications. SaaS would beapplications that are ready to be used, and are running on a cloud infras-tructure; running on a collection of hardware and software that enables thecharacteristics in the cloud definition [37].

The three levels are built upon each other, where IaaS is the foundationof all cloud services and is level 1, PaaS is built upon IaaS and is level 2,SaaS is level 3 and is in turn built upon PaaS and IaaS[27].

The three levels are modularised, which means that a PaaS instance canrun on any IaaS instance, and a SaaS instance can run on any underly-ing instance. This means that if a instance in one level goes down, theinstances that run on top of that instance can easily be transferred to useanother instance. This is provided through hypervisor technologies, thatdynamically provision resource collections to meet a specified service-levelagreement (SLA) [8]. This might give you an idea of the SPI model, and howit is structured. In Figure 4.1 is a illustration of the SPI model, followed bymore detailed descriptions of the three services.

Figure 4.1: Simplified illustration of the SPI model

Infrastructure as a Service (IaaS)

IaaS includes the entire infrastructure, from facilities to the hardware plat-forms inside of them[27]. The consumer of the IaaS is able to manage and

13 Johannes Nilsson

Page 22: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

interact with the underlying infrastructure through a set of Application Pro-gram Interface (API), that is provided by the cloud provider [27]. These APIsgives the consumer control over the operating system and applications; itenables deployment of applications and to run arbitrary software [27, 37].

What differentiate IaaS from a traditional hosted application model, witha dedicated server, is that the cloud approach offer a pay per use model, andthe ability to scale computing resources, memory and storage depending ondemand. This can be done in close to real time speeds [20].

Summarised this means that IaaS gives the consumer control over pro-cessing, storage and network resources, and the management and control ofthe underlying infrastructure is provided by the cloud provider [10, 11, 27,20, 37].

Platform as a Service (PaaS)

PaaS is built on top of IaaS and add on a development environment for ap-plication developers, who can develop and distribute applications through aproviders platform [10, 11, 20, 27, 37]. The consumer of a cloud is dependenton the programming tools and languages supplied by the cloud provider,which can be a constraint. The consumer does neither control the under-lying cloud infrastructure including network, servers, operating system andstorage. But can have control over hosting environments configuration [27,37], and has control over data tiers and scalability, that must be built in[20]. Actual scalability, reliability, and security should be built into a PaaS

solution by the developer [20].PaaS have a low cost of entry since it supports pay per use, and enables

deployment of applications without the complexity and cost of baying servers,and setting them up [20].

Summarised PaaS offers the consumer a development platform, and ab-stracts the consumer from the underlying infrastructure, network, servers,operating system and storage resources. Providing a platform where theconsumer can develop, test and maintain the application in the same envi-ronment[11].

Software as a Service (SaaS)

SaaS provide the customer with the ability to utilise a providers application,that is ready to be used and is running on a cloud infrastructure. SaaS isoften distributed through a thin client such as a web browser, but the serviceis not bound to this, and it could be accessed through any authorised device[11, 20, 27, 37].

14 Johannes Nilsson

Page 23: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

SaaS is built upon IaaS and PaaS, and is complete from a hardware, softwareand support perspective; SaaS deliver the entire user experience including thecontent, its presentation, the application and management capabilities [20,27]. SaaS is usually distributed through a pay per use model, or alternativelythe customer can rent the software on a subscription, but the customer doesnot purchase the software.

SaaS provides a organisation with a opportunity to outsource hosting andmanagement of applications to a third party, with the benefit that there isno need to pre load software into each device in the organisation, and lessmanagement of software. This will reduce the cost of application softwarelicensing, but in some cases there may be some preparatory work, to establishcompany specific data [20].

Summarised SaaS provides the consumer with software without need ofhosting or managing the software. The consumer does not control or manageany of the underlying infrastructure, with the possible exception of limiteduser specific application and configuration settings [10, 20, 27, 37].

4.1.2 Deployment Models

In this section we describe how clouds can be managed and deployed withinorganisations. There are a few ways to deploy a cloud implementation. Itcould for example be managed and maintained internal or external of theorganisation, which is often referred to as private and public clouds, butthere is a bit more complex line between the deployment models than that.In the flowing sub-sections four of the most common deployment models aredescribed.

Private Cloud

A private cloud is defined by that it is only used by one single organisation[11,20, 27, 37]. They are built upon cloud computing principles, on premise oroff premise, but they can only be accessed within the private network of theorganisation, and the infrastructure is not shared with any other organisation[10, 11, 20, 27, 37]. According to [11] there are five main reasons why anorganisation might choose a private cloud deployment model:

• Optimize utilisation of in-house resources that exist within the organi-sation.

• Concerns about security issues including privacy and trust.

15 Johannes Nilsson

Page 24: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

• Data transfer cost. To transfer data to a public cloud from local in-frastructure could be expensive.

• The organisation require to have full control over mission critical ac-tivities behind their firewalls.

• Academic use for research and teaching.

Public Cloud

Public clouds are the most common deployment model [11], it is hosted,operated and managed by a third-party vendor, and multiple organisationmay subscribe to the same public cloud [10, 20, 27, 37]. Customers share acommon infrastructure, and have a low oversight and degree of control overphysical and logical security of the public cloud. So security managementand supervision of the cloud is relegated to the cloud provider [20], who hasfull ownership of the cloud, with its own policy, profit, value, costing andcharging model [11].

Community Cloud

Community cloud is described by [20] as a variant of a private cloud, withthe difference that a community cloud is provided by a vendor who is boundby a custom SLA. While [10, 11, 27, 37] argue that a community cloud isshared among a defined set of organisations, with shared concerns such assecurity requirements, policy and other compliances. [11, 27, 37] are alsoarguing that a community cloud does not have to be hosted by a third party,it could also be hosted and managed by one or more of the organisations inthe organisation set, that is sharing the cloud.

Hybrid Cloud

A Hybrid cloud is a combination of two or more cloud deployment modelsthat are distinct, but are bound together by house developed or standard-ised technology that enables data and application technology [11, 27, 37].One example of use could be cloud bursting [11, 27, 37], which means thata application run on a private cloud, but burst into a public cloud whenthe demand for computing capacity peaks [41]. Another example of use isorganisations that run less sensitive applications in a public cloud, whilethey maintain sensitive applications in house, in a private cloud, in order tooptimise their resources [11, 20].

16 Johannes Nilsson

Page 25: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

4.2 Security

4.2.1 Responsibility of Security in the SPI model

In a SaaS security is generally integrated into the solution. Consumers of aSaaS can only do small configurations themselves, and the service has a lim-ited scalability, this leaves the responsibility of security to the cloud serviceprovider [27].

In PaaS developers are enabled to build their own applications after theirown liking on top of the service provided. This gives PaaS users more possibil-ities, at the expense of integrated security features by the provider, since theconsumer is responsible for the security within their own application. Con-sumers can, however, add on additional security themselves, and customisesecurity after their own requirements [27].

IaaS provides even less integrated application security features, if any. Theconsumer of a IaaS have great extensibility options, but gets all responsibilityfor security beyond protecting the infrastructure itself. The responsibility ofthe consumer then includes managing security of the operating system andapplications[27].

The conclusion that can be drawn from this is, less abstraction meansmore responsibility for the consumer; the further down in the SPI model, themore responsibility of management and security lies with the consumer [7,27].

4.2.2 Location of Data

Access and application data management do not have the same meaning andhave different regulation requirements depending on the location of the data[15, 20, 27]. Laws are different in different countries, some are based on thelocation of the organisation, some of the physical location of the data-centerstoring the data. This makes it hard to determine whose jurisdiction thedata falls under [20, 35].

To determine the exact location of data is very difficult due to the dynamicnature of a cloud, but it may be possible to specify location of data to largerregions [20, 27]. When using Windows Azure it is possible to restrict datato four different regions within the U.S, two in Europe and two in Asia [22].Amazon offers to restrict data to four different regions within the U.S, onein South America, one in Europe and two regions in Asia [12].

The jurisdiction for handling data in the cloud has not yet caught up withthe technology [20], this has to be considered before migrating corporate datathat may be sensitive into the cloud.

17 Johannes Nilsson

Page 26: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

4.2.3 Confidentiality and Integrity of Data

Data integrity is very critical in a system, it is easily achieved in a tradi-tional standalone system, with a single database. Integrity in such systemsare maintained via transactions that follow the Atomicity, Consistency, Iso-lation and Durability (ACID) properties [35]. This gets more complicated ina cloud, that can be seen as a distributed system with transactions acrossmultiple data sources and databases. To handle transactions with multi-ple databases, a central global transaction manager can be used. So whena transaction is in the pipeline, each application is able to participate viaa resource manager [35]. Transactions and guaranteed delivery is not sup-ported by Hyper Text Transfer Protocol (HTTP) at the protocol level, sothese has to be implemented at the API level, by for example using MessageAuthentication Codes (MAC) [20, 35].

To ensure confidentiality of data a tested encryption schema should beused to implement cryptography for data in transit over networks, encryptdata at rest and in backup media. Since resources are shared in the cloud itmay also be a good idea to encrypt dynamic data, such as data in memory[16, 20, 27].

A reason to why data in transit over the network should be encrypted, isthat data is spread over a wide range of components. There may be tenantssharing the same components, who tries to collect information about othertenants data [27].

Encrypting data at rest, such as data on disk or in a live productiondatabase, protect against other tenants sharing the same cloud, intrudersand even the cloud provider. This can for example be achieved by encryptingdata before sending it into the cloud, so that the only data transferred andstored in the cloud, is the ciphertext of the original data [27].

Encrypt data in backup media should be done because it can give protec-tion against misuse of data in case of intrusion or theft. This is often doneby the cloud provider, but it is the consumers responsibility to verify that itis actually implemented by the provider [27].

4.2.4 Privacy

The American Institute of Certified Public Accountants (AICPA) and theCanadian Institute of Chartered Accountants (CICA) have defined privacy inthe Generally Accepted Privacy Principles (GAPP) standard [4] as:

”The rights and obligations of individuals and organizations with respectto the collection, use, retention, disclosure and disposal of personal informa-

18 Johannes Nilsson

Page 27: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

tion”

The concept of privacy is not the same everywhere, it can vary among orwithin countries, cultures and jurisdictions [20]. Privacy is therefore closelyrelated to the location of data section 4.2.2, with the problem of restrictingand not having exact control over data location in cloud computing, resultingin the possible risk of violating local laws [20].

An other possible problem resulting from the architecture of a cloud,where data is replicated across multiple systems and sites, to provide in-creased availability, is to destroying the data [20]. Since it is known that it ishard to truly destroy stored data from a disk, it may be very hard to ensurethat all information has been deleted from the cloud.

When it comes to responsibility for protecting the privacy of data, it maynot be so obvious if it is the providers of the cloud who is responsible, orthe consumers. In the eyes of the law, liability lies on the organisation thatcollects the data [20]. On Windows Azures website you can read the following[38]:

”An end user should direct privacy-related requests to the entity providinga service to the end user. Microsoft is not responsible for the privacy prac-tices of our customers using our Services.”

And:

”Some data may be particularly sensitive to you or your organization orbe subject to specific regulatory requirements. You are responsible for deter-mining whether our security meets your requirements.”

The conclusion of this is that a thorough investigation and analysis of thedata in question should be carried out, before putting the data into a cloud.This is to make sure to be on safe ground, and not be caught in a unpleasantsituation, since it is up to the consumer of a cloud to make sure the cloudmeets the desired security requirements.

4.2.5 Vendor Lock In

The cloud computing industry is relatively young, so a standardised API hasnot yet taken a steady hold of the market, and many providers have theirown APIs for managing their service [6, 20]. The lack of a standardised API forcloud services, is a concern that prevents some organisations from adopting

19 Johannes Nilsson

Page 28: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

cloud solutions, since it can cause organisations to get locked into a specificvendor [6].

SaaS may have the most obvious risk for a lock-in, where data can bedesigned specifically for a provided software, although this is not specific forcloud computing [20].

PaaS may lock-in consumers in the API layer and in the component layer.Consumers may not only be bound to use custom APIs of the provider, butmay also be locked in to the providers back end data store services. It is theresponsibility of the consumers of a PaaS, to develop their own solution toexport data from the applications they have developed [20].

Lock-in into IaaS varies depending on the service consumed. If for ex-ample consuming cloud storage, the consumer will not be affected of noncompatible virtual machine formats, but instead data lock-in [20]. IaaS oftenuse hypervisor based Virtual Machine (VM), which bundle together VM metadata and software within a cloud provider, making it hard to migrate to another cloud provider. This can be prevented if cloud providers agree to usesolutions like Open Virtualization Format (OVF) [20].

It may seam attractive for cloud computing providers to lock in theircustomers. Providers may fear a significant price drop, if making it to easyfor consumers to migrate between them. But it is not only the price thatmatters, it is also the quality of the provided service, how easy it is to useand how reliable it is for example. Standardisation of cloud computing mayalso enable use of the same software infrastructure in a private cloud and ina public cloud, which could be used to capture heavy workloads [6] and getmore organisations to use cloud solutions.

4.2.6 Availability

On Windows Azures home page it is stated that Microsoft guarantee con-nectivity at 99.95% of the time, when deploying two or more role instancesin different upgrade and fault domains [39]. What this really means in re-ality is hard to grasp. Below is a Table 4.1 is showing the availability inpercent converted to HH:MM:SS, per day, mouth and year. Table 4.1 showthe significant difference of downtime in actual time, with small changes ofdowntime in percent.

Availability is key to many organisations [7]. Even if a cloud provider hasdata centers all over the world, and uses different network providers, it maygo down due to common software and infrastructure that may fail, or theprovider may even go out of business [7]. The latter happened to the cloudprovider Coghead in February 2009, giving its customers nine weeks to gettheir data off its servers, otherwise it would be lost [20]. Both Amazon and

20 Johannes Nilsson

Page 29: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

Total downtime (HH:MM:SS)Availability Per day Per month Per year

99.999% 00:00:00:4 00:00:26 00:05:1599.99% 00:00:08 00:04:22 00:52:3599.9% 00:01:26 00:43:49 08:45:5699% 00:14:23 07:18:17 87:39:29

Table 4.1: The table shows the availability in percent converted toHH:MM:SS, per day, mouth and year [20].

Azure have also suffered downtime due to technical issues [21, 23].To ensure availability there should not be a single point of failure. Win-

dows Azure attack this problem by offering data replication on geographicallydiverse data centers, and automatic fail over in the occurrence of a failure ina layer of the Azure platform infrastructure [36]. Windows Azure call theirsolution a ”Availability set”, that are directly related to fault domains andupdate domains. Windows Azures define a fault domain by avoiding singlepoints of failure, that for example can be the network switch or power unitof a rack of servers. So the fault domain is very closely related to a rackof physical servers. The availability set is used to ensure that the machinesrunning an application are located in different fault domains. It is also usedto ensure that the machines are located in different update domains. When aWindows Azure VM is updated, which happens periodically, it is shut downto apply the upgrade. The update domain ensures that not all Machines areupdated at the same time [18]. An illustration of an availability set can beseen in Figure 4.2.

Figure 4.2: Windows Azure availability set [18]

21 Johannes Nilsson

Page 30: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

But having just one provider, is having one single point of failure froma consumer perspective [7], as seen in the given examples above. One so-lution to this problem, is having more then one cloud provider, just as theproviders have multiple data centers and network providers. This will ensurea very high availability, and should be considered if availability is key to theorganisation.

4.2.7 Centralisation of Data

A cloud typically manages huge volumes of data, and therefore it becomes adesirable target for cyber-crimes and attacks. A cloud collects its consumersunder the same roof, and instead of attacking each single company one byone, cloud allows simultaneous attacks to several sites in one single hit [16].

4.2.8 Shared Resources

When using a public cloud, consumers share resources with each other ina physical infrastructure. [24] have shown that this can result in new vul-nerabilities. [24] do this by showing that it is possible to map the internalinfrastructure in a cloud, in this case Amazon EC2. [24] claim to be able toidentify a chosen targets VMs, and extract information from those VMs.

4.3 Pros and Cons

4.3.1 Pros

Scalability: Is the ability to scale computing resources and can be done bothhorizontally and vertically in a cloud. This means that it is possible toresize memory, and storage on each ”box” based on usage requirement.It is also possible to resize the cluster of ”boxes”, adding more if morecomputing power is needed, or put some of them to sleep when thepower requirements are low [20]. This can be done in close to real-timespeeds in a cloud, making the cloud to a very dynamic infrastructure.

Cover peaks: Traditionally resources are static which can lead to compli-cations if demand for a service varies with time. As seen in Figure4.3 this can result in lost revenue, since customer requests can not bemet at demand peaks. Adding static capacity to cover these peaks canresult in having a lot of unused resources when the demand is low, asillustrated in Figure 4.4 a). Cloud computing provide the ability to fit

22 Johannes Nilsson

Page 31: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

the allocation of resources dynamically to the demand at a given time,as seen in Figure 4.4 b). Streamlining the usage of resources.

Figure 4.3: Illustration of uncovered demand peaks. The gray areas illus-trates lost revenue [28].

Figure 4.4: The grey areas illustrates unused resources. In a) is a staticallyprovisioned data center, b) is a virtual data center in a cloud [28].

Pay as you go: Cloud computing services comes with pay as you go of-ferings. This means that you pay for exactly what you use, this incombination with the scaling opportunities means that consumers donot pay for unused resources as they may have to do with static provi-sioning, such as in Figure: 4.3 a).

Best of breed technology and resources: If outsourcing infrastructureto a cloud provider, organisations get access to best of breed technology,for a fraction of the cost compared to if buying it themselves [20].

Low cost of entry: Cloud services offers consumers to deploy and developapplications, without the cost and complexity it would mean for themto set up their own servers. Cloud computing also offers the opportunityto start up at a small scale, and then add resources as demand grows.This gives a low cost of entry for developers [20].

23 Johannes Nilsson

Page 32: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

Maintenance and security: Maintenance and parts of security is outsourcedto the cloud provider, such as physical security at the data centres,administrative and operational staff, monitoring servers, software andnetwork. As described in section 4.2.6, cloud providers also offers extrafunctionalities, for example Windows Azure offers functionality such asautomatic fail over in the occurrence of a failure in a layer, and cryp-tographic protection for messages within Azure with Transport LayerSecurity (TLS), which can also be applied on demand between end usersand VMs in Azure [36]. Even though a cloud provider have features toback up stored data, it may be an idea for a organisation to not totallyrely on one provider, as discussed in the availability section, having justone provider can be seen as a single point of failure.

Modularised: In cloud computing hardware, operating system and applica-tions have been abstracted from each other into three separated virtualcomponents (see SPI model section 4.1.1). One example of enabledpossibilities with this is that a operating system component easily canmigrate from one hardware component to another, which is contribut-ing to the scaling characteristics of the cloud.

4.3.2 Cons

Sharing resources: Using shared resources can introduce new vulnerabil-ities, such as described in the shared resources section 4.2.8, where cotenants can locate and extract information from targeted VMs.

Performance unpredictability Shared resources can also lead to unpre-dictability in performance. [6] claim that multiple VMs can share Cen-tral Processing Unit (CPU) and memory well with each other, the prob-lem is Input Output (I/O) interference. [6] tests carried out that a meandisk write bandwidth of 55 MBytes/sec, had a standard deviation ofabout 9Mbytes/sec. That is more than 16% of the mean, which issignificant. One possible solutions to this is to improve interrupt andI/O channels in the cloud architecture and operating system. Anotherpossible solution is that flash memory will decrease I/O interference inthe future [6]. However, this unpredictability does not only apply tocloud computing, but to all solutions with shared resources.

Location of data: It is hard to determine the exact location of data due tothe dynamic nature of a cloud. It is possible to restrict data to largerregions, but this may not be enough. Laws are different in different

24 Johannes Nilsson

Page 33: Solr and the cloud

CHAPTER 4. THE CLOUD Solr and the cloud

countries, and it may be very difficult to determine whose jurisdictionthat should be followed, as described in section 4.2.2.

Ensuring privacy: Closely related to the location of data issue is the issueof ensuring privacy in the cloud. since laws are not the same every-where, putting privacy related data into a cloud can resulting in thepossible risk of violating local laws, as described in section 4.2.4.

Destroying data: In order to for instance provide increased availability,data is replicated across multiple systems and sites of a cloud. Thismakes it hard to destroy data, as described in section 4.2.4, which makesit hard to ensure all private information is deleted, upon a request forexample.

Vendor lock in: Many providers use their own APIs, and there is not yetan agreement of a standardised API that is used by all providers. Thismeans that data can be locked in to a specific provider, as described insection 4.2.5.

Centralisation of data: A cloud centralise great amounts of data, andtherefore can become a desirable target, as described in section 4.2.7

25 Johannes Nilsson

Page 34: Solr and the cloud

Chapter 5

Implementation

5.1 Theory

5.1.1 Windows Azure

Microsoft’s cloud platform Windows Azure can be used in many ways, andit provides IaaS, web hosting and PaaS [14]. To be able to build the prototypeon the Windows Azure platform, it is necessary to understand its structure.Therefore this section will give a brief description of Windows Azure.

Execution Models

To execute applications Windows Azure provides several options, and morehave been developed during the short time it has been used for this report.Below are descriptions of the two services that has been in consideration forthe implementation of the prototype. The services are VMs and cloud service.

Virtual Machines

IaaS is provided through VMs in Windows Azure, that can be created ondemand, with a limitation of twenty CPU cores per Windows Azure sub-scription, from a standard image that is provided, or from a self suppliedimage. If choosing the first option, Windows Azure offers multiple VirtualHard Drive (VHD) options with images, such as Windows Server 2008 R2,Windows Server 2012, Windows Server 2008 R2 with SQL Server and alsoLinux images [14].

A user can store changes while running a VM, shut it down, and since thechanges are stored to the VHD, next time the VM is created everything is asit was just before it got shut down [14].

26

Page 35: Solr and the cloud

CHAPTER 5. IMPLEMENTATION Solr and the cloud

The VMs are paid for by the hour, the different size options and theirprice can be seen in Table 5.1. The Windows VM includes Windows licensingcosts, and the non-Windows VM does not, it instead allows separate licencedeployments of non-Windows host operating systems [14].

Computeinstancesize

CPUCores

Memory Bandwidth SupportedDataDisks

PriceWindowVM/hour(GeneralAvailabil-ity)

Price non-WindowVM/hour(GeneralAvailabil-ity)

Extra small shared 768MB 5 (Mbps) 1 $0.02 $0.02Small 1 1.75GB 100 (Mbps) 2 $0.115 $0.085Medium 2 3.5GB 200 (Mbps) 4 $0.23 $0.17Large 4 7GB 400 (Mbps) 8 $0.46 $0.34Extra large 8 14GB 800 (Mbps) 16 $0.92 $0.68

Table 5.1: Size and price information of VMs in Windows azure [14, 40].

Since this is IaaS the abstractions level is low, and users have to handle abig part of the administration themselves, but it also gives users more controland flexibility. More about IaaS can be found in 4.1.1.

Cloud Services

This is a PaaS solution provided by Windows Azure, more about PaaS can befound in 4.1.1. Microsoft’s solution is designed to support scalable low-adminapplications, that can be developed using a range of different technologiessuch as C#, Java, PHP, Python and Node.js. The developed applicationsthen executes in VMs that run a version of Windows server. The VMs usedin a cloud service are referred to as instances, and are distinct from the VMsthat are talked about in the previous section 5.1.1, this is because of thatWindows Azure manages the instances itself. This means that Azure handlesthings like operating system patches, and automatic handling of new patchedimages, and also takes care of monitoring VMs, and restarts them if they fail[14].

There are two kind of roles that a instance can have, as mentioned theyboth are based on Windows server. The roles are named web role and workerrole, and the difference between them is that web role instances run InternetInformation Server (IIS), and worker roles does not. What this means isthat web roles can take requests from users, and worker roles can not. Toscale the applications, developers can request creation of more instances or

27 Johannes Nilsson

Page 36: Solr and the cloud

CHAPTER 5. IMPLEMENTATION Solr and the cloud

that instances are shut down, this is easily done in Windows Azures portal, aGraphical User Interface (GUI) for administration of Windows Azures services[14]. There is a limitation though of how many roles that can be part of onecloud service, no more than 25 roles is the limit, but if more roles than thisare needed, cloud services can be connected via virtual network and togetherform a clod service with more than 25 roles [40].

Affinity Group

VMs in a Windows Azure subscription often work together, and when theydo it is much wanted to have the machines physically close to each otherto get out the optimal performance out of the collaboration. If they arelocated far away from each other, it would mean an increase of latency inthe communication between the VMs [2].

To ensure that collaborating VMs are located close to each other, WindowsAzure provide the opportunity to create an affinity group. If services areplaced in the same affinity group, Windows Azure knows that these shouldbe kept within the same data center cluster, and will locate them as close toeach other as possible when they are deployed [2].

5.1.2 Solr 4.0 and SolrCloud

Solr

Solr is a open-source search platform from the Apache Lucene project. In-cluded functionalities are a powerful texts-search tool, hit highlighting, faceting,dynamic clustering, database integration and rich document (eg Word, PDF)handling. Solr is very scalable, and it is the engine for search and naviga-tion functionalities on many of the large websites around the world, such asapple.com and DN.se [30]. Solr is written in Java and run as a standalonetext-search server in a servlet container such as Tomcat [5].

Solr uses a REST-like HTTP/XML and JSON API which make it com-patible with virtually any programming language. In the implementation forthis thesis SolrJ has been used to communicate with the solr instances. SolrJis a API that makes it easy for a Java client to communicate with Solr. Doc-uments are indexed into solr via XML, JSON, CSV or bianry over HTTP.Queries are sent via HTTP GET and receives XML, JSON, CSV or binaryresults [5].

28 Johannes Nilsson

Page 37: Solr and the cloud

CHAPTER 5. IMPLEMENTATION Solr and the cloud

SolrCloud

SolrCloud came with Solr 4.0 that was released on 12 Oct 2012 [31]. It shouldnot be confused to be a new cloud vendor, what SolrCloud does is to provideSolr with tools that make it easier for developers to set up a cluster of Solrservers that is fault tolerant and have high availability [32]. New featuresthat is provided by SolrCloud summarised by [5]:

• Centralized Apache ZooKeeper based configuration

• Automated distributed indexing/sharding - send documents to anynode and it will be forwarded to correct shard

• Near Real-Time indexing with immediate push-based replication (alsosupport for slower pull-based replication)

• Transaction log ensures no updates are lost even if the documents arenot yet indexed to disk

• Automated query fail over, index leader election and recovery in caseof failure

• No single point of failure

SolrCloud Glossary

To get a better understanding of SolrCloud a walk through of the SolrCloudglossary is needed. Here are some words used when talking about SolrCloud,with explanations from [32]:

Schema: A XML file that describes how incoming data should be indexedinto Solr.

Solrconfig: A XML file that describe the configuration of Solr.

Node: A JVM instance running Solr.

SolrCore: An individual Solr instance, represents a logical index.

Cluster: A cluster is a set of Solr nodes managed as a single unit. Theentire cluster must have a single schema and configuration informationstored in solrconfig.

Collection: Multiple documents that make up one logical index.

29 Johannes Nilsson

Page 38: Solr and the cloud

CHAPTER 5. IMPLEMENTATION Solr and the cloud

Document: A group of fields and their values. Documents are the basicunit of storage, and their specific locations are found using an index.Documents are assigned to shards using standard hashing. Documentsare versioned after each write operation.

Partition: A partition is a subset of the entire document collection. Apartition is created in such a way that all of its documents can becontained in a single index.

Shard: A partition may be stored in multiple nodes as specified by thereplication factor. All these nodes collectively form a shard. Eachshard can have 1 leader that handles indexing, and N replicas, that areused handling queries.

Figure 5.1: Illustration of a shard

Overseer: The Overseer coordinates the clusters. It keeps track of the ex-isting nodes and shards and assigns shards to nodes.

Leader: Each shard has one node identified as its leader. All the writes fordocuments belonging to a partition are routed through the leader.

ZooKeeper: Apache ZooKeeper keeps track of configuration and naming,among other things, for a cluster of Solr nodes. A ZooKeeper clusteris used as the central configuration store for the cluster, a coordinatorfor operations requiring distributed synchronization, and the system ofrecord for cluster topology.

Ensemble: Multiple ZooKeeper instances running simultaneously.

To summarise the concept of SolrCores and collections, a single instanceof Solr has a SolrCore that can be described as a single index. When usingSolrCloud, a single index can span over multiple Solr instances, this means

30 Johannes Nilsson

Page 39: Solr and the cloud

CHAPTER 5. IMPLEMENTATION Solr and the cloud

that a single index can be made up of several SolrCores on different Solrinstances. Multiple Solr instances can be run on a single machine or severalmachines. All the SolrCores that together make up a single logical indexare called a collection; A collection is a single index that spans over manySolrCores[33].

5.2 The Implementation

This implementation is intended to analyse how scaling of the underlyinginfrastructure and sharding, affects the performance of the indexing processof Solr. To examine this Solr 4.0 and SolrCloud has been used, and deployedon Microsoft’s cloud platform Windows Azure, more specifically WindowsAzures IaaS solution with VMs has been used to get full control of the im-plementation.

5.2.1 Setting up the VM Cluster

The first thing that was done in the process of setting up the cluster of VMswith Windows Server 2008 R2 for this implementation was creating a virtualnetwork. This was done through the Windows Azure portal, that is a GUI

for administration of services provided by Windows Azure. The VMs wascreated through the same portal and connected to the virtual network. Thisimplies that the VM are connected to the same affinity group, to ensure theyare as close to each other as possible.

Configurations had to be done to the VMs to fit the needs for the im-plementation. To avoid doing the configurations on each VM one by one,the configurations was first done on one VM. Then an image was createdfrom that VM, and that image could then be used to create all the otherVMs that then was ready to be used, with the exception that all endpoints(communicating ports) had to be added specifically for each VM. This wasdone through a Windows PowerShell script that can be found in appendix AThe configurations that was done to the image VM was the following:

• Allow ICMP in the firewall for both IPv4 and IPv6, this is to be ableto ping other VMs in the cluster. How it is done can be fond at 1.

• Install Java jre7 and allow Java through the firewall.

• Install Solr 4.0.

1http://blogs.biztalk360.com/windows-azure-virtual-machines-virtual-network-may-not-ping-automatically/

31 Johannes Nilsson

Page 40: Solr and the cloud

CHAPTER 5. IMPLEMENTATION Solr and the cloud

• Add ”<VM internal IP address><VM host name>” toC:\Windows\System32\drivers\etc\hosts The full address to a Win-dows Azure VM is <VM host name>.cloudapp.net. Zookeeper storesonly the <VM host name>part of the address. So this is done to en-able the Solr instances to communicate with each other, when they arelocated on separate VMs.

When the VM cluster is set up, ports for Solr communication shouldbe added as endpoints. This is done with the PowerShell script found inappandix A. Endpoints are needed for external communication with Solr; tobe able to connect to a Solr instance GUI administration, an endpoint needsto be opened to the port number that the Solr instance is run on.

5.2.2 Setting up Solr

The basic set up of solr is illustrated in Figure 5.2. All test set ups have incommon that they have one internal Zookeeper running. For a production setup it is not advised to run Zookeeper internally, this is because Zookeeperdoes not currently allow dynamically changing the nodes in an ensemblewithout doing a restart. But when SolrCloud is in a steady state, which itis when the config is not changing and the number of nodes in the clusteris not changing, Solr does not communicate much with Zookeeper, otherthen for really light stuff. This is confirmed by Mark Miller who is one ofthe developers of SolrCloud at [19]. So running Zookeeper internally will notaffect the outcome of these tests.

Figure 5.2: Illustration of of how solr is set up

32 Johannes Nilsson

Page 41: Solr and the cloud

CHAPTER 5. IMPLEMENTATION Solr and the cloud

The decision to run just one Zookeeper internally within the first startedSolr instance was made to simplify the set up, and so that each Solr set upshould have the same Zookeeper set up.

33 Johannes Nilsson

Page 42: Solr and the cloud

Chapter 6

Test set up

6.1 JVM heap size

The memory consumption of a Java Virtual Machine (JVM) running Solr wascontrolled by Xms and Xmx, that can be given as input parameters whenstaring up a Solr instance. These JVM options are used to control the heapsize of a JVM, Xms is used to set the initial heap size, and Xmx is usedto set the maximum heap size. The heap size for each set up was chosendepending on the memory size of the VM, and set to the maximum powerof two number that could fit into the VMs memory size. So if a small VM

was used, that has a memory size of 1,75 GB, the heap size was set to 1024MB. If a large VM was used that has a memory size of 7 GB, the heap sizewas set to 4096 MB. For the tests where more then one shard was runningon the same VM, these were dividing the memory amongst them, so if twoSolr instances were sharing a small VM each instance got (1024/2) 512 MBof memory, four instances got (1024/4) 256 MB of memory and so on.

6.2 Threads

To utilise the full capacity of the underlying infrastructure, and use morethan one CPU core at a time, multiple threads need to be used. Solr is multithreaded, but to use this feature, Solr needs to be called from multiple clients,since Solr is running as many threads as calling clients.

34

Page 43: Solr and the cloud

CHAPTER 6. TEST SET UP Solr and the cloud

6.3 Client

The client is the one sending documents on to the Solr cluster (see 5.2), andis set up on a large Windows Azure VM. The client is multi threaded, andeach thread is having its own connection to Solr, so that Solrs multi threadedfeature is used. The code of the client can be found in appendix B.

6.4 Test Documents

The documents used for the tests are 5000 Solr documents, that are createdin the client, the code of the client can be found in appendix B.

6.5 Base-line

To get a default value to compare all the other set up to, Solr is first setup with one shard, on a small Windows Azure VM that has 1 core and 1,75GB memory. The test documents were then indexed onto this set up using1 thread, and the resulting default value became 86239 ms.

6.6 Tests

6.6.1 Scaling infrastructure

To investigate how the underlying infrastructure affect the indexing processtime of Solr, the VMs were set up and configured in the following way:

1. Change VM size from small (1 core and 1,75 GB memory) to large (4cores and 7 GB memory). With Solr set up on one VM and one shard.

2. Set up the cluster with 1, 2, 4, 8 and 16 small VMs, with one shard oneach VM.

3. Set up the cluster with 1, 2 and 4 large VMs, with one shard on eachVM.

Because of the limitation of twenty cores per subscription (see 5.1.1) therecan be no more than 4 large VMs.

35 Johannes Nilsson

Page 44: Solr and the cloud

CHAPTER 6. TEST SET UP Solr and the cloud

6.6.2 Sharding

To analyse how sharding affects the indexing process time the following testswere performed:

1. Set up 1 small VM with 1, 2, 4, 8 and 16 shards.

2. Set up 1 large VM with 1, 2, 4, 8 and 16 shards.

6.6.3 Combination of Scaling Infrastructure and Shard-ing

1. Set up 2 and 4 large VMs with 1, 2, 4, 8 and 16 shards.

2. Set up 1 and 2 extra large VM(s) with 4 shards.

6.6.4 Collecting Data

Each test configuration is run 3 times, then the average of these runs iscalculated and the resulting average time is used as the result. The tablesfor the collected data can be found in appendix C.

36 Johannes Nilsson

Page 45: Solr and the cloud

Chapter 7

Results and Analysis

7.1 Scaling of Infrastructure

Test 1

Figure 7.1: Speed up on 1 small VM compared to 1 large VM

In Figure 7.1 a small Windows Azure VM is compared to a large WindowsAzure VM, both are set up with one shard. Speed up is a ratio between the

37

Page 46: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

base-line and the execution time of the set up. Looking at the graph of thesmall VM it can be seen that it reaches a speed up of 4,1. When looking atthe graph of the large VM it can be seen that it reaches a speed up of 6,5.This means that the increase from 1 core and 1,75GB of memory to 4 coresand 7GB of memory, enhanced the speedup of Solr with 2,4.

Test 2

Figure 7.2: Speed up n small VMs using n shards

In Figure 7.2 one small VM is compared to n small VMs, where n ∈ [2, 4, 8,16], and each VM has one shard. In contrast to when increasing the poweron one machine, It can be seen that there is a big drop in performance from1 to 2 VMs, from a speed up of 4,1 to 2,8, which is 1,3 smaller than using onesmall VM. The speed up performance then continues to decrease for n>2, butnot as drastic. 1 large VM and 4 small VMs has about the same capacity, 4cores and 7GB memory. But the difference in speed up is significant. Havingall cores and memory on one large VM gives a speed up that is 3,8 largerthan when having the cores and memory spread out on 4 small VMs.

38 Johannes Nilsson

Page 47: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

Test 3

Figure 7.3: Speed up n large VMs using n shards

In Figure 7.3 where 1 small VM is compared to n large VMs, where n ∈ [1,2, 4], and each VM has 1 shard. A similarity can be seen with the result inFigure 7.2. Here is also a drop form 1 VM, to 2 VMs. From a speed up of6,5 using 1 VM to 4,8 using 2 VMs, which is 1,7 smaller than using 1 largeVM. The performance is also decreasing when using 4 large VMs but it isvery similar to when using 2 VMs, so adding the 2 extra VMs does not givean extra boost in indexing time performance.

Based on these tests it seems that adding more cores and memory inthe form of more VMs, gives a decrease in Solr indexing time performance.But adding more cores and memory on one single machine gives a significantincrease if Solr’s indexing time performance.

39 Johannes Nilsson

Page 48: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

7.2 Sharding

Test 1

Figure 7.4: Speed up on 1 small VM

In Figure 7.4 one small VM with one shard is compared to one small VM withn shards, where n ∈ [2, 4, 8, 16]. It can be seen that sharding has a verystrongly decreasing affect on the Solr indexing time performance.

Test 2

In Figure 7.5 the same comparison is done with a large VM, with n ∈ [2, 4, 8,16]. Here it can also be seen that sharding has a decreasing affect. It is notas drastic as for a small VM, but it is still significant. There is a big drop inperformance from 1 to 2 shards, and form 2 to 4 shards, after that the dropis not as significant any more.

An explanation for this is that the overhead for dividing the work amongthe shards take over the benefit of having more SolrCores that can computethe index. The decrease in indexing time performance when adding more

40 Johannes Nilsson

Page 49: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

Figure 7.5: Speed up on 1 large VM

shards, is more significant the fewer shards there are. Instead of the wantedeffect of boosting the indexing time performance by dividing the work amongmore SolrCores, the performance instead decreases.

An observation from comparing Figure 7.3 and Figure 7.5 is illustratedin Figure 7.6. In Figure 7.6, having 2 shards on 1 large VM is comparedto having 2 large VMs with 1 shard on each VM. Figure 7.6 show that itseems like the biggest time thief is the overhead of sharding, and not theoverhead of having the shards on separate VMs.When comparing Figure 7.2and Figure 7.4 it seems like adding more VMs, can increase performance ifhaving more than one shard. But adding more power on one single machine,as in Figure 7.5 out perform adding more machines for the Solr indexing timeperformance. This is further analysed in 7.3.

41 Johannes Nilsson

Page 50: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

Figure 7.6: Speed up with 2 shards on 1 large VM compared to 2 large VMs

42 Johannes Nilsson

Page 51: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

7.3 Combination of Scaling Infrastructure and

Sharding

Test 1

Figure 7.7: Speed up with n shards on 2 large VMs

In Figure 7.7, 1 small VM with 1 shard, is compared to 2 large VMs with nshards, where n ∈ [2, 4, 8, 16], and the shards are evenly spread among theVMs. The 2 large VM set up, decreases the speed up when using 2 shards, butincreases the performance for more than 2 shards. This is when comparingto a 1 large VM set up, seen in Figure 7.5. Instead of a speed up between 3,2and 3,8 for n>2 when using 1 large VM, the speed up is between 4,4 and 4,7when using 2 large VMs.

43 Johannes Nilsson

Page 52: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

Test 2

Figure 7.8: Speed up with n shards on 4 large VMs

Using 4 large VMs as in Figure 7.8 gives a small increase of the speed up for4 shards, from 4,4 to 4,7. But that is not much at all considering that 8 coresand 14 GB of memory has been added, when going from using 2 large VMsto 4, and for n>4 there is a decrease in performance.

Test 3

If instead using 2 extra large VMs, equally many cores and memory are usedas when using 4 large VMs, as seen in Table 7.1. But there is a much greaterspeed up, as seen in Figure 7.9, where 1 and 2 extra large VMs are comparedto using 2 and 4 large VMs. Instead of a 0,3 increase of the speed up as whengoing from 2 to 4 large VMs, there is a 2,7 increase of the speed up when using2 extra large VMs.so using 2 extra large VMs is giving a significant increaseof the speed up compared to using one extra large VM. It can also be seenthat 1 extra large VM have the same performance as 2 large VMs, when using4 shards.

44 Johannes Nilsson

Page 53: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

Figure 7.9: Speed up with 4 shards, on 2 and 4 large VMs and 2 extra largeVMs

VMs Cores Memory4 large 16 28GB2 extra large 16 28GB

Table 7.1: Cores and memory on 4 large and 2 extra large VMs [14, 40].

7.4 Summary

From the analysis in section 7 it can be seen that sharding in itself is a bigconsumer of time when distributing the index over SolrCores. This is becauseof that the overhead for dividing the work among the shards, take over thebenefit of having more SolrCores that can compute the index. The negativeimpact on indexing time performance is greater, the fewer shards there are;going from 1 to 2 shards decreases the performance more, than if going fromusing 4 to 8 shards.

Sharding is also the reason why it is not so beneficial to scale Solr horizon-

45 Johannes Nilsson

Page 54: Solr and the cloud

CHAPTER 7. RESULTS AND ANALYSIS Solr and the cloud

tally on more machines, when considering indexing time performance. Thisis since you have to use shards to distribute the index over the machines. Ifusing more then one shard and an increase of the indexing time performanceis wanted, the best option when using Windows Azure is to scale up in VM

size first. If that is not enough when extra large VM size is reached, addingone more extra large server gives further speed up. When using smaller VM

sizes like small and large, adding more VMs of the same size will not give asignificant increase of the performance.

The number of threads is also very significant for the indexing time perfor-mance. All the tests show that 8 threads gives the best performance, exceptfor when using 2 extra large VMs, then 16 threads gives the best speed up.Multi threaded indexers have not been used by Itera Consulting before, soonly the implementation of a multi threaded indexer, will likely significantlyenhance the indexing time performance.

7.5 The Benchmark Data

Because of that the documents that are used in these tests are created directlyin the client and not read from disk, I/O does not affect the performance likeit can do if documents are read from disk. Since this is a potential bottleneckperformance will be lower, how much lower we cannot currently estimatethough! This needs to also be evaluated empirically.

46 Johannes Nilsson

Page 55: Solr and the cloud

Chapter 8

Conclusion and Discussion

The Cloud

A cloud is a type of distributed system that is consisting of virtualized com-puters that are inter-connected and dynamically provisioned, presented asone or more unified resource(s) to the end user. Typical characteristics ofa cloud are that resources are shared among multiple tenants, users pay asthey go and are able to provision computing capabilities without human in-teraction. A Cloud should also be able to scale rapidly, in proportion todemand.

Cloud services can be categorised into three levels. The levels are IaaS,PaaS and SaaS, and are known as the SPI model. Simplified IaaS can bedescribed as hardware and network, PaaS can be described as the OS anddevelopment environment, and SaaS ready applications. The three levels runon top of each other and are modularised, this means that if a instance inone level goes down, the instances that run on top of that instance can easilybe transferred to use another instance.

Shared resources is one security issue with the cloud. To ensure theconfidentiality of data, encryption should be used for data in transit overnetworks, at rest and in back up media.

Due to the dynamic nature of a cloud it is very difficult to determine theexact location of data, and data can only be specified to a larger regions.This may cause complications since laws are different in different countries,which can be a big issue. One specific example is privacy of data. Becauseof this I think many cloud solutions will be deployed as hybrid clouds, wheresensitive data is kept within a private part of the cloud, to get full control ofthe sensitive data, and access to the advantages of a public cloud.

An other issue that needs to be taken care of, is an agreement of a stan-dard API to manage cloud services to avoid vendor lock in. If this issue is

47

Page 56: Solr and the cloud

CHAPTER 8. CONCLUSION AND DISCUSSION Solr and the cloud

dealt with, more organisations will probably adopt to cloud solutions.

The Implementation

Sharding in it self is not increasing indexing time performance, this is becauseof that the overhead for dividing the work among the shards, take over thebenefit of having more SolrCores that can compute the index. The negativeimpact on indexing time performance is greater, the fewer shards there are,which indicate that the time for overhead increases in a logarithmic manner.

Sharding is also the reason why it is not so beneficial to scale Solr hor-izontally on more machines, when considering indexing time performance.The tests show that to increase indexing time performance when limited by20 cores and Windows Azures VM sizes, scale up in VM size gives the bestresult. If scaling up in VM size is no longer possible, having 1 extra large VM

adding 1 more extra large VM gives further speed up.A reason to why sharding in it self is not increasing indexing time per-

formance in these tests, may be because of that shards are not created ina parallel manner with the techniques used in these tests. A solution maybe to integrate Apache Solr with Apache Hadoop, as suggested in the futurework section 9. Hadoop offers a sophisticated way of distributing data to acluster of leader servers, and according to [13] it can be used to create shardsin a parallel manner.

From the tests it can also be seen that the number of threads is alsovery significant for the indexing time performance. Most test peak at 8threads, why this is can not be concluded by theses tests, but an assumptionis that it is likely due to a limitation in Solr. Multi threaded indexers havenot been used by Itera Consulting before, so only the implementation of amulti threaded indexer, will likely significantly enhance the indexing timeperformance.

The Viability of Deploying Solr as a Cloud Service

The cloud can be used in a very dynamic way, the meaning of this is that VM

sizes can be changed as desired, and also the number of VMs. This in com-bination with SolrClouds robustness, with automatic fail over and recoveryfeatures, opens the opportunity to build a very dynamic Solr implementa-tion. With the possibility to add and delete replicas as the query rate goesup and down.

It might also be possible to resize a leader of a shard after own desire,but since this will cause a restart of the leader VM, and Solr is automaticallyelecting a new leader from the replicas of that shard when this happens. This

48 Johannes Nilsson

Page 57: Solr and the cloud

CHAPTER 8. CONCLUSION AND DISCUSSION Solr and the cloud

can be a bit complicated. Now there is no feature that gives the possibilityto chose a new leader, but if that feature is added in the future, a replica ofdesired size can be added to the shard. This new replica is then elected asthe Crown Prince of the leader, and then all that has to be done is to killthe leader and the switch is done.

If implementing these features and following the systems demand peaksand lows, it will likely save money since you only pay for resources you use,and it will also respond to customer needs in a dynamic way.

Figure 8.1: Illustration of of how solr can be set up in the cloud, where eachleader and replica runs on its own role in Windows Azure Cloud Services.

49 Johannes Nilsson

Page 58: Solr and the cloud

Chapter 9

Future Work

Dynamics for Leaders an Replicas

Development of dynamics for replicas and leaders, and a load balancer to de-termine how many of these that should be used. An analysis for determiningthe values for the load balancer.

Solr in a PaaS Solution

Analyse how Solr could be developed on Windows Azures Cloud Services,Windows Azures PaaS solution. This to minimize administration time forscaling a Solr set up with multiple machines. It may for example be usefulfor the development of dynamics for leaders and replicas.

I/O

I/O is a potential bottleneck, and will likely lower performance. Furtherexperiments to determine the actual impact of file I/O bandwidth to disk, isrecommended. This could be done for files of varying size and number.

Hadoop

Hadoop enables a sophisticated way to index very large indexes over N leaderservers, by creating shards in a parallel manner [13]. This may be a solutionto utilise the potential of shards and reduce the indexing time even further,and is much suggested to look into, especially if working with very largeindexes having terabytes of data.

50

Page 59: Solr and the cloud

Bibliography

[1] A12 Find. axis twelve. Sept. 2012. url: http://www.axistwelve.

com/find.

[2] About Affinity Groups for Virtual Network. Microsoft. 2012. url:http://msdn.microsoft.com/en-us/library/jj156085.aspx.

[3] Amazone CloudSeach. Amazone. 2012. url: https://aws.amazon.com/cloudsearch/.

[4] An Executive Overview of GAPP. American Institute of Certi-fied Public Accountants (AICPA) and Canadian Institute of Char-tered Accountants (CICA). 2012. url: http : / / www . aicpa .

org/InterestAreas/InformationTechnology/Resources/Privacy/

GenerallyAcceptedPrivacyPrinciples / DownloadableDocuments /

10261378ExecOverviewGAPP.pdf.

[5] Apache Solr. Apache. 2012. url: http://lucene.apache.org/solr/index.html.

[6] Michael Armbrust et al. Above the Clouds: A Berkeley View of CloudComputing. UC Berkeley Reliable Adaptive Distributed Systems Lab-oratory. Feb. 2009. url: https://radlab.cs.berkeley.edu/.

[7] Michael Armbrust et al. “A view of cloud computing”. In: Communi-cations of the ACM,V53 Issue 4. 2010, pp. 50–58.

[8] Rajkumar Buyya, Chee Shin Yeo, and Srikumar Venugopal. “Market-Oriented Cloud Computing: Vision, Hype, and Reality for DeliveringIT Services as Computing Utilities”. In: 10th IEEE International Con-ference on High Performance Computing and Communications. 2008,pp. 5–13.

[9] Rajkumar Buyya et al. “Cloud computing and emerging IT platforms:Vision, hype, and reality for deliveringcomputing as the 5thutility”. In:Future Generation Computer Systems V25. 2009, pp. 599–616.

51

Page 60: Solr and the cloud

BIBLIOGRAPHY Solr and the cloud

[10] Cloud Computing Risk Assessment. The European Network and In-formation Security Agency (ENISA). Nov. 2009. url: https : / /

www . enisa . europa . eu / activities / risk - management / files /

deliverables/cloud-computing-risk-assessment.

[11] Tharam Dillon, Chen Wu, and Elizabeth Chang. “Cloud Computing:Issues and Challenges”. In: 24th IEEE International Conference onAdvanced Information Networking and Applications. 2010, pp. 27–33.

[12] Global Infrastructure. Amazone. Sept. 2012. url: https : / / aws .

amazon.com/about-aws/globalinfrastructure/.

[13] Hadoop. Apache. 2012. url: http : / / wiki . apache . org / solr /

HadoopIndexing.

[14] Introducing Windows Azure. Microsoft. 2012. url: https://www.

windowsazure.com/en-us/develop/net/fundamentals/intro-to-

windows-azure/.

[15] Balachandra Reddy Kandukuri, Ramakrishna Paturi V, and Dr. AtanuRakshit. “Data Security in the World of Cloud Computing”. In: IEEEInternational Conference on Services Computing. 2009, pp. 517 –520.

[16] L. M. Kaufman. “Data Security in the World of Cloud Computing”.In: Security and Privacy, IEEE, V7 issue 4. 2009, pp. 61 –64.

[17] LucidWorks Search. LucidWorks. Sept. 2012. url: http : / / www .

lucidworks.com.

[18] Manage the Availability of Virtual Machines. Microsoft. 2012. url:https://www.windowsazure.com/en-us/manage/windows/common-

tasks/manage-vm-availability/.

[19] Mark Miller. Apache. 2012. url: http://lucene.472066.n3.nabble.com/Internal-Vs-External-ZooKeeper-td4019543.html.

[20] Tim Mather, Subra Kumaraswamy, and Shahed Lati. Cloud Securityand Privacy: An Enterprise Perspective on Risks and Compliance. 1005Gravenstein Highway North, Sebastopol, CA 95472, United States ofAmerica: O’Really Media, Inc, 2009.

[21] Microsoft Windows Azure Downtime Blamed on Leap Year Bug. eWeek.Mar. 2012. url: http : / / www . eweek . com / c / a / Enterprise -

Applications/Microsoft-Windows-Azure-Downtime-Blamed-on-

Leap-Year-Bug-707169/.

[22] Privacy. Microsoft. Sept. 2012. url: https://www.windowsazure.

com/en-us/support/trust-center/privacy/.

52 Johannes Nilsson

Page 61: Solr and the cloud

BIBLIOGRAPHY Solr and the cloud

[23] Reddit, Quora, Foursquare, Hootsuite Go Down Due To Amazon EC2Cloud Service Troubles. The Huffington Post. Apr. 2011. url: http://www.huffingtonpost.com/2011/04/21/amazon-ec2-takes-down-

reddit-quora-foursquare-hootsuite_n_851964.html.

[24] Thomas Ristenpart et al. “Hey, You, Get Off of My Cloud: ExploringInformation Leakage in Third-Party Compute Clouds”. In: 16th ACMconference on Computer and communications security. 2009, pp. 199–212.

[25] Scaling Lucene and Solr. Apache. 2009. url: http://searchhub.org/2009/09/02/scaling-lucene-and-solr.

[26] Scaling Solr Horizontally in the Cloud. Baracuda Networks, Inc andNutschell, LLZ. 2011. url: http://lanyrd.com/2011/oscon-data/sghpk/.

[27] Security Guidance for Critical Areas of Focus in Cloud Comput-ing V2.1. Cloud Security Alliance (CSA). Dec. 2009. url: www .

cloudsecurityalliance.org/guidance/csaguide.v2.1.pdf.

[28] Slides: Above the Clouds: A Berkeley View of Cloud Computing. RADLab. 2009. url: http://static.usenix.org/event/lisa09/tech/slides/fox.pdf.

[29] Solr Near Realtime Search. Apache. 2012. url: http://wiki.apache.org/solr/NearRealtimeSearch.

[30] Solr search. itera Consulting. Sept. 2012. url: http : / /

iteraconsulting.se/Vara-tjanster/Sok/Solr/.

[31] Solr Wiki. Apache. 2012. url: http://wiki.apache.org/solr/Solr4.0.

[32] SolrCloud. LucidWorks. 2012. url: http : / / lucidworks .

lucidimagination.com/display/solr/SolrCloud.

[33] SolrCloud. Apache. 2012. url: http://wiki.apache.org/solr/

SolrCloud.

[34] SolrHQ. TNR Global. Sept. 2012. url: http://www.tnrglobal.com/lucene-solr/solrhq-solr-in-the-cloud.

[35] S. Subashini and V. Kavitha. “A survey on security issues in servicedelivery models of cloud computing”. In: Journal of Network and Com-puter Applications V34, Issue 1. 2011, pp. 1 –11.

[36] Technical Overview of the Security Features in the Windows Azure Plat-form. Microsoft. 2012. url: https://www.windowsazure.com/en-

us/support/legal/security-overview/.

53 Johannes Nilsson

Page 62: Solr and the cloud

BIBLIOGRAPHY Solr and the cloud

[37] The NIST Definition of Cloud Computing. National Institute of Stan-dards and Technology (NIST). Sept. 2011. url: http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf.

[38] Windows Azure Privacy Statement. Microsoft. 2012. url: https :

/ / www . windowsazure . com / en - us / support / legal / privacy -

statement/?l=en-us.

[39] Windows Azure SLA. Microsoft. 2012. url: https : / / www .

windowsazure.com/en-us/support/legal/sla/.

[40] Windows Azure Virtual Machines. Senior Technical Evangelist – Win-dows Azure. 2012. url: http://michaelwasham.com/2012/06/08/understanding-windows-azure-virtual-machines/.

[41] Timothy Wood et al. “CloudNet: dynamic pooling of cloud resourcesby live WAN migration of virtual machines”. In: Proceedings of the 7thACM SIGPLAN/SIGOPS international conference on Virtual execu-tion environments. 2011, pp. 121–132.

54 Johannes Nilsson

Page 63: Solr and the cloud

Appendices

55

Page 64: Solr and the cloud

Appendix A

PowerShell Script

To modify Windows Azure VM A introduction to Windows Azure PowershellCmdlets can be found on:

http://msdn.microsoft.com/en-us/library/windowsazure/jj554332.aspx

Below is the commands used specifically for this report:

//SetUpClusterEndpoints.ps1

Set-ExecutionPolicy RemoteSigned

Import-Module "C:\Program Files (x86)\Microsoft SDKs\Windows

Azure\PowerShell\Azure\Azure.psd1"

Import-AzurePublishSettingsFile "<path to

mysettings.puplishsettingsfile>"

Select-AzureSubscription -SubscriptionName $mySubscriptionName

$i = 1

do {

//Add port to solr VM with

Get-AzureVM -ServiceName ("solrVM" + $i) -Name ("solrVM" + $i) |

Add-AzureEndpoint -Name "solr" -Protocol "TCP" -PublicPort 8983

-LocalPort 8983 | Update-AzureVM

$i++

}

while ($i -le 16)

56

Page 65: Solr and the cloud

Appendix B

The Client Code

ThreadIndexer.java

//ThreadIndexer.java

import java.io.FileInputStream;

import java.util.concurrent.CyclicBarrier;

import java.util.Properties;

import org.apache.solr.client.solrj.SolrServer;

import org.apache.solr.client.solrj.impl.HttpSolrServer;

public class ThreadIndexer {

private static Properties properties;

private static int numThreads;

private static String solrUrl;

public static void main(String args[]) throws Exception{

try

{

properties = new Properties();

properties.load(new

FileInputStream("config/indexing.properties"));

numThreads =

Integer.parseInt(properties.getProperty("threads"));

solrUrl = properties.getProperty("solrUrl");

} catch (Exception ex) {

System.out.println("Could not read properties

file: "+ex.getMessage());

return;

}

57

Page 66: Solr and the cloud

APPENDIX B. THE CLIENT CODE Solr and the cloud

//Barrier that wait for numThreads+1 threads to call

await before proceeding

final CyclicBarrier barrier = new

CyclicBarrier(numThreads+1);

SolrServer server = new HttpSolrServer( solrUrl );

WorkerThread[] threads = new

WorkerThread[numThreads];

int totalDocs = 5000;

int docsPerThread = totalDocs/numThreads;

int leftover = totalDocs%numThreads;

//Start taking time

long start1 = System.currentTimeMillis();

//Distribute docPerTherad among the first n-1 threads

for (int i = 0; i < numThreads-1; i++) {

threads[i] = new WorkerThread("thread_" + i,

barrier, docsPerThread, properties,

server);

}

//Distribute the remaining documents to the last

thread

threads[numThreads - 1] = new WorkerThread("thread_"

+ (numThreads - 1),

barrier, docsPerThread + leftover,

properties, server);

try {

//The main thread calls the barrier and waits

until all worker threads have done so to

barrier.await();

server.commit();

//Stop taking time

long end1 = System.currentTimeMillis();

//Calculate the total run time

long totTimeAdd = end1 - start1;

System.out.println("Total time adding "+

(totalDocs) +" docs ("+ totTimeAdd + "

58 Johannes Nilsson

Page 67: Solr and the cloud

APPENDIX B. THE CLIENT CODE Solr and the cloud

ms)");

}catch(Exception e ){

System.out.println("In Main: " + e);

}

}

}

WorkerThread.java

//WorkerThread.java

import java.util.concurrent.CyclicBarrier;

import org.apache.solr.client.solrj.SolrServer;

import org.apache.solr.client.solrj.SolrServerException;

import org.apache.solr.client.solrj.impl.HttpSolrServer;

import org.apache.solr.common.SolrInputDocument;

import java.util.Properties;

public class WorkerThread implements Runnable{

private Thread runner;

private CyclicBarrier barrier;

private int ThisThreadsDocs;

private Properties properties;

private SolrServer server;

//Initiate WorkerThread

public WorkerThread(String threadName,CyclicBarrier b, int

d, Properties p, SolrServer s){

runner = new Thread(this, threadName);

barrier = b;

ThisThreadsDocs = d;

properties = p;

server = s;

runner.start();

}

//Run WorkerThread

public void run(){

try{

//Add ThisThreadDocs numer of docs to server

for(int i=0; i<ThisThreadsDocs; i++){

59 Johannes Nilsson

Page 68: Solr and the cloud

APPENDIX B. THE CLIENT CODE Solr and the cloud

addDocument(server, i,

runner.getName());

}

//Tell barrier this WorkerThread is done

//and wait for the other threads to finish

barrier.await();

}catch(Exception e){

System.out.println("In thread " +

runner.getName() + ": " +e);

}

}

//Add document to SolrServerSolr with i and threadName as id

public static void addDocument( SolrServer solr, int i,

String threadName) {

try {

SolrInputDocument solrDoc = new

SolrInputDocument();

solrDoc.addField("id", threadName + "_" +i);

solrDoc.addField("name", threadName);

solrDoc.addField("price", i);

solr.add(solrDoc);

} catch (SolrServerException e) {

e.printStackTrace();

} catch (Exception e) {

e.printStackTrace();

}

}

}

indexing.properties

//indexing.properties

solrUrl=<VM DNS name>:8983/solr

threads=8

60 Johannes Nilsson

Page 69: Solr and the cloud

Appendix C

Collected Data

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 86203 46937 28375 22391 69680 65054 64435 654212 run 85640 45953 27515 20281 69724 66014 66626 666653 run 86874 45734 27812 20562 69249 67266 66872 67260average time 86239 46208 27900.667 21078 696 66111.332 65977.667 66448.667

Large (1st run) 83625 42231 24688 14281 62171 45218 45156 472962 run 83919 44406 23575 12718 46756 40781 46168 471393 run 83761 42353 24968 12562 52291 41859 45719 57878average time 83768.334 42996.667 24410.334 13187 53739.334 42619.334 45681 50771

Table C.1: Indexing time in milliseconds on 1 server 1 shard

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 97411 63998 42046 57874 48452 46295 48780 488582 run 99582 63687 42943 57811 43374 45386 46858 473583 run 98391 63562 43030 57608 46858 45412 47640 47546average time 98461.332 63749 42673 57764.332 46228 45697.667 47759.334 47920.667

Large (1st run) 86575 44444 25833 16968 51308 39872 47092 457642 run 91419 47774 25145 16374 48665 42342 47592 434483 run 86154 46181 25364 16796 50137 40640 45577 46968average time 88049.332 46133 25447.334 16712.667 50036.667 40951.332 46753.667 45393.332

Table C.2: Indexing time in milliseconds on 1 server 2 shard

61

Page 70: Solr and the cloud

APPENDIX C. COLLECTED DATA Solr and the cloud

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 111066 71264 65154 54608 47686 50811 44562 478582 run 111717 67841 68138 53576 53998 52592 48655 443983 run 111121 66889 69951 55467 45186 51420 43014 46241average time 111301.332 68664.667 67747.667 54550.332 48956.667 51607.66 45410.332 46165.667

Large (1st run) 145849 72288 40415 27671 63373 65373 64170 658262 run 144755 72613 37929 23437 63201 66171 64483 632173 run 145041 72093 37724 23375 63670 64686 64093 64170average time 145215 72331.332 38689.332 24827.667 63414.667 65410 64248.667 64404.332

Table C.3: Indexing time in milliseconds on 1 server 4 shard

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 138636 95341 61905 67499 49530 52823 51014 502582 run 143637 93825 63076 64048 54599 53157 51940 507793 run 140149 95685 63030 65576 53971 51966 52172 51493average time 140807.332 94950.332 62670.332 65707.667 52700 52648.66 51708.667 50843.332

Large (1st run) 99794 46452 28499 23843 70967 64171 67811 696702 run 94560 47280 29530 22765 71998 69170 70639 677173 run 90373 47030 29093 22016 73373 68732 68498 66389average time 94909 46920.667 29040.667 22874.667 72112.667 67357.667 68982.667 67925.332

Table C.4: Indexing time in milliseconds on 1 server 8 shard

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 125566 92876 92995 70629 66726 62570 62952 635672 run 122305 89928 87687 75661 67317 61212 63268 633933 run 119798 92629 94117 74952 66892 62533 61731 61175average time 122556.332 91811 91599.667 73747.332 66978.332 62105 62650.332 62711.667

Large (1st run) 174043 83810 44968 26953 74701 70279 69452 694362 run 167621 83138 44312 26296 74561 72108 69779 692953 run 172012 83247 43062 27077 73686 68796 64452 67874average time 171225.332 83398.332 44114 26775.332 74316 70394.332 67894.332 68868.332

Table C.5: Indexing time in milliseconds on 1 server 16 shard

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 129499 58937 42468 29640 80341 73951 74029 726072 run 130702 61593 42764 29109 84185 73326 72904 708733 run 103530 67687 41406 32952 79185 73904 72825 72544average time 121243.667 62739 42212.667 30567 81237 73727 73252.667 72008

Table C.6: Indexing time in milliseconds on 2 server 2 shard

62 Johannes Nilsson

Page 71: Solr and the cloud

APPENDIX C. COLLECTED DATA Solr and the cloud

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 131980 92607 51077 32780 79935 73639 71326 719522 run 112200 84544 51467 32421 81217 72248 71842 715623 run 117606 87762 48795 31593 80716 71342 69905 69265average time 120595.332 88304.332 50446.332 32264.667 80622.667 72409.667 71024.332 70926.332

Table C.7: Indexing time in milliseconds on 4 server 4 shard

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 170764 92217 51967 32999 82374 69873 68587 718522 run 172764 82546 46546 32156 67905 66861 69900 699293 run 164451 91639 45500 33499 79030 71674 68898 66803average time 169326.332 88800.667 48004.332 32884.667 76436.332 69469.332 69128.332 69528

Table C.8: Indexing time in milliseconds on 8 server 8 shard

VM size / Treads 1 2 4 8 16 32 64 128Small (1st run) 196765 107030 51452 36000 70092 71138 69874 707792 run 168530 82967 50077 37780 64327 72389 67311 702163 run 196936 100858 50596 37577 82498 71546 68976 70504average time 187410.332 96951.667 50708.332 37119 72305.667 71691 68720.332 70499.667

Table C.9: Indexing time in milliseconds on 16 server 16 shard

VM size / Treads 1 2 4 8 16 32 64 128Large (1st run) 92205 61282 33342 18065 67791 62169 60816 601602 run 95654 64387 33310 17861 67245 62461 60703 600493 run 100877 62502 33365 17705 67604 61863 59908 60503average time 96245.332 62723.667 33339 17877 67546.667 62164.334 60475.667 60237.332

Table C.10: Indexing time in milliseconds on 2 server 2 shard

VM size / Treads 1 2 4 8 16 32 64 128Large (1st run) 148470 57279 28985 17094 70976 64949 61338 617042 run 143264 55409 24679 19778 71074 60500 60496 604423 run 137087 60495 27255 22194 70641 64278 60729 61724average time 142940.332 57727.667 26973 19688.667 70897 63242.332 60854.332 61290

Table C.11: Indexing time in milliseconds on 2 server 4 shard

63 Johannes Nilsson

Page 72: Solr and the cloud

APPENDIX C. COLLECTED DATA Solr and the cloud

VM size / Treads 1 2 4 8 16 32 64 128Large (1st run) 161429 80603 32483 18857 67557 66264 63501 621142 run 161008 80790 31939 19513 67683 64952 64923 613813 run 161783 79740 33529 16735 70747 64281 63552 62068average time 161406.667 80377.667 32650.332 18368.332 68662.332 65165.667 63992 61854.332

Table C.12: Indexing time in milliseconds on 2 server 8 shard

VM size / Treads 1 2 4 8 16 32 64 128Large (1st run) 109802 85433 26517 18136 70332 67952 64742 642512 run 104034 85779 26688 19711 72020 67017 64554 648263 run 106721 85905 26952 19273 75188 66662 63919 62733average time 106852.332 85705.667 26719 19040 72513.332 67210.332 64405 63936.667

Table C.13: Indexing time in milliseconds on 2 server 16 shard

VM size / Treads 1 2 4 8 16 32 64 128Large (1st run) 155235 77461 38922 18976 69867 60583 59759 592062 run 156131 76966 38395 19755 65193 60462 60666 594583 run 155548 77389 37479 16837 65484 60237 59968 58889average time 155638 77272 38265.332 18522.667 66848 60427.332 60131 59184.332

Table C.14: Indexing time in milliseconds on 4 server 4 shard

VM size / Treads 1 2 4 8 16 32 64 128Large (1st run) 168997 84101 40746 20140 67517 61976 60904 606012 run 168497 83172 41183 20858 66596 62088 61184 629143 run 167095 82237 38763 19530 67145 62178 54590 59911average time 168196.332 83170 40230.667 20176 67086 62080.667 58892.667 61142

Table C.15: Indexing time in milliseconds on 4 server 8 shard

VM size / Treads 1 2 4 8 16 32 64 128Large (1st run) 180453 88580 43880 20424 69946 62454 62220 615822 run 184941 88459 42085 20534 70260 63656 61518 628753 run 181329 89163 41255 20488 68983 61611 61766 60814average time 182241 88734 42406.667 20482 69729.667 62573.667 61834.667 61757

Table C.16: Indexing time in milliseconds on 4 servers 16 shard

64 Johannes Nilsson

Page 73: Solr and the cloud

APPENDIX C. COLLECTED DATA Solr and the cloud

VM size / Treads 1 2 4 8 16 32 64 128Extra Large (1st run) 93534 45903 29257 17785 65655 62527 62229 607582 run 90344 45309 28538 18817 66216 65840 63448 611493 run 90780 49075 29007 18708 65997 65964 61852 61288average time 91552.667 46762.332 28934 18436.667 65956 64777 62509.667 61065

Table C.17: Indexing time in milliseconds on 1 servers 4 shard

VM size / Treads 1 2 4 8 16 32 64 128Extra Large (1st run) 143578 43703 24469 17234 12749 63500 61530 586562 run 140217 45311 23437 14562 11781 61375 60421 603433 run 141280 48109 24781 16469 12046 61624 59999 58655average time 141691.667 45707.667 24229 16088.332 12192 62166.332 60650 59218

Table C.18: Indexing time in milliseconds on 2 server 4 shard

65 Johannes Nilsson