solr extracting data

Download Solr Extracting Data

If you can't read please download the document

Upload: semtech-solutions-ltd

Post on 16-Apr-2017

2.087 views

Category:

Technology


0 download

TRANSCRIPT

Solr Extracting Data

Start this session with a full Solr indexed repository

Movie cAiYBD4BQeE showed installation

Movie Th5Scvlyt-E showed Nutch web crawl

This movie will show how to

Extract data from Solr

Extract to xml or csv

Show aim to load into data warehouse

This movie assumes you know Linux

Solr Extracting Data

Progress so far, greyed out area yet to be examined

Checking Solr Data

Data should have been indexed in Solr

In Solr Admin window

Set 'Core Selector' = collection1

Click 'Query'

In Query window set fl field = url

Click Execute Query

The result ( next ) shows the filtered list of urls in Solr

Checking Solr Data

How To Extract

How could we get at Solr data ?

In admin console via query

Via http solr select

Via curl -o call using solr http select

What format of data that suits this purpose

Xml

Comma separated variable (csv)

How To Extract

We want to extract two columns from Solr

tstamp, url

We want to extract as csv ( csv in call below could be xml )

We want to extract to a file

So we will use an http call

http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv

We will also use a curl call

curl -o ''

How To Extract

Ceate a bash file in Solr install directory

cd solr-4-2-1/extract ; touch solr_url_extract.bash

chmod 755 solr_url_extract.bash

Add contents to bash file

#!/bin/bash

curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv'

mv result.csv result.csv.$(date +%Y%m%d.%H%M%S)

Now run the bash script

./solr_url_extract.bash

Check Output

Now we check whether we have data

ls -l shows

result.csv.20130506.124857

Check the content , wc -l shows 11 lines

Check the content , head -2 shows

tstamp, url

2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ...

Congratulations, you have extracted data from Solr

It's in CSV format ready to be loaded into a data warehouse

Possible Next Steps

Choose more fields to extract from data

Allow Nutch crawl to go deeper

Allow Nutch crawl to collect a lot more data

Look at facets in Solr data

Load CSV files into Data Warehouse Staging schema

Next movie will show next step in progress

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

[email protected]

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems