solr extracting data
TRANSCRIPT
Solr Extracting Data
Start this session with a full Solr indexed repository
Movie cAiYBD4BQeE showed installation
Movie Th5Scvlyt-E showed Nutch web crawl
This movie will show how to
Extract data from Solr
Extract to xml or csv
Show aim to load into data warehouse
This movie assumes you know Linux
Solr Extracting Data
Progress so far, greyed out area yet to be examined
Checking Solr Data
Data should have been indexed in Solr
In Solr Admin window
Set 'Core Selector' = collection1
Click 'Query'
In Query window set fl field = url
Click Execute Query
The result ( next ) shows the filtered list of urls in Solr
Checking Solr Data
How To Extract
How could we get at Solr data ?
In admin console via query
Via http solr select
Via curl -o call using solr http select
What format of data that suits this purpose
Xml
Comma separated variable (csv)
How To Extract
We want to extract two columns from Solr
tstamp, url
We want to extract as csv ( csv in call below could be xml )
We want to extract to a file
So we will use an http call
http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv
We will also use a curl call
curl -o ''
How To Extract
Ceate a bash file in Solr install directory
cd solr-4-2-1/extract ; touch solr_url_extract.bash
chmod 755 solr_url_extract.bash
Add contents to bash file
#!/bin/bash
curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv'
mv result.csv result.csv.$(date +%Y%m%d.%H%M%S)
Now run the bash script
./solr_url_extract.bash
Check Output
Now we check whether we have data
ls -l shows
result.csv.20130506.124857
Check the content , wc -l shows 11 lines
Check the content , head -2 shows
tstamp, url
2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ...
Congratulations, you have extracted data from Solr
It's in CSV format ready to be loaded into a data warehouse
Possible Next Steps
Choose more fields to extract from data
Allow Nutch crawl to go deeper
Allow Nutch crawl to collect a lot more data
Look at facets in Solr data
Load CSV files into Data Warehouse Staging schema
Next movie will show next step in progress
Contact Us
Feel free to contact us at
www.semtech-solutions.co.nz
We offer IT project consultancy
We are happy to hear about your problems
You can just pay for those hours that you need
To solve your problems