spectrum geocoding for big data v3.2.0 user guide · hive jobs and reports using zeppelin 26 3...

Spectrum™ Geocoding for Big Data Version 3.2

Spectrum™ Geocoding for Big Data User Guide

Table of Contents

1 - Welcome

What is Spectrum™ Geocoding for Big Data? 4 Spectrum™ Geocoding for Big Data

Architecture 5 System Requirements and Dependencies 5

2 - Hive User-Defined Geocoding Functions

Hive Variables 8 Hive Output Fields 11 Geocode UDF 12 Reverse Geocode UDF 17 GeocodeByKey UDF 21 Sample Hive Geocoding Script 25 Hive Jobs and Reports Using Zeppelin 26

3 - Spark Jobs

Spark Output Fields 29 Geocode 30 Reverse Geocode 35 Spark API 40 Sample Spark API Geocoding Application 40

4 - Appendix

Sample Data 44 Geocode Preferences 46 Geocode-Level Errors 47 Download Permissions 48

1 - Welcome

In this section

What is Spectrum™ Geocoding for Big Data? 4 Spectrum™ Geocoding for Big Data Architecture 5 System Requirements and Dependencies 5

Welcome

What is Spectrum™ Geocoding for Big Data?

The Pitney Bowes Spectrum™ Geocoding for Big Data is a toolkit for processing enterprise data for large scale spatial analysis. Billions of records in a single file can be processed in parallel, using Hive and Apache Spark's cluster processing framework, yielding results faster than ever. Unlike traditional processing techniques that used to take weeks to process the data, now the data processing can be done in a few hours using this product.

This guide describes the Spark jobs and Hive user-defined functions for the following geocoding operations.

• Geocode: Accepts address as an input and returns the standardized US or international address and the geocoded point and attribution.

• Reverse Geocode: Accepts longitude, latitude and coordinate system as an input and returns address information for the location.

• Geocode By Key: Accepts a key (such as a PB_KEY) and returns geocoding attributes such as longitude, latitude, or a point geometry.

Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 4

Welcome

Spectrum™ Geocoding for Big Data Architecture

What is Spectrum™ Geocoding for Big Data?

The Spectrum™ Geocoding for Big Data packages components into an SDK for Big Data platforms like Hadoop for Spark and Hive.

SDK provides:

• Integration APIs for Geocoding (Geocode, Reverse Geocode, Geocode By Key) • Input datasets and metadata

API Types:

• Pre-built Hive UDF wrappers for Geocoding operations • Core Geocoding APIs (security enabled via Kerberos and Apache Sentry for Hive)

System Requirements and Dependencies

Spectrum™ Geocoding for Big Data is collection of jar files that can be deployed to your Hadoop system.

This product is verified on the following Hadoop distributions.

• Cloudera 5.12 and 6.0 • Hortonworks 2.6 and 3.0 • EMR 5.10 • MapR 6.0 and above, with MapR Expansion Pack (MEP) 5.0.0

To use these jar files, you must be familiar with configuring Hadoop in Hortonworks, Cloudera, EMR, or MapR and developing applications for distributed processing. For more information, refer to Hortonworks, Cloudera, EMR, or MapR documentation.

To use the product, the following must be installed on your system:

for Hive:

• Hive version 1.2.1 or above

for Hive Client

• Beeline, for example

for Spark and Zeppelin Notebook:


http://docs.hortonworks.com/index.htmlhttp://www.cloudera.com/documentation.htmlhttps://aws.amazon.com/documentation/emr/https://mapr.com/docs/

Welcome

• Java JDK version 1.8 or above • Hadoop version 2.6.0 or above • Spark version 1.6.0 or above (2.0 or above required for MapR and Cloudera 6.0) • Zeppelin Notebook is not supported in Cloudera


2 - Hive User-Defined Geocoding Functions

In this section

Hive Variables 8 Hive Output Fields 11 Geocode UDF 12 Reverse Geocode UDF 17 GeocodeByKey UDF 21 Sample Hive Geocoding Script 25 Hive Jobs and Reports Using Zeppelin 26

Hive User-Defined Geocoding Functions

Hive Variables

The Geocoding UDFs use Hive variables (some of which are required) to set various properties, which take precedence over the system properties . They can also be set as part of an option in a UDF query, where they will take precedence over the Hive variables as well as the system properties.

Variable Examples

pb.geocoding.config.location hdfs:///pb/geocoding/software/resources/config/

Location of the directory containing the geocoding /mapr/ /pb/geocoding/software/resources/config/ configuration. This variable must be set. This directory contains the JsonDataConfig.json which contains the location and information of the geocoding reference data installed.

pb.geocoding.binaries.location hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64

Location of the directory containing geocoding /mapr/ libraries. This variable must be set. This directory /pb/geocoding/software/resources/nativeLibraries/bin/linux64/ contains libraries used by the geocoder.

pb.geocoding.preferences.filepath hdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml

File path of the geocoding preferences file. This /mapr/ optional file can be edited by advanced users to /pb/geocoding/software/resources/config/geocodePreferences.xml change the behavior of the geocoder.

pb.geocoding.output.fields X,Y,formattedStreetAddress,formattedLocationAddress

Comma-separated list of fields requested from the geocoder. You must either set this variable or set the preferences in the UDF query. For more information about output fields see Hive Output Fields.

pb.download.location /pb/downloads

Note: Use this variable only if reference data was distributed remotely via HDFS or S3.

Location of the directory where reference data will be downloaded to. This path must exist on every data node and the HiveServer2 node.



Variable Examples

pb.download.group pbdownloads

Note: Use this variable only if reference data was distributed remotely via HDFS or S3.

This is an optional property and only specific to POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group.

For more information, see Download Permissions on page 48.

pb.geocoding.error.limit 1

The number of geocoding errors to allow before failing a task for "too many errors". This prevents a task (and thus the job) from continuing in the case of a likely configuration error. The default value is 10.

pb.geocoding.error.limit.disabled true

Disables the error limit. All errors will be logged but will not cause the task or job to fail.

Setting Variables

Since these are Hive configuration variables, you can set them permanently by editing the hiveserver2-site.xml file in your cluster.

Example (using HDFS)

pb.geocoding.config.location=hdfs:///pb/geocoding/software/resources/config/pb.geocoding.binaries.location=hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddresspb.geocoding.preferences.filepath=hdfs:///pb/geocoding/software/resources/config/geocodePreferences.xmlpb.download.location=/pb/downloadspb.download.group=pbdownloadspb.geocoding.error.limit=1

Example (using MapR)

pb.geocoding.config.location=/mapr//pb/geocoding/software/resources/config/pb.geocoding.binaries.location=/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64/pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddress



pb.geocoding.preferences.filepath=/mapr//pb/geocoding/software/resources/config/geocodePreferences.xmlpb.download.location=/pb/downloadspb.download.group=pbdownloadspb.geocoding.error.limit=1

Alternatively you can set them temporarily in each Hive session that you open.

For example, if you wish to geocode in Beeline you can set the variables in the following way:


set pb.geocoding.config.location=hdfs:///pb/geocoding/software/resources/config/;set pb.geocoding.binaries.location=hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/;set pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddress;set pb.geocoding.preferences.filepath=hdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml;set pb.download.location=/pb/downloads;set pb.download.group=pbdownloads;set pb.geocoding.error.limit=1;


set pb.geocoding.config.location=/mapr//pb/geocoding/software/resources/config/;set pb.geocoding.binaries.location=/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64/;set pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddress;set pb.geocoding.preferences.filepath=/mapr//pb/geocoding/software/resources/config/geocodePreferences.xml;set pb.download.location=/pb/downloads;set pb.download.group=pbdownloads;set pb.geocoding.error.limit=1;

Variables set in the local session will override any hive-site variables. This means you can have default values set in the hiveserver2-site file and override them in Beeline when necessary.

System Variables (Deprecated)

Hive variables used to exist in the system namespace. This is deprecated as of version 3.0 and no longer the recommended method.



Hive Output Fields

The result of a Geocode UDF or Reverse Geocode UDF is a struct that will look similar to the example below.

{ "x":"-73.700257","y":"42.678161","formattedstreetaddress":"350 JORDAN RD","formattedlocationaddress":"TROY, NY 12180-8352","error": null

}

To see examples of queries using this struct see Geocode UDF or Reverse Geocode UDF. To change the fields present in the struct you must use either the pb.geocoding.output.fieldsvariable or set the preferences in the UDF query.

Note: For information on custom output fields per country, see the appendix in the Global Geocoding SDK Developer Guide, which is also available on the Spectrum Spatial for Big Data documentation landing page.

Output Fields Variable

The output fields can use the variable pb.geocoding.output.fields set to a comma-separated list of fields. This list is case-sensitive and a mistyped field will return null. The above example can be created from the following statement.

set pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddress;

PBKey and Other Custom Fields

Certain output fields will not be returned unless you edit the geocode preferences. To learn more, see Geocode Preferences on page 46.

Error Field

The error field is always returned and does not need to be requested. If the value is null then no errors occurred during the functions execution. See Geocode-Level Errors on page 47 for explanations of possible errors.


http://support.pb.com/help/hadoop/ggs/GlobalGeocodingSDK_DeveloperGuide.pdfhttp://support.pb.com/help/hadoop/ggs/GlobalGeocodingSDK_DeveloperGuide.pdfhttp://support.pb.com/help/hadoop/landingpage/index.htmlhttp://support.pb.com/help/hadoop/landingpage/index.html


Geocode UDF

Description

The Geocode function accepts an address as an argument and returns geocoding attributes such as longitude, latitude, or a point geometry.

Note: In order to use this UDF, you must set the required options either as Hive variables or provided in the optional config map parameter.

Syntax

Geocode(String mainAddressLine, String country, [Map config])

Geocode(String mainAddressLine, String addressLastLine, String country, [Map config])

Geocode(String addressNumber, String streetName, String unitValue, String areaName3,String areaName1, String postCode1, String country, [Map config])

Geocode(Map input, [Map config])

Parameters

Address Fields

Parameter Type Description

mainAddressLine String the street address or a single line address

addressLastLine String the location level information like city and postal code

addressNumber String the house number information

streetName String the street name

unitType String the unit type (such as Apt. or Suite)

unitValue String the unit value




areaName1 String the largest geographic division in an area, such as a state, province, or region

areaName2 String the secondary geographic division in an area, such as a county or district

areaName3 String the city or town name

areaName4 String the smallest geographic division in an area, such as a city subdivision or locality

postCode1 String the primary postal code

postCode2 String the secondary postal code

placeName String the name of a business, building, or location

country String the ISO-3166 two- or three-character abbreviation for the country, or the country name

Input and Configuration Maps


input Map key-value pairs of address fields and input values. The keys must be constant values from the list of address fields in the table above. The values can be either constants or column references.


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ComplexTypeConstructors



config Map can be used for setting both Hive properties and Preferences

Hive Properties

Options that set the Hive Variables on page 8.

Note: Any values specified here will override values set as Hive properties.

All values must be constant. If a key is present but has a null value, the property is cleared.

Preferences

settings that control the matching and geocoding criteria, return values and fields

Note: Any values specified here will override values set in the geocode preferences file.

All values must be constant. If a key is present but has a null value, the preference is cleared.

For more information, see Geocode Preferences on page 46.

Note: The function may time out when using a large number of datasets that are stored in remote locations (such as HDFS and S3). If you are using Hive with the MapReduce engine, you can adjust the value of the mapreduce.task.timeout property.

Return Values

This function returns struct values described in Hive Output Fields on page 11.

Examples

In order to use this UDF, you must set the required options either as Hive variables or provided in the optional config map parameter.

All the examples below are using the following output fields: X, Y, formattedStreetAddress, formattedLocationAddress

Single Geocode returning struct

All these queries return the same response, a struct containing all the requested output fields.

SELECT geocode("600 Maryland Ave SW, Washington, DC 20002", "USA") AS result;

SELECT geocode(map("mainAddressLine", "600 Maryland Ave SW", "addressLastLine", "Washington,DC 20002", "country","USA")) AS result;

Output:




Result

{"x":"-77.020707","y":"38.886115","formattedstreetaddress":"600 MARYLAND AVESW","formattedlocationaddress":"WASHINGTON, DC 20024-2520","error":null}

Single Geocode returning a single row of columns

Use a built-in, table-generating function called inline to turn the struct into columns.

SELECT inline(array(geocode("600 Maryland Ave SW, Washington, DC 20002", "USA")));

Output:

X Y Street Address Location Address Error

-77.020707 38.886115 600 MARYLAND AVE SW WASHINGTON, DC null 20024-2520

Single Geocode returning particular fields

SELECT result.x AS x, result.y AS y FROM (SELECT inline(array(geocode("600 Maryland AveSW, Washington, DC 20002","USA")))) result;

Output:

X Y

-77.020707 38.886115

Table Geocode with struct

SELECT customer_id, customer_name, customer_phone, geocode("", t.address, "", t.city,t.state, t.postcode, t.country) AS geocoded_result FROM customers t;

Output:

ID Name Phone Result

0000000001 John +(1)-(555)-6647575 {"x":"-79.794777","y":"34.191205","formattedstreetaddress":"1205 Michael W PALMETTO ST","formattedlocationaddress":"FLORENCE, Dorian SC 29501-4131","pb_key":"P0000KPW08AT","error":null}




0000000002 Elliot Reid +(1)-(555)-5406548 {"x":"-122.393078","y":"37.753697","formattedstreetaddress":"1107 PENNSYLVANIA AVE","formattedlocationaddress":"SAN FRANCISCO, CA 94107-3413","pb_key":"P00002T1L8H7","error":null}

0000000003 Jan Itor +(1)-(555)-3925210 {"x":"-76.30457","y":"40.045757","formattedstreetaddress":"478 N DUKE ST","formattedlocationaddress":"LANCASTER, PA 17602-4967","pb_key":"P0000JCCUDFI","error":null}

Table Geocode

SELECT customer_id, customer_name, customer_phone, result.* FROM customers LATERAL VIEW OUTER inline(array(geocode("", address, "", city, state, postcode, country))) result;

Output:

ID Name Phone X Y Street Address Location Error Address

0000000001 John -79.794777(555)-6647575 34.191205 1205 W FLORENCE, SC null Michael PALMETTO ST 29501-4131 Dorian

0000000002 Elliot Reid -122.393078(555)-5406548 37.753697 1107 nullSAN FRANCISCO, PENNSYLVANIA CA 94107-3413 AVE

0000000003 Jan Itor (555)-3925210 -76.30457 40.045757 478 N DUKE ST LANCASTER, PA null 17602-4967

Table Geocode into new table

CREATE TABLE customers_geocoded AS SELECT customer_id, customer_name, customer_phone,result.x, result.y, result.formattedStreetAddress, result.formattedLocationAddress FROMcustomers LATERAL VIEW OUTER inline(array(geocode("", address, "", city, state, postcode,country))) result;

Table Geocode using optional config map syntax

SELECT geocode(map('mainAddressLine', t.address,'areaName3', t.city, 'areaName1',t.state, 'postCode1', t.zipcode,'country','USA'),map('pb.geocoding.config.location', 'hdfs:///pb/geocoding/software/resources/config','pb.geocoding.binaries.location', 'hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/','pb.geocoding.output.fields', 'formattedStreetAddress,formattedLocationAddress,precisionCode,PB_KEY','matchMode', 'RELAXED','returnAllCandidateInfo', 'true')) FROM myTable t;



Reverse Geocode UDF

Description

The ReverseGeocode function accepts X, Y, coordinateSystem or point as arguments and returns an address along with address attributions.


Syntax

ReverseGeocode(double X, double Y, [String coordinateSystem], [Map config])

ReverseGeocode(double X, double Y, [Map config])

ReverseGeocode(WritableGeometry point, [Map config])

Parameters


X Double Longitude of the address location in degrees.

Y Double Latitude of the address location in degrees.

coordinateSystem String The coordinate system that you want to convert the geometry to. The format must be the European Petroleum Survey Group (EPSG) code or the SRID code

If you use a signature that only uses numeric X and Y coordinates, the coordinate system can be set in the geocodePreferences.xml file. If no value is set, then the default coordinate system is WGS84 (EPSG:4326).

point WriteableGeometry The WritableGeometry will most likely be the output of a different spatial function, such as ST_Point.





Hive Properties




Preferences






Return Values


Examples


All the examples below are using the following output fields: formattedStreetAddress, areaName3, areaName1, postCode1, country

Single ReverseGeocode returning struct

This example does not specify the coordinate system in the query nor in the geocodePreferences.xml file. Therefore the default coordinate system is used, WGS84 (EPSG:4326).

SELECT ReverseGeocode(-77.020707, 38.886115) AS result;

This example has coordinates not in the default coordinate system, therefore the coordinate system must be given

SELECT ReverseGeocode(-8573905.88, 4705371.63, "EPSG:3857") AS result;


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ComplexTypeConstructorshttp:4705371.63http:ReverseGeocode(-8573905.88


This example uses a function from the Location Intelligence for Big Data product to generate a WritableGeometry.

SELECT ReverseGeocode(FromGeoJSON('{"type": "Point", "coordinates":[-77.020707,38.886115]}'))AS result;

All these queries return the same response, a struct containing all the requested output fields:

Result

{"formattedstreetaddress":"600 MARYLAND AVESW","areaname3":"WASHINGTON","areaname1":"DC", "postcode1":"20024","country":"USA","error":null}

Single ReverseGeocode returning a single row of columns


SELECT inline(array(ReverseGeocode(-74.036008, 40.721673)));

Output:

Street Address Area Name 3 Area Name 1 Location Postcode Error Address

600 MARYLAND AVE WASHINGTON DC 20024 USA null SW

Single ReverseGeocode returning particular fields

SELECT result.areaname3, result.areaname1, result.postCode1, result.country FROM (SELECTinline(array(ReverseGeocode(-77.020707, 38.886115)))) result;

Output:

Area Name 3 Area Name 1 Postcode Country

WASHINGTON DC 20024 USA

Table ReverseGeocode with struct

SELECT customer_id, customer_name, ReverseGeocode(customers.x , customers.y)AS reverse_geocoded_result from customers;

Output:



ID Name Result

0000000001 Kevin Ball {"formattedstreetaddress":"2113 S HOMAN AVE","areaname3":"CHICAGO","areaname1":"IL","postcode1":"60623","country":"USA","error":null}

0000000002 Fiona {"formattedstreetaddress":"2119 S HOMAN Gallagher AVE","areaname3":"CHICAGO","areaname1":"IL","postcode1":"60623","country":"USA","error":null}

0000000003 Mickey {"formattedstreetaddress":"1955 S TRUMBULL Milkovich AVE","areaname3":"CHICAGO","areaname1":"IL","postcode1":"60623","country":"USA","error":null}

Table ReverseGeocode

SELECT customer_id, customer_name, result.* FROM customers LATERAL VIEW OUTER inline(array(ReverseGeocode(customers.x, customers.y))) result;

Output:

ID Name Street Area Area Postcode Country Error Address Name 3 Name 1

0000000001 Kevin Ball 2113 S CHICAGO IL 60623 USA null HOMAN AVE

0000000002 Fiona 2119 S CHICAGO IL 60623 USA null Gallagher HOMAN

AVE

0000000003 Mickey 1955 S CHICAGO IL 60623 USA null Milkovich TRUMBULL

AVE

Table ReverseGeocode into new table

CREATE TABLE customers_reverse_geocoded ASSELECT customer_id, customer_name, result.formattedStreetAddress, result.areaName3,result.areaName1, result.postCode1, result.country FROMcustomers LATERAL VIEW OUTER inline(array(ReverseGeocode(customers.x , customers.y))) result;

Table ReverseGeocode using optional config map syntax

SELECT ReverseGeocode(t.x, t.y,map('pb.geocoding.config.location', 'hdfs:///pb/geocoding/software/resources/config','pb.geocoding.binaries.location', 'hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/','pb.geocoding.output.fields', 'formattedStreetAddress,formattedLocationAddress,PB_KEY','returnAllCandidateInfo','true')) FROM myTable t;



GeocodeByKey UDF

Description

The GeocodeByKey function accepts a key (such as a PB_KEY) and returns geocoding attributes such as longitude, latitude, or a point geometry (the same results that are produced by the Geocode function).


Syntax

GeocodeByKey(String key, [String type], String country, [Map config])

Parameters


key String the unique identifier for the address.

type String Optional. The type of key for a lookup. The default value is PB_KEY.

country String the ISO-3166 two- or three-character abbreviation for the country of the address, or country name





Hive Properties




Preferences






Return Values


Examples


All the examples below are using the following output fields: X, Y, formattedStreetAddress, formattedLocationAddress

Single GeocodeByKey returning struct

All these queries return the same response, a struct containing all the requested output fields.

SELECT GeocodeByKey("P000053KV1L4", "PB_KEY", "USA") AS result;

Output:




Result

{"x":"-77.020707","y":"38.886115","formattedstreetaddress":"600 MARYLAND AVESW","formattedlocationaddress":"WASHINGTON, DC 20024-2520","error":null}

Single GeocodeByKey returning a single row of columns


SELECT inline(array(GeocodeByKey("P000053KV1L4", "PB_KEY", "USA")));

Output:

X Y Street Address Location Address Error

-77.020707 38.886115 600 MARYLAND AVE WASHINGTON, DC null SW 20024-2520

Single GeocodeByKey returning particular fields

SELECT result.x AS x, result.y AS y FROM (SELECT inline(array(GeocodeByKey("P000053KV1L4","PB_KEY", "USA")))) result;

Output:

X Y

-77.020707 38.886115

Table GeocodeByKey with struct

SELECT customer_id, customer_name, customer_phone, GeocodeByKey(t.key, t.country) ASgeocoded_result FROM customers t;

Output:


0000000001 John +(1)-(555)-6647575 {"x":"-79.794777","y":"34.191205","formattedstreetaddress":"1205 Michael W PALMETTO ST","formattedlocationaddress":"FLORENCE, Dorian SC 29501-4131","pb_key":"P0000KPW08AT","error":null}




0000000002 Elliot Reid +(1)-(555)-5406548 {"x":"-122.393078","y":"37.753697","formattedstreetaddress":"1107 PENNSYLVANIA AVE","formattedlocationaddress":"SAN FRANCISCO, CA 94107-3413","pb_key":"P00002T1L8H7","error":null}

0000000003 Jan Itor +(1)-(555)-3925210 {"x":"-76.30457","y":"40.045757","formattedstreetaddress":"478 N DUKE ST","formattedlocationaddress":"LANCASTER, PA 17602-4967","pb_key":"P0000JCCUDFI","error":null}

Table GeocodeByKey

SELECT customer_id, customer_name, customer_phone, result.* FROM customers LATERAL VIEW OUTER inline(array(GeocodeByKey("P000053KV1L4", "PB_KEY", "USA"))) result;

Output:

ID Name Phone X Y Street Address Location Error Address

0000000001 John -79.794777(555)-6647575 34.191205 1205 W FLORENCE, SC null Michael PALMETTO ST 29501-4131 Dorian

0000000002 Elliot Reid -122.393078(555)-5406548 37.753697 1107 nullSAN FRANCISCO, PENNSYLVANIA CA 94107-3413 AVE

0000000003 Jan Itor (555)-3925210 -76.30457 40.045757 478 N DUKE ST LANCASTER, PA null 17602-4967

Table GeocodeByKey into new table

CREATE TABLE customers_geocoded AS SELECT customer_id, customer_name, customer_phone,result.x, result.y, result.formattedStreetAddress, result.formattedLocationAddress FROMcustomers LATERAL VIEW OUTER inline(array(GeocodeByKey(key, country))) result;

Table GeocodeByKey using optional config map syntax

SELECT GeocodeByKey(t.key, t.country,map('pb.geocoding.config.location', 'hdfs:///pb/geocoding/software/resources/config','pb.geocoding.binaries.location', 'hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/','pb.geocoding.output.fields','X,Y,formattedStreetAddress,formattedLocationAddress')) FROM myTable t;



Sample Hive Geocoding Script

Note: This topic assumes the product is installed to /pb/geocoding/software as described in the Spectrum™ Geocoding for Big Data Install Guide.

A SQL script is available that, when run in a Hive session, demonstrates multipass geocoding against the addresses.csv file in /sampleData/DC/. For all first-pass results with the suboptimal precision code of Z1, another geocoding pass will be performed as a single line address to improve the results.

The script is located at: /hive/examples/MultipassGeocoding.sql

From a Hive session (for example, using the Beeline client), run the script using the following command:

!run /pb/geocoding/software/hive/examples/MultipassGeocoding.sql

Multipass Geocoding Script

-- This script assumes the data and product is installed in locations specified in the Geocoding Install Guide

-- The following creates a table and then populates it with the data from the local file systemCREATE TABLE IF NOT EXISTS PB_multipassgeocodeaddress(street String, city String,state

String,postal_code String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';LOAD DATA LOCAL INPATH '/pb/geocoding/software/sampleData/DC/addresses.csv' OVERWRITE INTO TABLE

PB_multipassgeocodeaddress;

-- Create a temporary function that uses the geocoderCREATE TEMPORARY FUNCTION Geocode AS 'com.pb.bigdata.geocoding.hive.Geocode';

-- Set the needed geocoding parameters, notice we are storing the config and binaries in hdfs instead of the local file systemSET pb.geocoding.output.fields=precisionCode,formattedStreetAddress,country;SET pb.geocoding.config.location=hdfs:///pb/geocoding/software/resources/config/;SET

pb.geocoding.binaries.location=hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/;

-- Set where the data should goSET pb.download.location=/pb/downloads;

-- Now we can geocode the tableCREATE TABLE PB_firstpass AS SELECT street, city, postal_code, result.precisionCode,

result.formattedStreetAddress FROM PB_multipassgeocodeaddress lateral view outer inline(array(Geocode("",street,"",city,"",postal_code,"USA"))) result;

–- Display the resultsSELECT * FROM PB_firstpass;

-- Try to get a better result for all "Z1" geocodes by running as single lineCREATE TABLE PB_secondpass AS SELECT street, city, postal_code, result.precisionCode,

result.formattedStreetAddress FROM PB_firstpass lateral view outer inline(array(Geocode(CONCAT(street,"",city," ",postal_code),"USA")))result WHERE PB_firstpass.precisioncode="Z1";

-- Display the resultsSELECT * FROM PB_secondpass;



Hive Jobs and Reports Using Zeppelin

Apache Zeppelin is an open source, web-based notebook that allows you to run Hive jobs and analyze the data interactively. To get set up with Zeppelin, see the Zeppelin Quick Start page.

Note: Zeppelin is supported in Hortonworks and EMR. It is not supported on Cloudera.

We provide a sample geocoding JSON file that can be imported into Zeppelin. Once imported, a notebook is created that provides Hive queries for geocoding addresses and generating a summary report for geocoded addresses. The code and relevant information is provided in the notebook.

To access Zeppelin, from the Ambari interface, go to Zeppelin > Quick Links > ZeppelinUI.

The Zeppelin sample JSON file is located in the /zeppelin directory, where you extracted spectrum-bigdata-geocoding-version.zip file.

The process involves running a sequence of paragraphs that contains the instructions for the job. You may run all paragraphs in the order given, or one paragraph at a time. The paragraph order is important. If you run them individually, do it in the order provided.

Geocoding Addresses

This notebook has Hive queries that you can use to create a Hive input address table, geocode the input addresses stored in the input table, and create a geocoded output address table. Modify the Hive queries as per the required schema of the input and output table.

Geocoding Summary Reports

This notebook is used to generate geocoding metrics as summary reports over an input table of geocoded addresses. It provides these summary reports based on whether the input geocoding reference data is international or US-specific.

There are two categories of reports:

International (one or more countries):

• Location Quality Summary - This report gives the geocoded address location precision as a count and percentage. It is based on the precisionCode output field of a geocoded address and used to analyze the location precision of geocoded international single country or mixed country addresses.

• Records Changed Summary - This report gives the count and percentage of addresses needing correction. It is based on the precisionCode output field of a geocoded address. It can be used to generate metrics for geocoded international single country or mixed country addresses.

• PB Key Associations - This report gives the count and percentage of addresses that matched with a PB_KEY. It is based on PB_KEY output field of a geocoded address. It can be used to generate metrics for geocoded addresses that have MLD datasets.

US:


http://zeppelin.apache.org/docs/0.7.2/install/install.html


• Address Match Summary - This report gives the count and percentage of addresses meeting the USPS standards. It is based on the MATCH_CODE output field of a geocoded address.

• Census Block Stats - This report uses Census block ID output fields for geocoded addresses to determine the count and percentage of addresses assigned to a tract, block group, and block.


3 - Spark Jobs To use Spark jobs:

• the input and output must both be in CSV format • installation must be completed as outlined in the Spectrum™ Geocoding for Big Data Geocoding Install Guide on the Spectrum Spatial for Big Data documentation landing page

Spark drivers for both versions of Spark (1.0 and 2.0) are pre-built and delivered with the Spectrum™ Geocoding for Big Data toolkit. The drivers are also available as a Gradle project that you can build and customize to meet your specific needs via the Spark API.

Note: Building drivers requires an Internet connection.

In this section

Spark Output Fields 29 Geocode 30 Reverse Geocode 35 Spark API 40 Sample Spark API Geocoding Application 40

http://support.pb.com/help/hadoop/landingpage/index.htmlhttp://support.pb.com/help/hadoop/landingpage/index.html

Spark Jobs

Spark Output Fields

The result of a Geocode or Reverse Geocode Spark Job is a CSV file with all columns from the input plus additional requested output fields. To change the output fields, you must use the geocoding-output-fields parameter.

Note: For information on custom output fields per country, see the appendix in the Global Geocoding SDK Developer Guide, which is also available on the Spectrum Spatial for Big Data documentation landing page.

Output Fields Parameter

The output fields can use the geocoding-output-fields parameter set to a space-separated list of fields. For example:

spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark2/driver/spectrum-bigdata-geocoding-spark2drivers-version-all.jar--input /user/pbuser/customers/addresses.csv--output /user/pbuser/customers_geocoded--geocoding-output-fields x y formattedStreetAddress formattedLocationAddress--geocoding-config-location hdfs:///pb/geocoding/software/resources/config/--geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2 postCode1=3

This list is case-sensitive and a mistyped field will return a column of null values.

PBKey and Other Custom Fields

Certain output fields will not be returned unless you edit the geocodePreferences.xml file. This file path is set using the geocoding-preferences-filepath command line parameter. To learn more, see Geocode Preferences on page 46.

Errors

All input records will go to output. If a row errors then it will be put in the -output location. In the case where no candidates are found, you will get null values for all geocoding output fields. Errors will also get null values for all geocoding output fields, and you can add the error field to your output to see any error information. See Geocode-Level Errors on page 47 for explanations of possible errors.



Spark Jobs

Geocode

This application is a Spark job that processes an entire book of business to assign a geocode location for each property. It appends geocode attributes to an input CSV file of address.

For example, the input is a CSV file representing the following customer data and addresses:

Name Address City State Zip Country

Richard Hendricks 5230 Penfield Ave Woodland Hills CA 91364 USA

Peter Gregory 54 State St #701 Albany NY 12207 USA

Russ Hanneman 3 Comma St Warwick RI 02889 USA

The Spark job is then run with the requested geocoding-output-fields: X,Y.

The output is a CSV with the input columns and requested output columns:

Name Address City State Zip Country X Y

Richard Hendricks 5230 Penfield Ave Woodland CA 91364 USA -118.568145 34.165625 Hills

Peter Gregory 54 State St #701 Albany NY 12207 USA -73.752 42.64892

Russ Hanneman 3 Comma St Warwick RI 02889 USA -71.383139 41.713852

Job Commands

To run the Spark job you must use the spark-submit script in Spark’s bin directory.

The following are example Spark commands. Use the command that's appropriate for your version of Spark.


Spark Jobs

Example 1 (Spark 1.0 using HDFS)

spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark1/driver/spectrum-bigdata-geocoding-spark1-version-all.jar--input /user/pbuser/customers/addresses.csv--output /user/pbuser/customers_geocoded--geocoding-output-fields x y--geocoding-config-location hdfs:///pb/geocoding/software/resources/config/--geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2 postCode1=3--num-partitions=15--combine

Example 2 (Spark 2.0 using MapR)

spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark2/driver/spectrum-bigdata-geocoding-spark2-version-all.jar--input /user/pbuser/customers/addresses.csv--output /user/pbuser/customers_geocoded--geocoding-output-fields x y--geocoding-config-location /mapr//pb/geocoding/software/resources/config/--geocoding-binaries-location/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64/--geocoding-preferences-filepath/mapr//pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2 postCode1=3--num-partitions=15--combine

Custom job arguments

All parameters are declared with a double dash.

Parameter Example

--input --input/user/pbuser/customers/addresses.csvThe location to the input CSV file.

--output --output /user/pbuser/customers_geocoded

The location of the directory for the output, including all input columns along with the fields requested in the geocoding-output-fields parameter as well as any errors.


Spark Jobs

Parameter Example

--geocoding-output-fields --geocoding-output-fields x y

The requested geocode fields to be included in the output. For more information, see Spark Output Fields on page 29.

--geocoding-config-location

Location of the directory containing geocoding configurations. This property must be set. This directory contains the JsonDataConfig.json which contains the location and information of the geocoding reference data installed.

--geocoding-config-locationhdfs:///pb/geocoding/software/resources/config/

--geocoding-config-location/mapr/ /pb/geocoding/software/resources/config/

--geocoding-binaries-location --geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64Location of the directory containing geocoding libraries.

This property must be set. --geocoding-binaries-location/mapr/ /pb/geocoding/software/resources/nativeLibraries/bin/linux64

--geocoding-preferences-filepath

File path of the geocoding preferences file. This optional file can be edited by advanced users to change the behavior of the geocoder.

--geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml

--geocoding-preferences-filepath/mapr/ /pb/geocoding/software/resources/config/geocodePreferences.xml

--geocoding-input-fields

Geocoding-input-fields as address field mappings, using mixed or camelCase form.

Note: For detailed information on input fields, see the Global Geocoding SDK Developer Guide, which is also available on the Spectrum Spatial for Big Data documentation landing page.

• Specifying individual address fields by input column index:

--geocoding-input-fields streetName=0areaName3=1 areaName1=2 postCode1=3

• Using column names from the input CSV file (requires a header in the CSV file and setting --csv header=true):

--geocoding-input-fieldsstreetName=street areaName3=cityareaName1=state postCode1=zip

• Specifying input as a single line, where multiple input CSV fields are concatenated into one address field:

--geocoding-input-fieldsmainAddressLine=0,1,2,3



Spark Jobs

Parameter Example

Optional Parameters

--download-location --download-location /pb/downloads

Location of the directory where reference data will be downloaded to. This path must exist on every data node and the HiveServer2 node.

Note: Use only if reference data was distributed remotely via HDFS or S3.

--geocoding-country --geocoding-country USA

If your input data does not have country information then you can specify the country as a parameter. Alternatively, you can use a column reference in --geocoding-input-fields (for example: --geocoding-input-fields country=2)

--overwrite --overwrite

Including this parameter will tell the job to overwrite the output directory. Otherwise the job will fail if this directory already has content. This parameter does not have a value.

--num-partitions --num-partitions=15

The minimum number of partitions used to split up the input file.

--download-group --download.group pbdownloads

This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group.




Spark Jobs

Parameter Example

--error-field --error-field error

Add the error field to your output to see any error information.

--combine --combine

Including this parameter will tell the job to combine all output files into a single output file. Otherwise the job will create multiple output files and the number of output files will depend on number of partitions specified by user.

Note: As the size of the data to be combined grows, especially past the size of the space available on a single node, there is a chance of getting errors.

CSV Options

Any standard Databricks CSV entry can also be used as needed, including the following options:

CSV Option Example

--csv delimiter --csv delimiter=, Default is the Databricks CSV default: , --csv delimiter='\t'

--csv quote --csv quote=\'

Default is the Databricks CSV default: "

--csv escape --csv escape=\'

Default is the Databricks CSV default: \

--csv header --csv header=true

Default is the Databricks CSV default: (false).


Spark Jobs

Reverse Geocode

This application is a Spark job that processes longitude and latitude coordinates and returns an address and other geocode attributes. It appends these attributes to an input CSV address file.

For example, the input is a CSV file representing the following customer data and coordinates:

Name X Y

James McGraw -83.687971 43.011972

Jack Rackham -74.023676 41.923198

Edward Teach -71.369907 42.459512

Then the Spark job is run with the requested geocoding-output-fields: formattedStreetAddress, areaName3, areaName1, and postCode1.

The output is a CSV with the input columns and requested columns:

Name X Y Street Area Name Area Name 1 Postcode Address 3

James -83.687971 43.011972 900 S FLINT MI 48502 McGraw SAGINAW ST

Jack Rackham -74.023676 41.923198 26 KINGSTON NY 12401 VALENTINE AVE

Edward Teach -71.369907 42.459512 63 WOOD ST CONCORD MA 01742

Job Commands

To run the Spark job you must use the spark-submit script in Spark’s bin directory.

The following are example Spark commands. Use the command that's appropriate for your version of Spark.


Spark Jobs

Example 1 (Spark 1.0 using HDFS)

spark-submit--class com.pb.bigdata.geocoding.spark.app.ReverseGeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark1/driver/spectrum-bigdata-geocoding-spark1-version-all.jar--input /user/pbuser/customers/locations.csv--output /user/pbuser/customers_addresses--geocoding-output-fields formattedStreetAddress areaName3 areaName1 postCode1 country--geocoding-config-location hdfs:///pb/geocoding/software/resources/config/--geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields x=0 y=1--combine

Example 2 (Spark 2.0 using MapR)

spark-submit--class com.pb.bigdata.geocoding.spark.app.ReverseGeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark2/driver/spectrum-bigdata-geocoding-spark2-version-all.jar--input /user/pbuser/customers/locations.csv--output /user/pbuser/customers_addresses--geocoding-output-fields formattedStreetAddress areaName3 areaName1 postCode1 country--geocoding-config-location /mapr//pb/geocoding/software/resources/config/--geocoding-binaries-location/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64/--geocoding-preferences-filepath/mapr//pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields x=0 y=1--combine

Custom job arguments

All parameters are declared with a double dash.

Parameter Example

--input --input /user/pbuser/customers/locations.csv

The location to the input CSV file.

--output --output /user/pbuser/customers_addresses

The location of the directory for the output, including all input columns along with the fields requested in the geocoding-output-fieldsparameter as well as any errors.


Spark Jobs

Parameter Example

--geocoding-output-fields --geocoding-output-fields formattedStreetAddressareaName3 areaName1 postCode1 country The requested geocode fields to be included in

the output. For more information, see Spark Output Fields on page 29

--geocoding-config-location

Location of the directory containing geocoding configurations. This property must be set. This directory contains the JsonDataConfig.json which contains the location and information of the geocoding reference data installed.

--geocoding-config-locationhdfs:///pb/geocoding/software/resources/config/

--geocoding-config-location /mapr/ /pb/geocoding/software/resources/config/

--geocoding-binaries-location --geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64Location of the directory containing geocoding

libraries. This property must be set. --geocoding-binaries-location/mapr/ /pb/geocoding/software/resources/nativeLibraries/bin/linux64

--geocoding-preferences-filepath --geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xmlFile path of the geocoding preferences file. This

optional file can be edited by advanced users to --geocoding-preferences-filepathchange the behavior of the geocoder. /mapr/

/pb/geocoding/software/resources/config/geocodePreferences.xml

--geocoding-input-fields --geocoding-input-fields x=0 y=1

The X (Longitude) column and Y (Latitude) column. Optionally include csys and countryif you want them from individual rows in the input rather than a single global setting for all rows.

Optional Parameters


Spark Jobs

Parameter Example

--download-location --download-location /pb/downloads

The location of the directory where reference data will be downloaded to. This path must exist on every data node and the HiveServer2 node. This property is required if your reference data is distributed on HDFS, as recommended. It is also required if you use HDFS paths for configuration files, binaries, or preferences.


--geocoding-country --geocoding-country USA

If your input data does not have country information, then you can specify the country as a parameter. Alternatively, you can use a column reference in --geocoding-input-fields(for example: --geocoding-input-fieldsx=0 y=1 country=2).

--geocoding-csys epsg:4326 --geocoding-csys

If your input data does not have coordinate system information, then you can specify the coordinate system as a parameter. Alternatively, you can use a column reference in --geocoding-input-fields (for example: --geocoding-input-fields x=0 y=1csys=2).

--overwrite --overwrite

Including this parameter will tell the job to overwrite the output directory. Otherwise the job will fail if this directory already has content. This parameter does not have a value.

--num-partitions --num-partitions=15

The minimum number of partitions used to split up the input file.


Spark Jobs

Parameter Example

--download-group --download-group pbdownloads

This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group.



--error-field --error-field error

Add the error field to your output to see any error information.

--combine --combine

Including this parameter will tell the job to combine all output files into a single output file. Otherwise the job will create multiple output files and the number of output files will depend on number of partitions specified by user.

Note: As the size of the data to be combined grows, especially past the size of the space available on a single node, there is a chance of getting errors.

CSV Options

Any standard Databricks CSV entry can also be used as needed, including the following options:

CSV Option Example

--csv delimiter --csv delimiter=, Default is the Databricks CSV default: , --csv delimiter='\t'


Spark Jobs

CSV Option Example

--csv quote --csv quote=\'

Default is the Databricks CSV default: "

--csv escape --csv escape=\'

Default is the Databricks CSV default: \

--csv header --csv header=true

Default is the Databricks CSV default: (false).

Spark API

In-process Spark APIs for geocoding, reverse geocoding, and geocoding by key are also available for Spark versions 1.0 and 2.0:

• spectrum-bigdata-geocoding-version.zip/spark1/sdk • spectrum-bigdata-geocoding-version.zip/spark2/sdk

When writing basic Spark jobs, you can use this API to augment dataframes with fields from a geocode result. For example, you can add some specific geocoding result fields from the top candidate, then process that result with an aggregation or filter. See the javadocs provided with the API for details. Javadocs are also available on the Spectrum Spatial for Big Data documentation landing page.

Sample Spark API Geocoding Application

Note: This topic assumes the product is installed to /pb/geocoding/software as described in the Spectrum™ Geocoding for Big Data Install Guide. An Internet connection is also required to build the sample.

A sample for Spark API geocoding in Scala is available that demonstrates multipass geocoding against the addresses.csv file provided in /sampleData/DC. For all first-pass results with suboptimal precision codes (that is, any code other than S8), another geocoding pass will be performed as a single line address to improve the results.


http://support.pb.com/help/hadoop/landingpage/index.html

Spark Jobs

To build the sample in your environment:

1. Navigate to the appropriate directory for your version of Spark:

Version Path

1.0 /pb/geocoding/software/spark1/samples/multipass

2.0 /pb/geocoding/software/spark2/samples/multipass

2. At the command prompt, set the gradlew file to be executable then run the gradle build:

chmod +x gradlew./gradlew build

A build/libs folder is created containing the MultipassGeocodingSample-version-all.jar file for the sample.

3. Copy the jar to a location on the master node, such as /pb/master/geocoding/sample/driver/MultipassGeocodingSample-version-all.jar.

4. You will also need to copy the sample addresses.csv file to HDFS, the MapR file system, or S3. In the directory where you extracted spectrum-bigdata-geocoding-version.zip file, navigate to /sampleData/DC (if you followed the Spectrum™ Geocoding for Big Data Install Guide, this location would be /pb/geocoding/software/sampleData/DC).

5. Copy addresses.csv to a temporary location on a node, then create a location on HDFS, the MapR file system, or S3 and copy addresses.csv to it, for example:


hadoop fs -mkdir hdfs:///pb/geocoding/sampleData/DC

hadoop fs -copyFromLocal/pb/geocoding/software/sampleData/DC/addresses.csvhdfs:///pb/geocoding/sampleData/DC


hadoop fs -mkdir /mapr//pb/geocoding/sampleData/DC

hadoop fs -copyFromLocal/pb/geocoding/software/sampleData/DC/addresses.csv/mapr//pb/geocoding/sampleData/DC

6. Specify the download group if required. It is an optional argument and only specific to POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating


Spark Jobs

system user executing the Hadoop service should be a part of this group. This is the last argument in spark-submit command.

The sample is now built and ready to use. If you want to edit the Scala sample code, it is located in:

spark1/samples/multipass/MultipassGeocodingSample/src/main/scala/com/pb/sample/spark/app or spark2/samples/multipass/MultipassGeocodingSample/src/main/scala/com/pb/sample/spark2/app

Example Spark command (using HDFS):

spark-submit--class com.pb.sample.spark.app.MultipassGeocodeDriver--master yarn --deploy-mode cluster/pb/master/geocoding/sample/driver/MultipassGeocodingSample-version-all.jarhdfs:///pb/geocoding/sampleData/DC/addresses.csvhdfs:///pb/geocoding/software/resources/confighdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64 /pb/downloads

hdfs:///pb/user/multipassGeocoding/output pbdownloads

Example Spark command (using MapR):

spark-submit--class com.pb.sample.spark.app.MultipassGeocodeDriver--master yarn --deploy-mode cluster/pb/master/geocoding/sample/driver/MultipassGeocodingSample-version-all.jar/mapr//pb/geocoding/sampleData/DC/addresses.csv/mapr//pb/geocoding/software/resources/config/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64

/mapr//pb/user/multipassGeocoding/output


4 - Appendix

In this section

Sample Data 44 Geocode Preferences 46 Geocode-Level Errors 47 Download Permissions 48 Copyright 49

Appendix

Sample Data

If you want to perform geocoding and reverse geocoding but do not have a geocoding data dictionary or sample input, you can use the samples that are in the product under /pb/geocoding/software/sampleData/DC.

Installing Washington D.C. Data Dictionary

The data dictionary used for samples covers all of Washington D.C. and will work for geocoding and reverse geocoding. Follow the instructions in the Spectrum™ Geocoding for Big Data Geocoding Install Guide to install geocoding reference data as you would for normal reference data but use the /pb/geocoding/software/sampleData/DC/sampleDictionary directory instead of an spd file.

Also, this directory is already unzipped which means you do not have to use the Geocoding CLI to extract the spd. Instead you will need to copy the directory to the data folder:

cp -R /pb/geocoding/software/sampleData/DC/sampleDictionary/pb/geocoding/data

After the geocoding reference data is installed, follow the rest of the instructions in the Spectrum™ Geocoding for Big Data User Guide on the Spectrum Spatial for Big Data documentation landing page, depending on the services you want to use.

Sample Addresses for Geocoding

A sample CSV file containing addresses exists at /pb/geocoding/software/sampleData/DC/addresses.csv. This file contains 200 addresses in Washington D.C. with some addresses intentionally containing errors to demonstrate behavior. Here are the first couple rows:

3400 IDAHO AVE NW # 402 WASHINGTON DC 20016

1130 17TH ST NW # 200 WASHINGTON DC 20036

In order to use this you will need to copy it, for example, to HDFS:

hadoop fs -mkdir dc_addresseshadoop fs -copyFromLocal/pb/geocoding/software/sampleData/DC/addresses.csvdc_addresses/address.csv

This would create a file at dc_addresses/address.csv in your user's home directory in HDFS.


http://support.pb.com/help/hadoop/landingpage/index.html

Appendix

• If you are using Spark, you can directly point to this HDFS directory or file. • If you are using Hive then you need to perform the additional step of creating the Hive table. Use a supported JDBC client like Beeline and run the following CREATE statement. You will have to replace pbuser with your own user.

CREATE EXTERNAL TABLE dc_addresses(address STRING, city STRING, StateSTRING, Postcode STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'WITH SERDEPROPERTIES (

"separatorChar" = ",")STORED AS TEXTFILE location '/user/pbuser/dc_addresses';

This assumed you connected to Hive using your end user. If you are connected as the Hive user, then you will need to give the Hive user permissions to that directory.

sudo -u hdfs hadoop fs -chown -R hive: /user/pbuser/dc_addresses

Sample Coordinates for Reverse Geocoding

A sample CSV file containing locations exists at /pb/geocoding/software/sampleData/DC/locations.csv. This file contains 10 locations in Washington D.C. Here are the first couple rows:

-76.97921145819674 38.89325106109181

-77.0405668395 38.901288308

In order to use this you will need to copy it, for example, to HDFS:

hadoop fs -mkdir dc_locationshadoop fs -copyFromLocal/pb/geocoding/software/sampleData/DC/locations.csvdc_locations/locations.csv

This would create a file at dc_locations/locations.csv in your user's home directory in HDFS.

• If you are using Spark, you can directly point to this HDFS directory or file. • If you are using Hive then you need to perform the additional step of creating the Hive table. Use a supported JDBC client like Beeline and run the following CREATE statement. You will have to replace pbuser with your own user.

CREATE EXTERNAL TABLE dc_locations(x DOUBLE, y DOUBLE)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'WITH SERDEPROPERTIES (


Appendix

"separatorChar" = ",")STORED AS TEXTFILE location '/user/pbuser/dc_locations';

This assumed you connected to Hive using your end user. If you are connected as the Hive user, then you will need to give the Hive user permissions to that directory.

sudo -u hdfs hadoop fs -chown -R hive: /user/pbuser/dc_locations

Geocode Preferences

You may need to change how the geocoder behaves or set some option that is not directly exposed. In order to do that you will need to edit the geocodePreferences.xml file located at hdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml or /mapr//pb/geocoding/software/resources/config/geocodePreferences.xml.

If you do not want to store this remotely (on HDFS or S3), then it must exist on all data nodes and master nodes.

PB_Key

Master Location Data datasets support a unique feature called PB_KEY. This is an identifier attached to every address record that facilitates data enrichment for faster and more informed decision making. To get PB_KEY, add the following XML element inside the geocodePreferences.xml file:

true

You will still need to request PB_KEY in the outputFields field.

Default Coordinate System for Reverse Geocoding

If you do not specify the coordinate system when performing a reverse geocode, then the default used is WGS84 (EPSG:4326). If you want a different default to be used, then set the following XML element inside the geocodePreferences.xml file:

epsg:3857


Appendix

Geocode-Level Errors

These errors occur in the error field of an individual geocode.

Error Solution

Geocoding candidate not found

A location could not be found with the given address.

This can be solved in different ways

• Verify the proper columns are being used in the geocode function. For example, the state column may improperly be in the city argument. If you have the majority of geocodes hitting this error then this is probably the cause. Consider revisiting the UDF page of the function you are trying to use.

• The address may need to be more precise or cleaned up.

Even after doing all that it will still be possible to have an address that cannot be geocoded.

Data not found

The geocoding reference data is missing or misconfigured on at least one master or data node.

Invalid record

This row is invalid or malformed.

This can be solved in different ways

• Verify that JsonDataConfig.json has paths properly pointing to the installed geocoding data dictionaries. Every node needs to be configured, and they all need to be configured the same way. The data dictionaries also have to exist on each data node and master node.

• Verify you have data dictionaries that cover the country or region you are trying to geocode.

If all addresses have this error then verify the column indexes match up with your data.

Otherwise look into the following:

• Verify the row has the correct amount of columns. • If the row has special characters, like the Delimiter, Quote, or Escape characters, verify the row is properly escaped.


Appendix

Download Permissions

Setting the download permissions allows multiple services to download and update the downloaded data when required. You should have a common operating system group of which all the service users who need to download the data are part of. For example, if Hive and YARN jobs are required to download data and use the same download location, then both the Hive and YARN operating system users should be part of a common operating system group. The group of the download directory should be the common operating system group, one that has Read, Write, and Execute (775) permissions for the owner and group.

Your group should contain services and users that will run jobs in your cluster. You may skip services you will not use or do not have installed. Services include YARN, Hive, Zeppelin, and Hue.

You also should include all operating system users who will run jobs such as pbuser and .

1. Add the group.

sudo groupadd pbdownloads

2. Add users to the group.

sudo usermod -a -G pbdownloads hive sudo usermod -a -G pbdownloads yarn sudo usermod -a -G pbdownloads zeppelin sudo usermod -a -G pbdownloads hue sudo usermod -a -G pbdownloads pbuser sudo usermod -a -G pbdownloads

3. Using a window where no job is running, restart all the services whose operating system users were added to the new group.

4. Using a window where no job is running, restart the session of all the operating system users that were added to new group (for example, pbuser).

5. Update the group to the common operating system group and update permissions to 775 for the download directory specified in pb.download.location property.

sudo chgrp pbdownloads /pb/downloadssudo chmod 775 /pb/downloads


Appendix

Copyright

Information in this document is subject to change without notice and does not represent a commitment on the part of the vendor or its representatives. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, without the written permission of Pitney Bowes Software Inc., 350 Jordan Road, Troy, New York 12180. © 2019 Pitney Bowes Software Inc. All rights reserved. Location Intelligence APIs are trademarks of Pitney Bowes Software Inc. All other marks and trademarks are property of their respective holders.


pitney bowes 0. 3001 Summer Street

Stamford CT 06926-0700

USA

www.pitneybowes.com

© 2019 Pitney Bowes Software Inc.

All rights reserved

http:www.pitneybowes.com

Table of ContentsWelcomeWhat is Spectrum™ Geocoding for Big Data?Spectrum™ Geocoding for Big Data ArchitectureSystem Requirements and Dependencies

Hive User-Defined Geocoding FunctionsHive VariablesHive Output FieldsGeocode UDFReverse Geocode UDFGeocodeByKey UDFSample Hive Geocoding ScriptHive Jobs and Reports Using Zeppelin

Spark JobsSpark Output FieldsGeocodeJob Commands

Reverse GeocodeJob Commands

Spark APISample Spark API Geocoding Application

AppendixSample DataGeocode PreferencesGeocode-Level ErrorsDownload Permissions

spectrum geocoding for big data v3.2.0 user guide · hive jobs and reports using zeppelin 26 3...

Documents