spectrum geocoding for big data v3.2.0 user guide · hive jobs and reports using zeppelin 26 3...

50
Spectrum Geocoding for Big Data Version 3.2 Spectrum Geocoding for Big Data User Guide

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Spectrum™ Geocoding for Big Data Version 3.2

    Spectrum™ Geocoding for Big Data User Guide

  • Table of Contents

    1 - Welcome

    What is Spectrum™ Geocoding for Big Data? 4 Spectrum™ Geocoding for Big Data

    Architecture 5 System Requirements and Dependencies 5

    2 - Hive User-Defined Geocoding Functions

    Hive Variables 8 Hive Output Fields 11 Geocode UDF 12 Reverse Geocode UDF 17 GeocodeByKey UDF 21 Sample Hive Geocoding Script 25 Hive Jobs and Reports Using Zeppelin 26

    3 - Spark Jobs

    Spark Output Fields 29 Geocode 30 Reverse Geocode 35 Spark API 40 Sample Spark API Geocoding Application 40

    4 - Appendix

    Sample Data 44 Geocode Preferences 46 Geocode-Level Errors 47 Download Permissions 48

  • 1 - Welcome

    In this section

    What is Spectrum™ Geocoding for Big Data? 4 Spectrum™ Geocoding for Big Data Architecture 5 System Requirements and Dependencies 5

  • Welcome

    What is Spectrum™ Geocoding for Big Data?

    The Pitney Bowes Spectrum™ Geocoding for Big Data is a toolkit for processing enterprise data for large scale spatial analysis. Billions of records in a single file can be processed in parallel, using Hive and Apache Spark's cluster processing framework, yielding results faster than ever. Unlike traditional processing techniques that used to take weeks to process the data, now the data processing can be done in a few hours using this product.

    This guide describes the Spark jobs and Hive user-defined functions for the following geocoding operations.

    • Geocode: Accepts address as an input and returns the standardized US or international address and the geocoded point and attribution.

    • Reverse Geocode: Accepts longitude, latitude and coordinate system as an input and returns address information for the location.

    • Geocode By Key: Accepts a key (such as a PB_KEY) and returns geocoding attributes such as longitude, latitude, or a point geometry.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 4

  • Welcome

    Spectrum™ Geocoding for Big Data Architecture

    What is Spectrum™ Geocoding for Big Data?

    The Spectrum™ Geocoding for Big Data packages components into an SDK for Big Data platforms like Hadoop for Spark and Hive.

    SDK provides:

    • Integration APIs for Geocoding (Geocode, Reverse Geocode, Geocode By Key) • Input datasets and metadata

    API Types:

    • Pre-built Hive UDF wrappers for Geocoding operations • Core Geocoding APIs (security enabled via Kerberos and Apache Sentry for Hive)

    System Requirements and Dependencies

    Spectrum™ Geocoding for Big Data is collection of jar files that can be deployed to your Hadoop system.

    This product is verified on the following Hadoop distributions.

    • Cloudera 5.12 and 6.0 • Hortonworks 2.6 and 3.0 • EMR 5.10 • MapR 6.0 and above, with MapR Expansion Pack (MEP) 5.0.0

    To use these jar files, you must be familiar with configuring Hadoop in Hortonworks, Cloudera, EMR, or MapR and developing applications for distributed processing. For more information, refer to Hortonworks, Cloudera, EMR, or MapR documentation.

    To use the product, the following must be installed on your system:

    for Hive:

    • Hive version 1.2.1 or above

    for Hive Client

    • Beeline, for example

    for Spark and Zeppelin Notebook:

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 5

    http://docs.hortonworks.com/index.htmlhttp://www.cloudera.com/documentation.htmlhttps://aws.amazon.com/documentation/emr/https://mapr.com/docs/

  • Welcome

    • Java JDK version 1.8 or above • Hadoop version 2.6.0 or above • Spark version 1.6.0 or above (2.0 or above required for MapR and Cloudera 6.0) • Zeppelin Notebook is not supported in Cloudera

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 6

  • 2 - Hive User-Defined Geocoding Functions

    In this section

    Hive Variables 8 Hive Output Fields 11 Geocode UDF 12 Reverse Geocode UDF 17 GeocodeByKey UDF 21 Sample Hive Geocoding Script 25 Hive Jobs and Reports Using Zeppelin 26

  • Hive User-Defined Geocoding Functions

    Hive Variables

    The Geocoding UDFs use Hive variables (some of which are required) to set various properties, which take precedence over the system properties . They can also be set as part of an option in a UDF query, where they will take precedence over the Hive variables as well as the system properties.

    Variable Examples

    pb.geocoding.config.location hdfs:///pb/geocoding/software/resources/config/

    Location of the directory containing the geocoding /mapr/ /pb/geocoding/software/resources/config/ configuration. This variable must be set. This directory contains the JsonDataConfig.json which contains the location and information of the geocoding reference data installed.

    pb.geocoding.binaries.location hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64

    Location of the directory containing geocoding /mapr/ libraries. This variable must be set. This directory /pb/geocoding/software/resources/nativeLibraries/bin/linux64/ contains libraries used by the geocoder.

    pb.geocoding.preferences.filepath hdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml

    File path of the geocoding preferences file. This /mapr/ optional file can be edited by advanced users to /pb/geocoding/software/resources/config/geocodePreferences.xml change the behavior of the geocoder.

    pb.geocoding.output.fields X,Y,formattedStreetAddress,formattedLocationAddress

    Comma-separated list of fields requested from the geocoder. You must either set this variable or set the preferences in the UDF query. For more information about output fields see Hive Output Fields.

    pb.download.location /pb/downloads

    Note: Use this variable only if reference data was distributed remotely via HDFS or S3.

    Location of the directory where reference data will be downloaded to. This path must exist on every data node and the HiveServer2 node.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 8

  • Hive User-Defined Geocoding Functions

    Variable Examples

    pb.download.group pbdownloads

    Note: Use this variable only if reference data was distributed remotely via HDFS or S3.

    This is an optional property and only specific to POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group.

    For more information, see Download Permissions on page 48.

    pb.geocoding.error.limit 1

    The number of geocoding errors to allow before failing a task for "too many errors". This prevents a task (and thus the job) from continuing in the case of a likely configuration error. The default value is 10.

    pb.geocoding.error.limit.disabled true

    Disables the error limit. All errors will be logged but will not cause the task or job to fail.

    Setting Variables

    Since these are Hive configuration variables, you can set them permanently by editing the hiveserver2-site.xml file in your cluster.

    Example (using HDFS)

    pb.geocoding.config.location=hdfs:///pb/geocoding/software/resources/config/pb.geocoding.binaries.location=hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddresspb.geocoding.preferences.filepath=hdfs:///pb/geocoding/software/resources/config/geocodePreferences.xmlpb.download.location=/pb/downloadspb.download.group=pbdownloadspb.geocoding.error.limit=1

    Example (using MapR)

    pb.geocoding.config.location=/mapr//pb/geocoding/software/resources/config/pb.geocoding.binaries.location=/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64/pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddress

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 9

  • Hive User-Defined Geocoding Functions

    pb.geocoding.preferences.filepath=/mapr//pb/geocoding/software/resources/config/geocodePreferences.xmlpb.download.location=/pb/downloadspb.download.group=pbdownloadspb.geocoding.error.limit=1

    Alternatively you can set them temporarily in each Hive session that you open.

    For example, if you wish to geocode in Beeline you can set the variables in the following way:

    Example (using HDFS)

    set pb.geocoding.config.location=hdfs:///pb/geocoding/software/resources/config/;set pb.geocoding.binaries.location=hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/;set pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddress;set pb.geocoding.preferences.filepath=hdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml;set pb.download.location=/pb/downloads;set pb.download.group=pbdownloads;set pb.geocoding.error.limit=1;

    Example (using MapR)

    set pb.geocoding.config.location=/mapr//pb/geocoding/software/resources/config/;set pb.geocoding.binaries.location=/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64/;set pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddress;set pb.geocoding.preferences.filepath=/mapr//pb/geocoding/software/resources/config/geocodePreferences.xml;set pb.download.location=/pb/downloads;set pb.download.group=pbdownloads;set pb.geocoding.error.limit=1;

    Variables set in the local session will override any hive-site variables. This means you can have default values set in the hiveserver2-site file and override them in Beeline when necessary.

    System Variables (Deprecated)

    Hive variables used to exist in the system namespace. This is deprecated as of version 3.0 and no longer the recommended method.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 10

  • Hive User-Defined Geocoding Functions

    Hive Output Fields

    The result of a Geocode UDF or Reverse Geocode UDF is a struct that will look similar to the example below.

    { "x":"-73.700257","y":"42.678161","formattedstreetaddress":"350 JORDAN RD","formattedlocationaddress":"TROY, NY 12180-8352","error": null

    }

    To see examples of queries using this struct see Geocode UDF or Reverse Geocode UDF. To change the fields present in the struct you must use either the pb.geocoding.output.fieldsvariable or set the preferences in the UDF query.

    Note: For information on custom output fields per country, see the appendix in the Global Geocoding SDK Developer Guide, which is also available on the Spectrum Spatial for Big Data documentation landing page.

    Output Fields Variable

    The output fields can use the variable pb.geocoding.output.fields set to a comma-separated list of fields. This list is case-sensitive and a mistyped field will return null. The above example can be created from the following statement.

    set pb.geocoding.output.fields=X,Y,formattedStreetAddress,formattedLocationAddress;

    PBKey and Other Custom Fields

    Certain output fields will not be returned unless you edit the geocode preferences. To learn more, see Geocode Preferences on page 46.

    Error Field

    The error field is always returned and does not need to be requested. If the value is null then no errors occurred during the functions execution. See Geocode-Level Errors on page 47 for explanations of possible errors.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 11

    http://support.pb.com/help/hadoop/ggs/GlobalGeocodingSDK_DeveloperGuide.pdfhttp://support.pb.com/help/hadoop/ggs/GlobalGeocodingSDK_DeveloperGuide.pdfhttp://support.pb.com/help/hadoop/landingpage/index.htmlhttp://support.pb.com/help/hadoop/landingpage/index.html

  • Hive User-Defined Geocoding Functions

    Geocode UDF

    Description

    The Geocode function accepts an address as an argument and returns geocoding attributes such as longitude, latitude, or a point geometry.

    Note: In order to use this UDF, you must set the required options either as Hive variables or provided in the optional config map parameter.

    Syntax

    Geocode(String mainAddressLine, String country, [Map config])

    Geocode(String mainAddressLine, String addressLastLine, String country, [Map config])

    Geocode(String addressNumber, String streetName, String unitValue, String areaName3,String areaName1, String postCode1, String country, [Map config])

    Geocode(Map input, [Map config])

    Parameters

    Address Fields

    Parameter Type Description

    mainAddressLine String the street address or a single line address

    addressLastLine String the location level information like city and postal code

    addressNumber String the house number information

    streetName String the street name

    unitType String the unit type (such as Apt. or Suite)

    unitValue String the unit value

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 12

  • Hive User-Defined Geocoding Functions

    Parameter Type Description

    areaName1 String the largest geographic division in an area, such as a state, province, or region

    areaName2 String the secondary geographic division in an area, such as a county or district

    areaName3 String the city or town name

    areaName4 String the smallest geographic division in an area, such as a city subdivision or locality

    postCode1 String the primary postal code

    postCode2 String the secondary postal code

    placeName String the name of a business, building, or location

    country String the ISO-3166 two- or three-character abbreviation for the country, or the country name

    Input and Configuration Maps

    Parameter Type Description

    input Map key-value pairs of address fields and input values. The keys must be constant values from the list of address fields in the table above. The values can be either constants or column references.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 13

    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ComplexTypeConstructors

  • Hive User-Defined Geocoding Functions

    Parameter Type Description

    config Map can be used for setting both Hive properties and Preferences

    Hive Properties

    Options that set the Hive Variables on page 8.

    Note: Any values specified here will override values set as Hive properties.

    All values must be constant. If a key is present but has a null value, the property is cleared.

    Preferences

    settings that control the matching and geocoding criteria, return values and fields

    Note: Any values specified here will override values set in the geocode preferences file.

    All values must be constant. If a key is present but has a null value, the preference is cleared.

    For more information, see Geocode Preferences on page 46.

    Note: The function may time out when using a large number of datasets that are stored in remote locations (such as HDFS and S3). If you are using Hive with the MapReduce engine, you can adjust the value of the mapreduce.task.timeout property.

    Return Values

    This function returns struct values described in Hive Output Fields on page 11.

    Examples

    In order to use this UDF, you must set the required options either as Hive variables or provided in the optional config map parameter.

    All the examples below are using the following output fields: X, Y, formattedStreetAddress, formattedLocationAddress

    Single Geocode returning struct

    All these queries return the same response, a struct containing all the requested output fields.

    SELECT geocode("600 Maryland Ave SW, Washington, DC 20002", "USA") AS result;

    SELECT geocode(map("mainAddressLine", "600 Maryland Ave SW", "addressLastLine", "Washington,DC 20002", "country","USA")) AS result;

    Output:

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 14

    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ComplexTypeConstructors

  • Hive User-Defined Geocoding Functions

    Result

    {"x":"-77.020707","y":"38.886115","formattedstreetaddress":"600 MARYLAND AVESW","formattedlocationaddress":"WASHINGTON, DC 20024-2520","error":null}

    Single Geocode returning a single row of columns

    Use a built-in, table-generating function called inline to turn the struct into columns.

    SELECT inline(array(geocode("600 Maryland Ave SW, Washington, DC 20002", "USA")));

    Output:

    X Y Street Address Location Address Error

    -77.020707 38.886115 600 MARYLAND AVE SW WASHINGTON, DC null 20024-2520

    Single Geocode returning particular fields

    SELECT result.x AS x, result.y AS y FROM (SELECT inline(array(geocode("600 Maryland AveSW, Washington, DC 20002","USA")))) result;

    Output:

    X Y

    -77.020707 38.886115

    Table Geocode with struct

    SELECT customer_id, customer_name, customer_phone, geocode("", t.address, "", t.city,t.state, t.postcode, t.country) AS geocoded_result FROM customers t;

    Output:

    ID Name Phone Result

    0000000001 John +(1)-(555)-6647575 {"x":"-79.794777","y":"34.191205","formattedstreetaddress":"1205 Michael W PALMETTO ST","formattedlocationaddress":"FLORENCE, Dorian SC 29501-4131","pb_key":"P0000KPW08AT","error":null}

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 15

  • Hive User-Defined Geocoding Functions

    ID Name Phone Result

    0000000002 Elliot Reid +(1)-(555)-5406548 {"x":"-122.393078","y":"37.753697","formattedstreetaddress":"1107 PENNSYLVANIA AVE","formattedlocationaddress":"SAN FRANCISCO, CA 94107-3413","pb_key":"P00002T1L8H7","error":null}

    0000000003 Jan Itor +(1)-(555)-3925210 {"x":"-76.30457","y":"40.045757","formattedstreetaddress":"478 N DUKE ST","formattedlocationaddress":"LANCASTER, PA 17602-4967","pb_key":"P0000JCCUDFI","error":null}

    Table Geocode

    SELECT customer_id, customer_name, customer_phone, result.* FROM customers LATERAL VIEW OUTER inline(array(geocode("", address, "", city, state, postcode, country))) result;

    Output:

    ID Name Phone X Y Street Address Location Error Address

    0000000001 John -79.794777(555)-6647575 34.191205 1205 W FLORENCE, SC null Michael PALMETTO ST 29501-4131 Dorian

    0000000002 Elliot Reid -122.393078(555)-5406548 37.753697 1107 nullSAN FRANCISCO, PENNSYLVANIA CA 94107-3413 AVE

    0000000003 Jan Itor (555)-3925210 -76.30457 40.045757 478 N DUKE ST LANCASTER, PA null 17602-4967

    Table Geocode into new table

    CREATE TABLE customers_geocoded AS SELECT customer_id, customer_name, customer_phone,result.x, result.y, result.formattedStreetAddress, result.formattedLocationAddress FROMcustomers LATERAL VIEW OUTER inline(array(geocode("", address, "", city, state, postcode,country))) result;

    Table Geocode using optional config map syntax

    SELECT geocode(map('mainAddressLine', t.address,'areaName3', t.city, 'areaName1',t.state, 'postCode1', t.zipcode,'country','USA'),map('pb.geocoding.config.location', 'hdfs:///pb/geocoding/software/resources/config','pb.geocoding.binaries.location', 'hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/','pb.geocoding.output.fields', 'formattedStreetAddress,formattedLocationAddress,precisionCode,PB_KEY','matchMode', 'RELAXED','returnAllCandidateInfo', 'true')) FROM myTable t;

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 16

  • Hive User-Defined Geocoding Functions

    Reverse Geocode UDF

    Description

    The ReverseGeocode function accepts X, Y, coordinateSystem or point as arguments and returns an address along with address attributions.

    Note: In order to use this UDF, you must set the required options either as Hive variables or provided in the optional config map parameter.

    Syntax

    ReverseGeocode(double X, double Y, [String coordinateSystem], [Map config])

    ReverseGeocode(double X, double Y, [Map config])

    ReverseGeocode(WritableGeometry point, [Map config])

    Parameters

    Parameter Type Description

    X Double Longitude of the address location in degrees.

    Y Double Latitude of the address location in degrees.

    coordinateSystem String The coordinate system that you want to convert the geometry to. The format must be the European Petroleum Survey Group (EPSG) code or the SRID code

    If you use a signature that only uses numeric X and Y coordinates, the coordinate system can be set in the geocodePreferences.xml file. If no value is set, then the default coordinate system is WGS84 (EPSG:4326).

    point WriteableGeometry The WritableGeometry will most likely be the output of a different spatial function, such as ST_Point.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 17

  • Hive User-Defined Geocoding Functions

    Parameter Type Description

    config Map can be used for setting both Hive properties and Preferences

    Hive Properties

    Options that set the Hive Variables on page 8.

    Note: Any values specified here will override values set as Hive properties.

    All values must be constant. If a key is present but has a null value, the property is cleared.

    Preferences

    settings that control the matching and geocoding criteria, return values and fields

    Note: Any values specified here will override values set in the geocode preferences file.

    All values must be constant. If a key is present but has a null value, the preference is cleared.

    For more information, see Geocode Preferences on page 46.

    Note: The function may time out when using a large number of datasets that are stored in remote locations (such as HDFS and S3). If you are using Hive with the MapReduce engine, you can adjust the value of the mapreduce.task.timeout property.

    Return Values

    This function returns struct values described in Hive Output Fields on page 11.

    Examples

    In order to use this UDF, you must set the required options either as Hive variables or provided in the optional config map parameter.

    All the examples below are using the following output fields: formattedStreetAddress, areaName3, areaName1, postCode1, country

    Single ReverseGeocode returning struct

    This example does not specify the coordinate system in the query nor in the geocodePreferences.xml file. Therefore the default coordinate system is used, WGS84 (EPSG:4326).

    SELECT ReverseGeocode(-77.020707, 38.886115) AS result;

    This example has coordinates not in the default coordinate system, therefore the coordinate system must be given

    SELECT ReverseGeocode(-8573905.88, 4705371.63, "EPSG:3857") AS result;

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 18

    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ComplexTypeConstructorshttp:4705371.63http:ReverseGeocode(-8573905.88

  • Hive User-Defined Geocoding Functions

    This example uses a function from the Location Intelligence for Big Data product to generate a WritableGeometry.

    SELECT ReverseGeocode(FromGeoJSON('{"type": "Point", "coordinates":[-77.020707,38.886115]}'))AS result;

    All these queries return the same response, a struct containing all the requested output fields:

    Result

    {"formattedstreetaddress":"600 MARYLAND AVESW","areaname3":"WASHINGTON","areaname1":"DC", "postcode1":"20024","country":"USA","error":null}

    Single ReverseGeocode returning a single row of columns

    Use a built-in, table-generating function called inline to turn the struct into columns.

    SELECT inline(array(ReverseGeocode(-74.036008, 40.721673)));

    Output:

    Street Address Area Name 3 Area Name 1 Location Postcode Error Address

    600 MARYLAND AVE WASHINGTON DC 20024 USA null SW

    Single ReverseGeocode returning particular fields

    SELECT result.areaname3, result.areaname1, result.postCode1, result.country FROM (SELECTinline(array(ReverseGeocode(-77.020707, 38.886115)))) result;

    Output:

    Area Name 3 Area Name 1 Postcode Country

    WASHINGTON DC 20024 USA

    Table ReverseGeocode with struct

    SELECT customer_id, customer_name, ReverseGeocode(customers.x , customers.y)AS reverse_geocoded_result from customers;

    Output:

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 19

  • Hive User-Defined Geocoding Functions

    ID Name Result

    0000000001 Kevin Ball {"formattedstreetaddress":"2113 S HOMAN AVE","areaname3":"CHICAGO","areaname1":"IL","postcode1":"60623","country":"USA","error":null}

    0000000002 Fiona {"formattedstreetaddress":"2119 S HOMAN Gallagher AVE","areaname3":"CHICAGO","areaname1":"IL","postcode1":"60623","country":"USA","error":null}

    0000000003 Mickey {"formattedstreetaddress":"1955 S TRUMBULL Milkovich AVE","areaname3":"CHICAGO","areaname1":"IL","postcode1":"60623","country":"USA","error":null}

    Table ReverseGeocode

    SELECT customer_id, customer_name, result.* FROM customers LATERAL VIEW OUTER inline(array(ReverseGeocode(customers.x, customers.y))) result;

    Output:

    ID Name Street Area Area Postcode Country Error Address Name 3 Name 1

    0000000001 Kevin Ball 2113 S CHICAGO IL 60623 USA null HOMAN AVE

    0000000002 Fiona 2119 S CHICAGO IL 60623 USA null Gallagher HOMAN

    AVE

    0000000003 Mickey 1955 S CHICAGO IL 60623 USA null Milkovich TRUMBULL

    AVE

    Table ReverseGeocode into new table

    CREATE TABLE customers_reverse_geocoded ASSELECT customer_id, customer_name, result.formattedStreetAddress, result.areaName3,result.areaName1, result.postCode1, result.country FROMcustomers LATERAL VIEW OUTER inline(array(ReverseGeocode(customers.x , customers.y))) result;

    Table ReverseGeocode using optional config map syntax

    SELECT ReverseGeocode(t.x, t.y,map('pb.geocoding.config.location', 'hdfs:///pb/geocoding/software/resources/config','pb.geocoding.binaries.location', 'hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/','pb.geocoding.output.fields', 'formattedStreetAddress,formattedLocationAddress,PB_KEY','returnAllCandidateInfo','true')) FROM myTable t;

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 20

  • Hive User-Defined Geocoding Functions

    GeocodeByKey UDF

    Description

    The GeocodeByKey function accepts a key (such as a PB_KEY) and returns geocoding attributes such as longitude, latitude, or a point geometry (the same results that are produced by the Geocode function).

    Note: In order to use this UDF, you must set the required options either as Hive variables or provided in the optional config map parameter.

    Syntax

    GeocodeByKey(String key, [String type], String country, [Map config])

    Parameters

    Parameter Type Description

    key String the unique identifier for the address.

    type String Optional. The type of key for a lookup. The default value is PB_KEY.

    country String the ISO-3166 two- or three-character abbreviation for the country of the address, or country name

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 21

  • Hive User-Defined Geocoding Functions

    Parameter Type Description

    config Map can be used for setting both Hive properties and Preferences

    Hive Properties

    Options that set the Hive Variables on page 8.

    Note: Any values specified here will override values set as Hive properties.

    All values must be constant. If a key is present but has a null value, the property is cleared.

    Preferences

    settings that control the matching and geocoding criteria, return values and fields

    Note: Any values specified here will override values set in the geocode preferences file.

    All values must be constant. If a key is present but has a null value, the preference is cleared.

    For more information, see Geocode Preferences on page 46.

    Note: The function may time out when using a large number of datasets that are stored in remote locations (such as HDFS and S3). If you are using Hive with the MapReduce engine, you can adjust the value of the mapreduce.task.timeout property.

    Return Values

    This function returns struct values described in Hive Output Fields on page 11.

    Examples

    In order to use this UDF, you must set the required options either as Hive variables or provided in the optional config map parameter.

    All the examples below are using the following output fields: X, Y, formattedStreetAddress, formattedLocationAddress

    Single GeocodeByKey returning struct

    All these queries return the same response, a struct containing all the requested output fields.

    SELECT GeocodeByKey("P000053KV1L4", "PB_KEY", "USA") AS result;

    Output:

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 22

    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ComplexTypeConstructors

  • Hive User-Defined Geocoding Functions

    Result

    {"x":"-77.020707","y":"38.886115","formattedstreetaddress":"600 MARYLAND AVESW","formattedlocationaddress":"WASHINGTON, DC 20024-2520","error":null}

    Single GeocodeByKey returning a single row of columns

    Use a built-in, table-generating function called inline to turn the struct into columns.

    SELECT inline(array(GeocodeByKey("P000053KV1L4", "PB_KEY", "USA")));

    Output:

    X Y Street Address Location Address Error

    -77.020707 38.886115 600 MARYLAND AVE WASHINGTON, DC null SW 20024-2520

    Single GeocodeByKey returning particular fields

    SELECT result.x AS x, result.y AS y FROM (SELECT inline(array(GeocodeByKey("P000053KV1L4","PB_KEY", "USA")))) result;

    Output:

    X Y

    -77.020707 38.886115

    Table GeocodeByKey with struct

    SELECT customer_id, customer_name, customer_phone, GeocodeByKey(t.key, t.country) ASgeocoded_result FROM customers t;

    Output:

    ID Name Phone Result

    0000000001 John +(1)-(555)-6647575 {"x":"-79.794777","y":"34.191205","formattedstreetaddress":"1205 Michael W PALMETTO ST","formattedlocationaddress":"FLORENCE, Dorian SC 29501-4131","pb_key":"P0000KPW08AT","error":null}

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 23

  • Hive User-Defined Geocoding Functions

    ID Name Phone Result

    0000000002 Elliot Reid +(1)-(555)-5406548 {"x":"-122.393078","y":"37.753697","formattedstreetaddress":"1107 PENNSYLVANIA AVE","formattedlocationaddress":"SAN FRANCISCO, CA 94107-3413","pb_key":"P00002T1L8H7","error":null}

    0000000003 Jan Itor +(1)-(555)-3925210 {"x":"-76.30457","y":"40.045757","formattedstreetaddress":"478 N DUKE ST","formattedlocationaddress":"LANCASTER, PA 17602-4967","pb_key":"P0000JCCUDFI","error":null}

    Table GeocodeByKey

    SELECT customer_id, customer_name, customer_phone, result.* FROM customers LATERAL VIEW OUTER inline(array(GeocodeByKey("P000053KV1L4", "PB_KEY", "USA"))) result;

    Output:

    ID Name Phone X Y Street Address Location Error Address

    0000000001 John -79.794777(555)-6647575 34.191205 1205 W FLORENCE, SC null Michael PALMETTO ST 29501-4131 Dorian

    0000000002 Elliot Reid -122.393078(555)-5406548 37.753697 1107 nullSAN FRANCISCO, PENNSYLVANIA CA 94107-3413 AVE

    0000000003 Jan Itor (555)-3925210 -76.30457 40.045757 478 N DUKE ST LANCASTER, PA null 17602-4967

    Table GeocodeByKey into new table

    CREATE TABLE customers_geocoded AS SELECT customer_id, customer_name, customer_phone,result.x, result.y, result.formattedStreetAddress, result.formattedLocationAddress FROMcustomers LATERAL VIEW OUTER inline(array(GeocodeByKey(key, country))) result;

    Table GeocodeByKey using optional config map syntax

    SELECT GeocodeByKey(t.key, t.country,map('pb.geocoding.config.location', 'hdfs:///pb/geocoding/software/resources/config','pb.geocoding.binaries.location', 'hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/','pb.geocoding.output.fields','X,Y,formattedStreetAddress,formattedLocationAddress')) FROM myTable t;

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 24

  • Hive User-Defined Geocoding Functions

    Sample Hive Geocoding Script

    Note: This topic assumes the product is installed to /pb/geocoding/software as described in the Spectrum™ Geocoding for Big Data Install Guide.

    A SQL script is available that, when run in a Hive session, demonstrates multipass geocoding against the addresses.csv file in /sampleData/DC/. For all first-pass results with the suboptimal precision code of Z1, another geocoding pass will be performed as a single line address to improve the results.

    The script is located at: /hive/examples/MultipassGeocoding.sql

    From a Hive session (for example, using the Beeline client), run the script using the following command:

    !run /pb/geocoding/software/hive/examples/MultipassGeocoding.sql

    Multipass Geocoding Script

    -- This script assumes the data and product is installed in locations specified in the Geocoding Install Guide

    -- The following creates a table and then populates it with the data from the local file systemCREATE TABLE IF NOT EXISTS PB_multipassgeocodeaddress(street String, city String,state

    String,postal_code String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';LOAD DATA LOCAL INPATH '/pb/geocoding/software/sampleData/DC/addresses.csv' OVERWRITE INTO TABLE

    PB_multipassgeocodeaddress;

    -- Create a temporary function that uses the geocoderCREATE TEMPORARY FUNCTION Geocode AS 'com.pb.bigdata.geocoding.hive.Geocode';

    -- Set the needed geocoding parameters, notice we are storing the config and binaries in hdfs instead of the local file systemSET pb.geocoding.output.fields=precisionCode,formattedStreetAddress,country;SET pb.geocoding.config.location=hdfs:///pb/geocoding/software/resources/config/;SET

    pb.geocoding.binaries.location=hdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/;

    -- Set where the data should goSET pb.download.location=/pb/downloads;

    -- Now we can geocode the tableCREATE TABLE PB_firstpass AS SELECT street, city, postal_code, result.precisionCode,

    result.formattedStreetAddress FROM PB_multipassgeocodeaddress lateral view outer inline(array(Geocode("",street,"",city,"",postal_code,"USA"))) result;

    –- Display the resultsSELECT * FROM PB_firstpass;

    -- Try to get a better result for all "Z1" geocodes by running as single lineCREATE TABLE PB_secondpass AS SELECT street, city, postal_code, result.precisionCode,

    result.formattedStreetAddress FROM PB_firstpass lateral view outer inline(array(Geocode(CONCAT(street,"",city," ",postal_code),"USA")))result WHERE PB_firstpass.precisioncode="Z1";

    -- Display the resultsSELECT * FROM PB_secondpass;

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 25

  • Hive User-Defined Geocoding Functions

    Hive Jobs and Reports Using Zeppelin

    Apache Zeppelin is an open source, web-based notebook that allows you to run Hive jobs and analyze the data interactively. To get set up with Zeppelin, see the Zeppelin Quick Start page.

    Note: Zeppelin is supported in Hortonworks and EMR. It is not supported on Cloudera.

    We provide a sample geocoding JSON file that can be imported into Zeppelin. Once imported, a notebook is created that provides Hive queries for geocoding addresses and generating a summary report for geocoded addresses. The code and relevant information is provided in the notebook.

    To access Zeppelin, from the Ambari interface, go to Zeppelin > Quick Links > ZeppelinUI.

    The Zeppelin sample JSON file is located in the /zeppelin directory, where you extracted spectrum-bigdata-geocoding-version.zip file.

    The process involves running a sequence of paragraphs that contains the instructions for the job. You may run all paragraphs in the order given, or one paragraph at a time. The paragraph order is important. If you run them individually, do it in the order provided.

    Geocoding Addresses

    This notebook has Hive queries that you can use to create a Hive input address table, geocode the input addresses stored in the input table, and create a geocoded output address table. Modify the Hive queries as per the required schema of the input and output table.

    Geocoding Summary Reports

    This notebook is used to generate geocoding metrics as summary reports over an input table of geocoded addresses. It provides these summary reports based on whether the input geocoding reference data is international or US-specific.

    There are two categories of reports:

    International (one or more countries):

    • Location Quality Summary - This report gives the geocoded address location precision as a count and percentage. It is based on the precisionCode output field of a geocoded address and used to analyze the location precision of geocoded international single country or mixed country addresses.

    • Records Changed Summary - This report gives the count and percentage of addresses needing correction. It is based on the precisionCode output field of a geocoded address. It can be used to generate metrics for geocoded international single country or mixed country addresses.

    • PB Key Associations - This report gives the count and percentage of addresses that matched with a PB_KEY. It is based on PB_KEY output field of a geocoded address. It can be used to generate metrics for geocoded addresses that have MLD datasets.

    US:

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 26

    http://zeppelin.apache.org/docs/0.7.2/install/install.html

  • Hive User-Defined Geocoding Functions

    • Address Match Summary - This report gives the count and percentage of addresses meeting the USPS standards. It is based on the MATCH_CODE output field of a geocoded address.

    • Census Block Stats - This report uses Census block ID output fields for geocoded addresses to determine the count and percentage of addresses assigned to a tract, block group, and block.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 27

  • 3 - Spark Jobs To use Spark jobs:

    • the input and output must both be in CSV format • installation must be completed as outlined in the Spectrum™ Geocoding for Big Data Geocoding Install Guide on the Spectrum Spatial for Big Data documentation landing page

    Spark drivers for both versions of Spark (1.0 and 2.0) are pre-built and delivered with the Spectrum™ Geocoding for Big Data toolkit. The drivers are also available as a Gradle project that you can build and customize to meet your specific needs via the Spark API.

    Note: Building drivers requires an Internet connection.

    In this section

    Spark Output Fields 29 Geocode 30 Reverse Geocode 35 Spark API 40 Sample Spark API Geocoding Application 40

    http://support.pb.com/help/hadoop/landingpage/index.htmlhttp://support.pb.com/help/hadoop/landingpage/index.html

  • Spark Jobs

    Spark Output Fields

    The result of a Geocode or Reverse Geocode Spark Job is a CSV file with all columns from the input plus additional requested output fields. To change the output fields, you must use the geocoding-output-fields parameter.

    Note: For information on custom output fields per country, see the appendix in the Global Geocoding SDK Developer Guide, which is also available on the Spectrum Spatial for Big Data documentation landing page.

    Output Fields Parameter

    The output fields can use the geocoding-output-fields parameter set to a space-separated list of fields. For example:

    spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark2/driver/spectrum-bigdata-geocoding-spark2drivers-version-all.jar--input /user/pbuser/customers/addresses.csv--output /user/pbuser/customers_geocoded--geocoding-output-fields x y formattedStreetAddress formattedLocationAddress--geocoding-config-location hdfs:///pb/geocoding/software/resources/config/--geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2 postCode1=3

    This list is case-sensitive and a mistyped field will return a column of null values.

    PBKey and Other Custom Fields

    Certain output fields will not be returned unless you edit the geocodePreferences.xml file. This file path is set using the geocoding-preferences-filepath command line parameter. To learn more, see Geocode Preferences on page 46.

    Errors

    All input records will go to output. If a row errors then it will be put in the -output location. In the case where no candidates are found, you will get null values for all geocoding output fields. Errors will also get null values for all geocoding output fields, and you can add the error field to your output to see any error information. See Geocode-Level Errors on page 47 for explanations of possible errors.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 29

    http://support.pb.com/help/hadoop/ggs/GlobalGeocodingSDK_DeveloperGuide.pdfhttp://support.pb.com/help/hadoop/ggs/GlobalGeocodingSDK_DeveloperGuide.pdfhttp://support.pb.com/help/hadoop/landingpage/index.htmlhttp://support.pb.com/help/hadoop/landingpage/index.html

  • Spark Jobs

    Geocode

    This application is a Spark job that processes an entire book of business to assign a geocode location for each property. It appends geocode attributes to an input CSV file of address.

    For example, the input is a CSV file representing the following customer data and addresses:

    Name Address City State Zip Country

    Richard Hendricks 5230 Penfield Ave Woodland Hills CA 91364 USA

    Peter Gregory 54 State St #701 Albany NY 12207 USA

    Russ Hanneman 3 Comma St Warwick RI 02889 USA

    The Spark job is then run with the requested geocoding-output-fields: X,Y.

    The output is a CSV with the input columns and requested output columns:

    Name Address City State Zip Country X Y

    Richard Hendricks 5230 Penfield Ave Woodland CA 91364 USA -118.568145 34.165625 Hills

    Peter Gregory 54 State St #701 Albany NY 12207 USA -73.752 42.64892

    Russ Hanneman 3 Comma St Warwick RI 02889 USA -71.383139 41.713852

    Job Commands

    To run the Spark job you must use the spark-submit script in Spark’s bin directory.

    The following are example Spark commands. Use the command that's appropriate for your version of Spark.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 30

  • Spark Jobs

    Example 1 (Spark 1.0 using HDFS)

    spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark1/driver/spectrum-bigdata-geocoding-spark1-version-all.jar--input /user/pbuser/customers/addresses.csv--output /user/pbuser/customers_geocoded--geocoding-output-fields x y--geocoding-config-location hdfs:///pb/geocoding/software/resources/config/--geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2 postCode1=3--num-partitions=15--combine

    Example 2 (Spark 2.0 using MapR)

    spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark2/driver/spectrum-bigdata-geocoding-spark2-version-all.jar--input /user/pbuser/customers/addresses.csv--output /user/pbuser/customers_geocoded--geocoding-output-fields x y--geocoding-config-location /mapr//pb/geocoding/software/resources/config/--geocoding-binaries-location/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64/--geocoding-preferences-filepath/mapr//pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2 postCode1=3--num-partitions=15--combine

    Custom job arguments

    All parameters are declared with a double dash.

    Parameter Example

    --input --input/user/pbuser/customers/addresses.csvThe location to the input CSV file.

    --output --output /user/pbuser/customers_geocoded

    The location of the directory for the output, including all input columns along with the fields requested in the geocoding-output-fields parameter as well as any errors.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 31

  • Spark Jobs

    Parameter Example

    --geocoding-output-fields --geocoding-output-fields x y

    The requested geocode fields to be included in the output. For more information, see Spark Output Fields on page 29.

    --geocoding-config-location

    Location of the directory containing geocoding configurations. This property must be set. This directory contains the JsonDataConfig.json which contains the location and information of the geocoding reference data installed.

    --geocoding-config-locationhdfs:///pb/geocoding/software/resources/config/

    --geocoding-config-location/mapr/ /pb/geocoding/software/resources/config/

    --geocoding-binaries-location --geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64Location of the directory containing geocoding libraries.

    This property must be set. --geocoding-binaries-location/mapr/ /pb/geocoding/software/resources/nativeLibraries/bin/linux64

    --geocoding-preferences-filepath

    File path of the geocoding preferences file. This optional file can be edited by advanced users to change the behavior of the geocoder.

    --geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml

    --geocoding-preferences-filepath/mapr/ /pb/geocoding/software/resources/config/geocodePreferences.xml

    --geocoding-input-fields

    Geocoding-input-fields as address field mappings, using mixed or camelCase form.

    Note: For detailed information on input fields, see the Global Geocoding SDK Developer Guide, which is also available on the Spectrum Spatial for Big Data documentation landing page.

    • Specifying individual address fields by input column index:

    --geocoding-input-fields streetName=0areaName3=1 areaName1=2 postCode1=3

    • Using column names from the input CSV file (requires a header in the CSV file and setting --csv header=true):

    --geocoding-input-fieldsstreetName=street areaName3=cityareaName1=state postCode1=zip

    • Specifying input as a single line, where multiple input CSV fields are concatenated into one address field:

    --geocoding-input-fieldsmainAddressLine=0,1,2,3

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 32

    http://support.pb.com/help/hadoop/ggs/GlobalGeocodingSDK_DeveloperGuide.pdfhttp://support.pb.com/help/hadoop/ggs/GlobalGeocodingSDK_DeveloperGuide.pdfhttp://support.pb.com/help/hadoop/landingpage/index.htmlhttp://support.pb.com/help/hadoop/landingpage/index.html

  • Spark Jobs

    Parameter Example

    Optional Parameters

    --download-location --download-location /pb/downloads

    Location of the directory where reference data will be downloaded to. This path must exist on every data node and the HiveServer2 node.

    Note: Use only if reference data was distributed remotely via HDFS or S3.

    --geocoding-country --geocoding-country USA

    If your input data does not have country information then you can specify the country as a parameter. Alternatively, you can use a column reference in --geocoding-input-fields (for example: --geocoding-input-fields country=2)

    --overwrite --overwrite

    Including this parameter will tell the job to overwrite the output directory. Otherwise the job will fail if this directory already has content. This parameter does not have a value.

    --num-partitions --num-partitions=15

    The minimum number of partitions used to split up the input file.

    --download-group --download.group pbdownloads

    This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group.

    For more information, see Download Permissions on page 48.

    Note: Use only if reference data was distributed remotely via HDFS or S3.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 33

  • Spark Jobs

    Parameter Example

    --error-field --error-field error

    Add the error field to your output to see any error information.

    --combine --combine

    Including this parameter will tell the job to combine all output files into a single output file. Otherwise the job will create multiple output files and the number of output files will depend on number of partitions specified by user.

    Note: As the size of the data to be combined grows, especially past the size of the space available on a single node, there is a chance of getting errors.

    CSV Options

    Any standard Databricks CSV entry can also be used as needed, including the following options:

    CSV Option Example

    --csv delimiter --csv delimiter=, Default is the Databricks CSV default: , --csv delimiter='\t'

    --csv quote --csv quote=\'

    Default is the Databricks CSV default: "

    --csv escape --csv escape=\'

    Default is the Databricks CSV default: \

    --csv header --csv header=true

    Default is the Databricks CSV default: (false).

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 34

  • Spark Jobs

    Reverse Geocode

    This application is a Spark job that processes longitude and latitude coordinates and returns an address and other geocode attributes. It appends these attributes to an input CSV address file.

    For example, the input is a CSV file representing the following customer data and coordinates:

    Name X Y

    James McGraw -83.687971 43.011972

    Jack Rackham -74.023676 41.923198

    Edward Teach -71.369907 42.459512

    Then the Spark job is run with the requested geocoding-output-fields: formattedStreetAddress, areaName3, areaName1, and postCode1.

    The output is a CSV with the input columns and requested columns:

    Name X Y Street Area Name Area Name 1 Postcode Address 3

    James -83.687971 43.011972 900 S FLINT MI 48502 McGraw SAGINAW ST

    Jack Rackham -74.023676 41.923198 26 KINGSTON NY 12401 VALENTINE AVE

    Edward Teach -71.369907 42.459512 63 WOOD ST CONCORD MA 01742

    Job Commands

    To run the Spark job you must use the spark-submit script in Spark’s bin directory.

    The following are example Spark commands. Use the command that's appropriate for your version of Spark.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 35

  • Spark Jobs

    Example 1 (Spark 1.0 using HDFS)

    spark-submit--class com.pb.bigdata.geocoding.spark.app.ReverseGeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark1/driver/spectrum-bigdata-geocoding-spark1-version-all.jar--input /user/pbuser/customers/locations.csv--output /user/pbuser/customers_addresses--geocoding-output-fields formattedStreetAddress areaName3 areaName1 postCode1 country--geocoding-config-location hdfs:///pb/geocoding/software/resources/config/--geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields x=0 y=1--combine

    Example 2 (Spark 2.0 using MapR)

    spark-submit--class com.pb.bigdata.geocoding.spark.app.ReverseGeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/software/spark2/driver/spectrum-bigdata-geocoding-spark2-version-all.jar--input /user/pbuser/customers/locations.csv--output /user/pbuser/customers_addresses--geocoding-output-fields formattedStreetAddress areaName3 areaName1 postCode1 country--geocoding-config-location /mapr//pb/geocoding/software/resources/config/--geocoding-binaries-location/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64/--geocoding-preferences-filepath/mapr//pb/geocoding/software/resources/config/geocodePreferences.xml--geocoding-input-fields x=0 y=1--combine

    Custom job arguments

    All parameters are declared with a double dash.

    Parameter Example

    --input --input /user/pbuser/customers/locations.csv

    The location to the input CSV file.

    --output --output /user/pbuser/customers_addresses

    The location of the directory for the output, including all input columns along with the fields requested in the geocoding-output-fieldsparameter as well as any errors.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 36

  • Spark Jobs

    Parameter Example

    --geocoding-output-fields --geocoding-output-fields formattedStreetAddressareaName3 areaName1 postCode1 country The requested geocode fields to be included in

    the output. For more information, see Spark Output Fields on page 29

    --geocoding-config-location

    Location of the directory containing geocoding configurations. This property must be set. This directory contains the JsonDataConfig.json which contains the location and information of the geocoding reference data installed.

    --geocoding-config-locationhdfs:///pb/geocoding/software/resources/config/

    --geocoding-config-location /mapr/ /pb/geocoding/software/resources/config/

    --geocoding-binaries-location --geocoding-binaries-locationhdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64Location of the directory containing geocoding

    libraries. This property must be set. --geocoding-binaries-location/mapr/ /pb/geocoding/software/resources/nativeLibraries/bin/linux64

    --geocoding-preferences-filepath --geocoding-preferences-filepathhdfs:///pb/geocoding/software/resources/config/geocodePreferences.xmlFile path of the geocoding preferences file. This

    optional file can be edited by advanced users to --geocoding-preferences-filepathchange the behavior of the geocoder. /mapr/

    /pb/geocoding/software/resources/config/geocodePreferences.xml

    --geocoding-input-fields --geocoding-input-fields x=0 y=1

    The X (Longitude) column and Y (Latitude) column. Optionally include csys and countryif you want them from individual rows in the input rather than a single global setting for all rows.

    Optional Parameters

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 37

  • Spark Jobs

    Parameter Example

    --download-location --download-location /pb/downloads

    The location of the directory where reference data will be downloaded to. This path must exist on every data node and the HiveServer2 node. This property is required if your reference data is distributed on HDFS, as recommended. It is also required if you use HDFS paths for configuration files, binaries, or preferences.

    Note: Use only if reference data was distributed remotely via HDFS or S3.

    --geocoding-country --geocoding-country USA

    If your input data does not have country information, then you can specify the country as a parameter. Alternatively, you can use a column reference in --geocoding-input-fields(for example: --geocoding-input-fieldsx=0 y=1 country=2).

    --geocoding-csys epsg:4326 --geocoding-csys

    If your input data does not have coordinate system information, then you can specify the coordinate system as a parameter. Alternatively, you can use a column reference in --geocoding-input-fields (for example: --geocoding-input-fields x=0 y=1csys=2).

    --overwrite --overwrite

    Including this parameter will tell the job to overwrite the output directory. Otherwise the job will fail if this directory already has content. This parameter does not have a value.

    --num-partitions --num-partitions=15

    The minimum number of partitions used to split up the input file.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 38

  • Spark Jobs

    Parameter Example

    --download-group --download-group pbdownloads

    This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group.

    For more information, see Download Permissions on page 48.

    Note: Use only if reference data was distributed remotely via HDFS or S3.

    --error-field --error-field error

    Add the error field to your output to see any error information.

    --combine --combine

    Including this parameter will tell the job to combine all output files into a single output file. Otherwise the job will create multiple output files and the number of output files will depend on number of partitions specified by user.

    Note: As the size of the data to be combined grows, especially past the size of the space available on a single node, there is a chance of getting errors.

    CSV Options

    Any standard Databricks CSV entry can also be used as needed, including the following options:

    CSV Option Example

    --csv delimiter --csv delimiter=, Default is the Databricks CSV default: , --csv delimiter='\t'

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 39

  • Spark Jobs

    CSV Option Example

    --csv quote --csv quote=\'

    Default is the Databricks CSV default: "

    --csv escape --csv escape=\'

    Default is the Databricks CSV default: \

    --csv header --csv header=true

    Default is the Databricks CSV default: (false).

    Spark API

    In-process Spark APIs for geocoding, reverse geocoding, and geocoding by key are also available for Spark versions 1.0 and 2.0:

    • spectrum-bigdata-geocoding-version.zip/spark1/sdk • spectrum-bigdata-geocoding-version.zip/spark2/sdk

    When writing basic Spark jobs, you can use this API to augment dataframes with fields from a geocode result. For example, you can add some specific geocoding result fields from the top candidate, then process that result with an aggregation or filter. See the javadocs provided with the API for details. Javadocs are also available on the Spectrum Spatial for Big Data documentation landing page.

    Sample Spark API Geocoding Application

    Note: This topic assumes the product is installed to /pb/geocoding/software as described in the Spectrum™ Geocoding for Big Data Install Guide. An Internet connection is also required to build the sample.

    A sample for Spark API geocoding in Scala is available that demonstrates multipass geocoding against the addresses.csv file provided in /sampleData/DC. For all first-pass results with suboptimal precision codes (that is, any code other than S8), another geocoding pass will be performed as a single line address to improve the results.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 40

    http://support.pb.com/help/hadoop/landingpage/index.html

  • Spark Jobs

    To build the sample in your environment:

    1. Navigate to the appropriate directory for your version of Spark:

    Version Path

    1.0 /pb/geocoding/software/spark1/samples/multipass

    2.0 /pb/geocoding/software/spark2/samples/multipass

    2. At the command prompt, set the gradlew file to be executable then run the gradle build:

    chmod +x gradlew./gradlew build

    A build/libs folder is created containing the MultipassGeocodingSample-version-all.jar file for the sample.

    3. Copy the jar to a location on the master node, such as /pb/master/geocoding/sample/driver/MultipassGeocodingSample-version-all.jar.

    4. You will also need to copy the sample addresses.csv file to HDFS, the MapR file system, or S3. In the directory where you extracted spectrum-bigdata-geocoding-version.zip file, navigate to /sampleData/DC (if you followed the Spectrum™ Geocoding for Big Data Install Guide, this location would be /pb/geocoding/software/sampleData/DC).

    5. Copy addresses.csv to a temporary location on a node, then create a location on HDFS, the MapR file system, or S3 and copy addresses.csv to it, for example:

    Example (using HDFS)

    hadoop fs -mkdir hdfs:///pb/geocoding/sampleData/DC

    hadoop fs -copyFromLocal/pb/geocoding/software/sampleData/DC/addresses.csvhdfs:///pb/geocoding/sampleData/DC

    Example (using MapR)

    hadoop fs -mkdir /mapr//pb/geocoding/sampleData/DC

    hadoop fs -copyFromLocal/pb/geocoding/software/sampleData/DC/addresses.csv/mapr//pb/geocoding/sampleData/DC

    6. Specify the download group if required. It is an optional argument and only specific to POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 41

  • Spark Jobs

    system user executing the Hadoop service should be a part of this group. This is the last argument in spark-submit command.

    The sample is now built and ready to use. If you want to edit the Scala sample code, it is located in:

    spark1/samples/multipass/MultipassGeocodingSample/src/main/scala/com/pb/sample/spark/app or spark2/samples/multipass/MultipassGeocodingSample/src/main/scala/com/pb/sample/spark2/app

    Example Spark command (using HDFS):

    spark-submit--class com.pb.sample.spark.app.MultipassGeocodeDriver--master yarn --deploy-mode cluster/pb/master/geocoding/sample/driver/MultipassGeocodingSample-version-all.jarhdfs:///pb/geocoding/sampleData/DC/addresses.csvhdfs:///pb/geocoding/software/resources/confighdfs:///pb/geocoding/software/resources/nativeLibraries/bin/linux64 /pb/downloads

    hdfs:///pb/user/multipassGeocoding/output pbdownloads

    Example Spark command (using MapR):

    spark-submit--class com.pb.sample.spark.app.MultipassGeocodeDriver--master yarn --deploy-mode cluster/pb/master/geocoding/sample/driver/MultipassGeocodingSample-version-all.jar/mapr//pb/geocoding/sampleData/DC/addresses.csv/mapr//pb/geocoding/software/resources/config/mapr//pb/geocoding/software/resources/nativeLibraries/bin/linux64

    /mapr//pb/user/multipassGeocoding/output

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 42

  • 4 - Appendix

    In this section

    Sample Data 44 Geocode Preferences 46 Geocode-Level Errors 47 Download Permissions 48 Copyright 49

  • Appendix

    Sample Data

    If you want to perform geocoding and reverse geocoding but do not have a geocoding data dictionary or sample input, you can use the samples that are in the product under /pb/geocoding/software/sampleData/DC.

    Installing Washington D.C. Data Dictionary

    The data dictionary used for samples covers all of Washington D.C. and will work for geocoding and reverse geocoding. Follow the instructions in the Spectrum™ Geocoding for Big Data Geocoding Install Guide to install geocoding reference data as you would for normal reference data but use the /pb/geocoding/software/sampleData/DC/sampleDictionary directory instead of an spd file.

    Also, this directory is already unzipped which means you do not have to use the Geocoding CLI to extract the spd. Instead you will need to copy the directory to the data folder:

    cp -R /pb/geocoding/software/sampleData/DC/sampleDictionary/pb/geocoding/data

    After the geocoding reference data is installed, follow the rest of the instructions in the Spectrum™ Geocoding for Big Data User Guide on the Spectrum Spatial for Big Data documentation landing page, depending on the services you want to use.

    Sample Addresses for Geocoding

    A sample CSV file containing addresses exists at /pb/geocoding/software/sampleData/DC/addresses.csv. This file contains 200 addresses in Washington D.C. with some addresses intentionally containing errors to demonstrate behavior. Here are the first couple rows:

    3400 IDAHO AVE NW # 402 WASHINGTON DC 20016

    1130 17TH ST NW # 200 WASHINGTON DC 20036

    In order to use this you will need to copy it, for example, to HDFS:

    hadoop fs -mkdir dc_addresseshadoop fs -copyFromLocal/pb/geocoding/software/sampleData/DC/addresses.csvdc_addresses/address.csv

    This would create a file at dc_addresses/address.csv in your user's home directory in HDFS.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 44

    http://support.pb.com/help/hadoop/landingpage/index.html

  • Appendix

    • If you are using Spark, you can directly point to this HDFS directory or file. • If you are using Hive then you need to perform the additional step of creating the Hive table. Use a supported JDBC client like Beeline and run the following CREATE statement. You will have to replace pbuser with your own user.

    CREATE EXTERNAL TABLE dc_addresses(address STRING, city STRING, StateSTRING, Postcode STRING)

    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'WITH SERDEPROPERTIES (

    "separatorChar" = ",")STORED AS TEXTFILE location '/user/pbuser/dc_addresses';

    This assumed you connected to Hive using your end user. If you are connected as the Hive user, then you will need to give the Hive user permissions to that directory.

    sudo -u hdfs hadoop fs -chown -R hive: /user/pbuser/dc_addresses

    Sample Coordinates for Reverse Geocoding

    A sample CSV file containing locations exists at /pb/geocoding/software/sampleData/DC/locations.csv. This file contains 10 locations in Washington D.C. Here are the first couple rows:

    -76.97921145819674 38.89325106109181

    -77.0405668395 38.901288308

    In order to use this you will need to copy it, for example, to HDFS:

    hadoop fs -mkdir dc_locationshadoop fs -copyFromLocal/pb/geocoding/software/sampleData/DC/locations.csvdc_locations/locations.csv

    This would create a file at dc_locations/locations.csv in your user's home directory in HDFS.

    • If you are using Spark, you can directly point to this HDFS directory or file. • If you are using Hive then you need to perform the additional step of creating the Hive table. Use a supported JDBC client like Beeline and run the following CREATE statement. You will have to replace pbuser with your own user.

    CREATE EXTERNAL TABLE dc_locations(x DOUBLE, y DOUBLE)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'WITH SERDEPROPERTIES (

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 45

  • Appendix

    "separatorChar" = ",")STORED AS TEXTFILE location '/user/pbuser/dc_locations';

    This assumed you connected to Hive using your end user. If you are connected as the Hive user, then you will need to give the Hive user permissions to that directory.

    sudo -u hdfs hadoop fs -chown -R hive: /user/pbuser/dc_locations

    Geocode Preferences

    You may need to change how the geocoder behaves or set some option that is not directly exposed. In order to do that you will need to edit the geocodePreferences.xml file located at hdfs:///pb/geocoding/software/resources/config/geocodePreferences.xml or /mapr//pb/geocoding/software/resources/config/geocodePreferences.xml.

    If you do not want to store this remotely (on HDFS or S3), then it must exist on all data nodes and master nodes.

    PB_Key

    Master Location Data datasets support a unique feature called PB_KEY. This is an identifier attached to every address record that facilitates data enrichment for faster and more informed decision making. To get PB_KEY, add the following XML element inside the geocodePreferences.xml file:

    true

    You will still need to request PB_KEY in the outputFields field.

    Default Coordinate System for Reverse Geocoding

    If you do not specify the coordinate system when performing a reverse geocode, then the default used is WGS84 (EPSG:4326). If you want a different default to be used, then set the following XML element inside the geocodePreferences.xml file:

    epsg:3857

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 46

  • Appendix

    Geocode-Level Errors

    These errors occur in the error field of an individual geocode.

    Error Solution

    Geocoding candidate not found

    A location could not be found with the given address.

    This can be solved in different ways

    • Verify the proper columns are being used in the geocode function. For example, the state column may improperly be in the city argument. If you have the majority of geocodes hitting this error then this is probably the cause. Consider revisiting the UDF page of the function you are trying to use.

    • The address may need to be more precise or cleaned up.

    Even after doing all that it will still be possible to have an address that cannot be geocoded.

    Data not found

    The geocoding reference data is missing or misconfigured on at least one master or data node.

    Invalid record

    This row is invalid or malformed.

    This can be solved in different ways

    • Verify that JsonDataConfig.json has paths properly pointing to the installed geocoding data dictionaries. Every node needs to be configured, and they all need to be configured the same way. The data dictionaries also have to exist on each data node and master node.

    • Verify you have data dictionaries that cover the country or region you are trying to geocode.

    If all addresses have this error then verify the column indexes match up with your data.

    Otherwise look into the following:

    • Verify the row has the correct amount of columns. • If the row has special characters, like the Delimiter, Quote, or Escape characters, verify the row is properly escaped.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 47

  • Appendix

    Download Permissions

    Setting the download permissions allows multiple services to download and update the downloaded data when required. You should have a common operating system group of which all the service users who need to download the data are part of. For example, if Hive and YARN jobs are required to download data and use the same download location, then both the Hive and YARN operating system users should be part of a common operating system group. The group of the download directory should be the common operating system group, one that has Read, Write, and Execute (775) permissions for the owner and group.

    Your group should contain services and users that will run jobs in your cluster. You may skip services you will not use or do not have installed. Services include YARN, Hive, Zeppelin, and Hue.

    You also should include all operating system users who will run jobs such as pbuser and .

    1. Add the group.

    sudo groupadd pbdownloads

    2. Add users to the group.

    sudo usermod -a -G pbdownloads hive sudo usermod -a -G pbdownloads yarn sudo usermod -a -G pbdownloads zeppelin sudo usermod -a -G pbdownloads hue sudo usermod -a -G pbdownloads pbuser sudo usermod -a -G pbdownloads

    3. Using a window where no job is running, restart all the services whose operating system users were added to the new group.

    4. Using a window where no job is running, restart the session of all the operating system users that were added to new group (for example, pbuser).

    5. Update the group to the common operating system group and update permissions to 775 for the download directory specified in pb.download.location property.

    sudo chgrp pbdownloads /pb/downloadssudo chmod 775 /pb/downloads

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 48

  • Appendix

    Copyright

    Information in this document is subject to change without notice and does not represent a commitment on the part of the vendor or its representatives. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, without the written permission of Pitney Bowes Software Inc., 350 Jordan Road, Troy, New York 12180. © 2019 Pitney Bowes Software Inc. All rights reserved. Location Intelligence APIs are trademarks of Pitney Bowes Software Inc. All other marks and trademarks are property of their respective holders.

    Spectrum™ Geocoding for Big Data 3.2 Spectrum™ Geocoding for Big Data User Guide 49

  • pitney bowes 0. 3001 Summer Street

    Stamford CT 06926-0700

    USA

    www.pitneybowes.com

    © 2019 Pitney Bowes Software Inc.

    All rights reserved

    http:www.pitneybowes.com

    Table of ContentsWelcomeWhat is Spectrum™ Geocoding for Big Data?Spectrum™ Geocoding for Big Data ArchitectureSystem Requirements and Dependencies

    Hive User-Defined Geocoding FunctionsHive VariablesHive Output FieldsGeocode UDFReverse Geocode UDFGeocodeByKey UDFSample Hive Geocoding ScriptHive Jobs and Reports Using Zeppelin

    Spark JobsSpark Output FieldsGeocodeJob Commands

    Reverse GeocodeJob Commands

    Spark APISample Spark API Geocoding Application

    AppendixSample DataGeocode PreferencesGeocode-Level ErrorsDownload Permissions