apache kite

of 14/14

Post on 14-Jul-2015



Data & Analytics

2 download

Embed Size (px)


Apache Kite

Apache KiteMaking life easier in hadoopWhy Kite?Codify expert patterns and practices for building data-oriented systems and applications.Let developers focus on business logic, not plumbing or infrastructure.Provide smart defaults for platform choices.Support piecemeal adoption via loosely-coupled modules.

Kite Data ModuleProvides APIs to interact with data in Hadoop.

The data module contains APIs and utilities for defining and performing actions on datasets.entitiesschemasdatasetsdataset repositoriesloading datadataset writersviewing data

Entities, SchemasEntity:A single record in the dataset. (More like a plain java object, analogous to a row in relational RDBMS table)Schemas: A schema specify fieldname and the data type for a dataset. Kite relies on Apache AVRO for the same.Using Java API (AVRO schema is inferred from a Java class/ AVRO data)Using command line argument (AVRO schema is inferred from a Java class/ CSV data)

DatasetsDatasets:A collection of zero or more entities. It is analogous to a RDBMS table.The HDFS implementation of dataset is stored as Snappy-compressed Avro data files by default. We can also store it in column oriented Parquet format.Performance of a dataset can be increased by partition strategyBased on one or more fields in the entityPartitioning can be done using Hash, Identity or Date (year, month, day, hour) strategiesIt provides coarse grained organizationPartition strategy is configured with a JSON-based formatPartition strategy can be applied only when dataset is created and cannot be altered later on.We can work with a subset of dataset entities using Views API.Datasets are identified using URIs.DatasetsDataset URIs:Depending on the data set scheme, we can specify dataset URI using one of the following pattern.

View URIs:A view URI is constructed by changing the prefix of a dataset URI from dataset: to view:. The query arguments can be added as name/value pairs, similar to query arguments in HTTP URL.

Hivedataset:hive:/HDFSdataset:hdfs:///Local FSdataset:file:///HBasedataset:hbase:/Dataset Repositories, Loading, Dataset Writers, Viewing DataDataset Repositories:The physical storage location for datasets. It is equivalent to database in RDBMS model.Required for logical grouping, security, access controls, backup policies, etc.Each dataset belong to exactly one dataset repository.Kite does not provide the functionality of copying/moving a dataset from one dataset repository to another. (However, it can be done via Map Reduce)Loading:We can load comma separated values into dataset repository using CLI.Dataset Writers:Used to add entities to datasets.Viewing Data:We can query the data using Hive/ImpalaWe can also use CLI.

Kite Dataset Lifecycle

Generate SchemaA Kite dataset is defined using an Avro schema.It can manually written or generated from Java object/CSV data file.

CLI command for: Schema generation from Java classkite-dataset obj-schema org.kitesdk.cli.example.Movie -o movie.avsc

Schema generation from CSV filekite-dataset csv-schema movie.csv --class Movie -o movie.avsc Example Schema Generationpackage org.kitesdk.examples.data; /** Movie class */ class Movie { private int id; private String title; private String releaseDate; . . . public Movie() { // Empty constructor for serialization purposes }}{ "type":"record", "name":"Movie", "namespace":"org.kitesdk.examples.data", "fields":[ {"name":"id","type":"int"}, {"name":"title","type":"string"}, {"name":"releaseDate","type":"string"}, ]}Create DatasetDataset is created using the Avro schema.kite-dataset create movie --schema movie.avsc

Partition Strategy:Logical partitions for improving performanceSpecified using a JSON fileExample: movie.json[ { "source" : "id", "type" : "int", "name" : "id"}]

kite-dataset create movie --schema movie.avsc partition-by movie.jsonCreate DatasetColumn Mapping:Specifies how data should be stored in Hbase for maximum performanceSpecified in JSON fileEach definition is a JSON object with following fieldsSOURCE The field in the entityTYPE Where the field data is stored (cells in Hbase)FAMILY the column family in Hbase tableQUALIFIER the column name in Hbase tableExample{"source" : "timestamp", "type" : "column", "family" : "m", "qualifier" : "ts"}

There are five mapping types:1. Column2. Counter 3. keyAsColumn4. Key 5. VersionPopulate-Validate-Update-Annihilate DatasetPopulate Dataset:There are various ways to populate data to datasetImporting form csv filesCopying from another datasetUsing Flume ingestion, etc.Validate Dataset:SHOW command can be used to validate the data loaded.Update Dataset:Kite supports schema evolution as AVRO.Annihilate Dataset:Delete dataset when it is not required.