data structure for ctpp 2012-2016 data set€¦  · web viewthe etl process creates...

18
Data Structure for CTPP 2012-2016 Data Set 1. Source Data from Census The data from Census comes in a form of ZIP files (one per state). Each ZIP file contains a collection of CSV files (one per data table). Below is an example of data (first few rows) in AK_2012thru2016_A101102.csv file. The first column stores a geographic identifier (state of Alaska in screenshot above), the TBLID column contains the Table ID, LINENO contains the Line Number in the table (later mapped to specific Variables and Categories), EST contains the Estimate, MOE contains the Margin of Error. The last column (Source) is sometimes used to attach notes. Various lookup tables are also provided, which enables the software to construct Variables (called Dimensions in Analysis Services world) with codes and labels. 2. Data ETL Process After data is received from Census various ETL (Extraction, Transformation and Load) processes are run by Beyond 20/20 using SQL Server Integration Services (SSIS) packages. The goal of ETL is to load data into SQL Server databases in a form and shape suitable for building SQL Server Analysis Services (SSAS) cubes that can be analyzed by Perspective web application. The ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables. Fact Tables are created for every data table and geographic level.

Upload: others

Post on 13-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

Data Structure for CTPP 2012-2016 Data Set

1. Source Data from CensusThe data from Census comes in a form of ZIP files (one per state). Each ZIP file contains a collection of CSV files (one per data table).

Below is an example of data (first few rows) in AK_2012thru2016_A101102.csv file.

The first column stores a geographic identifier (state of Alaska in screenshot above), the TBLID column contains the Table ID, LINENO contains the Line Number in the table (later mapped to specific Variables and Categories), EST contains the Estimate, MOE contains the Margin of Error. The last column (Source) is sometimes used to attach notes.

Various lookup tables are also provided, which enables the software to construct Variables (called Dimensions in Analysis Services world) with codes and labels.

2. Data ETL ProcessAfter data is received from Census various ETL (Extraction, Transformation and Load) processes are run by Beyond 20/20 using SQL Server Integration Services (SSIS) packages. The goal of ETL is to load data into SQL Server databases in a form and shape suitable for building SQL Server Analysis Services (SSAS) cubes that can be analyzed by Perspective web application.

The ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables.

Fact Tables are created for every data table and geographic level.

For Part 1 (RESIDENCE) there are 103 data tables and 11 geographic levels (C01, C02, C03, C04, C05, C07, C08, C09, C11, C12, C13), which results in 1,133 fact tables. The tables are named FACT_PART1_UNI(n)_C(l), where (n) runs from 1 to 103 and (l) is the level number.

For example table FACT_PART1_UNI1_C02 contains data for first RESIDENCE data table (A101101) at State level (C02). The order of tables is based on the order in “Part1Tables” sheet of “2012-2016 CTPP Requirements.xlsx” document supplied by Census. The sheet contains 102 tables, but there is additional calculated table B112108 inserted at position 96.

Page 2: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

Here is an example of data (first few rows) stored in FACT_PART1_UNI1_C02 table.

The contents of each fact table are based on the contents of the corresponding data table. Table A101101 contains Population measure and Age(11) dimension (as defined in “table_content_lookup.csv” document).

The first 3 columns correspond to the measure used in the table. “Population” contains the actual value, “Population_mv” contains the Missing Value Indicator and “Population_fn” contains Footnote Indicator for each cell.

The next columns (0, 1 or more depending on the number of dimensions) contain Category Id for each dimension. In case of Age_11 value of 1 corresponds to “Total, all ages”, value of 2 corresponds to “Under 16 years”, etc.

“Dim_output” column contains the value of special Output dimension that you’ll see later in the cubes. Value of “est” corresponds to Estimate; value of “SE2” corresponds to Standard Error Squared (aka Variance of Standard Error). Only these 2 values are stored in the fact tables as the other values (Standard Error, Margin of Error, MOE Percent of Estimate) can be calculated from SE2. The reason SE2 is the value stored is that it is useful for calculating aggregate values.

The last column (“geoId”) stores the geographic identifier. In the screenshot above it is Alabama.

Page 3: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

For Part 2 (WORKPLACE) there are 47 data tables and 11 geographic levels (C21, C22, C23, C24, C25, C27,C28, C29, C31, C32, C33) which results in 517 fact tables. The tables are named FACT_PART2_UNI(n)_C(l), where (n) runs from 1 to 47 and (l) is the level number.

For example table FACT_PART2_UNI1_C22 contains data for first WORKPLACE data table (A202101) at State level (C22). The order of tables is based on the order in “Part2Tables” sheet of “2012-2016 CTPP Requirements.xlsx” document supplied by Census.

Here is an example of data (first few rows) stored in FACT_PART2_UNI1_C22 table.

Refer to description of Part 1 fact tables for the order and meaning of each column.

For Part 3 (FLOW) there are 24 data tables and 17 geographic levels (C42, C43, C44, C45, C48, C49, C50, C51, C52, C53, C54, C55, C56, C57, C58, C59, C60) which results in 408 fact tables. The tables are named FACT_PART3_UNI(n)_C(l), where (n) runs from 1 to 24 and (l) is the level number.

For example table FACT_PART3_UNI1_C42 contains data for first FLOW data table (A302103) at State -> State flow level (C42). The order of tables is based on the order in “Part3Tables” sheet of “2012-2016 CTPP Requirements.xlsx” document supplied by Census.

Page 4: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

Here is an example of data (first few rows) stored in FACT_PART3_UNI1_C42 table.

The table layout is similar to Part 1 and Part 2 tables except for the last 2 columns which contain 2 geographic identifiers, one for RESIDENCE and one for WORKPLACE.

Database CTPP_2012_2016_DW also contains Dimension Tables, both geographic and non-geographic. All dimension tables are prefixed with “dim_” followed by dimension name. For example table “dim_Age_9” for dimension “Age(9)” is shown below.

Column “Category” matches the dimension value in the fact tables; column “CategoryLabel” provides descriptive text and the last column (“Parent”) indicates the hierarchical parent of dimension member. Value of 0 indicates top level (no parent). In the above example categories 2-9 are children of category 1.

Page 5: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

RESIDENCE geographic dimensions (one for each level) are stored in tables prefixed with “dim_geo_res_“ and followed by level code. For example table “dim_geo_res_C03” (shown partially below) contains RESIDENCE County dimension.

The first 5 columns (GeoID, GeoIDLabel, FipsCode, FipsName, SplitFipsCode) are present in all levels.

The “ExternalParent” column is only present below the National level and contains the geo code of the parent from a higher level in geographic hierarchy (in this example C0200US01 is the code of Alabama, which is the external parent of all counties in Alabama).

The “StateParents” column is only present below the State level. At some levels it may contain multiple entries as shown in table “dim_geo_res_C07” (RESIDENCE MSA dimension) below.

Page 6: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

WORKPLACE geographic dimensions (one for each level) are stored in tables prefixed with “dim_geo_pow_“ and followed by level code. For example table “dim_geo_pow_C23” (shown partially below) contains WORKPLACE County dimension.

In addition to Fact and Dimension Tables the database CTPP_2012_2016_DW also contains other useful tables.

RESIDENCE levels are stored in “res_geo_levels” table shown below.

WORKPLACE levels are stored in “pow_geo_levels” table.

Page 7: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

FLOW levels are stored in “flow_geo_levels” table.

Table “Footnotes” contains the dictionary of all footnotes (hover notes) used by the application.

Table “Missing_values” contains the dictionary of all missing values used by the application.

Page 8: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

Valid flows are stored in tables prefixed with “geo_flow_” and followed by FLOW level code. For example table “geo_flow_C42” partially shown below show all valid State -> State flows.

Table “Selection_indicators” contains a list of Selection Indicators for every RESIDENCE and WORKPLACE geographic level. These indicators can be displayed on RESIDENCE or WORKPLACE selection map in the Perspective web application. The table is partially shown below.

The data for each Selection Indicator is stored in tables prefixed by “selection_data_” and followed by geographic level code. For example table “selection_data_C02” contains the data for Selection Indicators at RESIDENCE County level.

Most of the tables contained in CTPP_2012_2016_DW database are only required for building SSAS Cubes and are not directly required by Perspective web application. There are a few tables to which the

Page 9: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

application needs access. These tables are copied from CTPP_2012_2016_DW database into a smaller CTPP_2012_2016_DW_APP database which accessible to the application.

CTPP_2012_2016_DW_APP database contains the following tables:

- Geographic Dimension Tables,- Level Tables,- Valid Flows Tables,- Selection Indicators and Selection Data Tables.

Another database used by Perspective application is CTPP_2012_2016_DW_GEOM supplied by CartoVista. This database contains shapes for various geographic identifiers plus other important geographic information.

3. SSAS CubesFinally SQL Server Analysis Services (SSAS) cubes are built from Fact Tables and Dimension Tables generated by the ETL process.

Just like Fact Tables the SSAS cubes are also built for every Data Table and every geographic level in all 3 parts (RESIDENCE, WORKPLACE, FLOW).

The cubes are divided into 7 SSAS databases: CTPP_2012_2016_Cubes_1A, CTPP_2012_2016_Cubes_1B, CTPP_2012_2016_Cubes_1C, CTPP_2012_2016_Cubes_1D, CTPP_2012_2016_Cubes_2A, CTPP_2012_2016_Cubes_2B, CTPP_2012_2016_Cubes_3A. The number before the last letter corresponds to Part Number (1, 2 or 3).

Database CTPP_2012_2016_Cubes_1A contains first 25 RESIDENCE cubes for each geographic level (C01 to C13), so total of 275 cubes. The cubes are named RES_CUBE(n)_ C(l), where (n) is the table number and (l) is the level number. Single-digit table numbers are prefixed with “0”.

Page 10: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

For example cube RES_CUBE01_C02 contains data for first RESIDENCE table (A101101) at State level. This cube is used to display “A101101 - Age (11) (All persons)” Perspective report (at State level) shown below.

Database CTPP_2012_2016_Cubes_1B contains RESIDENCE cubes for tables 26-50 for each geographic level (C01 to C13), so total of 275 cubes. The cubes use the same naming convention as tables in database CTPP_2012_2016_Cubes_1A.

Database CTPP_2012_2016_Cubes_1C contains RESIDENCE cubes for tables 51-75 for each geographic level (C01 to C13), so total of 275 cubes. The cubes use the same naming convention as tables in database CTPP_2012_2016_Cubes_1A.

Database CTPP_2012_2016_Cubes_1D contains RESIDENCE cubes for tables 76-103 for each geographic level (C01 to C13), so total of 308 cubes. The cubes use the same naming convention as tables in database CTPP_2012_2016_Cubes_1A.

Page 11: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

Database CTPP_2012_2016_Cubes_2A contains first 25 WORKPLACE cubes for each geographic level (C21 to C33), so total of 275 cubes. The cubes are named WRK_CUBE(n)_ C(l), where (n) is the table number and (l) is the level number. Single-digit table numbers are prefixed with “0”.

For example cube WRK_CUBE01_C22 contains data for first WORKPLACE table (A202101) at State level. This cube is used to display “A202101 - Age of Worker (10) (Workers 16 years and over)” Perspective report (at State level) shown below.

Database CTPP_2012_2016_Cubes_2B contains WORKPLACE cubes for tables 26-47 for each geographic level (C21 to C33), so total of 242 cubes. The cubes use the same naming convention as tables in database CTPP_2012_2016_Cubes_2A.

Page 12: Data Structure for CTPP 2012-2016 Data Set€¦  · Web viewThe ETL process creates CTPP_2012_2016_DW database containing Fact Tables, Dimension Tables and a few other useful tables

Database CTPP_2012_2016_Cubes_3A contains FLOW cubes for all 24 FLOW tables for each geographic level (C42 to C60), total of 408 cubes. The cubes are named FLOW_CUBE(n)_ C(l), where (n) is the table number and (l) is the level number. Single-digit table numbers are prefixed with “0”.

For example cube FLOW_CUBE01_C42 contains data for first FLOW table (A302103) at State -> State flow level. This cube is used to display “A302103 - Means of transportation (18) (Workers 16 years and over)” Perspective report (at State -> State flow level) shown below.