s3.wp.wsu.edu€¦ · web viewhere all we are doing is creating a new column and putting a text...

12
Useful SQL Functions to Format and Categorize Data: DATEPART(), DATENAME(), CASE(), RANK(), TOP N() Note: DATEPART() and DATENAME() are examined here again to reinforce a prior module. They are added here for convenience. Analysts spend a lot of time analyzing business transactions by different date and time categories. Who knew it was so easy to make new columns that can be used as slicers in a reporting software? Take a look at this next table, it is the result of running the SQL query below. This document also introduces several other very useful SQL functions. USE [AdventureWorksDW2012]; SELECT [SalesOrderNumber], OrderDate , FORMAT([OrderDate], 'd', 'en-US' ) as [Order Date] , DATEPART(DAY, OrderDate) AS [Day of Month] , DATEPART(WEEKDAY, OrderDate) AS [Day of Week] , DATENAME(WEEKDAY, OrderDate) AS [Day] , DATEPART(DAYOFYEAR, OrderDate) AS [Day of Year] , DATEPART(MONTH, OrderDate) AS [Month#] , DATENAME(MONTH, OrderDate) AS [Month] , DATEPART(WEEK, OrderDate) AS [Week # of Year] , CONCAT('Q' , DATEPART(quarter, OrderDate)) AS [Qtr] , DATEPART(year, OrderDate) AS [Year] --the above functions are available in SQL Server 2012 and newer. The following functions are older and also in Excel , MONTH(orderdate) AS [Month Num] , DAY(orderdate) AS [Day Num] , YEAR(orderdate) AS [Year Num] The DATEPART() function is simple to use yet powerful and useful. The DATEPART() and related DATENAME() functions are used here to create new columns of data that are later used to pivot (aggregate) or slice the data, such as retrieving data grouped by week or month. For example a bar chart used to explore whether consumers buy different colored bicycles depending on day of week, or month of year is not possible without first creating the day and month dimensions used to categorize the data. DATENAME() is another helpful SQL function that can pull out the English day of the week or the month of the year (ie Monday or January, etc.) out of a date field. Notice the usage of CONCAT to show a proper Fiscal Quarter. This is concatenation function to merge textual columns. 1

Upload: others

Post on 10-Oct-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: s3.wp.wsu.edu€¦ · Web viewHere all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows

Useful SQL Functions to Format and Categorize Data: DATEPART(), DATENAME(), CASE(), RANK(), TOP N()Note: DATEPART() and DATENAME() are examined here again to reinforce a prior module. They are added here for convenience. Analysts spend a lot of time analyzing business transactions by different date and time categories. Who knew it was so easy to make new columns that can be used as slicers in a reporting software? Take a look at this next table, it is the result of running the SQL query below.

This document also introduces several other very useful SQL functions.

USE [AdventureWorksDW2012];SELECT [SalesOrderNumber], OrderDate, FORMAT([OrderDate], 'd', 'en-US' ) as [Order Date], DATEPART(DAY, OrderDate) AS [Day of Month], DATEPART(WEEKDAY, OrderDate) AS [Day of Week], DATENAME(WEEKDAY, OrderDate) AS [Day], DATEPART(DAYOFYEAR, OrderDate) AS [Day of Year]

, DATEPART(MONTH, OrderDate) AS [Month#], DATENAME(MONTH, OrderDate) AS [Month]

, DATEPART(WEEK, OrderDate) AS [Week # of Year], CONCAT('Q' , DATEPART(quarter, OrderDate)) AS [Qtr], DATEPART(year, OrderDate) AS [Year]

--the above functions are available in SQL Server 2012 and newer. The following functions are older and also in Excel, MONTH(orderdate) AS [Month Num], DAY(orderdate) AS [Day Num], YEAR(orderdate) AS [Year Num]

-- this next formula is needed to be able to daw line charts over the years of your dataset, (YEAR(orderdate) *100) + MONTH(orderdate) AS [Year-Month]

FROM [dbo].[FactInternetSales]

The DATEPART() function is simple to use yet powerful and useful. The DATEPART() and related DATENAME() functions are used here to create new columns of data that are later used to pivot (aggregate) or slice the data, such as retrieving data grouped by week or month. For example a bar chart used to explore whether consumers buy different colored bicycles depending on day of week, or month of year is not possible without first creating the day and month dimensions used to categorize the data.

DATENAME() is another helpful SQL function that can pull out the English day of the week or the month of the year (ie Monday or January, etc.) out of a date field.

Notice the usage of CONCAT to show a proper Fiscal Quarter. This is concatenation function to merge textual columns.

The MONTH(), DAY(), YEAR() functions tried and tested and keepers, especially because they work in EXCEL and in pre-2012 versions of SQL

The last formula that creates the [YEAR-Month] column is the way to show all the months for a dataset. More information at http://msdn.microsoft.com/en-us/library/ms174420.aspx

1

Page 2: s3.wp.wsu.edu€¦ · Web viewHere all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows

I hope you can envision (better yet practice!) the usefulness of DATEPART() and DATENAME() to provide slicers and categories for charts. Here we introduce another useful technique to categorize data for data visualization the mighty CASE statement! Any case statement is used to perform different functionality depending on a test condition. While an IF/THEN statement uses an either/or logic– CASE statements use a, “it depends” logic to perform different outcomes.

Here all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows into one category to be able to examine them further. The text value can be used as a slicer or category in a bar chart.

The CASE statement below creates a new column named [Cost Category] and places one of several values into it ($, $$, $$$, $$$$). Notice the last column in the figure below and the last line of the CASE statement AS [new column name]. In this instance the AS statement gives the column name. The purpose of this type of CASE statement is to categorize data. It is also possible to put records into groups such as quartiles, here we write the logic to do the categorization. So we are just adding a new column with a word in it that indicates the category the record belongs to. This new column can then be used as a slicer.

2

Page 3: s3.wp.wsu.edu€¦ · Web viewHere all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows

USE [Featherman_Analytics];SELECT [Category], [Sub Category], [Model], [Product], [Cost], ([Cost] * 1.25) as [25% Markup], ([Cost] * 1.5) as [50% Markup], [Dealer Price]

, CASE WHEN [COST] IS NULL THEN ' 'WHEN [Cost] BETWEEN 0 AND 99.99 THEN '$'WHEN [Cost] BETWEEN 100 AND 299.99 THEN '$$'WHEN [Cost] BETWEEN 300 AND 499.99 THEN '$$$'WHEN [Cost] >= 500 THEN '$$$$'ENDAS [Cost Category]

FROM [dbo].[AW_Products_Flattened]--WHERE [Dealer Price] < ([Cost] * 1.25)

We start with a simple example of adding more columns that calculate the sales price if a 25% or 50% markup was applied. This is used to identify the products selling at lower than a 25% markup (which is probably just break-even). If you copy this SQL query, and then remove the hyphens on the last line of code (to include the WHERE clause) then you should retrieve a list of the products that are probably being sold to bike shops at too low a price.

Another data formatting task that analysts perform repeatedly is to put records into textual categories -- i.e., such as low, medium, high levels of a dimension --or perhaps to examine sales records and segment consumers into categories (i.e., infrequent customer, regular customer, and VIP customer). We can also perform this functionality in DAX using a SWITCH statement, here is the TSQL equivalent. The approach is similar, create a new column and place the textual group name in that new column based a categorization scheme such as ranges. The example here puts a value into each row (either the symbol $, $$, $$$, or $$$$) for a new column based on ranges of values of the cost of a product (<100, >=100 & <300, >=300 & <500, and >=500.

The CASE syntax starts with the term CASE and ends with the term END. The AS term is used to supply a column name. While there are many uses for a CASE statement such as to apply different discounting pattern, based on different levels of another variable (ie # the more days a product has been in stock on the shelf, the higher the level of discounting); the most common use of CASE statements is to create categories as shown here.

3

Updates from WSU alumni indicate that analysts write a lot of CASE statements to add columns of descriptors (to put rows into categories) and to create slicers.

Page 4: s3.wp.wsu.edu€¦ · Web viewHere all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows

USE [Featherman_Analytics];SELECT [Category], [Sub Category], [Model], [Product], [Cost], ([Cost] * 1.25) as [25% Markup], ([Cost] * 1.5) as [50% Markup], [Dealer Price]

, CASE WHEN [Dealer Price] < [COST] THEN 'FUBAR'WHEN [Dealer Price] < ([Cost] * 1.25) THEN 'small profit'WHEN [Dealer Price] > ([Cost] * 1.25) THEN 'good profit'ENDAS [Cost Category]

FROM [dbo].[AW_Products_Flattened]

This example should drive home the point about the usefulness of the technique. Each line in the WHEN statement can provide a different probe or test of the data and provide a different result in the column. Here we are still using categorical labels, we could also perform different calculations based on a test condition.

How does the functionality work? The values in each row are evaluated. The first When condition that equates to true will stop the evaluation of that row and the functionality after the THEN statement is performed (here different words are written into the column. The terms are used later as filters in a slicer or on the X-axis of a column chart.

Let’s not forget that the SQL engine is examining every row of the dataset and providing the cost category.

Go ahead and run this query it’s fun.

4

Page 5: s3.wp.wsu.edu€¦ · Web viewHere all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows

USE [AdventureWorksDW2012];SELECT Top 100 [EnglishCountryRegionName], [StateProvinceName],[City], [ResellerName], sum([OrderQuantity]) as [Total], CASE WHEN sum([OrderQuantity]) IS NULL THEN ' 'WHEN sum([OrderQuantity]) BETWEEN 0 AND 99 THEN 'New Store'WHEN sum([OrderQuantity]) BETWEEN 100 AND 299 THEN 'Emerging Store'WHEN sum([OrderQuantity]) BETWEEN 300 AND 799 THEN 'Good Store'WHEN sum([OrderQuantity]) BETWEEN 800 AND 2499 THEN 'Frequent Flyer'WHEN sum([OrderQuantity]) >= 2500 THEN 'Elite Partner'ENDAS [Cost Category] - - this should be [Store Category],RANK() OVER (ORDER BY sum([OrderQuantity]) DESC) AS Ranking

FROM [dbo].[DimReseller] as rINNER JOIN [dbo].[FactResellerSales] as s ON s.[ResellerKey] = r.[ResellerKey] INNER JOIN [dbo].[DimGeography] as g ON g.[GeographyKey] = r.[GeographyKey]

GROUP BY [EnglishCountryRegionName], [StateProvinceName],[City],[ResellerName]

Here are two small improvements. We can cut the number of rows to the Top 100 (or any #). You can also use Top 10 Percent. We also rank the output (here ranking the stores in how many units they purchased).

Here we calculate the # units sold to each reseller, then we use the CASE statement to create a new column that has a word in it.

The powerful RANK() function organizes the rows based on rank. The ranking is provided and the data is sorted on the ranking column. No need to belabor the syntax but the ranking is being done OVER the entire dataset. We have to specify what to perform the ranking using. Hmm what if the ranking was performed using average sale?

Note the BETWEEN syntax is very useful and works whereas the > or < code can only be used at the top end and bottom end.

5

Page 6: s3.wp.wsu.edu€¦ · Web viewHere all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows

As you can see the SQL done well, provides the data in the right format for data visualization. Here maps in Power BI are used to visualize the data. The map can also drill down from state to city. The slicer on Country is especially helpful to clean up the presentation (show only certain parts of the globe at one time). Notice the use of a hierarchy in the fields list, which enables the drill down. Run the query and slice on store category (the result of the CASE field).

One more thing…

6

Page 7: s3.wp.wsu.edu€¦ · Web viewHere all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows

Actually CASE() statements are very powerful and can be used in 100 different ways. Did you notice however that you have to specify the ranges (ie BETWEEN 200 and 299.99. Specifying the numeric ranges gives you 100% control, but sometimes you just want to put records into groups such as terciles or quartiles based on some value.

You can use NTILE() to quickly break your records into groups of roughly equal size. For example an NTILE(5) would create five ‘bins’ each of which would hold roughly 20% of the records. By default a new column is made with a number in it to signify which group (ie 1 = lowest, 5 = highest). . It is very easy for your reporting software to read the number in the cell and render it as an indicant (green, yellow, red stop light graphic for example).You can also place categorizing words in each row as shown in the following example.

, CASE NTILE(4) OVER (PARTITION BY ([EnglishProductSubcategoryName])ORDER BY SUM(SalesAmount) DESC)WHEN 1 THEN 'Gangbusters'WHEN 2 THEN 'Great Month'WHEN 3 THEN 'Good Month'WHEN 4 THEN 'Slow Month'END AS [Sales Rating]

The NTILE() places each row into a group based on the ORDER BY statement. If you place the NTILE inside a CASE statement then you can place words inside the column rather than numbers. While numbers are useful for drawing charts and stoplight KPI’s, the words are useful in a slicer or bar chart.

The CASE() statement then amplifies the usefulness of the NTILE() command. In this example the sales are totaled and categorized inside each Sub-Category rather than over the entire dataset, so you can see the

best sales months for Road Bikes for example. Leave out the partition to see the best months for the entire business.

7

Page 8: s3.wp.wsu.edu€¦ · Web viewHere all we are doing is creating a new column and putting a text value into it depending on what category a value falls within. We are putting the rows

8

Hope this content is useful to you either now or later.

Featherman