Skip to content

2c. SQL Aggregations

Lionel Mann edited this page Apr 12, 2018 · 1 revision
  • COUNT - Counts how many rows are in a particular column

  • SUM - Adds all the values in a particular column

  • MIN and MAX - Returns the lowest and highest values in a column

  • AVG - Calculates the average of all the values in a particular column

These functions operate down columns NOT across rows.


NULLs

NULLs are a datatype that specifies where no data exists in SQL. They are often ignored in our aggregation functions.

It's different from a zero or space. Nulls could exist because of poorly logged data or based on the data design itself.

There are two common ways in which you are likely to encounter NULLs:

  1. NULLs frequently occur when performing a LEFT or RIGHT JOIN.

  2. NULLs can also occur from simply missing data in our database.

When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don't use =, because NULL isn't considered a value in SQL. Rather, it is a property of the data.

Example

SELECT *
FROM accounts
WHERE primary_pos IS NULL

COUNT

COUNT counts the number of rows in a table.

Example

SELECT COUNT(*)
FROM accounts;

But we could have just as easily chosen a column to drop into the aggregation function:

SELECT COUNT(accounts.id) AS "Account ID"
FROM accounts;

Notice that COUNT does not consider rows that have NULL values. Therefore, this can be useful for quickly identifying which rows have missing data.

You can find the number of rows that have NULL.

Example

SELECT *
FROM accounts
WHERE primary_pos IS NULL

SUM

Unlike COUNT, you can only use SUM on numeric columns. However, SUM will ignore NULL values.

NOTE: you cannot use SUM() like you can with COUNT(). SUM will treat NULLs as 0

Example

SELECT SUM(standard_qty) AS standard,
       SUM(gloss_qty) AS gloss,
       SUM(poster_qty) AS poster
FROM orders

The result will be 1 row with 3 columns with the SUM of each specified column.

Examples:

Find the total amount of poster_qty paper ordered in the orders table.

SELECT SUM(poster_qty)
FROM orders

Find the total amount spent on standard_amt_usd and gloss_amt_usd paper for each order in the orders table. This should give a dollar amount for each order in the table.

SELECT standard_amt_usd + gloss_amt_usd AS total_standard_gloss
FROM orders;

Find the standard_amt_usd per unit of standard_qty paper.

SELECT SUM(standard_amt_usd)/SUM(standard_qty) AS standard_price_per_unit
FROM orders;

MIN and MAX

MIN and MAX are aggregators that again ignore NULL values. Functionally, MIN and MAX are similar to COUNT in that they can be used on non-numerical columns. Depending on the column type, MIN will return the lowest number, earliest date, or non-numerical value as early in the alphabet as possible. As you might suspect, MAX does the opposite—it returns the highest number, the latest date, or the non-numerical value closest alphabetically to "Z."

SELECT MIN(poster_qty),
       MAX(poster_qty)
FROM orders

AVG

AVG returns the mean of the data - that is the sum of all of the values in the column divided by the number of values in a column. This aggregate function again ignores the NULL values in both the numerator and the denominator.


GROUP BY

  • GROUP BY can be used to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives.

  • Any column in the SELECT statement that is not within an aggregator must be in the GROUP BY clause.

  • The GROUP BY always goes between WHERE and ORDER BY.

Examples

Which account (by name) placed the earliest order?

SELECT name, occurred_at
FROM accounts
JOIN orders
ON accounts.id = orders.account_id
ORDER BY occurred_at
LIMIT 1;

Find the total sales in usd for each account. You should include two columns - the total sales for each company's orders in usd and the company name.

SELECT a.name, SUM(total_amt_usd) total_sales
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.name;

You can GROUP BY multiple columns at once. It is often used to aggregate across a number of different segments.

The order of columns listed in the ORDER BY clause does make a difference. You are ordering the columns from left to right.

Example

SELECT account_id, channel, COUNT(id) AS events
FROM web_events
GROUP BY account_id, channel
ORDER BY acount_id, events DESC

The results of the above example would order first by account_id and then by events within the account_id

Determine the number of times a particular channel was used in the web_events table for each region. Your final table should have three columns - the region name, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.

SELECT r.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name, w.channel
ORDER BY num_events DESC;

DISTINCT

DISTINCT is always used in SELECT statements, and it provides the unique rows for all columns written in the SELECT statement. Therefore, you only use DISTINCT once in any particular SELECT statement.

You can think of DISTINCT this way. If you want to group by some columns but you don't want to include any aggregations, you can use DISTINCT instead.

SELECT DISTINCT column1, column2, column3
FROM table1;

Example

Use DISTINCT to test if there are any accounts associated with more than one region.

The below two queries have the same number of resulting rows (351), so we know that every account is associated with only one region. If each account was associated with more than one region, the first query should have returned more rows than the second query.

SELECT a.id as "account id", r.id as "region id", 
a.name as "account name", r.name as "region name"
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id;

and

SELECT DISTINCT id, name
FROM accounts;

HAVING

HAVING is the 'clean' way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a WHERE on an element of your query that was created by an aggregate, you need to use HAVING instead.

The WHERE clause does not allow you to filter on aggregate columns. This will not work.

SELECT account.id, SUM(total_amt_usd) AS sum_total_amt_usd
FROM orders
WHERE SUM(total_amt_usd) >= 25000
GROUP BY 1
ORDER BY 2 DESC

Instead use HAVING

SELECT account.id, SUM(total_amt_usd) AS sum_total_amt_usd
FROM orders
GROUP BY 1
HAVING SUM(total_amt_usd) >= 25000

This filters the query down to just accounts with more or equal to $25,000 in sales

Example

How many of the sales reps have more than 5 accounts that they manage?

SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.id
HAVING COUNT(*) > 5
ORDER BY num_accounts;

How many accounts spent more than 30,000 usd total across all orders?

SELECT a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
HAVING SUM(o.total_amt_usd) > 30000
ORDER BY total_spent;

Which accounts used facebook as a channel to contact customers more than 6 times?

SELECT a.name, w.channel, COUNT(*) num_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUP BY a.name, w.channel
HAVING COUNT(*) > 6 AND w.channel = 'facebook'
ORDER BY num_channel;

DATE

GROUPing BY a date column is not usually very useful in SQL, as these columns tend to have transaction data down to a second. Keeping date information at such a granular data is both a blessing and a curse, as it gives really precise information (a blessing), but it makes grouping information together directly difficult (a curse).

Lucky for us, there are a number of built-in SQL functions that are aimed at helping us improve our experience in working with dates.

Suppose we have these dates

2017-04-01 3:56:02
2017-04-01 5:01:23
2017-04-01 12:05:42
2017-04-01 22:10:10

In order to group by day, we will need to adjust all the times on 2017-04-01 to read:

2017-04-01 00:00:00

That way, when we group by date, we get every event that occurred for all hours, minutes and seconds of April 1st. They will all be grouped together.

We can do this using the DATE_TRUNC.

DATE_TRUNC

DATE_TRUNC allows you to truncate your date to a particular part of your date-time column. Common trunctions are day, month, and year.

Example:

SELECT DATE_TRUNC('day', occurred_at) AS day, SUM(standard_qty) AS standard_qty_sum
FROM orders
GROUP BY DATE_TRUNC('day' occurred_at)
ORDER BY DATE_TRUNC('day' occurred_at)

NOTE: It's important to group by the same metric as in the SELECT statement to ensure your results are consistent.

DATE_PART

DATE_PART can be useful for pulling a specific portion of a date.

NOTE: You can reference the columns in your select statement in GROUP BY and ORDER BY clauses with numbers that follow the order they appear in the select statement.

SELECT standard_qty, COUNT(*)
FROM orders
GROUP BY 1
ORDER BY 1

Examples

Find the sales in terms of total dollars for all orders in each year, ordered from greatest to least. Do you notice any trends in the yearly sales totals?

SELECT DATE_TRUNC('year', occurred_at) AS year, SUM(total_amt_usd) AS Total
FROM orders
GROUP BY 1
ORDER BY 2 DESC;

When we look at the yearly totals, you might notice that 2013 and 2017 have much smaller totals than all other years. If we look further at the monthly data, we see that for 2013 and 2017 there is only one month of sales for each of these years (12 for 2013 and 1 for 2017). Therefore, neither of these are evenly represented. Sales have been increasing year over year, with 2016 being the largest sales to date. At this rate, we might expect 2017 to have the largest sales.

Which month did Parch & Posey have the greatest sales in terms of total dollars? Are all months evenly represented by the dataset? In order for this to be 'fair', we should remove the sales from 2013 and 2017. For the same reasons as discussed above.

SELECT DATE_PART('month', occurred_at) ord_month, SUM(total_amt_usd) total_spent
FROM orders
WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01'
GROUP BY 1
ORDER BY 2 DESC;

Which year did Parch & Posey have the greatest sales in terms of total number of orders? Are all years evenly represented by the dataset?

SELECT DATE_TRUNC('year', occurred_at) AS year,
COUNT(*) AS Total
FROM orders
GROUP BY 1
ORDER BY 2 DESC;

Which month did Parch & Posey have the greatest sales in terms of total number of orders? Are all months evenly represented by the dataset?

SELECT DATE_PART('month', occurred_at) ord_month, COUNT(*) total_sales
FROM orders
WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01'
GROUP BY 1
ORDER BY 2 DESC;

December still has the most sales, but interestingly, November has the second most sales (but not the most dollar sales. To make a fair comparison from one month to another 2017 and 2013 data were removed.


CASE

In order to use CASE we need to make a derived column which takes data from existing columns and modifies. them. We can also do this using arithmetic. The CASE statement is SQL way of handling "IF" "THEN" logic.

  • The CASE statement always goes in the SELECT clause.

  • CASE must include the following components: WHEN, THEN, and END. ELSE is an optional component to catch cases that didn’t meet any of the other previous CASE conditions.

  • You can make any conditional statement using any conditional operator (like WHERE) between WHEN and THEN. This includes stringing together multiple conditional statements using AND and OR.

  • You can include multiple WHEN statements, as well as an ELSE statement again, to deal with any unaddressed conditions.

Example

SELECT account_id, occurred_at, total,
   CASE WHEN total > 500 THEN 'Over 500'
        WHEN total > 300 AND total <= 500 THEN '301 - 500'
        WHEN total > 100 AND total <= 300 THEN '101 - 300'
        ELSE '100 and under' END AS total_group
FROM orders

Create a column that divides the standard_amt_usd by the standard_qty to find the unit price for standard paper for each order. Limit the results to the first 10 orders, and include the id and account_id fields.

NOTE - you will be thrown an error with the correct solution to this question. This is for a division by zero

Let's see how we can use the CASE statement to get around this error.

SELECT id, account_id, standard_amt_usd/standard_qty AS unit_price
FROM orders
LIMIT 10;

Now, let's use a CASE statement. This way any time the standard_qty is zero, we will return 0, and otherwise, we will return the unit_price.

SELECT account_id, 
   CASE WHEN standard_qty = 0 OR standard_qty IS NULL THEN 0
        ELSE standard_amt_usd/standard_qty END AS unit_price
FROM orders
LIMIT 10;

CASE and Aggregations

For example, grouping orders and then counting orders up in each group is useful.

Example

SELECT CASE WHEN total > 500 THEN 'Over 500'
            ELSE '500 or under' END AS total_group,
       COUNT(*) AS order_count
FROM orders
GROUP BY 1

Examples

We would like to understand 3 different levels of customers based on the amount associated with their purchases. The top branch includes anyone with a Lifetime Value (total sales of all orders) greater than 200,000 usd. The second branch is between 200,000 and 100,000 usd. The lowest branch is anyone under 100,000 usd. Provide a table that includes the level associated with each account. You should provide the account name, the total sales of all orders for the customer, and the level. Order with the top spending customers listed first.

SELECT a.name, SUM(total_amt_usd) total, 
     CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top'
     WHEN  SUM(total_amt_usd) > 100000 THEN 'middle'
     ELSE 'low' END AS customer_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id 
GROUP BY 1
ORDER BY 2 DESC;

We would now like to perform a similar calculation to the first, but we want to obtain the total amount spent by customers only in 2016 and 2017. Keep the same levels as in the previous question. Order with the top spending customers listed first.

SELECT a.name, SUM(total_amt_usd) total_spent, 
     CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top'
     WHEN  SUM(total_amt_usd) > 100000 THEN 'middle'
     ELSE 'low' END AS customer_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id
WHERE occurred_at > '2015-12-31' 
GROUP BY 1
ORDER BY 2 DESC;

Clone this wiki locally