Skip to content

2a. SQL Basics

Lionel Mann edited this page Apr 12, 2018 · 1 revision

SQL (Structured Query Language)

Entity Relationship Diagrams

An entity relationship diagram (ERD) is a common way to view data in a database. Below is the ERD for the database we will use from Parch & Posey. The data analyzed includes:

  • The names of the tables.
  • The columns in each table.
  • The way the tables work together.

What to Notice

In the Parch & Posey database there are five tables (essentially 5 spreadsheets):

  • web_events
  • accounts
  • orders
  • sales_reps
  • region

You can think of each of these tables as an individual spreadsheet. Then the columns in each spreadsheet are listed below the table name. For example, the region table has two columns: id and name. The web_events table has four columns.

The "crow's foot" that connects the tables together shows us how the columns in one table relate to the columns in another table.

You will notice some of the columns in the tables have PK or FK next to the column name, while other columns don't have a label at all.

If you look a little closer, you might notice that the PK is associated with the first column in every table. The PK here stands for primary key. A primary key exists in every table, and it is a column that has a unique value for every row.

Keys

Primary Key (PK)

A primary key is a unique column in a particular table. This is the first column in each of our tables. Here, those columns are all called id, but that doesn't necessarily have to be the name. It is common that the primary key is the first column in our tables in most databases.

Foreign Key (FK)

A foreign key is when we see a primary key in another table. We can see these in the previous ERD the foreign keys are provided as:

  • region_id
  • account_id
  • sales_rep_id

Each of these is linked to the primary key of another table.

Primary - Foreign Key Link

In the above image you can see that:

  1. The region_id is the foreign key.
  2. The region_id is linked to id - this is the primary-foreign key link that connects these two tables.
  3. The crow's foot shows that the FK can actually appear in many rows in the sales_reps table.
  4. While the single line is telling us that the PK shows that id appears only once per row in this table.

Why Do Data Analysts Use SQL?

There are some major advantages to using traditional relational databases, which we interact with using SQL. The five most apparent are:

  • SQL is easy to understand.
  • Traditional databases allow us to access data directly.
  • Traditional databases allow us to audit and replicate our data.
  • SQL is a great tool for analyzing multiple tables at once.
  • SQL allows you to analyze more complex questions than dashboard tools like Google Analytics.

Type of SQL Statements

The key to SQL is understanding statements. A few statements include:

  • CREATE TABLE is a statement that creates a new table in a database.
  • DROP TABLE is a statement that removes a table in a database.
  • SELECT allows you to read data and display it. This is called a query.

Every SQL Query

Every query will have at least a SELECT and FROM statement. The SELECT statement is where you put the columns for which you would like to show the data. The FROM statement is where you put the tables from which you would like to pull data.

You can SELECT everything from a table using * or SELECT specific tables using comma separated values such as:

SELECT id, account_id, occurred_at 
FROM orders;

LIMIT Clause

The LIMIT statement is useful when you want to see just the first few rows of a table. This can be much faster for loading than if we load the entire dataset.

SELECT *
FROM orders
LIMIT 10;

or

SELECT occurred_at, account_id, channel 
FROM web_events 
LIMIT 15;

ORDER BY Clause

The ORDER BY statement allows us to order our table by any row.

The ORDER BY statement is always after the SELECT and FROM statements, but it is before the LIMIT statement. If we are using the LIMIT statement, it will always appear last.

DESC can be added after the column in your ORDER BY statement to sort in descending order, as the default is to sort in ascending order.

Example

Write a query to return the 10 earliest orders in the orders table. Include the id, occurred_at, and total_amt_usd.

SELECT id, occurred_at, total_amt_usd 
FROM orders 
ORDER BY occurred_at DESC 
LIMIT 10;

We can also ORDER BY more than one column at a time. The statement sorts according to columns listed from left first and to the right after that.

Example

Write a query that returns the top 5 rows from orders, ordered according to newest to oldest, but with the largest total_amt_usd for each date listed first for each date.

SELECT *
FROM orders
ORDER BY occurred_at DESC, total_amt_usd DESC
LIMIT 5;

WHERE

Using the WHERE statement, we can subset out tables based on conditions that must be met. The WHERE clause goes after FROM but before ORDER BY and LIMIT.

Common symbols used within WHERE statements include:

  • > (greater than)

  • < (less than)

  • >= (greater than or equal to)

  • <= (less than or equal to)

  • = (equal to)

  • != (not equal to)

SELECT *
FROM orders
WHERE account_id = 4251
ORDER BY occurred_at
LIMIT 5

Example

Pull the first 5 rows and all columns from the orders table that have a dollar amount of gloss_amt_usd greater than or equal to 1000.

SELECT *
FROM orders
WHERE gloss_amt_usd >= 1000
LIMIT 5

The WHERE statement can also be used with non-numerical data. We can use the = and != operators here. You also need to be sure to use single quotes (just be careful if you have quotes in the original text) with the text data.

Commonly when we are using WHERE with non-numeric data fields, we use the LIKE, NOT, or IN operators.

Example

Filter the accounts table to include the company name, website, and the primary point of contact (primary_poc) for Exxon Mobil in the accounts table.

SELECT name, website, primary_poc
FROM accounts
WHERE name = 'Exxon Mobil'

Derived Columns

Creating a new column that is a combination of existing columns is known as a derived column.

Common operators include:

  1. * (Multiplication)

  2. + (Addition)

  3. - (Subtraction)

  4. / (Division)

Order of Operations

The same order of operations applies when using arithmetic operators in SQL.

The following two statements have very different end results:

Standard_qty / standard_qty + gloss_qty + poster_qty
standard_qty / (standard_qty + gloss_qty + poster_qty)

Example

Create a column that divides the standard_amt_usd by the standard_qty to find the unit price for standard paper for each order. Limit the results to the first 10 orders, and include the id and account_id fields.

SELECT id, account_id, standard_amt_usd/standard_qty AS unit_price
FROM orders
LIMIT 10;

Write a query that finds the percentage of revenue that comes from poster paper for each order. You will need to use only the columns that end with _usd.

SELECT id, account_id, poster_amt_usd/(standard_amt_usd + gloss_amt_usd + poster_amt_usd) 
AS post_per
FROM orders;

Logical Operators

LIKE

This allows you to perform operations similar to using WHERE and =, but for cases when you might not know exactly what you are looking for.

The LIKE operator is extremely useful for working with text. You will use LIKE within a WHERE clause. The LIKE operator is frequently used with %. The % tells us that we might want any number of characters leading up to a particular set of characters or following a certain set of characters. Remember you will need to use single quotes for the text you pass to the LIKE operator, because of this lower and uppercase letters are not the same within the string. Searching for 'T' is not the same as searching for 't'.

Example

Use the accounts table to find:

All the companies whose names start with 'C'.

SELECT *
FROM accounts
WHERE name LIKE 'C%'

All companies whose names contain the string 'one' somewhere in the name.

SELECT *
FROM accounts
WHERE name LIKE '%one%'

All companies whose names end with 's'.

SELECT *
FROM accounts
WHERE name LIKE '%s'

IN

This allows you to perform operations similar to using WHERE and =, but for more than one condition.

The IN operator is useful for working with both numeric and text columns. This operator allows you to use a =, but for more than one item of that particular column. We can check one, two or many column values for which we want to pull data, but all within the same query.

Examples

Use the accounts table to find the account name, primary_poc, and sales_rep_id for Walmart, Target, and Nordstrom.

SELECT name, primary_poc, sales_rep_id
FROM accounts
WHERE name IN ('Walmart','Target','Nordstrom')

Use the web_events table to find all information regarding individuals who were contacted via the channel of organic or AdWords.

SELECT *
FROM web_events
WHERE channel IN ('organic','adwords')

NOT

This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a certain condition.

The NOT operator is an extremely useful operator for working with the previous two operators we introduced: IN and LIKE. By specifying NOT LIKE or NOT IN, we can grab all of the rows that do not meet particular criteria.

Examples

Use the accounts table to find the account name, primary poc, and sales rep id for all stores except Walmart, Target, and Nordstrom.

SELECT name, primary_poc, sales_rep_id
FROM accounts
WHERE name NOT IN ('Walmart','Target','Nordstrom')

All the companies whose names do not start with 'C'.

SELECT *
FROM accounts
WHERE name NOT LIKE 'C%'

AND & BETWEEN

These allow you to combine operations where all combined conditions must be true.

The AND operator is used within a WHERE statement to consider more than one logical clause at a time. Each time you link a new statement with an AND, you will need to specify the column you are interested in looking at. You may link as many statements as you would like to consider at the same time. This operator works with all of the operations we have seen so far including arithmetic operators (+, *, -, /). LIKE, IN, and NOT logic can also be linked together using the AND operator.

Sometimes we can make a cleaner statement using BETWEEN than we using AND. Particularly this is true when we are using the same column for different parts of our AND statement.

Instead of writing:

WHERE column >= 6 AND column <= 10

we can instead write, equivalently:

WHERE column BETWEEN 6 AND 10

Examples

Write a query that returns all the orders where the standard_qty is over 1000, the poster_qty is 0, and the gloss_qty is 0.

SELECT *
FROM orders
WHERE standard_qty > 1000 AND poster_qty = 0 AND gloss_qty = 0

Using the accounts table find all the companies whose names do not start with 'C' and end with 's'.

SELECT *
FROM accounts
WHERE name NOT LIKE 'C%' AND name LIKE '%s'

Use the web_events table to find all information regarding individuals who were contacted via organic or adwords and started their account at any point in 2016 sorted from newest to oldest.

SELECT *
FROM web_events
WHERE channel IN ('organic', 'adwords') 
    AND occurred_at 
    BETWEEN '2016-01-01' AND '2016-12-31'
ORDER BY occurred_at DESC

OR

This allows you to combine operations where at least one of the combined conditions must be true.

Similar to the AND operator, the OR operator can combine multiple statements. Each time you link a new statement with an OR, you will need to specify the column you are interested in looking at. You may link as many statements as you would like to consider at the same time. This operator works with all of the operations we have seen so far including arithmetic operators (+, *, -, /), LIKE, IN, NOT, AND, and BETWEEN logic can all be linked together using the OR operator.

When combining multiple of these operations, we frequently might need to use parentheses to assure that logic we want to perform is being executed correctly.

Examples

Find list of orders ids where either gloss_qty or poster_qty is greater than 4000. Only include the id field in the resulting table.

SELECT id
FROM orders
WHERE gloss_qty > 4000 OR poster_qty > 4000

Write a query that returns a list of orders where the standard_qty is zero and either the gloss_qty or poster_qty is over 1000.

SELECT *
FROM orders
WHERE standard_qty = 0 AND (gloss_qty > 1000 OR poster_qty > 1000)

Find all the company names that start with a 'C' or 'W', and the primary contact contains 'ana' or 'Ana', but it doesn't contain 'eana'.

SELECT *
FROM accounts
WHERE (name LIKE 'C%' OR name LIKE 'W%') 
    AND ((primary_poc LIKE '%ana%' OR primary_poc LIKE '%Ana%') 
    AND primary_poc NOT LIKE '%eana%');

Clone this wiki locally