-
Notifications
You must be signed in to change notification settings - Fork 11
3a. Data Visualization
What visualization you use not only depends on what data types you are working with but also how many columns in your dataset you want to look at in a single plot.
If you are only interested in one column in your dataset, this is called UNIVARIATE analysis.
For univariate quantitative data, both continuous and discrete, the most common plot to visualize the data is a histogram.
For quantitative data, if we are just looking at one column worth of data, we have four common visuals:
- Histogram (most cases this is used)
- Normal Quantile Plot
- Stem and Leaf Plot
- Box and Whisker Plot

The most common plot for categorical data is a bar chart. The bar chart is like a histogram, but the bins are determined based on a set category, and not based on a range that the chart creator can change.
For categorical data, if we are looking at just one variable (column), we have three common visuals:
- Bar Chart (most cases)
- Pie Chart
- Pareto Chart

NOTE: avoid 3-dimensional plots/charts as they can be misleading
If you are comparing two quantitative variables to one another, such as price and sales or heights and weights, the most common plot for such an analysis is a scatter plot.
These plots can be used to visualize both the strength and direction of the relationship between the two variables.
An example of a strong positive relationship is when one variable increases the other variable also increases.
An example of a strong negative relationship is when one variable increases the other variable decreases.

In a scatter plot, as the points spread out from one another, this weakens the relationship. The strength is not in the direction but in the closeness of the points in the chart. Generally, we consider strength as either weak, moderate or strong. And direction as simply positive or negative. A numerical number to capture these two aspects is called the correlation coefficient. The correlation coefficient is often denoted by r, and is used to tell us the strength and direction of a linear relationship.
The correlation coefficient (r) is always between -1 and 1. For example r=0.83. Where the closer it is to 1 or -1, the stronger the relationship. The negative tells you that there is a negative relationship and the positive tells you there is a positive relationship.
NOTE: In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/), also referred to as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation,[1] is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.
This needs to be taken with a grain of salt and is highly dependent on the area that the statistics is being applied. For example, social sciences or environmental.
Strong Moderate Weak
0.7 <= |r| <= 1.00 0.3 <= |r| < 0.70 0.0 <= |r| < 0.3
It can be calculated in Excel using CORREL(col1, col2), where col1 and col2 are the two columns you are looking to compare to one another.
Line plots are a common plot for viewing data over time. These plots allow us to quickly identify overall trends, seasonal occurrences, peaks, and valleys in the data. You will commonly see these used in looking at stock prices over time, but really tracking anything over time can be easily viewed using these plots.
The key to building great data visualizations is in aiming them at answering the questions you want answered. Here is an example of multiple charts asking different questions based on the same data.
Consider you run a blog and you want to understand traffic on the blog. For simplicity, let's use 12 months worth of traffic.
- If we wanted to know the month that had the most traffic, the best plot would be a bar chart, as this would put the months in order from highest to lowest traffic. Each bar is the amount of traffic for a particular month.

- If we wanted to know if traffic was increasing over time. The best chart would be a line chart.

- What if we had multiple years worth of data and we wanted do know the distribution of what the number of visitors looked like. How frequently were there more than 2000 visitors in a month? What was the most common number of visitors? The bar or line chart is not the best for this. And neither of these questions involve the time component, but rather specifically looking at the count. A histogram will help answer these questions.

Using the data, we just found 3 ways to present the data. The right way to visualize the data is what helps you answer your question.
When we introduce categorical data into our quantitative visualization many more options become available. For example, if wanted to compare two categorical variables to one another we would probably use a side by side bar chart.
If we had more than two variables, for example, sales over time for 3 products. We might start with a line chart since we are dealing with time but then have 3 lines for each product on top of each other.

- Images from Udacity: Data Foundations Nanodegree program