What does the alpha aesthetic do to the appearance of the points on the plot 1 point?

Run ggplot(data = mpg) what do you see?

This code creates an empty plot. The ggplot() function creates the background of the plot, but since no layers were specified with geom function, nothing is drawn.

How many rows are in mpg? How many columns?

There are 234 rows and 11 columns in the mpg data frame.

nrow(mpg) #> [1] 234 ncol(mpg) #> [1] 11

The glimpse() function also displays the number of rows and columns in a data frame.

glimpse(mpg) #> Rows: 234 #> Columns: 11 #> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", … #> $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", … #> $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2… #> $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20… #> $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,… #> $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut… #> $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "… #> $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, … #> $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, … #> $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "… #> $ class <chr> "compact", "compact", "compact", "compact", "compact", "…

What does the drv variable describe? Read the help for ?mpg to find out.

What happens if you make a scatter plot of class vs drv? Why is the plot not useful?

The resulting scatterplot has only a few points.

ggplot(mpg, aes(x = class, y = drv)) + geom_point()

A scatter plot is not a useful display of these variables since both drv and class are categorical variables. Since categorical variables typically take a small number of values, there are a limited number of unique combinations of (x, y) values that can be displayed. In this data, drv takes 3 values and class takes 7 values, meaning that there are only 21 values that could be plotted on a scatterplot of drv vs. class. In this data, there 12 values of (drv, class) are observed.

A simple scatter plot does not show how many observations there are for each (x, y) value. As such, scatterplots work best for plotting a continuous x and a continuous y variable, and when all (x, y) values are unique.

Warning: The following code uses functions introduced in a later section. Come back to this after reading section 7.5.2, which introduces methods for plotting two categorical variables. The first is geom_count() which is similar to a scatterplot but uses the size of the points to show the number of observations at an (x, y) point.

ggplot(mpg, aes(x = class, y = drv)) + geom_count()

The second is geom_tile() which uses a color scale to show the number of observations with each (x, y) value.

In the previous plot, there are many missing tiles. These missing tiles represent unobserved combinations of class and drv values. These missing values are not unknown, but represent values of (class, drv) where n = 0. The complete() function in the tidyr package adds new rows to a data frame for missing combinations of columns. The following code adds rows for missing combinations of class and drv and uses the fill argument to set n = 0 for those new rows.

The argumentcolour = "blue" is included within the mapping argument, and as such, it is treated as an aesthetic, which is a mapping between a variable and a value. In the expression, colour = "blue", "blue" is interpreted as a categorical variable which only takes a single value "blue". If this is confusing, consider how colour = 1:234 and colour = 1 are interpreted by aes().

The following code does produces the expected result.

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

The following list contains the categorical variables in mpg:

manufacturer
model
trans
drv
fl
class

The following list contains the continuous variables in mpg:

In the printed data frame, angled brackets at the top of each column provide type of each variable.

Those with <chr> above their columns are categorical, while those with <dbl> or <int> are continuous. The exact meaning of these types will be discussed in “Chapter 15: Vectors”.

glimpse() is another function that concisely displays the type of each column in the data frame:

For those lists, I considered any variable that was non-numeric was considered categorical and any variable that was numeric was considered continuous. This largely corresponds to the heuristics ggplot() uses for will interpreting variables as discrete or continuous.

However, this definition of continuous vs. categorical misses several important cases. Of the numeric variables, year and cyl (cylinders) clearly take on discrete values. The variables cty and hwy are stored as integers (int) so they only take on a discrete values. Even though displ has In some sense, due to measurement and computational constraints all numeric variables are discrete (). But unlike the categorical variables, it is possible to add and subtract these numeric variables in a meaningful way. The typology of levels of measurement is one such typology of data types.

In this case the R data types largely encode the semantics of the variables; e.g. integer variables are stored as integers, categorical variables with no order are stored as character vectors and so on. However, that is not always the case. Instead, the data could have stored the categorical class variable as an integer with values 1–7, where the documentation would note that 1 = “compact”, 2 = “midsize”, and so on.2 Even though this integer vector could be added, multiplied, subtracted, and divided, those operations would be meaningless.

Fundamentally, categorizing variables as “discrete”, “continuous”, “ordinal”, “nominal”, “categorical”, etc. is about specifying what operations can be performed on the variables. Discrete variables support counting and calculating the mode. Variables with an ordering support sorting and calculating quantiles. Variables that have an interval scale support addition and subtraction and operations such as taking the mean that rely on these primitives. In this way, the types of data or variables types is an information class system, something that is beyond the scope of R4DS but discussed in Advanced R.

Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

The variable cty, city highway miles per gallon, is a continuous variable.

ggplot(mpg, aes(x = displ, y = hwy, colour = cty)) + geom_point()

Instead of using discrete colors, the continuous variable uses a scale that varies from a light to dark blue color.

ggplot(mpg, aes(x = displ, y = hwy, size = cty)) + geom_point()

When mapped to size, the sizes of the points vary continuously as a function of their size.

When a continuous value is mapped to shape, it gives an error. Though we could split a continuous variable into discrete categories and use a shape aesthetic, this would conceptually not make sense. A numeric variable has an order, but shapes do not. It is clear that smaller points correspond to smaller values, or once the color scale is given, which colors correspond to larger or smaller values. But it is not clear whether a square is greater or less than a circle.

What happens if you map the same variable to multiple aesthetics?

In the above plot, hwy is mapped to both location on the y-axis and color, and displ is mapped to both location on the x-axis and size. The code works and produces a plot, even if it is a bad one. Mapping a single variable to multiple aesthetics is redundant. Because it is redundant information, in most cases avoid mapping a single variable to multiple aesthetics.

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

ggplot(mpg, aes(x = displ, y = hwy, colour = displ < 5)) + geom_point()

Aesthetics can also be mapped to expressions like displ < 5. The ggplot() function behaves as if a temporary variable was added to the data with values equal to the result of the expression. In this case, the result of displ < 5 is a logical variable which takes values of TRUE or FALSE.

This also explains why, in Exercise 3.3.1, the expression colour = "blue" created a categorical variable with only one category: “blue”.

In the following plot the class variable is mapped to color.

Advantages of encoding class with facets instead of color include the ability to encode more distinct categories. For me, it is difficult to distinguish between the colors of "midsize" and "minivan".

Given human visual perception, the max number of colors to use when encoding unordered categorical (qualitative) data is nine, and in practice, often much less than that. Displaying observations from different categories on different scales makes it difficult to directly compare values of observations across categories. However, it can make it easier to compare the shape of the relationship between the x and y variables across categories.

Disadvantages of encoding the class variable with facets instead of the color aesthetic include the difficulty of comparing the values of observations between categories since the observations for each category are on different plots. Using the same x- and y-scales for all facets makes it easier to compare values of observations across categories, but it is still more difficult than if they had been displayed on the same plot. Since encoding class within color also places all points on the same plot, it visualizes the unconditional relationship between the x and y variables; with facets, the unconditional relationship is no longer visualized since the points are spread across multiple plots.

The benefit of encoding a variable with facetting over encoding it with color increase in both the number of points and the number of categories. With a large number of points, there is often overlap. It is difficult to handle overlapping points with different colors color. Jittering will still work with color. But jittering will only work well if there are few points and the classes do not overlap much, otherwise, the colors of areas will no longer be distinct, and it will be hard to pick out the patterns of different categories visually. Transparency (alpha) does not work well with colors since the mixing of overlapping transparent colors will no longer represent the colors of the categories. Binning methods already use color to encode the density of points in the bin, so color cannot be used to encode categories.

As the number of categories increases, the difference between colors decreases, to the point that the color of categories will no longer be visually distinct.

Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables?

The arguments nrow (ncol) determines the number of rows (columns) to use when laying out the facets. It is necessary since facet_wrap() only facets on one variable.

The nrow and ncol arguments are unnecessary for facet_grid() since the number of unique values of the variables specified in the function determines the number of rows and columns.

When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

There will be more space for columns if the plot is laid out horizontally (landscape).

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

line chart: geom_line()
boxplot: geom_boxplot()
histogram: geom_histogram()
area chart: geom_area()

What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

The theme option show.legend = FALSE hides the legend box.

Consider this example earlier in the chapter.

In that plot, there is no legend. Removing the show.legend argument or setting show.legend = TRUE will result in the plot having a legend displaying the mapping between colors and drv.

In the chapter, the legend is suppressed because with three plots, adding a legend to only the last plot would make the sizes of plots different. Different sized plots would make it more difficult to see how arguments change the appearance of the plots. The purpose of those plots is to show the difference between no groups, using a group aesthetic, and using a color aesthetic, which creates implicit groups. In that example, the legend isn’t necessary since looking up the values associated with each color isn’t necessary to make that point.

Recreate the R code necessary to generate the following graphs.

The following code will generate those plots.

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

The “previous plot” referred to in the question is the following.

The arguments fun.ymin, fun.ymax, and fun.y have been deprecated and replaced with fun.min, fun.max, and fun in ggplot2 v 3.3.0.

The default geom for stat_summary() is geom_pointrange(). The default stat for geom_pointrange() is identity() but we can add the argument stat = "summary" to use stat_summary() instead of stat_identity().

The resulting message says that stat_summary() uses the mean and sd to calculate the middle point and endpoints of the line. However, in the original plot the min and max values were used for the endpoints. To recreate the original plot we need to specify values for fun.min, fun.max, and fun.

What does geom_col() do? How is it different to geom_bar()?

The geom_col() function has different default stat than geom_bar(). The default stat of geom_col() is stat_identity(), which leaves the data as is. The geom_col() function expects that the data contains x values and y values which represent the bar height.

The default stat of geom_bar() is stat_count(). The geom_bar() function only expects an x variable. The stat, stat_count(), preprocesses input data by counting the number of observations for each value of x. The y aesthetic uses the values of these counts.

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

The following tables lists the pairs of geoms and stats that are almost always used in concert.

Complementary geoms and stats

geom_bar()	stat_count()
geom_bin2d()	stat_bin_2d()
geom_boxplot()	stat_boxplot()
geom_contour_filled()	stat_contour_filled()
geom_contour()	stat_contour()
geom_count()	stat_sum()
geom_density_2d()	stat_density_2d()
geom_density()	stat_density()
geom_dotplot()	stat_bindot()
geom_function()	stat_function()
geom_sf()	stat_sf()
geom_sf()	stat_sf()
geom_smooth()	stat_smooth()
geom_violin()	stat_ydensity()
geom_hex()	stat_bin_hex()
geom_qq_line()	stat_qq_line()
geom_qq()	stat_qq()
geom_quantile()	stat_quantile()

These pairs of geoms and stats tend to have their names in common, such stat_smooth() and geom_smooth() and be documented on the same help page. The pairs of geoms and stats that are used in concert often have each other as the default stat (for a geom) or geom (for a stat).

The following tables contain the geoms and stats in ggplot2 and their defaults as of version 3.3.0. Many geoms have stat_identity() as the default stat.

ggplot2 geom layers and their default stats.

geom_abline()	stat_identity()
geom_area()	stat_identity()
geom_bar()	stat_count()	x
geom_bin2d()	stat_bin_2d()	x
geom_blank()	None
geom_boxplot()	stat_boxplot()	x
geom_col()	stat_identity()
geom_count()	stat_sum()	x
geom_countour_filled()	stat_countour_filled()	x
geom_countour()	stat_countour()	x
geom_crossbar()	stat_identity()
geom_curve()	stat_identity()
geom_density_2d_filled()	stat_density_2d_filled()	x
geom_density_2d()	stat_density_2d()	x
geom_density()	stat_density()	x
geom_dotplot()	stat_bindot()	x
geom_errorbar()	stat_identity()
geom_errorbarh()	stat_identity()
geom_freqpoly()	stat_bin()	x
geom_function()	stat_function()	x
geom_hex()	stat_bin_hex()	x
geom_histogram()	stat_bin()	x
geom_hline()	stat_identity()
geom_jitter()	stat_identity()
geom_label()	stat_identity()
geom_line()	stat_identity()
geom_linerange()	stat_identity()
geom_map()	stat_identity()
geom_path()	stat_identity()
geom_point()	stat_identity()
geom_pointrange()	stat_identity()
geom_polygon()	stat_identity()
geom_qq_line()	stat_qq_line()	x
geom_qq()	stat_qq()	x
geom_quantile()	stat_quantile()	x
geom_raster()	stat_identity()
geom_rect()	stat_identity()
geom_ribbon()	stat_identity()
geom_rug()	stat_identity()
geom_segment()	stat_identity()
geom_sf_label()	stat_sf_coordinates()	x
geom_sf_text()	stat_sf_coordinates()	x
geom_sf()	stat_sf()	x
geom_smooth()	stat_smooth()	x
geom_spoke()	stat_identity()
geom_step()	stat_identity()
geom_text()	stat_identity()
geom_tile()	stat_identity()
geom_violin()	stat_ydensity()	x
geom_vline()	stat_identity()

ggplot2 stat layers and their default geoms.

stat_bin_2d()	geom_tile()
stat_bin_hex()	geom_hex()	x
stat_bin()	geom_bar()	x
stat_boxplot()	geom_boxplot()	x
stat_count()	geom_bar()	x
stat_countour_filled()	geom_contour_filled()	x
stat_countour()	geom_contour()	x
stat_density_2d_filled()	geom_density_2d()	x
stat_density_2d()	geom_density_2d()	x
stat_density()	geom_area()
stat_ecdf()	geom_step()
stat_ellipse()	geom_path()
stat_function()	geom_function()	x
stat_function()	geom_path()
stat_identity()	geom_point()
stat_qq_line()	geom_path()
stat_qq()	geom_point()
stat_quantile()	geom_quantile()	x
stat_sf_coordinates()	geom_point()
stat_sf()	geom_rect()
stat_smooth()	geom_smooth()	x
stat_sum()	geom_point()
stat_summary_2d()	geom_tile()
stat_summary_bin()	geom_pointrange()
stat_summary_hex()	geom_hex()
stat_summary()	geom_pointrange()
stat_unique()	geom_point()
stat_ydensity()	geom_violin()	x

What variables does stat_smooth() compute? What parameters control its behavior?

The function stat_smooth() calculates the following variables:

y: predicted value
ymin: lower value of the confidence interval
ymax: upper value of the confidence interval
se: standard error

The “Computed Variables” section of the stat_smooth() documentation contains these variables.

The parameters that control the behavior of stat_smooth() include:

method: This is the method used to compute the smoothing line. If NULL, a default method is used based on the sample size: stats::loess() when there are less than 1,000 observations in a group, and mgcv::gam() with formula = y ~ s(x, bs = "CS) otherwise. Alternatively, the user can provide a character vector with a function name, e.g. "lm", "loess", or a function, e.g. MASS::rlm.
formula: When providing a custom method argument, the formula to use. The default is y ~ x. For example, to use the line implied by lm(y ~ x + I(x ^ 2) + I(x ^ 3)), use method = "lm" or method = lm and formula = y ~ x + I(x ^ 2) + I(x ^ 3).
method.arg(): Arguments other than than the formula, which is already specified in the formula argument, to pass to the function inmethod`.
se: If TRUE, display standard error bands, if FALSE only display the line.
na.rm: If FALSE, missing values are removed with a warning, if TRUE the are silently removed. The default is FALSE in order to make debugging easier. If missing values are known to be in the data, then can be ignored, but if missing values are not anticipated this warning can help catch errors.

TODO: Plots with examples illustrating the uses of these arguments.

In our proportion bar chart, we need to set group = 1 Why? In other words, what is the problem with these two graphs?

If group = 1 is not included, then all the bars in the plot will have the same height, a height of 1. The function geom_bar() assumes that the groups are equal to the x values since the stat computes the counts within the group.

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop..))

The problem with these two plots is that the proportions are calculated within the groups.

The following code will produce the intended stacked bar charts for the case with no fill aesthetic.

With the fill aesthetic, the heights of the bars need to be normalized.

What parameters to geom_jitter() control the amount of jittering?

From the geom_jitter() documentation, there are two arguments to jitter:

width controls the amount of horizontal displacement, and
height controls the amount of vertical displacement.

The defaults values of width and height will introduce noise in both directions. Here is what the plot looks like with the default values of height and width.

However, we can change these parameters. Here are few a examples to understand how these parameters affect the amount of jittering. Whenwidth = 0 there is no horizontal jitter.

When width = 20, there is too much horizontal jitter.

When height = 0, there is no vertical jitter.

When height = 15, there is too much vertical jitter.

When width = 0 and height = 0, there is neither horizontal or vertical jitter, and the plot produced is identical to the one produced with geom_point().

Note that the height and width arguments are in the units of the data. Thus height = 1 (width = 1) corresponds to different relative amounts of jittering depending on the scale of the y (x) variable. The default values of height and width are defined to be 80% of the resolution() of the data, which is the smallest non-zero distance between adjacent values of a variable. When x and y are discrete variables, their resolutions are both equal to 1, and height = 0.4 and width = 0.4 since the jitter moves points in both positive and negative directions.

The default values of height and width in geom_jitter() are non-zero, so unless both height and width are explicitly set set 0, there will be some jitter.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_jitter()

Compare and contrast geom_jitter() with geom_count().

The geom geom_jitter() adds random variation to the locations points of the graph. In other words, it “jitters” the locations of points slightly. This method reduces overplotting since two points with the same location are unlikely to have the same random variation.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_jitter()

However, the reduction in overlapping comes at the cost of slightly changing the x and y values of the points.

The geom geom_count() sizes the points relative to the number of observations. Combinations of (x, y) values with more observations will be larger than those with fewer observations.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_count()

The geom_count() geom does not change x and y coordinates of the points. However, if the points are close together and counts are large, the size of some points can itself create overplotting. For example, in the following example, a third variable mapped to color is added to the plot. In this case, geom_count() is less readable than geom_jitter() when adding a third variable as a color aesthetic.

Combining geom_count() with jitter, which is specified with the position argument to geom_count() rather than its own geom, helps overplotting a little.

But as this example shows, unfortunately, there is no universal solution to overplotting. The costs and benefits of different approaches will depend on the structure of the data and the goal of the data scientist.

What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.

The default position for geom_boxplot() is "dodge2", which is a shortcut for position_dodge2. This position adjustment does not change the vertical position of a geom but moves the geom horizontally to avoid overlapping other geoms. See the documentation for position_dodge2() for additional discussion on how it works.

When we add colour = class to the box plot, the different levels of the drv variable are placed side by side, i.e., dodged.

ggplot(data = mpg, aes(x = drv, y = hwy, colour = class)) + geom_boxplot()

If position_identity() is used the boxplots overlap.

What does labs() do? Read the documentation.

The labs function adds axis titles, plot titles, and a caption to the plot.

The arguments to labs() are optional, so you can add as many or as few of these as are needed.

The labs() function is not the only function that adds titles to plots. The xlab(), ylab(), and x- and y-scale functions can add axis titles. The ggtitle() function adds plot titles.

What’s the difference between coord_quickmap() and coord_map()?

The coord_map() function uses map projections to project the three-dimensional Earth onto a two-dimensional plane. By default, coord_map() uses the Mercator projection. This projection is applied to all the geoms in the plot. The coord_quickmap() function uses an approximate but faster map projection. This approximation ignores the curvature of Earth and adjusts the map for the latitude/longitude ratio. The coord_quickmap() project is faster than coord_map() both because the projection is computationally easier, and unlike coord_map(), the coordinates of the individual geoms do not need to be transformed.

See the coord_map() documentation for more information on these functions and some examples.

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

The function coord_fixed() ensures that the line produced by geom_abline() is at a 45-degree angle. A 45-degree line makes it easy to compare the highway and city mileage to the case in which city and highway MPG were equal.

If we didn’t include coord_fixed(), then the line would no longer have an angle of 45 degrees.

On average, humans are best able to perceive differences in angles relative to 45 degrees. See Cleveland (1993b), Cleveland (1994),Cleveland (1993a), Cleveland, McGill, and McGill (1988), Heer and Agrawala (2006) for discussion on how the aspect ratio of a plot affects perception of the values it encodes, evidence that 45-degrees is generally the optimal aspect ratio, and methods to calculate the optimal aspect ratio of a plot. The function ggthemes::bank_slopes() will calculate the optimal aspect ratio to bank slopes to 45-degrees.

Cleveland, William S. 1993b. Visualizing Information. Hobart Press.

Cleveland, William S. 1994. The Elements of Graphing Data. Hobart Press.

Cleveland, William S., Marylyn E. McGill, and Robert McGill. 1988. “The Shape Parameter of a Two-Variable Graph.” Journal of the American Statistical Association 83 (402). [American Statistical Association, Taylor & Francis, Ltd.]: 289–300. https://www.jstor.org/stable/2288843.

Heer, Jeffrey, and Maneesh Agrawala. 2006. “Multi-Scale Banking to 45º.” Ieee Transactions on Visualization and Computer Graphics 12 (5, September/October). https://doi.org/10.1109/TVCG.2006.163.

Page 2

The variable being printed is my_varıable, not my_variable: the seventh character is “ı” (“LATIN SMALL LETTER DOTLESS I”), not “i”.

While it wouldn’t have helped much in this case, the importance of distinguishing characters in code is reasons why fonts which clearly distinguish similar characters are preferred in programming. It is especially important to distinguish between two sets of similar looking characters:

the numeral zero (0), the Latin small letter O (o), and the Latin capital letter O (O),
the numeral one (1), the Latin small letter I (i), the Latin capital letter I (I), and Latin small letter L (l).

In these fonts, zero and the Latin letter O are often distinguished by using a glyph for zero that uses either a dot in the interior or a slash through it. Some examples of fonts with dotted or slashed zero glyphs are Consolas, Deja Vu Sans Mono, Monaco, Menlo, Source Sans Pro, and FiraCode.

Error messages of the form "object '...' not found" mean exactly what they say. R cannot find an object with that name. Unfortunately, the error does not tell you why that object cannot be found, because R does not know the reason that the object does not exist. The most common scenarios in which I encounter this error message are

I forgot to create the object, or an error prevented the object from being created.
I made a typo in the object’s name, either when using it or when I created it (as in the example above), or I forgot what I had originally named it. If you find yourself often writing the wrong name for an object, it is a good indication that the original name was not a good one.
I forgot to load the package that contains the object using library().

The error message is argument "data" is missing, with no default. This error is a result of a typo, dota instead of data.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

R could not find the function fliter() because we made a typo: fliter instead of filter.

We aren’t done yet. But the error message gives a suggestion. Let’s follow it.

R says it can’t find the object diamond. This is a typo; the data frame is named diamonds.

How did I know? I started typing in diamond and RStudio completed it to diamonds. Since diamonds includes the variable carat and the code works, that appears to have been the problem.

Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

This gives a menu with keyboard shortcuts. This can be found in the menu under Tools -> Keyboard Shortcuts Help.

Page 3

library("nycflights13") library("tidyverse")

Find all flights that

Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late
Were delayed by at least an hour, but made up over 30 minutes in flight
Departed between midnight and 6 am (inclusive)

The answer to each part follows.

Since the arr_delay variable is measured in minutes, find flights with an arrival delay of 120 or more minutes.

filter(flights, arr_delay >= 120) #> # A tibble: 10,200 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 811 630 101 1047 830 #> 2 2013 1 1 848 1835 853 1001 1950 #> 3 2013 1 1 957 733 144 1056 853 #> 4 2013 1 1 1114 900 134 1447 1222 #> 5 2013 1 1 1505 1310 115 1638 1431 #> 6 2013 1 1 1525 1340 105 1831 1626 #> # … with 10,194 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The flights that flew to Houston are those flights where the destination (dest) is either “IAH” or “HOU”.

filter(flights, dest == "IAH" | dest == "HOU") #> # A tibble: 9,313 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 623 627 -4 933 932 #> 4 2013 1 1 728 732 -4 1041 1038 #> 5 2013 1 1 739 739 0 1104 1038 #> 6 2013 1 1 908 908 0 1228 1219 #> # … with 9,307 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

However, using %in% is more compact and would scale to cases where there were more than two airports we were interested in.

filter(flights, dest %in% c("IAH", "HOU")) #> # A tibble: 9,313 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 623 627 -4 933 932 #> 4 2013 1 1 728 732 -4 1041 1038 #> 5 2013 1 1 739 739 0 1104 1038 #> 6 2013 1 1 908 908 0 1228 1219 #> # … with 9,307 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
In the flights dataset, the column carrier indicates the airline, but it uses two-character carrier codes. We can find the carrier codes for the airlines in the airlines dataset. Since the carrier code dataset only has 16 rows, and the names of the airlines in that dataset are not exactly “United”, “American”, or “Delta”, it is easiest to manually look up their carrier codes in that data.

The carrier code for Delta is "DL", for American is "AA", and for United is "UA". Using these carriers codes, we check whether carrier is one of those.

filter(flights, carrier %in% c("AA", "DL", "UA")) #> # A tibble: 139,504 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 554 600 -6 812 837 #> 5 2013 1 1 554 558 -4 740 728 #> 6 2013 1 1 558 600 -2 753 745 #> # … with 139,498 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The variable month has the month, and it is numeric. So, the summer flights are those that departed in months 7 (July), 8 (August), and 9 (September).

filter(flights, month >= 7, month <= 9) #> # A tibble: 86,326 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 7 1 1 2029 212 236 2359 #> 2 2013 7 1 2 2359 3 344 344 #> 3 2013 7 1 29 2245 104 151 1 #> 4 2013 7 1 43 2130 193 322 14 #> 5 2013 7 1 44 2150 174 300 100 #> 6 2013 7 1 46 2051 235 304 2358 #> # … with 86,320 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The %in% operator is an alternative. If the : operator is used to specify the integer range, the expression is readable and compact.

filter(flights, month %in% 7:9) #> # A tibble: 86,326 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 7 1 1 2029 212 236 2359 #> 2 2013 7 1 2 2359 3 344 344 #> 3 2013 7 1 29 2245 104 151 1 #> 4 2013 7 1 43 2130 193 322 14 #> 5 2013 7 1 44 2150 174 300 100 #> 6 2013 7 1 46 2051 235 304 2358 #> # … with 86,320 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We could also use the | operator. However, the | does not scale to many choices. Even with only three choices, it is quite verbose.

filter(flights, month == 7 | month == 8 | month == 9) #> # A tibble: 86,326 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 7 1 1 2029 212 236 2359 #> 2 2013 7 1 2 2359 3 344 344 #> 3 2013 7 1 29 2245 104 151 1 #> 4 2013 7 1 43 2130 193 322 14 #> 5 2013 7 1 44 2150 174 300 100 #> 6 2013 7 1 46 2051 235 304 2358 #> # … with 86,320 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We can also use the between() function as shown in Exercise 5.2.2.
Flights that arrived more than two hours late, but didn’t leave late will have an arrival delay of more than 120 minutes (arr_delay > 120) and a non-positive departure delay (dep_delay <= 0).

filter(flights, arr_delay > 120, dep_delay <= 0) #> # A tibble: 29 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 27 1419 1420 -1 1754 1550 #> 2 2013 10 7 1350 1350 0 1736 1526 #> 3 2013 10 7 1357 1359 -2 1858 1654 #> 4 2013 10 16 657 700 -3 1258 1056 #> 5 2013 11 1 658 700 -2 1329 1015 #> 6 2013 3 18 1844 1847 -3 39 2219 #> # … with 23 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>, #> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Were delayed by at least an hour, but made up over 30 minutes in flight. If a flight was delayed by at least an hour, then dep_delay >= 60. If the flight didn’t make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning dep_delay == arr_delay, or alternatively, dep_delay - arr_delay == 0. If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as dep_delay - arr_delay > 30.

filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30) #> # A tibble: 1,844 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 2205 1720 285 46 2040 #> 2 2013 1 1 2326 2130 116 131 18 #> 3 2013 1 3 1503 1221 162 1803 1555 #> 4 2013 1 3 1839 1700 99 2056 1950 #> 5 2013 1 3 1850 1745 65 2148 2120 #> 6 2013 1 3 1941 1759 102 2246 2139 #> # … with 1,838 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Finding flights that departed between midnight and 6 a.m. is complicated by the way in which times are represented in the data.
In dep_time, midnight is represented by 2400, not 0. You can verify this by checking the minimum and maximum of dep_time.

This is an example of why it is always good to check the summary statistics of your data. Unfortunately, this means we cannot simply check that dep_time < 600, because we also have to consider the special case of midnight.

filter(flights, dep_time <= 600 | dep_time == 2400) #> # A tibble: 9,373 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 9,367 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Alternatively, we could use the modulo operator, %%. The modulo operator returns the remainder of division. Let’s see how this affects our times.

c(600, 1200, 2400) %% 2400 #> [1] 600 1200 0

Since 2400 %% 2400 == 0 and all other times are left unchanged, we can compare the result of the modulo operation to 600,

filter(flights, dep_time %% 2400 <= 600) #> # A tibble: 9,373 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 9,367 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

This filter expression is more compact, but its readability depends on the familiarity of the reader with modular arithmetic.

Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

The expression between(x, left, right) is equivalent to x >= left & x <= right.

Of the answers in the previous question, we could simplify the statement of departed in summer (month >= 7 & month <= 9) using the between() function.

filter(flights, between(month, 7, 9)) #> # A tibble: 86,326 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 7 1 1 2029 212 236 2359 #> 2 2013 7 1 2 2359 3 344 344 #> 3 2013 7 1 29 2245 104 151 1 #> 4 2013 7 1 43 2130 193 322 14 #> 5 2013 7 1 44 2150 174 300 100 #> 6 2013 7 1 46 2051 235 304 2358 #> # … with 86,320 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

Find the rows of flights with a missing departure time (dep_time) using the is.na() function.

filter(flights, is.na(dep_time)) #> # A tibble: 8,255 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 NA 1630 NA NA 1815 #> 2 2013 1 1 NA 1935 NA NA 2240 #> 3 2013 1 1 NA 1500 NA NA 1825 #> 4 2013 1 1 NA 600 NA NA 901 #> 5 2013 1 2 NA 1540 NA NA 1747 #> 6 2013 1 2 NA 1620 NA NA 1746 #> # … with 8,249 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Notably, the arrival time (arr_time) is also missing for these rows. These seem to be cancelled flights.

The output of the function summary() includes the number of missing values for all non-character variables.

Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

NA ^ 0 == 1 since for all numeric values $x ^ 0 = 1$.

NA | TRUE is TRUE because anything or TRUE is TRUE. If the missing value were TRUE, then TRUE | TRUE == TRUE, and if the missing value was FALSE, then FALSE | TRUE == TRUE.

The value of NA & FALSE is FALSE because anything and FALSE is always FALSE. If the missing value were TRUE, then TRUE & FALSE == FALSE, and if the missing value was FALSE, then FALSE & FALSE == FALSE.

For NA | FALSE, the value is unknown since TRUE | FALSE == TRUE, but FALSE | FALSE == FALSE.

For NA & TRUE, the value is unknown since FALSE & TRUE== FALSE, but TRUE & TRUE == TRUE.

Since $x * 0 = 0$ for all finite numbers we might expect NA * 0 == 0, but that’s not the case. The reason that NA * 0 != 0 is that $0 \times \infty$ and $0 \times -\infty$ are undefined. R represents undefined results as NaN, which is an abbreviation of “not a number”.

Inf * 0 #> [1] NaN -Inf * 0 #> [1] NaN

How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

The arrange() function puts NA values last.

arrange(flights, dep_time) %>% tail() #> # A tibble: 6 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 9 30 NA 1842 NA NA 2019 #> 2 2013 9 30 NA 1455 NA NA 1634 #> 3 2013 9 30 NA 2200 NA NA 2312 #> 4 2013 9 30 NA 1210 NA NA 1330 #> 5 2013 9 30 NA 1159 NA NA 1344 #> 6 2013 9 30 NA 840 NA NA 1020 #> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dttm>

Using desc() does not change that.

arrange(flights, desc(dep_time)) #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 10 30 2400 2359 1 327 337 #> 2 2013 11 27 2400 2359 1 515 445 #> 3 2013 12 5 2400 2359 1 427 440 #> 4 2013 12 9 2400 2359 1 432 440 #> 5 2013 12 9 2400 2250 70 59 2356 #> 6 2013 12 13 2400 2359 1 432 440 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

To put NA values first, we can add an indicator of whether the column has a missing value. Then we sort by the missing indicator column and the column of interest. For example, to sort the data frame by departure time (dep_time) in ascending order but NA values first, run the following.

arrange(flights, desc(is.na(dep_time)), dep_time) #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 NA 1630 NA NA 1815 #> 2 2013 1 1 NA 1935 NA NA 2240 #> 3 2013 1 1 NA 1500 NA NA 1825 #> 4 2013 1 1 NA 600 NA NA 901 #> 5 2013 1 2 NA 1540 NA NA 1747 #> 6 2013 1 2 NA 1620 NA NA 1746 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The flights will first be sorted by desc(is.na(dep_time)). Since desc(is.na(dep_time)) is either TRUE when dep_time is missing, or FALSE, when it is not, the rows with missing values of dep_time will come first, since TRUE > FALSE.

Sort flights to find the most delayed flights. Find the flights that left earliest.

Find the most delayed flights by sorting the table by departure delay, dep_delay, in descending order.

arrange(flights, desc(dep_delay)) #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 9 641 900 1301 1242 1530 #> 2 2013 6 15 1432 1935 1137 1607 2120 #> 3 2013 1 10 1121 1635 1126 1239 1810 #> 4 2013 9 20 1139 1845 1014 1457 2210 #> 5 2013 7 22 845 1600 1005 1044 1815 #> 6 2013 4 10 1100 1900 960 1342 2211 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The most delayed flight was HA 51, JFK to HNL, which was scheduled to leave on January 09, 2013 09:00. Note that the departure time is given as 641, which seems to be less than the scheduled departure time. But the departure was delayed 1,301 minutes, which is 21 hours, 41 minutes. The departure time is the day after the scheduled departure time. Be happy that you weren’t on that flight, and if you happened to have been on that flight and are reading this, I’m sorry for you.

Similarly, the earliest departing flight can be found by sorting dep_delay in ascending order.

arrange(flights, dep_delay) #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 12 7 2040 2123 -43 40 2352 #> 2 2013 2 3 2022 2055 -33 2240 2338 #> 3 2013 11 10 1408 1440 -32 1549 1559 #> 4 2013 1 11 1900 1930 -30 2233 2243 #> 5 2013 1 29 1703 1730 -27 1947 1957 #> 6 2013 8 9 729 755 -26 1002 955 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Flight B6 97 (JFK to DEN) scheduled to depart on December 07, 2013 at 21:23 departed 43 minutes early.

Sort flights to find the fastest flights.

There are actually two ways to interpret this question: one that can be solved by using arrange(), and a more complex interpretation that requires creation of a new variable using mutate(), which we haven’t seen demonstrated before.

The colloquial interpretation of “fastest” flight can be understood to mean “the flight with the shortest flight time”. We can use arrange to sort our data by the air_time variable to find the shortest flights:

head(arrange(flights, air_time)) #> # A tibble: 6 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 16 1355 1315 40 1442 1411 #> 2 2013 4 13 537 527 10 622 628 #> 3 2013 12 6 922 851 31 1021 954 #> 4 2013 2 3 2153 2129 24 2247 2224 #> 5 2013 2 5 1303 1315 -12 1342 1411 #> 6 2013 2 12 2123 2130 -7 2211 2225 #> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dttm>

Another definition of the “fastest flight” is the flight with the highest average ground speed. The ground speed is not included in the data, but it can be calculated from the distance and air_time of the flight.

head(arrange(flights, desc(distance / air_time))) #> # A tibble: 6 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 5 25 1709 1700 9 1923 1937 #> 2 2013 7 2 1558 1513 45 1745 1719 #> 3 2013 5 13 2040 2025 15 2225 2226 #> 4 2013 3 23 1914 1910 4 2045 2043 #> 5 2013 1 12 1559 1600 -1 1849 1917 #> 6 2013 11 17 650 655 -5 1059 1150 #> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dttm>

Which flights traveled the longest? Which traveled the shortest?

To find the longest flight, sort the flights by the distance column in descending order.

arrange(flights, desc(distance)) #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 857 900 -3 1516 1530 #> 2 2013 1 2 909 900 9 1525 1530 #> 3 2013 1 3 914 900 14 1504 1530 #> 4 2013 1 4 900 900 0 1516 1530 #> 5 2013 1 5 858 900 -2 1519 1530 #> 6 2013 1 6 1019 900 79 1558 1530 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The longest flight is HA 51, JFK to HNL, which is 4,983 miles.

To find the shortest flight, sort the flights by the distance in ascending order, which is the default sort order.

arrange(flights, distance) #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 7 27 NA 106 NA NA 245 #> 2 2013 1 3 2127 2129 -2 2222 2224 #> 3 2013 1 4 1240 1200 40 1333 1306 #> 4 2013 1 4 1829 1615 134 1937 1721 #> 5 2013 1 4 2128 2129 -1 2218 2224 #> 6 2013 1 5 1155 1200 -5 1241 1306 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The shortest flight is US 1632, EWR to LGA, which is only 17 miles. This is a flight between two of the New York area airports. However, since this flight is missing a departure time so it either did not actually fly or there is a problem with the data.

The terms “longest” and “shortest” could also refer to the time of the flight instead of the distance. Now the longest and shortest flights by can be found by sorting by the air_time column. The longest flights by airtime are the following.

arrange(flights, desc(air_time)) #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 3 17 1337 1335 2 1937 1836 #> 2 2013 2 6 853 900 -7 1542 1540 #> 3 2013 3 15 1001 1000 1 1551 1530 #> 4 2013 3 17 1006 1000 6 1607 1530 #> 5 2013 3 16 1001 1000 1 1544 1530 #> 6 2013 2 5 900 900 0 1555 1540 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The shortest flights by airtime are the following.

arrange(flights, air_time) #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 16 1355 1315 40 1442 1411 #> 2 2013 4 13 537 527 10 622 628 #> 3 2013 12 6 922 851 31 1021 954 #> 4 2013 2 3 2153 2129 24 2247 2224 #> 5 2013 2 5 1303 1315 -12 1342 1411 #> 6 2013 2 12 2123 2130 -7 2211 2225 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

These are a few ways to select columns.

Specify columns names as unquoted variable names.
Specify column names as strings.
Specify the column numbers of the variables.

This works, but is not good practice for two reasons. First, the column location of variables may change, resulting in code that may continue to run without error, but produce the wrong answer. Second code is obfuscated, since it is not clear from the code which variables are being selected. What variable does column 6 correspond to? I just wrote the code, and I’ve already forgotten.
Specify the names of the variables with character vector and any_of() or all_of()

This is useful because the names of the variables can be stored in a variable and passed to all_of() or any_of().

These two functions replace the deprecated function one_of().
Selecting the variables by matching the start of their names using starts_with().
Selecting the variables using regular expressions with matches(). Regular expressions provide a flexible way to match string patterns and are discussed in the Strings chapter.
Specify the names of the variables with a character vector and use the bang-bang operator (!!).

This and the following answers use the features of tidy evaluation not covered in R4DS but covered in the Programming with dplyr vignette.
Specify the names of the variables in a character or list vector and use the bang-bang-bang operator.
Specify the unquoted names of the variables in a list using syms() and use the bang-bang-bang operator.

Some things that don’t work are:

What happens if you include the name of a variable multiple times in a select() call?

The select() call ignores the duplication. Any duplicated variables are only included once, in the first location they appear. The select() function does not raise an error or warning or print any message if there are duplicated variables.

This behavior is useful because it means that we can use select() with everything() in order to easily change the order of columns without having to specify the names of all the columns.

select(flights, arr_delay, everything()) #> # A tibble: 336,776 x 19 #> arr_delay year month day dep_time sched_dep_time dep_delay arr_time #> <dbl> <int> <int> <int> <int> <int> <dbl> <int> #> 1 11 2013 1 1 517 515 2 830 #> 2 20 2013 1 1 533 529 4 850 #> 3 33 2013 1 1 542 540 2 923 #> 4 -18 2013 1 1 544 545 -1 1004 #> 5 -25 2013 1 1 554 600 -6 812 #> 6 12 2013 1 1 554 558 -4 740 #> # … with 336,770 more rows, and 11 more variables: sched_arr_time <int>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

What does the one_of() function do? Why might it be helpful in conjunction with this vector?

The one_of() function selects variables with a character vector rather than unquoted variable name arguments. This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.

In the most recent versions of dplyr, one_of has been deprecated in favor of two functions: all_of() and any_of(). These functions behave similarly if all variables are present in the data frame.

These functions differ in their strictness. The function all_of() will raise an error if one of the variable names is not present, while any_of() will ignore it.

The deprecated function one_of() will raise a warning if an unknown column is encountered.

In the most recent versions of dplyr, the one_of() function is less necessary due to new behavior in the selection functions. The select() function can now accept the name of a vector containing the variable names you wish to select:

However there is a problem with the previous code. The name vars could refer to a column named vars in flights or a different variable named vars. What th code does will depend on whether or not vars is a column in flights. If vars was a column in flights, then that code would only select the vars column. For example:

However, vars is not a column in flights, as is the case, then select will use the value the value of the , and select those columns. If it has the same name or to ensure that it will not conflict with the names of the columns in the data frame, use the !!! (bang-bang-bang) operator.

This behavior, which is used by many tidyverse functions, is an example of what is called non-standard evaluation (NSE) in R. See the dplyr vignette, Programming with dplyr, for more information on this topic.

Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

The default behavior for contains() is to ignore case. This may or may not surprise you. If this behavior does not surprise you, that could be why it is the default. Users searching for variable names probably have a better sense of the letters in the variable than their capitalization. A second, technical, reason is that dplyr works with more than R data frames. It can also work with a variety of databases. Some of these database engines have case insensitive column names, so making functions that match variable names case insensitive by default will make the behavior of select() consistent regardless of whether the table is stored as an R data frame or in a database.

To change the behavior add the argument ignore.case = FALSE.

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

To get the departure times in the number of minutes, divide dep_time by 100 to get the hours since midnight and multiply by 60 and add the remainder of dep_time divided by 100. For example, 1504 represents 15:04 (or 3:04 PM), which is 904 minutes after midnight. To generalize this approach, we need a way to split out the hour-digits from the minute-digits. Dividing by 100 and discarding the remainder using the integer division operator, %/% gives us the following.

Instead of %/% could also use / along with trunc() or floor(), but round() would not work. To get the minutes, instead of discarding the remainder of the division by 100, we only want the remainder. So we use the modulo operator, %%, discussed in the Other Useful Functions section.

Now, we can combine the hours (multiplied by 60 to convert them to minutes) and minutes to get the number of minutes after midnight.

1504 %/% 100 * 60 + 1504 %% 100 #> [1] 904

There is one remaining issue. Midnight is represented by 2400, which would correspond to 1440 minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, x %% 1440 will convert 1440 to zero while keeping all the other times the same.

Now we will put it all together. The following code creates a new data frame flights_times with columns dep_time_mins and sched_dep_time_mins. These columns convert dep_time and sched_dep_time, respectively, to minutes since midnight.

Looking ahead to the Functions chapter, this is precisely the sort of situation in which it would make sense to write a function to avoid copying and pasting code. We could define a function time2mins(), which converts a vector of times in from the format used in flights to minutes since midnight.

time2mins <- function(x) { (x %/% 100 * 60 + x %% 100) %% 1440 }

Using time2mins, the previous code simplifies to the following.

Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

I expect that air_time is the difference between the arrival (arr_time) and departure times (dep_time). In other words, air_time = arr_time - dep_time.

To check that this relationship, I’ll first need to convert the times to a form more amenable to arithmetic operations using the same calculations as the previous exercise.

So, does air_time = arr_time - dep_time? If so, there should be no flights with non-zero values of air_time_diff.

nrow(filter(flights_airtime, air_time_diff != 0)) #> [1] 327150

It turns out that there are many flights for which air_time != arr_time - dep_time. Other than data errors, I can think of two reasons why air_time would not equal arr_time - dep_time.

The flight passes midnight, so arr_time < dep_time. In these cases, the difference in airtime should be by 24 hours (1,440 minutes).
The flight crosses time zones, and the total air time will be off by hours (multiples of 60). All flights in flights departed from New York City and are domestic flights in the US. This means that flights will all be to the same or more westerly time zones. Given the time-zones in the US, the differences due to time-zone should be 60 minutes (Central) 120 minutes (Mountain), 180 minutes (Pacific), 240 minutes (Alaska), or 300 minutes (Hawaii).

Both of these explanations have clear patterns that I would expect to see if they were true. In particular, in both cases, since time-zones and crossing midnight only affects the hour part of the time, all values of air_time_diff should be divisible by 60. I’ll visually check this hypothesis by plotting the distribution of air_time_diff. If those two explanations are correct, distribution of air_time_diff should comprise only spikes at multiples of 60.

This is not the case. While, the distribution of air_time_diff has modes at multiples of 60 as hypothesized, it shows that there are many flights in which the difference between air time and local arrival and departure times is not divisible by 60.

Let’s also look at flights with Los Angeles as a destination. The discrepancy should be 180 minutes.

To fix these time-zone issues, I would want to convert all the times to a date-time to handle overnight flights, and from local time to a common time zone, most likely UTC, to handle flights crossing time-zones. The tzone column of nycflights13::airports gives the time-zone of each airport. See the “Dates and Times” for an introduction on working with date and time data.

But that still leaves the other differences unexplained. So what else might be going on? There seem to be too many problems for this to be data entry problems, so I’m probably missing something. So, I’ll reread the documentation to make sure that I understand the definitions of arr_time, dep_time, and air_time. The documentation contains a link to the source of the flights data, https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236. This documentation shows that the flights data does not contain the variables TaxiIn, TaxiOff, WheelsIn, and WheelsOff. It appears that the air_time variable refers to flight time, which is defined as the time between wheels-off (take-off) and wheels-in (landing). But the flight time does not include time spent on the runway taxiing to and from gates. With this new understanding of the data, I now know that the relationship between air_time, arr_time, and dep_time is air_time <= arr_time - dep_time, supposing that the time zones of arr_time and dep_time are in the same time zone.

Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

I would expect the departure delay (dep_delay) to be equal to the difference between scheduled departure time (sched_dep_time), and actual departure time (dep_time), dep_time - sched_dep_time = dep_delay.

As with the previous question, the first step is to convert all times to the number of minutes since midnight. The column, dep_delay_diff, is the difference between the column, dep_delay, and departure delay calculated directly from the scheduled and actual departure times.

Does dep_delay_diff equal zero for all rows?

filter(flights_deptime, dep_delay_diff != 0) #> # A tibble: 1,236 x 22 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 848 1835 853 1001 1950 #> 2 2013 1 2 42 2359 43 518 442 #> 3 2013 1 2 126 2250 156 233 2359 #> 4 2013 1 3 32 2359 33 504 442 #> 5 2013 1 3 50 2145 185 203 2311 #> 6 2013 1 3 235 2359 156 700 437 #> # … with 1,230 more rows, and 14 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, #> # dep_time_min <dbl>, sched_dep_time_min <dbl>, dep_delay_diff <dbl>

No. Unlike the last question, time zones are not an issue since we are only considering departure times.3 However, the discrepancies could be because a flight was scheduled to depart before midnight, but was delayed after midnight. All of these discrepancies are exactly equal to 1440 (24 hours), and the flights with these discrepancies were scheduled to depart later in the day.

Thus the only cases in which the departure delay is not equal to the difference in scheduled departure and actual departure times is due to a quirk in how these columns were stored.

Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

The dplyr package provides multiple functions for ranking, which differ in how they handle tied values: row_number(), min_rank(), dense_rank(). To see how they work, let’s create a data frame with duplicate values in a vector and see how ranking functions handle ties.

rankme <- tibble( x = c(10, 5, 1, 5, 5) )

The function row_number() assigns each element a unique value. The result is equivalent to the index (or row) number of each element after sorting the vector, hence its name.

Themin_rank() and dense_rank() assign tied values the same rank, but differ in how they assign values to the next rank. For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one. To see the difference between dense_rank() and min_rank() compare the value of rankme$x_min_rank and rankme$x_dense_rank for x = 10.

If I had to choose one for presenting rankings to someone else, I would use min_rank() since its results correspond to the most common usage of rankings in sports or other competitions. In the code below, I use all three functions, but since there are no ties in the top 10 flights, the results don’t differ.

flights_delayed <- mutate(flights, dep_delay_min_rank = min_rank(desc(dep_delay)), dep_delay_row_number = row_number(desc(dep_delay)), dep_delay_dense_rank = dense_rank(desc(dep_delay)) ) flights_delayed <- filter(flights_delayed, !(dep_delay_min_rank > 10 | dep_delay_row_number > 10 | dep_delay_dense_rank > 10)) flights_delayed <- arrange(flights_delayed, dep_delay_min_rank) print(select(flights_delayed, month, day, carrier, flight, dep_delay, dep_delay_min_rank, dep_delay_row_number, dep_delay_dense_rank), n = Inf) #> # A tibble: 10 x 8 #> month day carrier flight dep_delay dep_delay_min_r… dep_delay_row_n… #> <int> <int> <chr> <int> <dbl> <int> <int> #> 1 1 9 HA 51 1301 1 1 #> 2 6 15 MQ 3535 1137 2 2 #> 3 1 10 MQ 3695 1126 3 3 #> 4 9 20 AA 177 1014 4 4 #> 5 7 22 MQ 3075 1005 5 5 #> 6 4 10 DL 2391 960 6 6 #> 7 3 17 DL 2119 911 7 7 #> 8 6 27 DL 2007 899 8 8 #> 9 7 22 DL 2047 898 9 9 #> 10 12 5 AA 172 896 10 10 #> # … with 1 more variable: dep_delay_dense_rank <int>

In addition to the functions covered here, the rank() function provides several more ways of ranking elements.

There are other ways to solve this problem that do not using ranking functions. To select the top 10, sort values with arrange() and select the top values with slice:

Alternatively, we could use the top_n().

The previous two approaches will always select 10 rows even if there are tied values. Ranking functions provide more control over how tied values are handled. Those approaches will provide the 10 rows with the largest values of dep_delay, while ranking functions can provide all rows with the 10 largest values of dep_delay. If there are no ties, these approaches are equivalent. If there are ties, then which is more appropriate depends on the use.

What does 1:3 + 1:10 return? Why?

The code given in the question returns the following.

This is equivalent to the following.

When adding two vectors, R recycles the shorter vector’s values to create a vector of the same length as the longer vector. The code also raises a warning that the shorter vector is not a multiple of the longer vector. A warning is raised since when this occurs, it is often unintended and may be a bug.

What trigonometric functions does R provide?

All trigonometric functions are all described in a single help page, named Trig. You can open the documentation for these functions with ?Trig or by using ? with any of the following functions, for example:?sin.

R provides functions for the three primary trigonometric functions: sine (sin()), cosine (cos()), and tangent (tan()). The input angles to all these functions are in radians.

In the previous code, I used the variable pi. R provides the variable pi which is set to the value of the mathematical constant $\pi$ .4

Although R provides the pi variable, there is nothing preventing a user from changing its value. For example, I could redefine pi to 3.14 or any other value.

pi <- 3.14 pi #> [1] 3.14 pi <- "Apple" pi #> [1] "Apple"

For that reason, if you are using the builtin pi variable in computations and are paranoid, you may want to always reference it as base::pi.

In the previous code block, since the angles were in radians, I wrote them as $\pi$ times some number. Since it is often easier to write radians multiple of $\pi$, R provides some convenience functions that do that. The function sinpi(x), is equivalent to sin(pi * x). The functions cospi() and tanpi() are similarly defined for the sin and tan functions, respectively.

R provides the function arc-cosine (acos()), arc-sine (asin()), and arc-tangent (atan()).

Finally, R provides the function atan2(). Calling atan2(y, x) returns the angle between the x-axis and the vector from (0,0) to (x, y).

atan2(c(1, 0, -1, 0), c(0, 1, 0, -1)) #> [1] 1.57 0.00 -1.57 3.14

Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it’s 2 hours late.

Which is more important: arrival delay or departure delay?

What this question gets at is a fundamental question of data analysis: the cost function. As analysts, the reason we are interested in flight delay because it is costly to passengers. But it is worth thinking carefully about how it is costly and use that information in ranking and measuring these scenarios.

In many scenarios, arrival delay is more important. In most cases, being arriving late is more costly to the passenger since it could disrupt the next stages of their travel, such as connecting flights or scheduled meetings.
If a departure is delayed without affecting the arrival time, this delay will not have those affects plans nor does it affect the total time spent traveling. This delay could be beneficial, if less time is spent in the cramped confines of the airplane itself, or a negative, if that delayed time is still spent in the cramped confines of the airplane on the runway.

Variation in arrival time is worse than consistency. If a flight is always 30 minutes late and that delay is known, then it is as if the arrival time is that delayed time. The traveler could easily plan for this. But higher variation in flight times makes it harder to plan.

Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))

The first expression is the following.

The count() function counts the number of instances within each group of variables. Instead of using the count() function, we can combine the group_by() and summarise() verbs.

An alternative method for getting the number of observations in a data frame is the function n().

Another alternative to count() is to use group_by() followed by tally(). In fact, count() is effectively a short-cut for group_by() followed by tally().

The second expression also uses the count() function, but adds a wt argument.

As before, we can replicate count() by combining the group_by() and summarise() verbs. But this time instead of using length(), we will use sum() with the weighting variable.

Like the previous example, we can also use the combination group_by() and tally(). Any arguments to tally() are summed.

Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?

If a flight never departs, then it won’t arrive. A flight could also depart and not arrive if it crashes, or if it is redirected and lands in an airport other than its intended destination. So the most important column is arr_delay, which indicates the amount of delay in arrival.

In this data dep_time can be non-missing and arr_delay missing but arr_time not missing. Some further research found that these rows correspond to diverted flights. The BTS database that is the source for the flights table contains additional information for diverted flights that is not included in the nycflights13 data. The source contains a column DivArrDelay with the description:

Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights.

Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

One pattern in cancelled flights per day is that the number of cancelled flights increases with the total number of flights per day. The proportion of cancelled flights increases with the average delay of flights.

To answer these questions, use definition of cancelled used in the chapter Section 5.6.3 and the relationship !(is.na(arr_delay) & is.na(dep_delay)) is equal to !is.na(arr_delay) | !is.na(dep_delay) by De Morgan’s law.

The first part of the question asks for any pattern in the number of cancelled flights per day. I’ll look at the relationship between the number of cancelled flights per day and the total number of flights in a day. There should be an increasing relationship for two reasons. First, if all flights are equally likely to be cancelled, then days with more flights should have a higher number of cancellations. Second, it is likely that days with more flights would have a higher probability of cancellations because congestion itself can cause delays and any delay would affect more flights, and large delays can lead to cancellations.

Plotting flights_num against cancelled_num shows that the number of flights cancelled increases with the total number of flights.

The second part of the question asks whether there is a relationship between the proportion of flights cancelled and the average departure delay. I implied this in my answer to the first part of the question, when I noted that increasing delays could result in increased cancellations. The question does not specify which delay, so I will show the relationship for both.

There is a strong increasing relationship between both average departure delay and
and average arrival delay and the proportion of cancelled flights.

Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

What airline corresponds to the "F9" carrier code?

You can get part of the way to disentangling the effects of airports versus bad carriers by comparing the average delay of each carrier to the average delay of flights within a route (flights from the same origin to the same destination). Comparing delays between carriers and within each route disentangles the effect of carriers and airports. A better analysis would compare the average delay of a carrier’s flights to the average delay of all other carrier’s flights within a route.

There are more sophisticated ways to do this analysis, however comparing the delay of flights within each route goes a long ways toward disentangling airport and carrier effects. To see a more complete example of this analysis, see this FiveThirtyEight piece.

Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

Summary functions (mean()), offset functions (lead(), lag()), ranking functions (min_rank(), row_number()), operate within each group when used with group_by() in mutate() or filter(). Arithmetic operators (+, -), logical operators (<, ==), modular arithmetic operators (%%, %/%), logarithmic functions (log) are not affected by group_by.

Summary functions like mean(), median(), sum(), std() and others covered in the section Useful Summary Functions calculate their values within each group when used with mutate() or filter() and group_by().

Arithmetic operators +, -, *, /, ^ are not affected by group_by().

The modular arithmetic operators %/% and %% are not affected by group_by()

The logarithmic functions log(), log2(), and log10() are not affected by group_by().

The offset functions lead() and lag() respect the groupings in group_by(). The functions lag() and lead() will only return values within each group.

The cumulative and rolling aggregate functions cumsum(), cumprod(), cummin(), cummax(), and cummean() calculate values within each group.

Logical comparisons, <, <=, >, >=, !=, and == are not affected by group_by().

Ranking functions like min_rank() work within each group when used with group_by().

Though not asked in the question, note that arrange() ignores groups when sorting values.

However, the order of values from arrange() can interact with groups when used with functions that rely on the ordering of elements, such as lead(), lag(), or cumsum().

Which plane (tailnum) has the worst on-time record?

The question does not define a way to measure on-time record, so I will consider two metrics:

proportion of flights not delayed or cancelled, and
mean arrival delay.

The first metric is the proportion of not-cancelled and on-time flights. I use the presence of an arrival time as an indicator that a flight was not cancelled. However, there are many planes that have never flown an on-time flight. Additionally, many of the planes that have the lowest proportion of on-time flights have only flown a small number of flights.

So, I will remove planes that flew at least 20 flights. The choice of 20 was chosen because it round number near the first quartile of the number of flights by plane.56

The plane with the worst on time record that flew at least 20 flights is:

There are cases where arr_delay is missing but arr_time is not missing. I have not debugged the cause of this bad data, so these rows are dropped for the purposes of this exercise.

The second metric is the mean minutes delayed. As with the previous metric, I will only consider planes which flew least 20 flights. A different plane has the worst on-time record when measured as average minutes delayed.

For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

The key to answering this question is to only include delayed flights when calculating the total delay and proportion of delay.

flights %>% filter(arr_delay > 0) %>% group_by(dest) %>% mutate( arr_delay_total = sum(arr_delay), arr_delay_prop = arr_delay / arr_delay_total ) %>% select(dest, month, day, dep_time, carrier, flight, arr_delay, arr_delay_prop) %>% arrange(dest, desc(arr_delay_prop)) #> # A tibble: 133,004 x 8 #> # Groups: dest [103] #> dest month day dep_time carrier flight arr_delay arr_delay_prop #> <chr> <int> <int> <int> <chr> <int> <dbl> <dbl> #> 1 ABQ 7 22 2145 B6 1505 153 0.0341 #> 2 ABQ 12 14 2223 B6 65 149 0.0332 #> 3 ABQ 10 15 2146 B6 65 138 0.0308 #> 4 ABQ 7 23 2206 B6 1505 137 0.0305 #> 5 ABQ 12 17 2220 B6 65 136 0.0303 #> 6 ABQ 7 10 2025 B6 1505 126 0.0281 #> # … with 132,998 more rows

There is some ambiguity in the meaning of the term flights in the question. The first example defined a flight as a row in the flights table, which is a trip by an aircraft from an airport at a particular date and time. However, flight could also refer to the flight number, which is the code a carrier uses for an airline service of a route. For example, AA1 is the flight number of the 09:00 American Airlines flight between JFK and LAX. The flight number is contained in the flights$flight column, though what is called a “flight” is a combination of the flights$carrier and flights$flight columns.

flights %>% filter(arr_delay > 0) %>% group_by(dest, origin, carrier, flight) %>% summarise(arr_delay = sum(arr_delay)) %>% group_by(dest) %>% mutate( arr_delay_prop = arr_delay / sum(arr_delay) ) %>% arrange(dest, desc(arr_delay_prop)) %>% select(carrier, flight, origin, dest, arr_delay_prop) #> `summarise()` regrouping output by 'dest', 'origin', 'carrier' (override with `.groups` argument) #> # A tibble: 8,834 x 5 #> # Groups: dest [103] #> carrier flight origin dest arr_delay_prop #> <chr> <int> <chr> <chr> <dbl> #> 1 B6 1505 JFK ABQ 0.567 #> 2 B6 65 JFK ABQ 0.433 #> 3 B6 1191 JFK ACK 0.475 #> 4 B6 1491 JFK ACK 0.414 #> 5 B6 1291 JFK ACK 0.0898 #> 6 B6 1195 JFK ACK 0.0208 #> # … with 8,828 more rows

Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag() explore how the delay of a flight is related to the delay of the immediately preceding flight.

This calculates the departure delay of the preceding flight from the same airport.

This plots the relationship between the mean delay of a flight for all values of the previous flight. For delays less than two hours, the relationship between the delay of the preceding flight and the current flight is nearly a line. After that the relationship becomes more variable, as long-delayed flights are interspersed with flights leaving on-time. After about 8-hours, a delayed flight is likely to be followed by a flight leaving on time.

The overall relationship looks similar in all three origin airports.

Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

When calculating this answer we should only compare flights within the same (origin, destination) pair.

To find unusual observations, we need to first put them on the same scale. I will standardize values by subtracting the mean from each and then dividing each by the standard deviation. \[ \mathsf{standardized}(x) = \frac{x - \mathsf{mean}(x)}{\mathsf{sd}(x)} . \] A standardized variable is often called a $z$-score. The units of the standardized variable are standard deviations from the mean. This will put the flight times from different routes on the same scale. The larger the magnitude of the standardized variable for an observation, the more unusual the observation is. Flights with negative values of the standardized variable are faster than the mean flight for that route, while those with positive values are slower than the mean flight for that route.

I add 1 to the denominator and numerator to avoid dividing by zero. Note that the ungroup() here is not necessary. However, I will be using this data frame later. Through experience, I have found that I have fewer bugs when I keep a data frame grouped for only those verbs that need it. If I did not ungroup() this data frame, the arrange() used later would not work as expected. It is better to err on the side of using ungroup() when unnecessary.

The distribution of the standardized air flights has long right tail.

Unusually fast flights are those flights with the smallest standardized values.

I used width = Inf to ensure that all columns will be printed.

The fastest flight is DL1499 from LGA to ATL which departed on 2013-05-25 at 17:09. It has an air time of 65 minutes, compared to an average flight time of 114 minutes for its route. This is 4.6 standard deviations below the average flight on its route.

It is important to note that this does not necessarily imply that there was a data entry error. We should check these flights to see whether there was some reason for the difference. It may be that we are missing some piece of information that explains these unusual times.

A potential issue with the way that we standardized the flights is that the mean and standard deviation used to calculate are sensitive to outliers and outliers is what we are looking for. Instead of standardizing variables with the mean and variance, we could use the median as a measure of central tendency and the interquartile range (IQR) as a measure of spread. The median and IQR are more resistant to outliers than the mean and standard deviation. The following method uses the median and inter-quartile range, which are less sensitive to outliers.

The distribution of the standardized air flights using this new definition also has long right tail of slow flights.

Unusually fast flights are those flights with the smallest standardized values.

All of these answers have relied only on using a distribution of comparable observations to find unusual observations. In this case, the comparable observations were flights from the same origin to the same destination. Apart from our knowledge that flights from the same origin to the same destination should have similar air times, we have not used any other domain-specific knowledge. But we know much more about this problem. The most obvious piece of knowledge we have is that we know that flights cannot travel back in time, so there should never be a flight with a negative airtime. But we also know that aircraft have maximum speeds. While different aircraft have different cruising speeds, commercial airliners typically cruise at air speeds around 547–575 mph. Calculating the ground speed of aircraft is complicated by the way in which winds, especially the influence of wind, especially jet streams, on the ground-speed of flights. A strong tailwind can increase ground-speed of the aircraft by 200 mph. Apart from the retired Concorde. For example, in 2018, a transatlantic flight traveled at 770 mph due to a strong jet stream tailwind. This means that any flight traveling at speeds greater than 800 mph is implausible, and it may be worth checking flights traveling at greater than 600 or 700 mph. Ground speed could also be used to identify aircraft flying implausibly slow. Joining flights data with the air craft type in the planes table and getting information about typical or top speeds of those aircraft could provide a more detailed way to identify implausibly fast or slow flights. Additional data on high altitude wind speeds at the time of the flight would further help.

Knowing the substance of the data analysis at hand is one of the most important tools of a data scientist. The tools of statistics are a complement, not a substitute, for that knowledge.

With that in mind, Let’s plot the distribution of the ground speed of flights. The modal flight in this data has a ground speed of between 400 and 500 mph. The distribution of ground speeds has a large left tail of slower flights below 400 mph constituting the majority. There are very few flights with a ground speed over 500 mph.

The fastest flight is the same one identified as the largest outlier earlier. Its ground speed was 703 mph. This is fast for a commercial jet, but not impossible.

One explanation for unusually fast flights is that they are “making up time” in the air by flying faster. Commercial aircraft do not fly at their top speed since the airlines are also concerned about fuel consumption. But, if a flight is delayed on the ground, it may fly faster than usual in order to avoid a late arrival. So, I would expect that some of the unusually fast flights were delayed on departure.

flights %>% mutate(mph = distance / (air_time / 60)) %>% arrange(desc(mph)) %>% select( origin, dest, mph, year, month, day, dep_time, flight, carrier, dep_delay, arr_delay ) #> # A tibble: 336,776 x 11 #> origin dest mph year month day dep_time flight carrier dep_delay #> <chr> <chr> <dbl> <int> <int> <int> <int> <int> <chr> <dbl> #> 1 LGA ATL 703. 2013 5 25 1709 1499 DL 9 #> 2 EWR MSP 650. 2013 7 2 1558 4667 EV 45 #> 3 EWR GSP 648 2013 5 13 2040 4292 EV 15 #> 4 EWR BNA 641. 2013 3 23 1914 3805 EV 4 #> 5 LGA PBI 591. 2013 1 12 1559 1902 DL -1 #> 6 JFK SJU 564 2013 11 17 650 315 DL -5 #> # … with 336,770 more rows, and 1 more variable: arr_delay <dbl> head(5) #> [1] 5

Five of the top ten flights had departure delays, and three of those were able to make up that time in the air and arrive ahead of schedule.

Overall, there were a few flights that seemed unusually fast, but they all fall into the realm of plausibility and likely are not data entry problems. [Ed. Please correct me if I am missing something]

The second part of the question asks us to compare flights to the fastest flight on a route to find the flights most delayed in the air. I will calculate the amount a flight is delayed in air in two ways. The first is the absolute delay, defined as the number of minutes longer than the fastest flight on that route,air_time - min(air_time). The second is the relative delay, which is the percentage increase in air time relative to the time of the fastest flight along that route, (air_time - min(air_time)) / min(air_time) * 100.

The most delayed flight in air in minutes was DL841 from JFK to SFO which departed on 2013-07-28 at 17:27. It took 189 minutes longer than the flight with the shortest air time on its route.

air_time_delayed %>% arrange(desc(air_time_delay)) %>% select( air_time_delay, carrier, flight, origin, dest, year, month, day, dep_time, air_time, air_time_min ) %>% head() %>% print(width = Inf) #> # A tibble: 6 x 11 #> # Groups: origin, dest [5] #> air_time_delay carrier flight origin dest year month day dep_time air_time #> <dbl> <chr> <int> <chr> <chr> <int> <int> <int> <int> <dbl> #> 1 189 DL 841 JFK SFO 2013 7 28 1727 490 #> 2 165 DL 426 JFK LAX 2013 11 22 1812 440 #> 3 163 AA 575 JFK EGE 2013 1 28 1806 382 #> 4 147 DL 17 JFK LAX 2013 7 10 1814 422 #> 5 145 UA 745 LGA DEN 2013 9 10 1513 331 #> 6 143 UA 587 EWR LAS 2013 11 22 2142 399 #> air_time_min #> <dbl> #> 1 301 #> 2 275 #> 3 219 #> 4 275 #> 5 186 #> 6 256

The most delayed flight in air as a percentage of the fastest flight along that route was US2136 from LGA to BOS departing on 2013-06-17 at 16:52. It took 410% longer than the flight with the shortest air time on its route.

air_time_delayed %>% arrange(desc(air_time_delay)) %>% select( air_time_delay_pct, carrier, flight, origin, dest, year, month, day, dep_time, air_time, air_time_min ) %>% head() %>% print(width = Inf) #> # A tibble: 6 x 11 #> # Groups: origin, dest [5] #> air_time_delay_pct carrier flight origin dest year month day dep_time #> <dbl> <chr> <int> <chr> <chr> <int> <int> <int> <int> #> 1 62.8 DL 841 JFK SFO 2013 7 28 1727 #> 2 60 DL 426 JFK LAX 2013 11 22 1812 #> 3 74.4 AA 575 JFK EGE 2013 1 28 1806 #> 4 53.5 DL 17 JFK LAX 2013 7 10 1814 #> 5 78.0 UA 745 LGA DEN 2013 9 10 1513 #> 6 55.9 UA 587 EWR LAS 2013 11 22 2142 #> air_time air_time_min #> <dbl> <dbl> #> 1 490 301 #> 2 440 275 #> 3 382 219 #> 4 422 275 #> 5 331 186 #> 6 399 256

Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

To restate this question, we are asked to rank airlines by the number of destinations that they fly to, considering only those airports that are flown to by two or more airlines. There are two steps to calculating this ranking. First, find all airports serviced by two or more carriers. Then, rank carriers by the number of those destinations that they service.

The carrier "EV" flies to the most destinations, considering only airports flown to by two or more carriers. What airline does the "EV" carrier code correspond to?

Unless you know the airplane industry, it is likely that you don’t recognize ExpressJet; I certainly didn’t. It is a regional airline that partners with major airlines to fly from hubs (larger airports) to smaller airports. This means that many of the shorter flights of major carriers are operated by ExpressJet. This business model explains why ExpressJet services the most destinations.

Among the airlines that fly to only one destination from New York are Alaska Airlines and Hawaiian Airlines.

For each plane, count the number of flights before the first delay of greater than 1 hour.

The question does not specify arrival or departure delay. I consider dep_delay in this answer, though similar code could be used for arr_delay.

Page 4

Page 5

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

First, I’ll calculate summary statistics for these variables and plot their distributions.

ggplot(diamonds) + geom_histogram(mapping = aes(x = x), binwidth = 0.01)

ggplot(diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.01)

ggplot(diamonds) + geom_histogram(mapping = aes(x = z), binwidth = 0.01)

There several noticeable features of the distributions:

x and y are larger than z,
there are outliers,
they are all right skewed, and
they are multimodal or “spiky”.

The typical values of x and y are larger than z, with x and y having inter-quartile ranges of 4.7–6.5, while z has an inter-quartile range of 2.9–4.0.

There are two types of outliers in this data. Some diamonds have values of zero and some have abnormally large values of x, y, or z.

These appear to be either data entry errors, or an undocumented convention in the dataset for indicating missing values. An alternative hypothesis would be that values of zero are the result of rounding values like 0.002 down, but since there are no diamonds with values of 0.01, that does not seem to be the case.

There are also some diamonds with values of y and z that are abnormally large. There are diamonds with y == 58.9 and y == 31.8, and one with z == 31.8. These are probably data errors since the values do not seem in line with the values of the other variables.

Initially, I only considered univariate outliers. However, to check the plausibility of those outliers I would informally consider how consistent their values are with the values of the other variables. In this case, scatter plots of each combination of x, y, and z shows these outliers much more clearly.

ggplot(diamonds, aes(x = x, y = y)) + geom_point()

ggplot(diamonds, aes(x = x, y = z)) + geom_point()

ggplot(diamonds, aes(x = y, y = z)) + geom_point()

Removing the outliers from x, y, and z makes the distribution easier to see. The right skewness of these distributions is unsurprising; there should be more smaller diamonds than larger ones and these values can never be negative. More interestingly, there are spikes in the distribution at certain values. These spikes often, but not exclusively, occur near integer values. Without knowing more about diamond cutting, I can’t say more about what these spikes represent. If you know, add a comment. I would guess that some diamond sizes are used more often than others, and these spikes correspond to those sizes. Also, I would guess that a diamond cut and carat value of a diamond imply values of x, y, and z. Since there are spikes in the distribution of carat sizes, and only a few different cuts, that could result in these spikes. I’ll leave it to readers to figure out if that’s the case.

According to the documentation for diamonds, x is length, y is width, and z is depth. If documentation were unavailable, I would compare the values of the variables to match them to the length, width, and depth. I would expect length to always be less than width, otherwise the length would be called the width. I would also search for the definitions of length, width, and depth with respect to diamond cuts. Depth can be expressed as a percentage of the length/width of the diamond, which means it should be less than both the length and the width.

It appears that depth (z) is always smaller than length (x) or width (y), perhaps because a shallower depth helps when setting diamonds in jewelry and due to how it affect the reflection of light. Length is more than width in less than half the observations, the opposite of my expectations.

Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

The price data has many spikes, but I can’t tell what each spike corresponds to. The following plots don’t show much difference in the distributions in the last one or two digits.
There are no diamonds with a price of $1,500 (between $1,455 and $1,545, including).
There’s a bulge in the distribution around $750.

The last digits of prices are often not uniformly distributed. They are often round, ending in 0 or 5 (for one-half). Another common pattern is ending in 99, as in $1999. If we plot the distribution of the last one and two digits of prices do we observe patterns like that?

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

Missing values are removed when the number of observations in each bin are calculated. See the warning message: Removed 9 rows containing non-finite values (stat_bin)

In the geom_bar() function, NA is treated as another category. The x aesthetic in geom_bar() requires a discrete (categorical) variable, and missing values act like another category.

In a histogram, the x aesthetic variable needs to be numeric, and stat_bin() groups the observations by ranges into bins. Since the numeric value of the NA observations is unknown, they cannot be placed in a particular bin, and are dropped.

What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

What are the general relationships of each variable with the price of the diamonds? I will consider the variables: carat, clarity, color, and cut. I ignore the dimensions of the diamond since carat measures size, and thus incorporates most of the information contained in these variables.

Since both price and carat are continuous variables, I use a scatter plot to visualize their relationship.

ggplot(diamonds, aes(x = carat, y = price)) + geom_point()

However, since there is a large number of points in the data, I will use a boxplot by binning carat, as suggested in the chapter:

Note that the choice of the binning width is important, as if it were too large it would obscure any relationship, and if it were too small, the values in the bins could be too variable to reveal underlying trends.

Version 3.3.0 of ggplot2 introduced changes to boxplots that may affect the orientation.

This geom treats each axis differently and, thus, can thus have two orientations. Often the orientation is easy to deduce from a combination of the given mappings and the types of positional scales in use. Thus, ggplot2 will by default try to guess which orientation the layer should have. Under rare circumstances, the orientation is ambiguous and guessing may fail

If you are getting something different with your code check the version of ggplot2. Use orientation = "x" (vertical boxplots) or orientation = "y" (horizontal boxplots) to explicitly specify how the geom should treat these axes.

The variables color and clarity are ordered categorical variables. The chapter suggests visualizing a categorical and continuous variable using frequency polygons or boxplots. In this case, I will use a box plot since it will better show a relationship between the variables.

There is a weak negative relationship between color and price. The scale of diamond color goes from D (best) to J (worst). Currently, the levels of diamonds$color are in the wrong order. Before plotting, I will reverse the order of the color levels so they will be in increasing order of quality on the x-axis. The color column is an example of a factor variable, which is covered in the “Factors” chapter of R4DS.

There is also weak negative relationship between clarity and price. The scale of clarity goes from I1 (worst) to IF (best).

For both clarity and color, there is a much larger amount of variation within each category than between categories. Carat is clearly the single best predictor of diamond prices.

Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut? Since this is an example of a continuous (carat) and categorical (cut) variable, it can be visualized with a box plot.

ggplot(diamonds, aes(x = cut, y = carat)) + geom_boxplot()

There is a lot of variability in the distribution of carat sizes within each cut category. There is a slight negative relationship between carat and cut. Noticeably, the largest carat diamonds have a cut of “Fair” (the lowest).

This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.

Install the ggstance package, and create a horizontal box plot. How does this compare to using coord_flip()?

Earlier, we created this horizontal box plot of the distribution hwy by class, using geom_boxplot() and coord_flip():

In this case the output looks the same, but x and y aesthetics are flipped.

Current versions of ggplot2 (since version 3.3.0) do not require coord_flip(). All geoms can choose the direction. The direction is be inferred from the aesthetic mapping. In this case, switching x and y produces a horizontal boxplot.

The orientation argument is used to explicitly specify the axis orientation of the plot.

One problem with box plots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn?

How do you interpret the plots?

Like box-plots, the boxes of the letter-value plot correspond to quantiles. However, they incorporate far more quantiles than box-plots. They are useful for larger datasets because,

larger datasets can give precise estimates of quantiles beyond the quartiles, and
in expectation, larger datasets should have more outliers (in absolute numbers).

ggplot(diamonds, aes(x = cut, y = price)) + geom_lv()

The letter-value plot is described in Hofmann, Wickham, and Kafadar (2017).

Compare and contrast geom_violin() with a faceted geom_histogram(), or a colored geom_freqpoly(). What are the pros and cons of each method?

I produce plots for these three methods below. The geom_freqpoly() is better for look-up: meaning that given a price, it is easy to tell which cut has the highest density. However, the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other. The geom_violin() and faceted geom_histogram() have similar strengths and weaknesses. It is easy to visually distinguish differences in the overall shape of the distributions (skewness, central values, variance, etc). However, since we can’t easily compare the vertical values of the distribution, it is difficult to look up which category has the highest density for a given price. All of these methods depend on tuning parameters to determine the level of smoothness of the distribution.

The violin plot was first described in Hintze and Nelson (1998).

If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

There are two methods:

geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points.

I’ll use the mpg box plot example since these methods display individual points, they are better suited for smaller datasets.

How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?

To clearly show the distribution of cut within color, calculate a new variable prop which is the proportion of each cut within a color. This is done using a grouped mutate.

Similarly, to scale by the distribution of color within cut,

I add limit = c(0, 1) to put the color scale between (0, 1). These are the logical boundaries of proportions. This makes it possible to compare each cell to its actual value, and would improve comparisons across multiple plots. However, it ends up limiting the colors and makes it harder to compare within the dataset. However, using the default limits of the minimum and maximum values makes it easier to compare within the dataset the emphasizing relative differences, but harder to compare across datasets.

Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

There are several things that could be done to improve it,

sort destinations by a meaningful quantity (distance, number of flights, average delay)
remove missing values

How to treat missing values is difficult. In this case, missing values correspond to airports which don’t have regular flights (at least one flight each month) from NYC. These are likely smaller airports (with higher variance in their average due to fewer observations). When we group all pairs of (month, dest) again by dest, we should have a total count of 12 (one for each month) per group (dest). This makes it easy to filter.

Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

Instead of summarizing the conditional distribution with a box plot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualization of the 2d distribution of carat and price?

Both cut_width() and cut_number() split a variable into groups. When using cut_width(), we need to choose the width, and the number of bins will be calculated automatically. When using cut_number(), we need to specify the number of bins, and the widths will be calculated automatically.

In either case, we want to choose the bin widths and number to be large enough to aggregate observations to remove noise, but not so large as to remove all the signal.

If categorical colors are used, no more than eight colors should be used in order to keep them distinct. Using cut_number, I will split carats into quantiles (five groups).

Alternatively, I could use cut_width to specify widths at which to cut. I will choose 1-carat widths. Since there are very few diamonds larger than 2-carats, this is not as informative. However, using a width of 0.5 carats creates too many groups, and splitting at non-whole numbers is unappealing.

Visualize the distribution of carat, partitioned by price.

Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.

Plotted with a box plot with 10 equal-width bins of $2,000. The argument boundary = 0 ensures that first bin is $0–$2,000.

How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you?

The distribution of very large diamonds is more variable. I am not surprised, since I knew little about diamond prices. After the fact, it does not seem surprising (as many thing do). I would guess that this is due to the way in which diamonds are selected for retail sales. Suppose that someone selling a diamond only finds it profitable to sell it if some combination size, cut, clarity, and color are above a certain threshold. The smallest diamonds are only profitable to sell if they are exceptional in all the other factors (cut, clarity, and color), so the small diamonds sold have similar characteristics. However, larger diamonds may be profitable regardless of the values of the other factors. Thus we will observe large diamonds with a wider variety of cut, clarity, and color and thus more variability in prices.

Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.

There are many options to try, so your solutions may vary from mine. Here are a few options that I tried.

Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.