Skip to content

Fix order of diamond color factor. #3146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jrnold
Copy link
Contributor

@jrnold jrnold commented Feb 16, 2019

The current color variable in the diamonds dataset is an ordered factor, but the levels are in decreasing order of color quality rather than increasing order.

library("ggplot2")
class(ggplot2::diamonds)
#> [1] "tbl_df"     "tbl"        "data.frame"
levels(ggplot2::diamonds$color)
#> [1] "D" "E" "F" "G" "H" "I" "J"

Created on 2019-02-15 by the reprex package (v0.2.1)

The factor order of diamonds$coloris currently D < ... < J. However, since "D" is the best grade and "J" the words, the levels should be J < ... < D. See https://en.wikipedia.org/wiki/Diamond_color#Grading_fancy_color_diamonds.

This pull request puts the factor levels in the correct order.

The factor order of the diamonds dataset is currently `D < ... < J`, but since "D" is the best grade, it should be `J < ... < D`.

https://en.wikipedia.org/wiki/Diamond_color#Grading_fancy_color_diamonds
jrnold added a commit to jrnold/r4ds-exercise-solutions that referenced this pull request Feb 16, 2019
Since the levels of colors were in the incorrect order (see tidyverse/ggplot2#3146), the interpretation of the relationship between color and price was backwards.

Fixes #199
@clauswilke
Copy link
Member

This PR has the potential to mess up a lot of existing examples and tutorials, so we should only merge if we're absolutely certain that this change is required. And I'm not entirely convinced. For rankings of quality, you can usually rank from worst to best or from best to worst. For example, if you submit a grant proposal to the National Institutes of Health, it will be ranked on a scale from 1 to 9, with 1 being exceptional and 9 terrible.

The current diamonds dataset ranks color from best to worst, and you're proposing to instead rank from worst to best. This may be an improvement in some cases and a regression in others. For example, the simple dplyr code below which counts the number of diamonds of different colors produces a table in the same order as the color table on Wikipedia.

library(tidyverse)

diamonds %>%
  group_by(color) %>%
  tally() %>%
  arrange(color)
#> # A tibble: 7 x 2
#>   color     n
#>   <ord> <int>
#> 1 D      6775
#> 2 E      9797
#> 3 F      9542
#> 4 G     11292
#> 5 H      8304
#> 6 I      5422
#> 7 J      2808

Created on 2019-02-16 by the reprex package (v0.2.1)

@yutannihilation
Copy link
Member

This PR has the potential to mess up a lot of existing examples and tutorials, so we should only merge if we're absolutely certain that this change is required.

Agreed. As I commented on #2962, it seems almost impossible to safely deprecate and update a dataset.

But, from this comment on the original issue (jrnold/r4ds-exercise-solutions#199), I feel this is just a documentation problem.

The levels of the ggplot2::diamonds$color are specified correctly in the documentation, but incorrectly in the variable.

Should line this be from D (best) to J (worst)?

#' \item{color}{diamond colour, from J (worst) to D (best)}

@jrnold
Copy link
Contributor Author

jrnold commented Feb 16, 2019

I think that the argument about breaking existing examples is key. So I'm going to close it.

Changing that line in the documentation may help.

Yes, quality can be ordered from best to worst or worst to best. But the ordering of color is inconsistent with that of cut and clarity. Of these, color is the only one that is ordered from best to worst. I suppose an argument for the ordering of color is that for letter grades there is prior knowledge that they tend to be ordered from best to worst, which is why it is different.

> head(diamonds$color)
[1] E E E I J J
Levels: D < E < F < G < H < I < J
> head(diamonds$clarity)
[1] SI2  SI1  VS1  VS2  SI2  VVS2
Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
> head(diamonds$cut)
[1] Ideal     Premium   Good      Premium   Good      Very Good
Levels: Fair < Good < Very Good < Premium < Ideal

@jrnold jrnold closed this Feb 16, 2019
@clauswilke
Copy link
Member

@jrnold Would you be willing to submit a PR that improves the documentation?

Put the order that the levels of `diamonds$color` are mentioned in the docs match their order in the variable.

See tidyverse#3146 (comment)
@jrnold jrnold reopened this Feb 17, 2019
@jrnold
Copy link
Contributor Author

jrnold commented Feb 17, 2019

Removed the commit with changes to thediamonds dataset. Added the doc changes suggested by @yutannihilation

@yutannihilation
Copy link
Member

Thanks, could you update the docs by running devtools::document()?

@yutannihilation
Copy link
Member

Thanks, could you update the docs by running devtools::document()?

No worries, I'll do this.

@yutannihilation yutannihilation merged commit 6be7893 into tidyverse:master Mar 7, 2019
@yutannihilation
Copy link
Member

Thanks!

@lock
Copy link

lock bot commented Sep 3, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants