项目作者: njtierney

项目描述 :
整洁的数据结构,摘要和缺失数据的可视化
高级语言: R
项目地址: git://github.com/njtierney/naniar.git
创建时间: 2015-12-10T07:11:30Z
项目社区:https://github.com/njtierney/naniar

开源协议:Other

下载


naniar

R-CMD-check
Coverage
Status
CRAN Status
Badge
CRAN Downloads Each
Month
lifecycle

naniar provides principled, tidy ways to summarise, visualise, and
manipulate missing data with minimal deviations from the workflows in
ggplot2 and tidy data. It does this by providing:

  • Shadow matrices, a tidy data structure for missing data:
    • bind_shadow() and nabular()
  • Shorthand summaries for missing data:
    • n_miss() and n_complete()
    • pct_miss()and pct_complete()
  • Numerical summaries of missing data in variables and cases:
    • miss_var_summary() and miss_var_table()
    • miss_case_summary(), miss_case_table()
  • Statistical tests of missingness:
  • Visualisation for missing data:
    • geom_miss_point()
    • gg_miss_var()
    • gg_miss_case()
    • gg_miss_fct()

For more details on the workflow and theory underpinning naniar, read
the vignette Getting started with
naniar
.

For a short primer on the data visualisation available in naniar, read
the vignette Gallery of Missing Data
Visualisations
.

For full details of the package, including

Installation

You can install naniar from CRAN:

  1. install.packages("naniar")

Or you can install the development version on github using remotes:

  1. # install.packages("remotes")
  2. remotes::install_github("njtierney/naniar")

A short overview of naniar

Visualising missing data might sound a little strange - how do you
visualise something that is not there? One approach to visualising
missing data comes from ggobi and
manet, which replaces NA values
with values 10% lower than the minimum value in that variable. This
visualisation is provided with the geom_miss_point() ggplot2 geom,
which we illustrate by exploring the relationship between Ozone and
Solar radiation from the airquality dataset.

  1. library(ggplot2)
  2. ggplot(data = airquality,
  3. aes(x = Ozone,
  4. y = Solar.R)) +
  5. geom_point()
  6. #> Warning: Removed 42 rows containing missing values or values outside the scale range
  7. #> (`geom_point()`).

ggplot2 does not handle these missing values, and we get a warning
message about the missing values.

We can instead use geom_miss_point() to display the missing data

  1. library(naniar)
  2. ggplot(data = airquality,
  3. aes(x = Ozone,
  4. y = Solar.R)) +
  5. geom_miss_point()

geom_miss_point() has shifted the missing values to now be 10% below
the minimum value. The missing values are a different colour so that
missingness becomes pre-attentive. As it is a ggplot2 geom, it supports
features like faceting and other ggplot features.

  1. p1 <-
  2. ggplot(data = airquality,
  3. aes(x = Ozone,
  4. y = Solar.R)) +
  5. geom_miss_point() +
  6. facet_wrap(~Month, ncol = 2) +
  7. theme(legend.position = "bottom")
  8. p1

Data Structures

naniar provides a data structure for working with missing data, the
shadow matrix (Swayne and Buja,
1998)
.
The shadow matrix is the same dimension as the data, and consists of
binary indicators of missingness of data values, where missing is
represented as “NA”, and not missing is represented as “!NA”, and
variable names are kep the same, with the added suffix “_NA” to the
variables.

  1. head(airquality)
  2. #> Ozone Solar.R Wind Temp Month Day
  3. #> 1 41 190 7.4 67 5 1
  4. #> 2 36 118 8.0 72 5 2
  5. #> 3 12 149 12.6 74 5 3
  6. #> 4 18 313 11.5 62 5 4
  7. #> 5 NA NA 14.3 56 5 5
  8. #> 6 28 NA 14.9 66 5 6
  9. as_shadow(airquality)
  10. #> # A tibble: 153 × 6
  11. #> Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA
  12. #> <fct> <fct> <fct> <fct> <fct> <fct>
  13. #> 1 !NA !NA !NA !NA !NA !NA
  14. #> 2 !NA !NA !NA !NA !NA !NA
  15. #> 3 !NA !NA !NA !NA !NA !NA
  16. #> 4 !NA !NA !NA !NA !NA !NA
  17. #> 5 NA NA !NA !NA !NA !NA
  18. #> 6 !NA NA !NA !NA !NA !NA
  19. #> 7 !NA !NA !NA !NA !NA !NA
  20. #> 8 !NA !NA !NA !NA !NA !NA
  21. #> 9 !NA !NA !NA !NA !NA !NA
  22. #> 10 NA !NA !NA !NA !NA !NA
  23. #> # ℹ 143 more rows

Binding the shadow data to the data you help keep better track of the
missing values. This format is called “nabular”, a portmanteau of NA
and tabular. You can bind the shadow to the data using bind_shadow
or nabular:

  1. bind_shadow(airquality)
  2. #> # A tibble: 153 × 12
  3. #> Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
  4. #> <int> <int> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
  5. #> 1 41 190 7.4 67 5 1 !NA !NA !NA !NA
  6. #> 2 36 118 8 72 5 2 !NA !NA !NA !NA
  7. #> 3 12 149 12.6 74 5 3 !NA !NA !NA !NA
  8. #> 4 18 313 11.5 62 5 4 !NA !NA !NA !NA
  9. #> 5 NA NA 14.3 56 5 5 NA NA !NA !NA
  10. #> 6 28 NA 14.9 66 5 6 !NA NA !NA !NA
  11. #> 7 23 299 8.6 65 5 7 !NA !NA !NA !NA
  12. #> 8 19 99 13.8 59 5 8 !NA !NA !NA !NA
  13. #> 9 8 19 20.1 61 5 9 !NA !NA !NA !NA
  14. #> 10 NA 194 8.6 69 5 10 NA !NA !NA !NA
  15. #> # ℹ 143 more rows
  16. #> # ℹ 2 more variables: Month_NA <fct>, Day_NA <fct>
  17. nabular(airquality)
  18. #> # A tibble: 153 × 12
  19. #> Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
  20. #> <int> <int> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
  21. #> 1 41 190 7.4 67 5 1 !NA !NA !NA !NA
  22. #> 2 36 118 8 72 5 2 !NA !NA !NA !NA
  23. #> 3 12 149 12.6 74 5 3 !NA !NA !NA !NA
  24. #> 4 18 313 11.5 62 5 4 !NA !NA !NA !NA
  25. #> 5 NA NA 14.3 56 5 5 NA NA !NA !NA
  26. #> 6 28 NA 14.9 66 5 6 !NA NA !NA !NA
  27. #> 7 23 299 8.6 65 5 7 !NA !NA !NA !NA
  28. #> 8 19 99 13.8 59 5 8 !NA !NA !NA !NA
  29. #> 9 8 19 20.1 61 5 9 !NA !NA !NA !NA
  30. #> 10 NA 194 8.6 69 5 10 NA !NA !NA !NA
  31. #> # ℹ 143 more rows
  32. #> # ℹ 2 more variables: Month_NA <fct>, Day_NA <fct>

Using the nabular format helps you manage where missing values are in
your dataset and make it easy to do visualisations where you split by
missingness:

  1. airquality %>%
  2. bind_shadow() %>%
  3. ggplot(aes(x = Temp,
  4. fill = Ozone_NA)) +
  5. geom_density(alpha = 0.5)

And even visualise imputations

  1. airquality %>%
  2. bind_shadow() %>%
  3. as.data.frame() %>%
  4. simputation::impute_lm(Ozone ~ Temp + Solar.R) %>%
  5. ggplot(aes(x = Solar.R,
  6. y = Ozone,
  7. colour = Ozone_NA)) +
  8. geom_point()
  9. #> Warning: Removed 7 rows containing missing values or values outside the scale range
  10. #> (`geom_point()`).

Or perform an upset plot -
to plot of the combinations of missingness across cases, using the
gg_miss_upset function

  1. gg_miss_upset(airquality)

naniar does this while following consistent principles that are easy to
read, thanks to the tools of the tidyverse.

naniar also provides handy visualations for each variable:

  1. gg_miss_var(airquality)

Or the number of missings in a given variable at a repeating span

  1. gg_miss_span(pedestrian,
  2. var = hourly_counts,
  3. span_every = 1500)

You can read about all of the visualisations in naniar in the vignette
Gallery of missing data visualisations using
naniar
.

naniar also provides handy helpers for calculating the number,
proportion, and percentage of missing and complete observations:

  1. n_miss(airquality)
  2. #> [1] 44
  3. n_complete(airquality)
  4. #> [1] 874
  5. prop_miss(airquality)
  6. #> [1] 0.04793028
  7. prop_complete(airquality)
  8. #> [1] 0.9520697
  9. pct_miss(airquality)
  10. #> [1] 4.793028
  11. pct_complete(airquality)
  12. #> [1] 95.20697

Numerical summaries for missing data

naniar provides numerical summaries of missing data, that follow a
consistent rule that uses a syntax begining with miss_. Summaries
focussing on variables or a single selected variable, start with
miss_var_, and summaries for cases (the initial collected row order of
the data), they start with miss_case_. All of these functions that
return dataframes also work with dplyr’s group_by().

For example, we can look at the number and percent of missings in each
case and variable with miss_var_summary(), and miss_case_summary(),
which both return output ordered by the number of missing values.

  1. miss_var_summary(airquality)
  2. #> # A tibble: 6 × 3
  3. #> variable n_miss pct_miss
  4. #> <chr> <int> <num>
  5. #> 1 Ozone 37 24.2
  6. #> 2 Solar.R 7 4.58
  7. #> 3 Wind 0 0
  8. #> 4 Temp 0 0
  9. #> 5 Month 0 0
  10. #> 6 Day 0 0
  11. miss_case_summary(airquality)
  12. #> # A tibble: 153 × 3
  13. #> case n_miss pct_miss
  14. #> <int> <int> <dbl>
  15. #> 1 5 2 33.3
  16. #> 2 27 2 33.3
  17. #> 3 6 1 16.7
  18. #> 4 10 1 16.7
  19. #> 5 11 1 16.7
  20. #> 6 25 1 16.7
  21. #> 7 26 1 16.7
  22. #> 8 32 1 16.7
  23. #> 9 33 1 16.7
  24. #> 10 34 1 16.7
  25. #> # ℹ 143 more rows

You could also group_by() to work out the number of missings in each
variable across the levels within it.

  1. library(dplyr)
  2. #>
  3. #> Attaching package: 'dplyr'
  4. #> The following objects are masked from 'package:stats':
  5. #>
  6. #> filter, lag
  7. #> The following objects are masked from 'package:base':
  8. #>
  9. #> intersect, setdiff, setequal, union
  10. airquality %>%
  11. group_by(Month) %>%
  12. miss_var_summary()
  13. #> # A tibble: 25 × 4
  14. #> # Groups: Month [5]
  15. #> Month variable n_miss pct_miss
  16. #> <int> <chr> <int> <num>
  17. #> 1 5 Ozone 5 16.1
  18. #> 2 5 Solar.R 4 12.9
  19. #> 3 5 Wind 0 0
  20. #> 4 5 Temp 0 0
  21. #> 5 5 Day 0 0
  22. #> 6 6 Ozone 21 70
  23. #> 7 6 Solar.R 0 0
  24. #> 8 6 Wind 0 0
  25. #> 9 6 Temp 0 0
  26. #> 10 6 Day 0 0
  27. #> # ℹ 15 more rows

You can read more about all of these functions in the vignette “Getting
Started with
naniar”
.

Statistical tests of missingness

naniar provides mcar_test() for Little’s
(1988)

statistical test for missing completely at random (MCAR) data. The null
hypothesis in this test is that the data is MCAR, and the test statistic
is a chi-squared value. Given the high statistic value and low p-value,
we can conclude that the airquality data is not missing completely at
random:

  1. mcar_test(airquality)
  2. #> # A tibble: 1 × 4
  3. #> statistic df p.value missing.patterns
  4. #> <dbl> <dbl> <dbl> <int>
  5. #> 1 35.1 14 0.00142 4

Contributions

Please note that this project is released with a Contributor Code of
Conduct
. By participating in
this project you agree to abide by its terms.

Future Work

  • Extend the geom_miss_* family to include categorical variables,
    Bivariate plots: scatterplots, density overlays
  • SQL translation for databases
  • Big Data tools (sparklyr, sparklingwater)
  • Work well with other imputation engines / processes
  • Provide tools for assessing goodness of fit for classical approaches
    of MCAR, MAR, and MNAR (graphical inference from nullabor package)

Acknowledgements

Firstly, thanks to Di Cook for giving the
initial inspiration for the package and laying down the rich theory and
literature that the work in naniar is built upon. Naming credit (once
again!) goes to Miles McBain. Among
various other things, Miles also worked out how to overload the missing
data and make it work as a geom. Thanks also to Colin
Fay
for helping me understand tidy
evaluation and for features such as replace_to_na, miss_*_cumsum,
and more.

A note on the name

naniar was previously named ggmissing and initially provided a ggplot
geom and some other visualisations. ggmissing was changed to naniar
to reflect the fact that this package is going to be bigger in scope,
and is not just related to ggplot2. Specifically, the package is
designed to provide a suite of tools for generating visualisations of
missing values and imputations, manipulate, and summarise missing data.

…But why naniar?

Well, I think it is useful to think of missing values in data being like
this other dimension, perhaps like C.S. Lewis’s
Narnia
- a
different world, hidden away. You go inside, and sometimes it seems like
you’ve spent no time in there but time has passed very quickly, or the
opposite. Also, NAniar = na in r, and if you so desire, naniar may
sound like “noneoya” in an nz/aussie accent. Full credit to @MilesMcbain
for the name, and @Hadley for the rearranged spelling.