Missing Data
In this post, we will review types of missing data and methods for dealing with missing data. Data is often missing, wrong, and generally just has a lot of problems.
Measurement gauges break down
Recording errors occur during transmission, data entry, and storage.
Human interaction causes a variety of challenges such as incorrect inputs, withheld or unavailable data, or medical history.
This post does not deal with censored data. Censored data occurs when an data point has not yet been observed. For example, if you are studying a components reliability, it may be unrealistic to run all the components to failure. If a test is suspended without the component failing, then it’s actual failure point is censored - you only know the lower-bound of the components life-time. Bayesian approaches are particularly effective when censored data occurs in time-to-event applications such as reliability theory, survival theory (healthcare outcomes), and geoscientific predictions (when will Mount Rainier erupt next?).
Types of missing data
Missing Completely At Random (MCAR): Missingness does not depend on observed or unobserved data. There is no systematic differences between what is observed and what is not.
Easy to deal with
Ignorable missingness
Missing At Random (MAR): Missingness depends only on the observed data. MAR occurs when the missingness is not random - there are systematic differences between what is observed and what is not - but where missingness can be fully accounted for by variables where there is complete information.
Easy to deal with
Ignorable missingness
The missing data correlated to other data in the data set. For example, perhaps it’s dangerous (to us) to measure the bills of large and aggressive penguins, so that variable may be missing or inaccurately measured for those types of penguins.
Missing Not At Random (MNAR): Neither MCAR, nor MAR hold; missingness may depend on the data that is missing - there are systematic differences between what is observed and what is not - and the causes are not accounted for. Usually, MNAR indicates that the situations at which missingness occurs depends on hidden or unobserved causes. This is the most dangerous and difficult type of missingness.
Difficult to deal with
Non-ignorable missingness
When the data is Missing Not At Random (MNAR), which missing values may depend on other factors such as data collection design, reporting biases, selection biases.
Patterns in missing data
Some factors that are more likely to be missing (e.g. due to difficulty of collection, or isn’t standard to be collected).
Other factor types are more likely to be missing as well for more complex reasons (e.g. bias for or not providing income levels, a radar gun used for speeds outside its operating range, death date won’t be recorded for living patients). This results in bias, and must be accounted for differently.
Methods for handling missing data
Ideally, missing data has been discussed, planned for, and negated well in advance of the modeling stage. However, missing data is a reality for data scientists even in the best situations. We’ll be covering three common methods for handling missing data, discuss their pros and cos, and demo them in R.
- Omit missing data
- Use categorical variables to indicate missing data
- Estimate (impute) missing data
Loading and Preparing Data
I’ll be using the recipe library from tidymodels, which was previously discussed here, and the palmerpenguins data as our example data.
First, let’s load the data.
The data contain information for 344 penguins. There are 3 different species of penguins, collected from 3 islands in the Palmer Archipelago, Antarctica. It includes size measurements for adult foraging penguins near Palmer Station, Antarctica.
# features available
|> str() penguins
tibble [344 x 7] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
The data contains 333 complete cases, with 19 missing values.
# count NA by variable
species island bill_length_mm bill_depth_mm
0 0 2 2
flipper_length_mm body_mass_g sex
2 2 11
Since I want full control over the data, I’m going to omit the 19 missing values, split the data into test and training sets, then intentionally set up multiple missing data scenarios for comparison.
# remove NAs from penguins
<- penguins |> na.omit()
# Put 3/4 of the data into the training set
<- initial_split(data, prop = 3/4, strata = species)
# Create data frames for the two sets
<- training(data_split)
data_train <- testing(data_split) data_test
Set up missing data
# no missing
<- data_train
# missing data
<- data_train
# randomly drop 20% of island
$island[sample(1:nrow(data_train_missing), 49)] <- NA
# randomly drop 10% of bill length and depth
$bill_length_mm[sample(1:nrow(data_train_missing), 24)] <- NA
data_train_missing$bill_depth_mm[sample(1:nrow(data_train_missing), 24)] <- NA
# randomly drop 5% of all others
$flipper_length_mm[sample(1:nrow(data_train_missing), 12)] <- NA
data_train_missing$body_mass_g[sample(1:nrow(data_train_missing), 12)] <- NA
data_train_missing$sex[sample(1:nrow(data_train_missing), 12)] <- NA
# check
species island bill_length_mm bill_depth_mm
0 49 24 24
flipper_length_mm body_mass_g sex
12 12 12
# set up base recipe
<- recipe(species~., data=data_train_missing) rec_base
Omit missing data
The first option is to simply omit or discard the missing data. That’s easy to implement and doesn’t potentially induce errors. However, you have to weight this against the risk of losing too many data points - many large data have hundreds or thousands of variables, if you removed any data point with a missing variables you could eliminate practically the entire data. Additionally, simply removing data creates the potential for for censored or biased missing data.
Example: Omit
<- rec_base |>
rec_omit step_naomit(all_predictors()) |>
# apply to missing data
<- bake(rec_omit, new_data=NULL)
# we lose about 100 data points if we choose to omit all the missing data!
|> nrow() data_omit
[1] 138
|> nrow() data_train_full
[1] 249
Use categorical variables to indicate missing data
The missing data can be biased! To account for that we can include interactions.
If we include interaction terms between the new categorical variable and all other variables, then essentially we’re creating two separate models. One for when there’s missing data in this variable and one for when there isn’t. So it’s really like a tree model with a single branch.
Example: Categorical value
Here, I build a categorical missing value for island.
<- rec_base |>
rec_cat # convert to character (easier)
step_mutate(island = as.character(island)) |>
# Change NA -> "Missing"
step_mutate(island = ifelse(is.na(island),'Missing',island)) |>
# covert back to factor
step_mutate(island = as.factor(island)) |>
# dummy step with one hot encoding
step_dummy(island,one_hot = TRUE) |>
# set interaction term between the missing island category and all other vars
step_interact(terms = ~island_Missing:all_predictors()) |>
# train
# apply to missing data
<- bake(rec_cat, new_data=NULL)
# these are the new variables for the model
|> names() data_cat
[1] "bill_length_mm" "bill_depth_mm"
[3] "flipper_length_mm" "body_mass_g"
[5] "sex" "species"
[7] "island_Biscoe" "island_Dream"
[9] "island_Missing" "island_Torgersen"
[11] "island_Missing_x_bill_length_mm" "island_Missing_x_bill_depth_mm"
[13] "island_Missing_x_flipper_length_mm" "island_Missing_x_body_mass_g"
[15] "island_Missing_x_sexmale" "island_Missing_x_island_Biscoe"
[17] "island_Missing_x_island_Dream" "island_Missing_x_island_Torgersen"
Example: Categorical value for numeric data
For numerical values, set NA = 0 and then add in a missing column. In this example, I create a missing term for bill length, then create the required interaction terms.
<- rec_base |>
rec_cat_num # set up missing category
step_mutate(bill_length_missing = ifelse(is.na(bill_length_mm),'Yes','No')) |>
# set bill length -> 0 if NA
step_mutate(bill_length_mm = ifelse(is.na(bill_length_mm),0,bill_length_mm)) |>
# convert the missing category to a factor
step_mutate(bill_length_missing = as.factor(bill_length_missing)) |>
# dummy step with one hot encoding
step_dummy(bill_length_missing,one_hot = FALSE) |>
# set interaction term between the missing category and all other vars
step_interact(terms = ~bill_length_missing_Yes:all_predictors()) |>
# train
# apply to missing data
<- bake(rec_cat_num, new_data=NULL)
# these are the new variables for the model
|> names() data_cat_num
[1] "island"
[2] "bill_length_mm"
[3] "bill_depth_mm"
[4] "flipper_length_mm"
[5] "body_mass_g"
[6] "sex"
[7] "species"
[8] "bill_length_missing_Yes"
[9] "bill_length_missing_Yes_x_islandDream"
[10] "bill_length_missing_Yes_x_islandTorgersen"
[11] "bill_length_missing_Yes_x_bill_length_mm"
[12] "bill_length_missing_Yes_x_bill_depth_mm"
[13] "bill_length_missing_Yes_x_flipper_length_mm"
[14] "bill_length_missing_Yes_x_body_mass_g"
[15] "bill_length_missing_Yes_x_sexmale"
Estimate missing data
General guidelines for imputation
Data is used twice, so it can lead to over-fitting
Limit the amount of imputation to no more than 5% per factor
If more than 5% is missing, use omission or categorical value methods
Approaches to imputation
Mid-range value: use mean, median (numeric), or mode (categorical)
Regression: Reduce or eliminate the problem of bias by using other factors to predict the missing value. Essentially, build a model for each factor.
Perturbation: Accounts for bias and variability. Essentially, add perturbation to each imputed variable (e.g. adjust up/down a random amount from the normally distributed variation).
Method | Pro | Con |
Mid-range value |
Regression |
Perturbation |
Do we add additional error from imputation and perturbation?
Yup! Total error = Imputation error + perturbation error + model error
. However, regular data almost certainly has errors as well. It’s up to you as the data scientist to decide what trade-offs to make in a given situation.
There are many approaches to imputation. For example, advanced methods like multivariate imputation by chained equations (MICE) can impute multiple factor values together.
Example: Estimation (Imputation)
Let’s estimate a few of the numerical values with different methods. Permutation sampling is covered in rsample
<- rec_base |>
rec_impute # impute bill length, depth and flipper length with mean
step_impute_mean(bill_length_mm,bill_depth_mm,flipper_length_mm) |>
# impute sex with mode
step_impute_mode(sex) |>
# impute body mass with linear model
# mass ~ sex + bill_length_mm +bill_depth_mm +flipper_length_mm
body_mass_g, impute_with = imp_vars(sex,bill_length_mm,bill_depth_mm,flipper_length_mm)) |>
# impute island with knn
step_impute_knn(island, neighbors = 5) |>
# train
# apply to missing data
<- bake(rec_impute, new_data=NULL)
# check
island bill_length_mm bill_depth_mm flipper_length_mm
0 0 0 0
body_mass_g sex species
0 0 0
Summarizing Methods
Method | Pro | Con |
Discard |
Categorical Value |
Estimate (Impute) |
