NULL values: When Missing Data is ValuableReading time: 2 minutes
In statistics, null values generally fall into one of two categories: values that are missing at random, and those values that are not missing at random. For this article, we’re focusing on the second case. Sometimes a new variable marking when a variable is missing can be just as predictive and informative as the actual variable itself.
When missingness is not random, there is usually an underlying explanation. The explanation may vary widely depending on the type of analysis you’re performing, but trying to understand how missing values could be representing characteristics of your underlying population will help you to avoid bias when carrying out your analysis.
Sometimes a variable may be missing due to the way that information is collected. For example, if data comes from direct contact with a constituent, they might choose not to give you their home phone number, which would result in a missing value. In this case, whether or not you have certain information about a constituent can be very telling – choosing to provide both a cell phone number and an email address, for example, might mean that they are more open to being contacted. So, while a cell phone number and an email address might be marginally useful in a model (if we focused on area code and email provider, perhaps), a variable marking the presence or absence of this information could be very useful as an indicator of connectedness or openness. By creating a binary variable to represent what information we have about a constituent, we can get an idea of when a missing value is actually valuable information in and of itself.
Another classic example of a survey variable that may not be missing at random is income. Historically, scientists have found that people with higher incomes are less likely to report them. So hypothetically there could be a difference between the group of people who report income and the group of people who self-censor. If we use the income data as-is for our modeling efforts, we are introducing that bias into any subsequent models. We can test to see if there is a difference between reporters and non-reporters in terms of our y-variable by creating a binary variable to represent the presence or absence of income data. This way, if there is a difference between the populations, we give the model a way to capture that difference going forward.
Similarly for categorical variables, such as ethnicity, creating a new category like ‘unknown’ for any missing values gives you the ability to test whether the presence of ethnicity makes a difference in terms of your y-variable. Any chance we get to mark where data is missing – if there’s a chance that it isn’t missing at random – is worth the effort to reduce bias in resulting models.
Missing data, by nature, is problematic. Whenever we’re using our historical data to gain insight, we want to make sure that our sample population is as representative of our actual population as possible. By determining the source of missingness in our datasets, we can more accurately decide how to handle missing values to improve model quality. If we know that missings are not occurring randomly, we want to leverage that knowledge into our dataset and resulting models as best we can.