A Missing Data Challenge

Be careful with missing values. I have heard advice recently that it’s often okay to just ignore missing values. Sure, sometimes…but be careful! We were recently given some data that looked like this – let’s say that it represents the number of shoppers visiting eight different retail stores over the course of a week. (I have anonymized the data.)

 5/4/20155/5/20155/6/20155/7/20155/8/20155/9/20155/10/2015
1 1150106511551091 1104
2       
311671328118911518288001110
4213018531064    
5204120141461 15781346 
6301626992043275724142268 
7 128289311971243  
827522001 2071  1511
Average2221.21761.01285.81666.21430.81471.31241.7

Let’s say our team had developed a forecasting method and were asked to compare our results against this data. What sorts of problems could you encounter if you simply ignored the missing values? Are the averages trustworthy? If you wanted to fill in the missing values above, how would you do that? And what the heck do you do about store 2?

Data acquisition is not as fun to think about as building cool machine learning or optimization models, but is every bit as important, and the issues are often subtle.