A Missing Data Challenge
Be careful with missing values. I have heard advice recently that it’s often okay to just ignore missing values. Sure, sometimes…but be careful! We were recently given some data that looked like this – let’s say that it represents the number of shoppers visiting eight different retail stores over the course of a week. (I have anonymized the data.)
| 5/4/2015 | 5/5/2015 | 5/6/2015 | 5/7/2015 | 5/8/2015 | 5/9/2015 | 5/10/2015 | |
| 1 | 1150 | 1065 | 1155 | 1091 | 1104 | ||
| 2 | |||||||
| 3 | 1167 | 1328 | 1189 | 1151 | 828 | 800 | 1110 |
| 4 | 2130 | 1853 | 1064 | ||||
| 5 | 2041 | 2014 | 1461 | 1578 | 1346 | ||
| 6 | 3016 | 2699 | 2043 | 2757 | 2414 | 2268 | |
| 7 | 1282 | 893 | 1197 | 1243 | |||
| 8 | 2752 | 2001 | 2071 | 1511 | |||
| Average | 2221.2 | 1761.0 | 1285.8 | 1666.2 | 1430.8 | 1471.3 | 1241.7 |
Let’s say our team had developed a forecasting method and were asked to compare our results against this data. What sorts of problems could you encounter if you simply ignored the missing values? Are the averages trustworthy? If you wanted to fill in the missing values above, how would you do that? And what the heck do you do about store 2?
Data acquisition is not as fun to think about as building cool machine learning or optimization models, but is every bit as important, and the issues are often subtle.