A Missing Data Challenge

Be careful with missing values. I have heard advice recently that it’s often okay to just ignore missing values. Sure, sometimes…but be careful! We were recently given some data that looked like this – let’s say that it represents the number of shoppers visiting eight different retail stores over the course of a week. (I have anonymized the data.)

	5/4/2015	5/5/2015	5/6/2015	5/7/2015	5/8/2015	5/9/2015	5/10/2015
1		1150	1065	1155	1091		1104
2
3	1167	1328	1189	1151	828	800	1110
4	2130	1853	1064
5	2041	2014	1461		1578	1346
6	3016	2699	2043	2757	2414	2268
7		1282	893	1197	1243
8	2752	2001		2071			1511
Average	2221.2	1761.0	1285.8	1666.2	1430.8	1471.3	1241.7

Let’s say our team had developed a forecasting method and were asked to compare our results against this data. What sorts of problems could you encounter if you simply ignored the missing values? Are the averages trustworthy? If you wanted to fill in the missing values above, how would you do that? And what the heck do you do about store 2?

Data acquisition is not as fun to think about as building cool machine learning or optimization models, but is every bit as important, and the issues are often subtle.