Photo on Blue-Granite

Demand Forecasting of Perishable Products

Last updated on Oct 15, 2021

Photo on Blue-Granite

Demand Forecasting of Perishable Products

Last updated on Oct 15, 2021

The objective of this project is to minimize wastage of meal kits in retail stores. Currently, this is being done by tracking each individual item from the source until the point of sale. This is a cumbersome process and is labor intensive. In order to realize the objective using machine learning the first step in the process is to have an accurate forecast of the demand. This project focuses on generating accurate forecast for each individual item (46 unique items) for each store (47 unique stores). There are three datasets that have been used for demand forecasting.

Product dataset
- Product_ID
- Cook Products
- Sales Product
- ShelfLife_after_Cook
- ShelfLife_WFM
- Price
- Cost
Locations dataset
- Cook_Loc
- Sales_Loc
- Market
Sales dataset
- Locationno
- Market
- Saledate
- Productid
- Unit_gross_sales

Note: The unaltered dataset is not shared. Locations dataset contains features obtained from https://censusreporter.org/ not present in the original dataset

# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # visualization
import seaborn as sns  # visualization
import holidays   # stored list of holidays in U.S.

# Model
import lightgbm as lgb
import shap                  # For error analysis
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import make_scorer
from sklearn.preprocessing import LabelEncoder       

# Configuration
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Read the dataset containing the products and sales of the products across the stores in the state of Texas

prod = pd.read_csv('WFM_Products', delimiter = '\t')
locations = pd.read_csv("locations.csv")
sales = pd.read_csv('WFM_Sales.txt', delimiter = '\t',  usecols = ['LOCATIONNO', 'MARKET', 'SALEDATE',
        'PRODUCTID', 'UNIT_GROSS_SALES'],
                    parse_dates = ['SALEDATE'])

Data Cleaning

Product dataset contains 9 missing values in the cost column. Client provided information for 4 of the nine missing values. Rest of the products were third party products thus were deemed unnecessary for the analysis. The price column contains ‘$’ symbol which was replaced and converted from string to float.

prod.isna().sum()

Product_ID              0
Cook Products           0
Sales Product           0
ShelfLife_after_Cook    0
ShelfLife_WFM           0
Price                   0
Cost                    9
dtype: int64

miss = {900979: 2.37, 901974: 1.78, 902588: 2.66, 
902901: 2.15 }

for key, val in miss.items():
    for i in prod['Product_ID']:
#         print(i)
        if i == key:
            prod.at[prod[prod['Product_ID'] == i].index[0],'Cost'] = val

        
prod.dropna(inplace = True)
i = prod[prod['Product_ID']== 902901].index
prod.drop(i, inplace = True)
prod['Price'] = prod['Price'].str.replace('$', '').astype(float)

The third party items were removed from the sales column as well.

sales['Cost'] = sales['PRODUCTID'].map(prod.set_index('Product_ID')['Cost'])
index_names = sales[(sales['PRODUCTID'] == 902729) | (sales['PRODUCTID'] == 902322) | (sales['PRODUCTID'] == 902787) |
                   (sales['PRODUCTID'] == 902901) | (sales['PRODUCTID'] == 902323) | (sales['PRODUCTID'] == 902733) |
                   (sales['PRODUCTID'] == 902730) | (sales['PRODUCTID'] == 902732) | (sales['PRODUCTID'] == 902911) |
                    (sales['PRODUCTID'] == 901902) | (sales['PRODUCTID'] == 902547) | (sales['PRODUCTID'] == 902546)].index

sales.drop(index_names, inplace = True)
sales.dropna(inplace = True)

print(sales.shape, "\n")
sales.head()

(290014, 6)

	LOCATIONNO	MARKET	SALEDATE	PRODUCTID	UNIT_GROSS_SALES	Cost
1	10073	Austin	2019-09-08	902287	1.0	1.82
2	10086	Austin	2019-09-08	901352	1.0	2.28
4	10017	Austin	2019-09-08	901352	1.0	2.28
5	10085	Austin	2019-09-08	902714	1.0	1.96
6	10073	Austin	2019-09-08	902934	1.0	0.96

We check the total number of stores and number of unique items in the data. We also check the time range of the data.

# How many stores and items are there?
print("number of unique items :",sales.PRODUCTID.nunique())
print("number of unique store :",sales.LOCATIONNO.nunique())

number of unique items : 46
number of unique store : 47

# Time Range
sales["SALEDATE"].min(), sales["SALEDATE"].max()

(Timestamp('2019-09-08 00:00:00'), Timestamp('2021-04-06 00:00:00'))

Although, there are several unique items not all items are sold in all the stores. Most of the stores sell one less product except for store number 4, which sells ten less products. Each item and store is encoded using label encoder so that the store and item data can be incorporated in the model.

# How many items are in the store?
le = LabelEncoder()
sales['LOCATIONNO'] = le.fit_transform(sales['LOCATIONNO'].values)
inverse_loc = le.inverse_transform(sales['LOCATIONNO'])
trans_loc = sales['LOCATIONNO']
sales['PRODUCTID'] = le.fit_transform(sales['PRODUCTID'].values)
sales.groupby(["LOCATIONNO"])["PRODUCTID"].nunique()

LOCATIONNO
0     46
1     46
2     46
3     46
4     36
5     45
6     45
7     46
8     46
9     45
10    46
11    46
12    46
13    45
14    46
15    45
16    45
17    46
18    46
19    46
20    46
21    46
22    46
23    46
24    46
25    46
26    46
27    45
28    46
29    46
30    46
31    45
32    46
33    46
34    46
35    45
36    46
37    45
38    46
39    46
40    46
41    46
42    45
43    46
44    46
45    46
46    45
Name: PRODUCTID, dtype: int64

The next step is to look at the summary statistic for each store to find out which of the items is more popular. We also do the same for the stores. We find out that store number 1 outsells most other stores.

# Summary Stats for each store
sales.groupby(["LOCATIONNO"]).agg({"UNIT_GROSS_SALES": ["count","sum", "mean", 
                                                        "median", "std", "min", "max"]})

	UNIT_GROSS_SALES
	count	sum	mean	median	std	min	max
LOCATIONNO
0	7466	22435.23	3.004987	2.0	2.688379	0.0	25.0
1	6624	15233.00	2.299668	2.0	1.962819	0.0	14.0
2	7430	18278.17	2.460050	2.0	2.107790	0.0	16.0
3	6367	11731.00	1.842469	1.0	1.621748	0.0	14.0
4	4890	17768.00	3.633538	3.0	3.411545	0.0	36.0
5	5095	8027.75	1.575613	1.0	1.507925	0.0	13.0
6	5224	8745.00	1.674005	1.0	1.619680	0.0	14.0
7	5487	9790.80	1.784363	1.0	1.819337	0.0	20.0
8	5358	10026.00	1.871221	1.0	1.705230	0.0	12.0
9	5097	6627.00	1.300177	1.0	1.319514	0.0	9.0
10	4956	7925.70	1.599213	1.0	1.520835	0.0	11.0
11	4520	5239.00	1.159071	1.0	1.248378	0.0	10.0
12	6262	19845.88	3.169256	2.0	2.914780	0.0	30.0
13	4651	5736.00	1.233283	1.0	1.275230	0.0	12.0
14	5726	13203.00	2.305798	2.0	2.081381	0.0	16.0
15	5738	12329.29	2.148709	2.0	2.003545	0.0	16.1
16	5379	9544.90	1.774475	1.0	1.721056	0.0	18.0
17	6903	16933.00	2.452991	2.0	2.094687	0.0	16.0
18	6389	12324.00	1.928940	2.0	1.671323	0.0	20.0
19	7158	17659.00	2.467030	2.0	2.074877	0.0	15.0
20	6254	10149.00	1.622801	1.0	1.457495	0.0	12.0
21	6634	12702.00	1.914682	2.0	1.650885	0.0	13.0
22	6617	10787.00	1.630195	1.0	1.533882	0.0	15.0
23	6887	12882.00	1.870481	2.0	1.617796	0.0	14.0
24	6988	17344.00	2.481969	2.0	2.162589	0.0	18.0
25	6381	11172.00	1.750823	1.0	1.573079	0.0	16.0
26	6736	14235.00	2.113272	2.0	1.740413	0.0	12.0
27	6319	13571.00	2.147650	2.0	2.002659	0.0	16.0
28	6318	10464.00	1.656220	1.0	1.418869	0.0	11.0
29	5911	9410.42	1.592018	1.0	1.396673	0.0	15.0
30	8480	49427.82	5.828752	4.0	5.271472	0.0	85.0
31	5992	8855.00	1.477804	1.0	1.367350	0.0	18.0
32	7301	17564.35	2.405746	2.0	2.044244	0.0	25.0
33	7849	24421.02	3.111354	3.0	2.600314	0.0	22.0
34	5941	7162.00	1.205521	1.0	1.216750	0.0	11.0
35	5561	8248.25	1.483231	1.0	1.373421	0.0	12.0
36	5699	11326.00	1.987366	1.0	2.049454	0.0	29.0
37	5653	7343.00	1.298956	1.0	1.254628	0.0	9.0
38	6153	11187.00	1.818137	1.0	1.590339	0.0	13.0
39	6677	15645.00	2.343118	2.0	2.166764	0.0	22.0
40	6492	11928.00	1.837338	2.0	1.541209	0.0	14.0
41	6305	11331.00	1.797145	1.0	1.547141	0.0	12.0
42	6215	13815.00	2.222848	2.0	2.045364	0.0	15.0
43	5702	7844.00	1.375658	1.0	1.373397	0.0	12.0
44	6511	12741.73	1.956954	2.0	1.667920	0.0	18.0
45	6309	10392.00	1.647171	1.0	1.557735	0.0	13.0
46	5409	7842.00	1.449806	1.0	1.300789	0.0	12.0

# Summary Stats for each item
sales.groupby(["PRODUCTID"]).agg({"UNIT_GROSS_SALES": ["count","sum", "mean", "median",
                                                       "std", "min", "max"]})

	UNIT_GROSS_SALES
	count	sum	mean	median	std	min	max
PRODUCTID
0	881	1489.00	1.690125	1.0	1.149310	1.00	12.0
1	15271	39342.84	2.576311	2.0	2.507954	0.00	28.0
2	8558	23905.00	2.793293	2.0	2.145337	0.00	22.0
3	18194	51367.75	2.823335	2.0	2.821902	0.00	59.0
4	100	165.00	1.650000	1.0	1.192358	1.00	7.0
5	6961	15097.77	2.168908	2.0	1.496724	0.42	16.0
6	5752	13468.00	2.341446	2.0	1.631576	0.00	13.0
7	3199	5679.00	1.775242	1.0	1.152993	0.00	13.0
8	8967	21627.71	2.411923	2.0	2.707817	0.00	27.0
9	4387	7152.65	1.630419	1.0	0.942318	0.00	7.0
10	1002	1627.00	1.623752	1.0	1.042146	1.00	9.0
11	8836	12853.00	1.454617	0.0	2.310401	0.00	25.0
12	3086	5614.00	1.819183	1.0	1.048439	1.00	7.0
13	8958	17678.39	1.973475	1.0	2.339349	0.00	27.0
14	843	1317.00	1.562278	1.0	0.915348	1.00	7.0
15	6124	14881.00	2.429948	2.0	1.885544	0.00	22.0
16	11353	30249.00	2.664406	2.0	1.953958	0.00	25.0
17	4306	9994.81	2.321136	2.0	1.703804	0.11	15.0
18	2591	5086.00	1.962949	2.0	1.313200	1.00	11.0
19	1987	5283.00	2.658782	2.0	2.037989	0.00	18.0
20	5255	9629.00	1.832350	1.0	1.116143	1.00	11.0
21	1106	1902.50	1.720163	1.0	1.057462	1.00	8.5
22	6964	14536.00	2.087306	2.0	1.387399	0.00	14.0
23	2839	4561.00	1.606552	1.0	0.924477	0.00	11.0
24	3504	6131.00	1.749715	1.0	1.033758	1.00	8.0
25	5595	11575.00	2.068811	2.0	1.299769	0.00	10.0
26	3078	5623.00	1.826836	2.0	1.059599	0.00	9.0
27	13402	35149.25	2.622687	2.0	2.708546	0.00	35.0
28	4546	10005.00	2.200836	2.0	1.461169	0.00	14.0
29	3925	7301.00	1.860127	1.0	1.145417	1.00	8.0
30	1498	3500.00	2.336449	2.0	1.517748	1.00	13.0
31	755	1435.00	1.900662	1.0	1.301677	0.00	9.0
32	1358	2502.00	1.842415	1.0	1.160730	0.00	11.0
33	6111	13937.90	2.280789	2.0	1.605280	0.00	19.0
34	13443	31241.59	2.324004	2.0	2.274236	0.00	26.0
35	5769	11913.00	2.065003	2.0	1.470268	0.00	15.0
36	836	1309.00	1.565789	1.0	0.973531	1.00	7.0
37	13299	30004.98	2.256183	2.0	2.299639	0.00	57.0
38	13522	33390.15	2.469320	2.0	2.412362	0.00	29.0
39	8837	9548.00	1.080457	0.0	1.863488	0.00	22.0
40	8836	10658.50	1.206258	0.0	1.859811	0.00	23.0
41	8836	18154.00	2.054550	1.0	3.142165	0.00	85.0
42	8836	19553.76	2.212965	1.0	3.178544	0.00	43.0
43	8836	13038.76	1.475641	0.0	2.216686	0.00	31.0
44	8836	9877.00	1.117813	0.0	1.928911	0.00	25.0
45	8836	8837.00	1.000113	0.0	1.571769	0.00	15.0

sns.set_context("talk", font_scale=1.4)
markets = ['Dallas', 'Houston', 'Frontier', 'Austin', 'San Antonio']

fig, axes = plt.subplots(2, 3, figsize=(10, 10))
for i in range(len(markets)):
    if i < 3:
        sales[sales.MARKET == markets[i]].UNIT_GROSS_SALES.hist(ax=axes[0, i])
        axes[0,i].set_title("Market " + markets[i], fontsize = 15)
    else:
        sales[sales.MARKET == markets[i]].UNIT_GROSS_SALES.hist(ax=axes[1, i-3])
        axes[1,i-3].set_title("Market " + markets[i], fontsize = 15)
        
plt.tight_layout(pad= 4.5)
plt.suptitle("Histogram: Market Sales");

# Histogram Sales
sns.set_context("talk", font_scale=1.4)
fig, axes = plt.subplots(8, 6, figsize=(20, 35))
for i in range(0,47):
    if i < 6:
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[0, i])
        axes[0,i].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 5) & (i < 12):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[1, i - 6])
        axes[1,i-6].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 11) & (i < 18): 
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[2, i - 12])
        axes[2,i-12].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 17) & (i < 24):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[3, i - 18])
        axes[3,i-18].set_title("Store " + str(i), fontsize = 15)
    
    elif (i > 23) & (i < 30):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[4, i - 24])
        axes[4,i-24].set_title("Store " + str(i), fontsize = 15)
    
    elif (i > 29) & (i < 36):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[5, i - 30])
        axes[5,i-30].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 35) & (i < 42):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[6, i - 36])
        axes[6,i-36].set_title("Store " + str(i), fontsize = 15)
    
    else:
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[7, i - 42])
        axes[7,i-42].set_title("Store " + str(i), fontsize = 15)
        
plt.tight_layout(pad= 3)
plt.suptitle("Histogram: Sales");

# Histogram Sales
sns.set_context("talk", font_scale=1.4)
fig, axes = plt.subplots(8, 6, figsize=(20, 35))
for i in range(0,47):
    if i < 6:
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[0, i])
        axes[0,i].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 5) & (i < 12):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[1, i - 6])
        axes[1,i-6].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 11) & (i < 18): 
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[2, i - 12])
        axes[2,i-12].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 17) & (i < 24):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[3, i - 18])
        axes[3,i-18].set_title("Store " + str(i), fontsize = 15)
    
    elif (i > 23) & (i < 30):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[4, i - 24])
        axes[4,i-24].set_title("Store " + str(i), fontsize = 15)
    
    elif (i > 29) & (i < 36):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[5, i - 30])
        axes[5,i-30].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 35) & (i < 42):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[6, i - 36])
        axes[6,i-36].set_title("Store " + str(i), fontsize = 15)
    
    else:
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[7, i - 42])
        axes[7,i-42].set_title("Store " + str(i), fontsize = 15)
        
plt.tight_layout(pad= 3)
plt.suptitle("Histogram: Sales");

The following plot shows the sale of all the items in store 1.

store = 1
sub = sales[sales.LOCATIONNO == store].set_index("SALEDATE")

# sns.set_context("talk", font_scale=1.4)
fig, axes = plt.subplots(8, 6, figsize=(35, 40))
for i in range(0,46):
    if i < 6:
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[0, i], 
                                      legend=True, label = "Item "+str(i)+" Sales")
    elif (i > 5) & (i < 12):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[1, i - 6], 
                                      legend=True, label = "Item "+str(i)+" Sales")
    elif (i > 11) & (i < 18):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[2, i - 12], 
                                      legend=True, label = "Item "+str(i)+" Sales")    
    elif (i > 17) & (i < 24):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[3, i - 18], 
                                      legend=True, label = "Item "+str(i)+" Sales")    
    elif (i > 23) & (i < 30):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[4, i - 24], 
                                      legend=True, label = "Item "+str(i)+" Sales")  
    elif (i > 29) & (i < 36):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[5, i - 30], 
                                      legend=True, label = "Item "+str(i)+" Sales")    
    elif (i > 35) & (i < 42):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[6, i - 36], 
                                      legend=True, label = "Item "+str(i)+" Sales")    
    else:
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[7, i - 42], 
                                      legend=True, label = "Item "+str(i)+" Sales")    

plt.tight_layout(pad=3)
plt.suptitle("Store 1 Items");

We check for correlation of total sales in each of the stores to find out if there stores or market that are correlated to one another. Based on the correlation data we can remove redundant data to improve the model prediction accuracy

# Correlation between total sales of stores

storesales = sales.groupby(["SALEDATE", "LOCATIONNO"]).UNIT_GROSS_SALES.sum().reset_index().set_index("SALEDATE")
corr =  pd.pivot_table(storesales, values = "UNIT_GROSS_SALES", columns="LOCATIONNO", index="SALEDATE").corr(method = "spearman")
plt.figure(figsize = (30,30))
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 9}, square=True);

# Correlation between total sales at market level

sns.set_context("talk", font_scale=1.4)
marketsales = sales.groupby(["SALEDATE", "MARKET"]).UNIT_GROSS_SALES.sum().reset_index().set_index("SALEDATE")
corr =  pd.pivot_table(marketsales, values = "UNIT_GROSS_SALES", columns="MARKET", index="SALEDATE").corr(method = "spearman")
plt.figure(figsize = (10,10))
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 9}, square=True);

A non-parametric T-test is conducted to see which store sales are similar to each other. The same test is carried out for different items and different markets. This is incorporated during the feaure engineering phase of the project. This test is also called A/B testing.

# T Test
def CompareTwoGroups(dataframe, group, target):
    
    import itertools
    from scipy.stats import shapiro
    import scipy.stats as stats
    
    # 1. Normality Test: Shapiro Test
    # 2. Homogeneity Test: Levene Test
    # 3. Parametric or Non-Parametric T Test: T-Test, Welch Test, Mann Whitney U
    
    # Create Combinations
    item_comb = list(itertools.combinations(dataframe[group].unique(), 2))
    
    AB = pd.DataFrame()
    for i in range(0, len(item_comb)):
        # Define Groups
        groupA = dataframe[dataframe[group] == item_comb[i][0]][target]
        groupB = dataframe[dataframe[group] == item_comb[i][1]][target]
        
        # Assumption: Normality
        ntA = shapiro(groupA)[1] < 0.05
        ntB = shapiro(groupB)[1] < 0.05
        # H0: Distribution is Normal! - False
        # H1: Distribution is not Normal! - True
        
        if (ntA == False) & (ntB == False): # "H0: Normal Distribution"
            # Parametric Test
            # Assumption: Homogeneity of variances
            leveneTest = stats.levene(groupA, groupB)[1] < 0.05
            # H0: Homogeneity: False
            # H1: Heterogeneous: True
            if leveneTest == False:
                # Homogeneity
                ttest = stats.ttest_ind(groupA, groupB, equal_var=True)[1]
                # H0: M1 = M2 - False
                # H1: M1 != M2 - True
            else:
                # Heterogeneous
                ttest = stats.ttest_ind(groupA, groupB, equal_var=False)[1]
                # H0: M1 = M2 - False
                # H1: M1 != M2 - True
        else:
            # Non-Parametric Test
            ttest = stats.mannwhitneyu(groupA, groupB)[1] 
            # H0: M1 = M2 - False
            # H1: M1 != M2 - True
            
        temp = pd.DataFrame({"Compare Two Groups":[ttest < 0.05], 
                             "p-value":[ttest],
                             "GroupA_Mean":[groupA.mean()], "GroupB_Mean":[groupB.mean()],
                             "GroupA_Median":[groupA.median()], "GroupB_Median":[groupB.median()],
                             "GroupA_Count":[groupA.count()], "GroupB_Count":[groupB.count()]
                            }, index = [item_comb[i]])
        temp["Compare Two Groups"] = np.where(temp["Compare Two Groups"] == True, "Different Groups", "Similar Groups")
        temp["TestType"] = np.where((ntA == False) & (ntB == False), "Parametric", "Non-Parametric")
        
        AB = pd.concat([AB, temp[["TestType", "Compare Two Groups", "p-value","GroupA_Median", "GroupB_Median","GroupA_Mean", "GroupB_Mean",
                                 "GroupA_Count", "GroupB_Count"]]])
        
    return AB
    
    
ctg_store = CompareTwoGroups(storesales, group = "LOCATIONNO", target = "UNIT_GROSS_SALES")
ctg_ss = ctg_store[ctg_store["Compare Two Groups"] == "Similar Groups"]
ctg_ss

	TestType	Compare Two Groups	p-value	GroupA_Median	GroupB_Median	GroupA_Mean	GroupB_Mean	GroupA_Count	GroupB_Count
(0, 2)	Non-Parametric	Similar Groups	0.233386	31.0	32.0	38.950052	33.354325	576	548
(20, 29)	Non-Parametric	Similar Groups	0.337784	15.0	15.0	17.650435	17.266826	575	545
(20, 36)	Non-Parametric	Similar Groups	0.115768	15.0	15.0	17.650435	21.573333	575	525
(20, 10)	Non-Parametric	Similar Groups	0.472980	15.0	15.0	17.650435	18.136613	575	437
(31, 35)	Non-Parametric	Similar Groups	0.380683	14.0	14.0	15.453752	15.801245	573	522
(31, 43)	Non-Parametric	Similar Groups	0.310684	14.0	13.0	15.453752	14.912548	573	526
(31, 46)	Non-Parametric	Similar Groups	0.401160	14.0	13.0	15.453752	15.316406	573	512
(31, 9)	Non-Parametric	Similar Groups	0.361584	14.0	14.0	15.453752	14.959368	573	443
(32, 17)	Non-Parametric	Similar Groups	0.278548	27.5	27.0	30.599913	30.956124	574	547
(32, 24)	Non-Parametric	Similar Groups	0.281458	27.5	27.0	30.599913	31.649635	574	548
(32, 14)	Non-Parametric	Similar Groups	0.103080	27.5	26.0	30.599913	29.803612	574	443
(33, 12)	Non-Parametric	Similar Groups	0.275046	36.0	38.0	42.471339	44.900181	575	442
(3, 22)	Non-Parametric	Similar Groups	0.291572	18.0	19.0	21.446069	19.720293	547	547
(3, 25)	Non-Parametric	Similar Groups	0.388627	18.0	18.0	21.446069	20.424132	547	547
(3, 28)	Non-Parametric	Similar Groups	0.096847	18.0	17.0	21.446069	19.094891	547	548
(3, 38)	Non-Parametric	Similar Groups	0.059920	18.0	19.0	21.446069	21.227704	547	527
(3, 45)	Non-Parametric	Similar Groups	0.347969	18.0	18.0	21.446069	19.719165	547	527
(3, 7)	Non-Parametric	Similar Groups	0.208064	18.0	19.0	21.446069	22.051351	547	444
(3, 16)	Non-Parametric	Similar Groups	0.133355	18.0	19.0	21.446069	21.497523	547	444
(17, 24)	Non-Parametric	Similar Groups	0.146562	27.0	27.0	30.956124	31.649635	547	548
(17, 39)	Non-Parametric	Similar Groups	0.068823	27.0	25.0	30.956124	29.686907	547	527
(17, 1)	Non-Parametric	Similar Groups	0.142868	27.0	25.0	30.956124	29.070611	547	524
(17, 14)	Non-Parametric	Similar Groups	0.245477	27.0	26.0	30.956124	29.803612	547	443
(18, 22)	Non-Parametric	Similar Groups	0.078444	19.0	19.0	22.530165	19.720293	547	547
(18, 27)	Non-Parametric	Similar Groups	0.152819	19.0	20.0	22.530165	24.764599	547	548
(18, 38)	Non-Parametric	Similar Groups	0.389069	19.0	19.0	22.530165	21.227704	547	527
(18, 41)	Non-Parametric	Similar Groups	0.476575	19.0	20.0	22.530165	21.541825	547	526
(18, 7)	Non-Parametric	Similar Groups	0.216345	19.0	19.0	22.530165	22.051351	547	444
(18, 16)	Non-Parametric	Similar Groups	0.322617	19.0	19.0	22.530165	21.497523	547	444
(18, 8)	Non-Parametric	Similar Groups	0.407530	19.0	19.0	22.530165	22.734694	547	441
(19, 24)	Non-Parametric	Similar Groups	0.163828	29.0	27.0	32.224453	31.649635	548	548
(21, 23)	Non-Parametric	Similar Groups	0.160270	21.0	22.0	23.178832	23.507299	548	548
(21, 27)	Non-Parametric	Similar Groups	0.209002	21.0	20.0	23.178832	24.764599	548	548
(21, 40)	Non-Parametric	Similar Groups	0.471788	21.0	21.0	23.178832	22.633776	548	527
(21, 42)	Non-Parametric	Similar Groups	0.359416	21.0	20.0	23.178832	26.465517	548	522
(21, 44)	Non-Parametric	Similar Groups	0.213510	21.0	22.0	23.178832	24.177856	548	527
(21, 8)	Non-Parametric	Similar Groups	0.085189	21.0	19.0	23.178832	22.734694	548	441
(22, 25)	Non-Parametric	Similar Groups	0.187974	19.0	18.0	19.720293	20.424132	547	547
(22, 38)	Non-Parametric	Similar Groups	0.111677	19.0	19.0	19.720293	21.227704	547	527
(22, 45)	Non-Parametric	Similar Groups	0.179004	19.0	18.0	19.720293	19.719165	547	527
(22, 7)	Non-Parametric	Similar Groups	0.297087	19.0	19.0	19.720293	22.051351	547	444
(22, 16)	Non-Parametric	Similar Groups	0.141047	19.0	19.0	19.720293	21.497523	547	444
(23, 40)	Non-Parametric	Similar Groups	0.144820	22.0	21.0	23.507299	22.633776	548	527
(23, 42)	Non-Parametric	Similar Groups	0.284477	22.0	20.0	23.507299	26.465517	548	522
(23, 44)	Non-Parametric	Similar Groups	0.440538	22.0	22.0	23.507299	24.177856	548	527
(25, 28)	Non-Parametric	Similar Groups	0.141127	18.0	17.0	20.424132	19.094891	547	548
(25, 45)	Non-Parametric	Similar Groups	0.480488	18.0	18.0	20.424132	19.719165	547	527
(25, 6)	Non-Parametric	Similar Groups	0.062208	18.0	17.0	20.424132	19.740406	547	443
(25, 7)	Non-Parametric	Similar Groups	0.112849	18.0	19.0	20.424132	22.051351	547	444
(25, 16)	Non-Parametric	Similar Groups	0.057894	18.0	19.0	20.424132	21.497523	547	444
(27, 38)	Non-Parametric	Similar Groups	0.083419	20.0	19.0	24.764599	21.227704	548	527
(27, 40)	Non-Parametric	Similar Groups	0.218523	20.0	21.0	24.764599	22.633776	548	527
(27, 41)	Non-Parametric	Similar Groups	0.137742	20.0	20.0	24.764599	21.541825	548	526
(27, 42)	Non-Parametric	Similar Groups	0.146031	20.0	20.0	24.764599	26.465517	548	522
(27, 44)	Non-Parametric	Similar Groups	0.072787	20.0	22.0	24.764599	24.177856	548	527
(27, 16)	Non-Parametric	Similar Groups	0.068801	20.0	19.0	24.764599	21.497523	548	444
(27, 8)	Non-Parametric	Similar Groups	0.233037	20.0	19.0	24.764599	22.734694	548	441
(28, 45)	Non-Parametric	Similar Groups	0.157909	17.0	18.0	19.094891	19.719165	548	527
(28, 5)	Non-Parametric	Similar Groups	0.082682	17.0	17.0	19.094891	18.162330	548	442
(28, 6)	Non-Parametric	Similar Groups	0.282677	17.0	17.0	19.094891	19.740406	548	443
(29, 36)	Non-Parametric	Similar Groups	0.056004	15.0	15.0	17.266826	21.573333	545	525
(29, 10)	Non-Parametric	Similar Groups	0.340116	15.0	15.0	17.266826	18.136613	545	437
(35, 43)	Non-Parametric	Similar Groups	0.216217	14.0	13.0	15.801245	14.912548	522	526
(35, 46)	Non-Parametric	Similar Groups	0.272802	14.0	13.0	15.801245	15.316406	522	512
(35, 9)	Non-Parametric	Similar Groups	0.498473	14.0	14.0	15.801245	14.959368	522	443
(37, 34)	Non-Parametric	Similar Groups	0.478634	13.0	12.0	14.013359	13.615970	524	526
(39, 1)	Non-Parametric	Similar Groups	0.330655	25.0	25.0	29.686907	29.070611	527	524
(39, 14)	Non-Parametric	Similar Groups	0.232906	25.0	26.0	29.686907	29.803612	527	443
(39, 15)	Non-Parametric	Similar Groups	0.388649	25.0	26.0	29.686907	27.831354	527	443
(1, 14)	Non-Parametric	Similar Groups	0.364876	25.0	26.0	29.070611	29.803612	524	443
(1, 15)	Non-Parametric	Similar Groups	0.219119	25.0	26.0	29.070611	27.831354	524	443
(36, 5)	Non-Parametric	Similar Groups	0.323113	15.0	17.0	21.573333	18.162330	525	442
(36, 6)	Non-Parametric	Similar Groups	0.116215	15.0	17.0	21.573333	19.740406	525	443
(36, 10)	Non-Parametric	Similar Groups	0.125947	15.0	15.0	21.573333	18.136613	525	437
(38, 41)	Non-Parametric	Similar Groups	0.352019	19.0	20.0	21.227704	21.541825	527	526
(38, 7)	Non-Parametric	Similar Groups	0.304000	19.0	19.0	21.227704	22.051351	527	444
(38, 16)	Non-Parametric	Similar Groups	0.455081	19.0	19.0	21.227704	21.497523	527	444
(38, 8)	Non-Parametric	Similar Groups	0.247194	19.0	19.0	21.227704	22.734694	527	441
(40, 42)	Non-Parametric	Similar Groups	0.381842	21.0	20.0	22.633776	26.465517	527	522
(40, 44)	Non-Parametric	Similar Groups	0.188451	21.0	22.0	22.633776	24.177856	527	527
(40, 8)	Non-Parametric	Similar Groups	0.078163	21.0	19.0	22.633776	22.734694	527	441
(41, 7)	Non-Parametric	Similar Groups	0.204610	20.0	19.0	21.541825	22.051351	526	444
(41, 16)	Non-Parametric	Similar Groups	0.334767	20.0	19.0	21.541825	21.497523	526	444
(41, 8)	Non-Parametric	Similar Groups	0.338689	20.0	19.0	21.541825	22.734694	526	441
(42, 44)	Non-Parametric	Similar Groups	0.387730	20.0	22.0	26.465517	24.177856	522	527
(43, 46)	Non-Parametric	Similar Groups	0.432120	13.0	13.0	14.912548	15.316406	526	512
(43, 9)	Non-Parametric	Similar Groups	0.217890	13.0	14.0	14.912548	14.959368	526	443
(45, 6)	Non-Parametric	Similar Groups	0.072031	18.0	17.0	19.719165	19.740406	527	443
(45, 7)	Non-Parametric	Similar Groups	0.092305	18.0	19.0	19.719165	22.051351	527	444
(46, 9)	Non-Parametric	Similar Groups	0.298520	13.0	14.0	15.316406	14.959368	512	443
(5, 6)	Non-Parametric	Similar Groups	0.247393	17.0	17.0	18.162330	19.740406	442	443
(5, 10)	Non-Parametric	Similar Groups	0.062061	17.0	15.0	18.162330	18.136613	442	437
(7, 16)	Non-Parametric	Similar Groups	0.386442	19.0	19.0	22.051351	21.497523	444	444
(7, 8)	Non-Parametric	Similar Groups	0.151309	19.0	19.0	22.051351	22.734694	444	441
(11, 13)	Non-Parametric	Similar Groups	0.094553	10.0	11.0	12.071429	13.036364	434	440
(14, 15)	Non-Parametric	Similar Groups	0.134353	26.0	26.0	29.803612	27.831354	443	443
(16, 8)	Non-Parametric	Similar Groups	0.213394	19.0	19.0	21.497523	22.734694	444	441

itemsales = sales.groupby(["SALEDATE", "PRODUCTID"]).UNIT_GROSS_SALES.sum().reset_index().set_index("SALEDATE")
ctg_is = CompareTwoGroups(itemsales, group = "PRODUCTID", target = "UNIT_GROSS_SALES")
ctg_iss = ctg_is[ctg_is["Compare Two Groups"] == "Similar Groups"]
ctg_iss

	TestType	Compare Two Groups	p-value	GroupA_Median	GroupB_Median	GroupA_Mean	GroupB_Mean	GroupA_Count	GroupB_Count
(6, 22)	Non-Parametric	Similar Groups	0.431454	48.0	46.0	47.590106	49.274576	283	295
(6, 25)	Non-Parametric	Similar Groups	0.062028	48.0	46.0	47.590106	46.115538	283	251
(6, 28)	Non-Parametric	Similar Groups	0.191243	48.0	51.0	47.590106	49.776119	283	201
(6, 18)	Non-Parametric	Similar Groups	0.395740	48.0	47.0	47.590106	47.981132	283	106
(6, 39)	Non-Parametric	Similar Groups	0.282581	48.0	55.0	47.590106	50.518519	283	189
(6, 44)	Non-Parametric	Similar Groups	0.279319	48.0	57.0	47.590106	52.537234	283	188
(6, 45)	Non-Parametric	Similar Groups	0.437603	48.0	51.0	47.590106	47.005319	283	188
(7, 32)	Non-Parametric	Similar Groups	0.267717	24.5	20.0	25.581081	26.617021	222	94
(7, 0)	Non-Parametric	Similar Groups	0.110582	24.5	27.0	25.581081	26.589286	222	56
(7, 10)	Non-Parametric	Similar Groups	0.385214	24.5	26.0	25.581081	23.926471	222	68
(7, 36)	Non-Parametric	Similar Groups	0.234398	24.5	25.0	25.581081	22.964912	222	57
(7, 14)	Non-Parametric	Similar Groups	0.267800	24.5	24.0	25.581081	23.105263	222	57
(9, 24)	Non-Parametric	Similar Groups	0.061105	28.0	31.0	28.610600	30.809045	250	199
(9, 21)	Non-Parametric	Similar Groups	0.235222	28.0	32.0	28.610600	29.269231	250	65
(9, 0)	Non-Parametric	Similar Groups	0.292907	28.0	27.0	28.610600	26.589286	250	56
(16, 19)	Non-Parametric	Similar Groups	0.313315	78.0	79.0	78.568831	78.850746	385	67
(16, 15)	Non-Parametric	Similar Groups	0.106617	78.0	73.5	78.568831	77.505208	385	192
(20, 29)	Non-Parametric	Similar Groups	0.330028	33.0	34.0	34.389286	36.143564	280	202
(20, 21)	Non-Parametric	Similar Groups	0.066783	33.0	32.0	34.389286	29.269231	280	65
(22, 25)	Non-Parametric	Similar Groups	0.055195	46.0	46.0	49.274576	46.115538	295	251
(22, 28)	Non-Parametric	Similar Groups	0.271737	46.0	51.0	49.274576	49.776119	295	201
(22, 18)	Non-Parametric	Similar Groups	0.477400	46.0	47.0	49.274576	47.981132	295	106
(22, 39)	Non-Parametric	Similar Groups	0.377522	46.0	55.0	49.274576	50.518519	295	189
(22, 44)	Non-Parametric	Similar Groups	0.403045	46.0	57.0	49.274576	52.537234	295	188
(22, 45)	Non-Parametric	Similar Groups	0.413270	46.0	51.0	49.274576	47.005319	295	188
(23, 30)	Non-Parametric	Similar Groups	0.278406	23.0	19.0	22.357843	26.315789	204	133
(23, 31)	Non-Parametric	Similar Groups	0.257764	23.0	19.0	22.357843	20.797101	204	69
(23, 32)	Non-Parametric	Similar Groups	0.094785	23.0	20.0	22.357843	26.617021	204	94
(23, 10)	Non-Parametric	Similar Groups	0.317212	23.0	26.0	22.357843	23.926471	204	68
(23, 36)	Non-Parametric	Similar Groups	0.416263	23.0	25.0	22.357843	22.964912	204	57
(23, 14)	Non-Parametric	Similar Groups	0.387084	23.0	24.0	22.357843	23.105263	204	57
(24, 21)	Non-Parametric	Similar Groups	0.332758	31.0	32.0	30.809045	29.269231	199	65
(25, 18)	Non-Parametric	Similar Groups	0.132354	46.0	47.0	46.115538	47.981132	251	106
(25, 39)	Non-Parametric	Similar Groups	0.211923	46.0	55.0	46.115538	50.518519	251	189
(25, 44)	Non-Parametric	Similar Groups	0.302716	46.0	57.0	46.115538	52.537234	251	188
(25, 45)	Non-Parametric	Similar Groups	0.238029	46.0	51.0	46.115538	47.005319	251	188
(28, 18)	Non-Parametric	Similar Groups	0.210105	51.0	47.0	49.776119	47.981132	201	106
(28, 39)	Non-Parametric	Similar Groups	0.311082	51.0	55.0	49.776119	50.518519	201	189
(28, 40)	Non-Parametric	Similar Groups	0.232932	51.0	54.0	49.776119	56.694149	201	188
(28, 44)	Non-Parametric	Similar Groups	0.395745	51.0	57.0	49.776119	52.537234	201	188
(28, 45)	Non-Parametric	Similar Groups	0.163656	51.0	51.0	49.776119	47.005319	201	188
(30, 31)	Non-Parametric	Similar Groups	0.308437	19.0	19.0	26.315789	20.797101	133	69
(30, 32)	Non-Parametric	Similar Groups	0.225922	19.0	20.0	26.315789	26.617021	133	94
(30, 21)	Non-Parametric	Similar Groups	0.071822	19.0	32.0	26.315789	29.269231	133	65
(30, 10)	Non-Parametric	Similar Groups	0.288977	19.0	26.0	26.315789	23.926471	133	68
(30, 36)	Non-Parametric	Similar Groups	0.281821	19.0	25.0	26.315789	22.964912	133	57
(30, 14)	Non-Parametric	Similar Groups	0.245616	19.0	24.0	26.315789	23.105263	133	57
(1, 3)	Non-Parametric	Similar Groups	0.205472	91.0	96.0	102.189195	107.689203	385	477
(1, 2)	Non-Parametric	Similar Groups	0.221018	91.0	91.0	102.189195	95.239044	385	251
(1, 8)	Non-Parametric	Similar Groups	0.053662	91.0	104.0	102.189195	111.483041	385	194
(1, 27)	Non-Parametric	Similar Groups	0.290834	91.0	93.0	102.189195	99.011972	385	355
(1, 38)	Non-Parametric	Similar Groups	0.230607	91.0	94.0	102.189195	98.206324	385	340
(1, 41)	Non-Parametric	Similar Groups	0.259650	91.0	104.5	102.189195	96.563830	385	188
(1, 42)	Non-Parametric	Similar Groups	0.170894	91.0	110.5	102.189195	104.009362	385	188
(31, 32)	Non-Parametric	Similar Groups	0.089981	19.0	20.0	20.797101	26.617021	69	94
(31, 36)	Non-Parametric	Similar Groups	0.062689	19.0	25.0	20.797101	22.964912	69	57
(32, 21)	Non-Parametric	Similar Groups	0.179261	20.0	32.0	26.617021	29.269231	94	65
(32, 10)	Non-Parametric	Similar Groups	0.445321	20.0	26.0	26.617021	23.926471	94	68
(32, 36)	Non-Parametric	Similar Groups	0.448063	20.0	25.0	26.617021	22.964912	94	57
(32, 14)	Non-Parametric	Similar Groups	0.485450	20.0	24.0	26.617021	23.105263	94	57
(26, 12)	Non-Parametric	Similar Groups	0.328254	40.0	41.5	41.043796	41.279412	137	136
(26, 44)	Non-Parametric	Similar Groups	0.069902	40.0	57.0	41.043796	52.537234	137	188
(12, 44)	Non-Parametric	Similar Groups	0.070772	41.5	57.0	41.279412	52.537234	136	188
(3, 8)	Non-Parametric	Similar Groups	0.181581	96.0	104.0	107.689203	111.483041	477	194
(3, 27)	Non-Parametric	Similar Groups	0.056285	96.0	93.0	107.689203	99.011972	477	355
(3, 41)	Non-Parametric	Similar Groups	0.068409	96.0	104.5	107.689203	96.563830	477	188
(3, 42)	Non-Parametric	Similar Groups	0.458308	96.0	110.5	107.689203	104.009362	477	188
(2, 27)	Non-Parametric	Similar Groups	0.289723	91.0	93.0	95.239044	99.011972	251	355
(2, 38)	Non-Parametric	Similar Groups	0.369148	91.0	94.0	95.239044	98.206324	251	340
(2, 13)	Non-Parametric	Similar Groups	0.127806	91.0	86.0	95.239044	91.597876	251	193
(2, 41)	Non-Parametric	Similar Groups	0.275901	91.0	104.5	95.239044	96.563830	251	188
(4, 44)	Non-Parametric	Similar Groups	0.050453	3.0	57.0	8.684211	52.537234	19	188
(5, 17)	Non-Parametric	Similar Groups	0.068951	58.0	64.0	60.150478	62.467562	251	160
(5, 35)	Non-Parametric	Similar Groups	0.361863	58.0	56.5	60.150478	57.274038	251	208
(5, 40)	Non-Parametric	Similar Groups	0.157302	58.0	54.0	60.150478	56.694149	251	188
(18, 39)	Non-Parametric	Similar Groups	0.270713	47.0	55.0	47.981132	50.518519	106	189
(18, 40)	Non-Parametric	Similar Groups	0.064286	47.0	54.0	47.981132	56.694149	106	188
(18, 44)	Non-Parametric	Similar Groups	0.325905	47.0	57.0	47.981132	52.537234	106	188
(18, 45)	Non-Parametric	Similar Groups	0.338588	47.0	51.0	47.981132	47.005319	106	188
(19, 15)	Non-Parametric	Similar Groups	0.103538	79.0	73.5	78.850746	77.505208	67	192
(19, 37)	Non-Parametric	Similar Groups	0.264685	79.0	81.5	78.850746	88.249941	67	340
(19, 13)	Non-Parametric	Similar Groups	0.084556	79.0	86.0	78.850746	91.597876	67	193
(19, 41)	Non-Parametric	Similar Groups	0.079300	79.0	104.5	78.850746	96.563830	67	188
(0, 10)	Non-Parametric	Similar Groups	0.244027	27.0	26.0	26.589286	23.926471	56	68
(0, 36)	Non-Parametric	Similar Groups	0.103580	27.0	25.0	26.589286	22.964912	56	57
(0, 14)	Non-Parametric	Similar Groups	0.081279	27.0	24.0	26.589286	23.105263	56	57
(10, 36)	Non-Parametric	Similar Groups	0.305561	26.0	25.0	23.926471	22.964912	68	57
(10, 14)	Non-Parametric	Similar Groups	0.297753	26.0	24.0	23.926471	23.105263	68	57
(15, 11)	Non-Parametric	Similar Groups	0.165177	73.5	78.0	77.505208	68.367021	192	188
(15, 43)	Non-Parametric	Similar Groups	0.102486	73.5	76.5	77.505208	69.355106	192	188
(8, 42)	Non-Parametric	Similar Groups	0.267297	104.0	110.5	111.483041	104.009362	194	188
(27, 38)	Non-Parametric	Similar Groups	0.425943	93.0	94.0	99.011972	98.206324	355	340
(27, 13)	Non-Parametric	Similar Groups	0.074162	93.0	86.0	99.011972	91.597876	355	193
(27, 41)	Non-Parametric	Similar Groups	0.234647	93.0	104.5	99.011972	96.563830	355	188
(27, 42)	Non-Parametric	Similar Groups	0.247916	93.0	110.5	99.011972	104.009362	355	188
(39, 40)	Non-Parametric	Similar Groups	0.179638	55.0	54.0	50.518519	56.694149	189	188
(39, 44)	Non-Parametric	Similar Groups	0.375144	55.0	57.0	50.518519	52.537234	189	188
(39, 45)	Non-Parametric	Similar Groups	0.208922	55.0	51.0	50.518519	47.005319	189	188
(17, 33)	Non-Parametric	Similar Groups	0.082676	64.0	68.0	62.467562	67.332850	160	207
(17, 11)	Non-Parametric	Similar Groups	0.159181	64.0	78.0	62.467562	68.367021	160	188
(17, 43)	Non-Parametric	Similar Groups	0.225410	64.0	76.5	62.467562	69.355106	160	188
(33, 11)	Non-Parametric	Similar Groups	0.276477	68.0	78.0	67.332850	68.367021	207	188
(33, 43)	Non-Parametric	Similar Groups	0.435254	68.0	76.5	67.332850	69.355106	207	188
(34, 13)	Non-Parametric	Similar Groups	0.497191	88.0	86.0	92.158083	91.597876	339	193
(34, 41)	Non-Parametric	Similar Groups	0.253872	88.0	104.5	92.158083	96.563830	339	188
(35, 40)	Non-Parametric	Similar Groups	0.240636	56.5	54.0	57.274038	56.694149	208	188
(35, 44)	Non-Parametric	Similar Groups	0.121123	56.5	57.0	57.274038	52.537234	208	188
(36, 14)	Non-Parametric	Similar Groups	0.490951	25.0	24.0	22.964912	23.105263	57	57
(37, 13)	Non-Parametric	Similar Groups	0.107570	81.5	86.0	88.249941	91.597876	340	193
(37, 41)	Non-Parametric	Similar Groups	0.115668	81.5	104.5	88.249941	96.563830	340	188
(38, 13)	Non-Parametric	Similar Groups	0.093953	94.0	86.0	98.206324	91.597876	340	193
(38, 41)	Non-Parametric	Similar Groups	0.381587	94.0	104.5	98.206324	96.563830	340	188
(38, 42)	Non-Parametric	Similar Groups	0.124984	94.0	110.5	98.206324	104.009362	340	188
(13, 41)	Non-Parametric	Similar Groups	0.309650	86.0	104.5	91.597876	96.563830	193	188
(11, 43)	Non-Parametric	Similar Groups	0.473016	78.0	76.5	68.367021	69.355106	188	188
(40, 44)	Non-Parametric	Similar Groups	0.243153	54.0	57.0	56.694149	52.537234	188	188
(41, 42)	Non-Parametric	Similar Groups	0.167488	104.5	110.5	96.563830	104.009362	188	188
(44, 45)	Non-Parametric	Similar Groups	0.129569	57.0	51.0	52.537234	47.005319	188	188

CompareTwoGroups(marketsales, group = "MARKET", target = "UNIT_GROSS_SALES")

	TestType	Compare Two Groups	p-value	GroupA_Median	GroupB_Median	GroupA_Mean	GroupB_Mean	GroupA_Count	GroupB_Count
(Austin, Dallas)	Non-Parametric	Different Groups	2.142448e-28	223.5	315.0	261.493785	345.159545	576	549
(Austin, Houston)	Non-Parametric	Similar Groups	4.293589e-01	223.5	216.0	261.493785	244.136837	576	528
(Austin, San Antonio)	Non-Parametric	Different Groups	5.611602e-173	223.5	41.0	261.493785	43.897021	576	527
(Austin, Frontier)	Non-Parametric	Different Groups	2.915368e-02	223.5	232.5	261.493785	263.604324	576	444
(Dallas, Houston)	Non-Parametric	Different Groups	4.189280e-39	315.0	216.0	345.159545	244.136837	549	528
(Dallas, San Antonio)	Non-Parametric	Different Groups	3.759319e-172	315.0	41.0	345.159545	43.897021	549	527
(Dallas, Frontier)	Non-Parametric	Different Groups	1.311596e-22	315.0	232.5	345.159545	263.604324	549	444
(Houston, San Antonio)	Non-Parametric	Different Groups	1.266149e-165	216.0	41.0	244.136837	43.897021	528	527
(Houston, Frontier)	Non-Parametric	Different Groups	1.424168e-02	216.0	232.5	244.136837	263.604324	528	444
(San Antonio, Frontier)	Non-Parametric	Different Groups	2.881501e-153	41.0	232.5	43.897021	263.604324	527	444

Feature Engineering

Following features are engineered to be incorporated into the model for better prediction accuracy.

Time Related Features
- Year
- Month
- Week
- Day
- Weekday
- Day of the year
- Day of the week
- Holidays for the state of Texas
- Is the saledate a weekend?
- Is the saledate month end?
- Is the saledate month start?
- Is the saledate quarter end?
- Is the saledate quarter start?
- Is the saledate year end?
- Is the saledate year start?
Lagged Features
- sales lag for 3 days
- sales lag for 7 days
- sales lag for 14 days
- sales lag for 30 days
- sales lag for 60 days
- sales lag for 90 days
Moving Average Features
- Rolling mean
Hypothesis Testing: Similarity Features
- Similar store sales
- Similar item sales
- Similar market sales
Exponentially Weighted Mean Features

# 1. Time Related Features
#####################################################
def add_datepart(df, clmn):
    df['Year'] = df[clmn].dt.year
    df['Month'] = df[clmn].dt.month
    df['Week'] = df[clmn].dt.isocalendar().week
    df['Day'] = df[clmn].dt.day
    
    wkday = []
    doty = []
    for dt in df[clmn]:
        wkday.append(dt.weekday())
        doty.append(dt.timetuple().tm_yday)
    df['DayofWeek'] = wkday
    df['DayofYear'] = doty
    
    tx = holidays.US(state = 'TX')
    is_holiday = []
    for i in df[clmn]: 
           is_holiday.append(i in tx)
            
    df['Is_holiday'] = is_holiday
    df['Is_holiday'] = df['Is_holiday'].astype(int)
    df['Is_wknd'] = df['DayofWeek'] // 4
    df['Is_month_end'] = df[clmn].dt.is_month_end.astype(int)
    df['Is_month_start'] = df[clmn].dt.is_month_start.astype(int)
    df['Is_quarter_end'] = df[clmn].dt.is_quarter_end.astype(int)
    df['Is_quarter_start'] = df[clmn].dt.is_quarter_start.astype(int)
    df['Is_year_end'] = df[clmn].dt.is_year_end.astype(int)
    df['Is_year_start'] = df[clmn].dt.is_year_start.astype(int)
#     df.drop([clmn], axis = 1, inplace = True)
    return df

new_sales = add_datepart(sales, 'SALEDATE').copy()

# Rolling Summary Stats Features
#####################################################
for i in [4, 7, 14, 30, 60, 100]:
    new_sales["sales_roll_mean_"+str(i)] = new_sales.groupby(["LOCATIONNO", "PRODUCTID"]).UNIT_GROSS_SALES.rolling(i).mean().shift(1).values

# List of similar store sales and similar item sales from the non-parametric t-test.

sim_store = list(ctg_ss.index)
sim_item = list(ctg_iss.index)

# 2. Hypothesis Testing: Similarity
#####################################################

# Store Based
j = 1
for i in sim_store:
    if j == 1:
        new_sales["StoreSalesSimilarity"] = np.where(new_sales.LOCATIONNO.isin(list(i)), j, 0)
    else:
        new_sales["StoreSalesSimilarity"] = np.where(new_sales.LOCATIONNO.isin(list(i)), j, new_sales["StoreSalesSimilarity"])
    j+=1

# item Based
j = 1
for i in sim_item:
    if j == 1:
        new_sales["ItemSalesSimilarity"] = np.where(new_sales.PRODUCTID.isin(list(i)), j, 0)
    else:
        new_sales["ItemSalesSimilarity"] = np.where(new_sales.PRODUCTID.isin(list(i)), j, new_sales["ItemSalesSimilarity"])
    j+=1

# Market Based
new_sales["MarketSalesSimilarity"] = np.where(new_sales.MARKET.isin(['Austin', 'Houston']), 1, 0)

# 3. Lag/Shifted Features
#####################################################

new_sales.sort_values(by=['LOCATIONNO', 'PRODUCTID', 'SALEDATE'], axis=0, inplace=True)

def lag_features(dataframe, lags, groups = ['LOCATIONNO', 'PRODUCTID'], target = 'UNIT_GROSS_SALES', prefix = ''):
    dataframe = dataframe.copy()
    for lag in lags:
        dataframe[prefix + str(lag)] = dataframe.groupby(groups)[target].transform(
            lambda x: x.shift(lag))
    return dataframe
lag_sales = lag_features(new_sales, lags = [3, 7, 14, 30, 60, 90], 
                         groups = ['LOCATIONNO', 'PRODUCTID'], 
                         target = 'UNIT_GROSS_SALES', prefix = 'sales_lag_').copy()

lag_sales.head()

	MARKET	SALEDATE	UNIT_GROSS_SALES	Cost	Year	Month	Week	Day	DayofWeek	DayofYear	Is_wknd	sales_roll_mean_4	sales_roll_mean_7	sales_roll_mean_14	sales_roll_mean_30	sales_roll_mean_60	sales_roll_mean_100	StoreSalesSimilarity	ItemSalesSimilarity	MarketSalesSimilarity	sales_lag_3	sales_lag_7	sales_lag_14	sales_lag_30	sales_lag_60	sales_lag_90
88361	Austin	2020-03-24	1.0	1.91	2020	3	13	24	1	84	0	1.0	1.428571	1.285714	NaN	NaN	NaN	1	86	1	NaN	NaN	NaN	NaN	NaN	NaN
88948	Austin	2020-03-25	3.0	1.91	2020	3	13	25	2	85	0	1.0	1.000000	1.285714	0.733333	1.133333	1.35	1	86	1	NaN	NaN	NaN	NaN	NaN	NaN
89312	Austin	2020-03-26	3.0	1.91	2020	3	13	26	3	86	0	2.0	1.714286	1.500000	1.766667	1.950000	NaN	1	86	1	NaN	NaN	NaN	NaN	NaN	NaN
89530	Austin	2020-03-27	1.0	1.91	2020	3	13	27	4	87	1	1.0	0.571429	1.071429	1.233333	1.250000	1.63	1	86	1	1.0	NaN	NaN	NaN	NaN	NaN
89919	Austin	2020-03-28	1.0	1.91	2020	3	13	28	5	88	1	1.5	1.714286	1.571429	1.833333	NaN	NaN	1	86	1	3.0	NaN	NaN	NaN	NaN	NaN

Remove correlated features from the dataset

def drop_cor(dataframe, name, index):
    ind = dataframe[dataframe.columns[dataframe.columns.str.contains(name)].tolist()+["UNIT_GROSS_SALES"]].corr().UNIT_GROSS_SALES.sort_values(ascending = False).index[1:index]
    ind = dataframe.drop(ind, axis = 1).columns[dataframe.drop(ind, axis = 1).columns.str.contains(name)]
    dataframe.drop(ind, axis = 1, inplace = True)

drop_cor(lag_sales, "sales_lag", 5)

lag_sales.head()

	MARKET	SALEDATE	UNIT_GROSS_SALES	Cost	Year	Month	Week	Day	DayofWeek	DayofYear	Is_wknd	sales_roll_mean_4	sales_roll_mean_7	sales_roll_mean_14	sales_roll_mean_30	sales_roll_mean_60	sales_roll_mean_100	StoreSalesSimilarity	ItemSalesSimilarity	MarketSalesSimilarity	sales_lag_3	sales_lag_7	sales_lag_14	sales_lag_30
88361	Austin	2020-03-24	1.0	1.91	2020	3	13	24	1	84	0	1.0	1.428571	1.285714	NaN	NaN	NaN	1	86	1	NaN	NaN	NaN	NaN
88948	Austin	2020-03-25	3.0	1.91	2020	3	13	25	2	85	0	1.0	1.000000	1.285714	0.733333	1.133333	1.35	1	86	1	NaN	NaN	NaN	NaN
89312	Austin	2020-03-26	3.0	1.91	2020	3	13	26	3	86	0	2.0	1.714286	1.500000	1.766667	1.950000	NaN	1	86	1	NaN	NaN	NaN	NaN
89530	Austin	2020-03-27	1.0	1.91	2020	3	13	27	4	87	1	1.0	0.571429	1.071429	1.233333	1.250000	1.63	1	86	1	1.0	NaN	NaN	NaN
89919	Austin	2020-03-28	1.0	1.91	2020	3	13	28	5	88	1	1.5	1.714286	1.571429	1.833333	NaN	NaN	1	86	1	3.0	NaN	NaN	NaN

# 4. Last i. Months
#####################################################
lag_sales["monthyear"] = lag_sales.SALEDATE.dt.to_period('M')

# Store-Item Based
for i in [3, 6, 9, 12, 15]:
    last_months = lag_sales.groupby(["LOCATIONNO", "PRODUCTID", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['LOCATIONNO', 'PRODUCTID', 'monthyear', 'last_'+str(i)+'months_sales_sum',
                           'last_'+str(i)+'months_sales_mean', 'last_'+str(i)+'months_sales_std',
                           'last_'+str(i)+'months_sales_min', 'last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how   = "left", on = ["LOCATIONNO", "PRODUCTID", "monthyear"])
del last_months, i

drop_cor(lag_sales, "last_", 4)

# Store Based


for i in [3, 6, 9, 12]:
    last_months = lag_sales.groupby(["LOCATIONNO", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['LOCATIONNO', 'monthyear', 'store_last_'+str(i)+'months_sales_sum',
                           'store_last_'+str(i)+'months_sales_mean', 'store_last_'+str(i)+'months_sales_std',
                           'store_last_'+str(i)+'months_sales_min', 'store_last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how = "left", on = ["LOCATIONNO", "monthyear"])
del last_months, i


# Item Based
for i in [3, 6, 9, 12]:
    last_months = lag_sales.groupby(["PRODUCTID", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['PRODUCTID', 'monthyear', 'item_last_'+str(i)+'months_sales_sum',
                           'item_last_'+str(i)+'months_sales_mean', 'item_last_'+str(i)+'months_sales_std',
                           'item_last_'+str(i)+'months_sales_min', 'item_last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how = "left", on = ["PRODUCTID", "monthyear"])
del last_months, i


# Similarity Based


for i in [3, 6, 9, 12]:
    last_months = lag_sales.groupby(["StoreSalesSimilarity", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['StoreSalesSimilarity', 'monthyear', 'storesim_last_'+str(i)+'months_sales_sum',
                           'storesim_last_'+str(i)+'months_sales_mean', 'storesim_last_'+str(i)+'months_sales_std',
                           'storesim_last_'+str(i)+'months_sales_min', 'storesim_last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how = "left", on = ["StoreSalesSimilarity", "monthyear"])
del last_months, i

for i in [3, 6, 9, 12]:
    last_months = lag_sales.groupby(["ItemSalesSimilarity", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['ItemSalesSimilarity', 'monthyear', 'itemsim_last_'+str(i)+'months_sales_sum',
                           'itemsim_last_'+str(i)+'months_sales_mean', 'itemsim_last_'+str(i)+'months_sales_std',
                           'itemsim_last_'+str(i)+'months_sales_min', 'itemsim_last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how = "left", on = ["ItemSalesSimilarity", "monthyear"])
del last_months, i

lag_sales.drop("monthyear", axis = 1, inplace = True)

# 5. Last i. day of week
#####################################################
lag_sales.sort_values(["LOCATIONNO", "PRODUCTID", "DayofWeek", "SALEDATE"], inplace = True)

lag_sales = lag_features(lag_sales, lags = np.arange(12,41, 1).tolist()+[3, 7, 60, 90],
                  groups = ["LOCATIONNO", "PRODUCTID", "DayofWeek"], target = 'UNIT_GROSS_SALES', prefix = 'dayofweek_sales_lag_')

lag_sales[lag_sales.columns[lag_sales.columns.str.contains("dayofweek_sales_lag_")].tolist()+["UNIT_GROSS_SALES"]].corr().UNIT_GROSS_SALES.sort_values(ascending = False)

drop_cor(lag_sales, "dayofweek_sales_lag_", 5)

lag_sales.sort_values(["LOCATIONNO", "PRODUCTID", "SALEDATE"], inplace = True)

#####################################################
# Exponentially Weighted Mean Features
#####################################################
def ewm_features(dataframe, alphas, lags):
    dataframe = dataframe.copy()
    for alpha in alphas:
        for lag in lags:
            dataframe['sales_ewm_alpha_' + str(alpha).replace(".", "") + "_lag_" + str(lag)] = \
                dataframe.groupby(["LOCATIONNO", "PRODUCTID"])['UNIT_GROSS_SALES']. \
                    transform(lambda x: x.shift(lag).ewm(alpha=alpha).mean())
    return dataframe

alphas = [0.95, 0.9, 0.8, 0.7, 0.5]
lags = [3, 7, 14, 30, 60, 90]

df = ewm_features(lag_sales, alphas, lags).copy()

df.shape

(290014, 150)

Train-Validation Split

Like most model building process, the data is split into training and validation data. Since it is time series data, the data for the months of January and February of 2021 are in the validation set. Any, data after the month of February 2021 is part of the test set. The test set will only be used at the end to check the performance of the model.

# Dataframe must be sorted by date because of Time Series Split
df = df.sort_values("SALEDATE").reset_index(drop = True)
df['Week'] = df['Week'].astype(int)
df = pd.get_dummies(df, columns = ['MARKET'])

# Train Validation Split
# Validation set includes 2 months (Jan, Feb)
train = df.loc[(df["SALEDATE"] < "2021-01-01"), :]
val = df.loc[(df["SALEDATE"] >= "2021-01-01") & (df["SALEDATE"] < "2021-03-01"), :]
test = df.loc[(df["SALEDATE"] >= "2021-03-01")] # Any data after Feb 28 is set aside as test set.

cols = [col for col in train.columns if col not in ['SALEDATE', "UNIT_GROSS_SALES", "Year"]]

Y_train = train['UNIT_GROSS_SALES']
X_train = train[cols]

Y_val = val['UNIT_GROSS_SALES']
X_val = val[cols]

Y_train.shape, X_train.shape, Y_val.shape, X_val.shape

((217822,), (217822, 151), (44368,), (44368, 151))

Custom Cost Function

Symmetric mean absolute percentage error is used as the cost function

SMAPE = $\frac{100\%}{n}\sum_{t=1}^{n}\frac{|F_t - A_t|}{(|A_t| + |F_t|)/2}$

where $A_t$ is the actual value and $F_t$ is the forecast value. The absolute difference between $A_t$ and $F_t$ is divided by half the sum of absolute values of the actual value $A_t$ and the forecast value $F_t$. The value of this calculation is summed for every fitted point t and divided again by the number of fitted points n.

# SMAPE: Symmetric mean absolute percentage error (adjusted MAPE)
def smape(preds, target):
    n = len(preds)
    masked_arr = ~((preds == 0) & (target == 0))
    preds, target = preds[masked_arr], target[masked_arr]
    num = np.abs(preds-target)
    denom = np.abs(preds)+np.abs(target)
    smape_val = (200*np.sum(num/denom))/n
    return smape_val

def lgbm_smape(y_true, y_pred):
    smape_val = smape(y_true, y_pred)
    return 'SMAPE', smape_val, False

First Model

After completing the feature engineering phase the first model with default parameters and all the features is built and it’s performance is analysed. The model considered here is the lgbm model for its

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Faster training speed and higher efficiency
Lower memory usage
Better accuracy
Capability to handle large-scale data.

first_model = lgb.LGBMRegressor(random_state=384).fit(X_train, Y_train, 
                                                      eval_metric= lambda y_true, y_pred: [lgbm_smape(y_true, y_pred)])

print("TRAIN SMAPE:", smape(Y_train, first_model.predict(X_train)))
print("VALID SMAPE:", smape(Y_val, first_model.predict(X_val)))

TRAIN SMAPE: 69.37855260588468
VALID SMAPE: 96.27532578267159

Feature Importance

def plot_lgb_importances(model, plot=False, num=10):
    from matplotlib import pyplot as plt
    import seaborn as sns
    
    
    
    # LGBM API
    #gain = model.feature_importance('gain')
    #feat_imp = pd.DataFrame({'feature': model.feature_name(),
    #                         'split': model.feature_importance('split'),
    #                         'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
    
    # SKLEARN API
    gain = model.booster_.feature_importance(importance_type='gain')
    feat_imp = pd.DataFrame({'feature': model.feature_name_,
                             'split': model.booster_.feature_importance(importance_type='split'),
                             'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
    if plot:
        plt.figure(figsize=(10, 10))
        sns.set_context("talk", font_scale=1)
        sns.set_style("whitegrid")
        sns.barplot(x="gain", y="feature", data=feat_imp[0:25], color = 'lightblue')
        plt.title('feature')
        plt.tight_layout()
        plt.show()
    else:
        print(feat_imp.head(num))
        return feat_imp

feature_imp_df = plot_lgb_importances(first_model, num=10)

                           feature  split       gain
141       sales_ewm_alpha_05_lag_7     78  31.387042
140       sales_ewm_alpha_05_lag_3    100  22.170319
142      sales_ewm_alpha_05_lag_14     51   9.177836
6                        DayofWeek    267   5.746324
114          dayofweek_sales_lag_3     76   2.918399
7                        DayofYear    190   2.184596
21             sales_roll_mean_100    204   1.659234
20              sales_roll_mean_60    173   1.539139
5                              Day    165   1.366325
43   store_last_9months_sales_mean      7   1.329502

feature_imp_df.shape, feature_imp_df[feature_imp_df.gain > 0].shape, feature_imp_df[feature_imp_df.gain > 0.57].shape

((151, 3), (133, 3), (21, 3))

plot_lgb_importances(first_model, plot=True)

Shap

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

sns.set_context('talk', font_scale = 1.4)
explainer = shap.Explainer(first_model)
shap_values_train = explainer(X_train)
shap_values_valid = explainer(X_val)

len(shap_values_train), len(shap_values_valid)

(217822, 44368)

A beeswarm summary plot

The beeswarm plot is designed to display an information-dense summary of how the top features in a dataset impact the model’s output. Each instance the given explanation is represented by a single dot on each feature fow. The x position of the dot is determined by the SHAP value (shap_values.value[instance,feature]) of that feature, and dots “pile up” along each feature row to show density. Color is used to display the original value of a feature (shap_values.data[instance,feature]). In the plot below we can see that sales lag for 3 days is the most important feature on average for the training data. However, for the validation data sales lag for 7 days is the most important feature on average.

# summarize the effects of all the features
sns.set_context('talk', font_scale = 1.4)
shap.plots.beeswarm(shap_values_train, max_display=30)

# summarize the effects of all the features
sns.set_context('talk', font_scale = 1.4)
shap.plots.beeswarm(shap_values_valid, max_display=30)

shap.plots.bar(shap_values_train, max_display=30)

Error Analysis

error = pd.DataFrame({
    "date":val.SALEDATE,
    "store":X_val.LOCATIONNO,
    "item":X_val.PRODUCTID,
    "actual":Y_val,
    "pred":first_model.predict(X_val)
}).reset_index(drop = True)

error["error"] = np.abs(error.actual-error.pred)

error.sort_values("error", ascending=False).head(20)

	date	store	item	actual	pred	error
38403	2021-02-21	30	41	85.0	11.357431	73.642569
38456	2021-02-21	30	37	57.0	9.129747	47.870253
38569	2021-02-21	30	3	59.0	13.590359	45.409641
38439	2021-02-21	4	41	36.0	5.282435	30.717565
18216	2021-01-25	30	42	43.0	13.941786	29.058214
39840	2021-02-22	30	42	38.0	9.921250	28.078750
43925	2021-02-28	30	41	44.0	16.407923	27.592077
14008	2021-01-19	30	42	32.0	8.389453	23.610547
38932	2021-02-21	4	37	31.0	7.826655	23.173345
38563	2021-02-21	30	43	31.0	8.182316	22.817684
39088	2021-02-21	36	3	29.0	7.635129	21.364871
41019	2021-02-24	30	3	26.0	4.827231	21.172769
29090	2021-02-08	30	41	32.0	11.175341	20.824659
22973	2021-01-31	12	42	30.0	9.534384	20.465616
40879	2021-02-24	30	27	25.0	4.890703	20.109297
8366	2021-01-12	30	27	29.0	9.047829	19.952171
23307	2021-01-31	30	41	35.0	15.066269	19.933731
28965	2021-02-08	4	37	26.0	6.205603	19.794397
27421	2021-02-06	30	27	27.0	7.547040	19.452960
35710	2021-02-17	32	41	25.0	5.810368	19.189632

error[["actual", "pred", "error"]].describe([0.7, 0.8, 0.9, 0.95, 0.99]).T

	count	mean	std	min	50%	70%	80%	90%	95%	99%	max
actual	44368.0	2.584691	3.101834	0.000000	2.000000	3.000000	4.000000	6.000000	8.000000	14.000000	85.000000
pred	44368.0	2.350267	1.485043	0.101703	2.007167	2.537990	2.980707	3.834862	4.929053	8.650099	18.277589
error	44368.0	1.864230	1.917119	0.000021	1.405939	2.089352	2.657121	3.826397	5.232841	9.339638	73.642569

Waterfall plots are designed to display explanations for individual predictions, so they expect a single row of an Explanation object as input. The bottom of a waterfall plot starts as the expected value of the model output, and then each row shows how the positive (red) or negative (blue) contribution of each feature moves the value from the expected model output over the background dataset to the model output for this prediction.

sns.set_context('talk', font_scale = 1.4)
shap.plots.waterfall(shap_values_valid[30125])

Force plot is used to show how much a given feature contributes to our prediction (compared to if we made that prediction at some baseline value of that feature). The force plots are interactive, as github had difficulty rendering it; I have uploaded the pictures of the plots.

# visualize the first prediction's explanation with a force plot
shap.initjs()
shap.plots.force(shap_values_valid[30125])

# visualize the explanation for the all the prediction with a force plot. So as to not overwhelm the browser
# first 100 values are passed
valid_shap = explainer.shap_values(X_val)
shap.initjs()
shap.plots.force(explainer.expected_value, valid_shap[:100, :], X_val.iloc[:100, :])

# Mean Absolute Error
error.groupby(["store", "item"]).error.mean().sort_values(ascending = False)

store  item
30     41      8.796683
       3       7.730369
       27      7.417767
       42      7.331254
       37      6.566694
       1       6.551101
       38      6.450013
       8       5.688377
       34      5.215296
       11      4.847538
       43      4.786728
12     42      4.604071
       3       4.430166
30     13      4.366960
       40      4.349082
0      34      4.321580
       42      4.259871
30     39      4.251495
4      37      4.141325
       8       4.049376
33     3       3.985145
4      42      3.982138
12     8       3.964201
       27      3.933068
0      11      3.931204
       3       3.796247
       1       3.767996
12     34      3.690778
0      27      3.663062
30     44      3.632818
12     41      3.628897
4      3       3.591541
0      8       3.581333
4      41      3.575084
24     1       3.554002
0      41      3.517047
1      3       3.429900
4      38      3.424165
       1       3.372646
       27      3.366829
36     3       3.359900
33     8       3.340034
       1       3.331595
30     45      3.328554
14     41      3.318947
27     3       3.310898
39     42      3.283005
33     38      3.237391
39     1       3.190325
17     27      3.186373
32     41      3.174905
33     27      3.141269
4      13      3.132354
12     38      3.108171
       11      3.100573
19     3       3.097261
39     27      3.074528
24     42      3.071504
12     13      3.059989
0      44      2.998813
33     11      2.953767
42     1       2.942637
12     1       2.926406
14     38      2.910530
0      43      2.908644
2      13      2.896304
33     34      2.895418
42     13      2.895023
4      39      2.891847
33     41      2.886471
39     3       2.878725
0      37      2.872639
12     44      2.863298
15     41      2.859410
27     43      2.835657
14     34      2.831470
12     43      2.803658
2      42      2.795462
19     38      2.767300
17     8       2.758636
12     40      2.756067
14     42      2.754917
17     1       2.754808
       34      2.743868
33     44      2.742895
24     43      2.742398
8      41      2.708218
18     1       2.701318
17     38      2.678119
1      27      2.665648
14     37      2.664356
3      41      2.661593
42     3       2.646940
19     27      2.634222
27     42      2.632597
       27      2.621078
2      8       2.599549
16     1       2.599387
14     8       2.591194
32     11      2.586096
0      38      2.585180
32     37      2.580516
15     37      2.577988
19     41      2.577918
24     38      2.572516
2      27      2.558549
0      13      2.553786
24     13      2.549550
32     8       2.541285
7      38      2.538866
39     37      2.534707
       41      2.533105
36     41      2.526091
       42      2.524357
4      43      2.523554
24     27      2.517807
12     39      2.514400
       37      2.510257
32     3       2.495721
7      3       2.472591
32     27      2.472559
42     37      2.471904
27     1       2.468068
33     39      2.466154
16     41      2.456669
27     41      2.455888
       13      2.454431
17     44      2.451902
42     41      2.436222
24     41      2.427852
26     3       2.425128
15     42      2.423113
33     37      2.413726
2      1       2.412799
17     3       2.410413
0      45      2.398012
24     3       2.396723
32     42      2.390878
1      1       2.364970
7      41      2.363477
24     8       2.363074
2      3       2.360506
19     42      2.349965
24     37      2.349415
19     37      2.346062
18     11      2.344202
32     38      2.340597
7      37      2.332521
4      34      2.330637
33     42      2.328101
32     34      2.319122
14     27      2.315701
16     3       2.286446
15     27      2.285725
42     27      2.272444
1      8       2.258241
36     39      2.254372
21     42      2.237494
27     38      2.237187
16     27      2.231148
18     3       2.228452
42     11      2.226240
19     1       2.223807
36     37      2.216599
39     11      2.212751
3      3       2.209608
44     3       2.205291
19     8       2.202903
26     8       2.193774
1      44      2.187075
17     13      2.183657
25     1       2.183614
1      37      2.175282
21     38      2.171630
39     43      2.163067
3      37      2.157286
19     34      2.152494
1      11      2.149056
15     3       2.147725
17     43      2.134811
27     11      2.134671
1      38      2.134429
15     38      2.134370
8      42      2.130037
19     11      2.126680
36     27      2.125526
42     42      2.116560
7      1       2.115409
32     13      2.114694
1      42      2.113248
2      43      2.112902
42     38      2.108225
18     41      2.104434
16     42      2.101756
20     8       2.094922
23     3       2.089880
38     1       2.088971
18     42      2.083530
27     8       2.076874
44     43      2.066242
45     37      2.066145
15     1       2.066116
21     13      2.064104
39     8       2.063827
28     41      2.062566
0      40      2.060275
4      11      2.059599
17     42      2.058140
3      42      2.057786
38     3       2.054505
8      37      2.051368
43     42      2.041615
40     37      2.038056
33     13      2.035505
14     3       2.033955
8      3       2.033846
22     42      2.033555
36     13      2.029968
24     34      2.029384
16     38      2.027139
1      43      2.024892
26     41      2.024144
16     8       2.022475
11     41      2.012826
26     37      2.008662
       13      2.003392
44     1       1.999507
5      42      1.997661
1      13      1.996955
44     44      1.996313
10     41      1.989540
40     3       1.987509
25     42      1.984760
26     1       1.984527
1      41      1.978309
33     43      1.973729
17     41      1.973248
26     34      1.973114
41     27      1.970733
21     37      1.962348
3      1       1.960896
23     8       1.960636
44     42      1.957164
3      38      1.956553
42     43      1.949328
18     37      1.949230
39     13      1.948035
44     41      1.945350
20     41      1.945303
18     34      1.940320
44     8       1.940279
17     37      1.935355
33     40      1.933901
3      43      1.933638
25     3       1.932040
24     40      1.927933
6      42      1.927292
22     8       1.926427
39     44      1.917073
4      45      1.915448
26     27      1.915327
42     40      1.912418
27     34      1.912250
42     8       1.911520
3      8       1.909868
19     43      1.909620
39     38      1.908371
2      37      1.907649
18     27      1.902842
6      41      1.901781
3      11      1.896718
36     1       1.889901
44     27      1.886836
1      39      1.881931
36     8       1.873640
14     43      1.872762
26     38      1.871297
45     44      1.865992
39     39      1.865921
44     13      1.864748
27     45      1.863641
42     44      1.862351
21     1       1.859969
41     1       1.858548
15     34      1.858174
25     8       1.857868
2      34      1.857595
0      39      1.856275
20     34      1.851778
32     40      1.848275
24     44      1.845841
43     3       1.843875
31     41      1.843710
19     13      1.842873
17     11      1.842854
3      34      1.839337
21     43      1.838972
8      38      1.835077
43     13      1.824492
18     38      1.820752
25     41      1.820603
2      41      1.814130
14     39      1.813962
3      13      1.813783
20     13      1.813201
35     34      1.811165
20     27      1.810796
10     27      1.803908
14     44      1.801503
2      38      1.798955
6      34      1.798940
42     39      1.798875
45     27      1.796934
21     11      1.796591
36     11      1.793527
35     38      1.790625
8      34      1.790520
5      38      1.790312
21     39      1.787890
31     38      1.787751
7      8       1.783308
14     1       1.783071
24     39      1.781381
15     8       1.776310
9      41      1.768692
25     13      1.761052
10     3       1.760061
16     11      1.753975
28     37      1.752716
36     38      1.751559
6      1       1.748578
7      34      1.744266
38     37      1.743989
39     40      1.741437
18     13      1.737523
42     34      1.736864
2      40      1.735879
10     34      1.732812
32     1       1.730147
8      1       1.729259
3      27      1.728229
32     39      1.727863
16     44      1.726095
20     1       1.725890
40     42      1.724510
1      34      1.721047
25     34      1.710974
19     44      1.710354
26     44      1.707571
44     37      1.707304
32     44      1.705164
21     8       1.697304
38     27      1.693728
22     43      1.692139
45     38      1.691840
28     8       1.691809
20     42      1.690213
41     41      1.690204
45     11      1.688956
21     41      1.687332
15     43      1.683894
10     8       1.681712
46     27      1.678961
33     45      1.677837
22     13      1.669968
21     27      1.669549
37     3       1.667952
29     1       1.664966
39     34      1.664316
19     39      1.662362
36     43      1.660920
35     41      1.658878
38     11      1.658647
20     44      1.655116
7      42      1.653392
4      44      1.650880
35     42      1.649943
41     37      1.643977
22     27      1.642987
41     42      1.641927
14     13      1.639081
35     3       1.637331
28     42      1.635993
26     42      1.635693
36     34      1.633226
21     3       1.631875
23     27      1.629565
13     42      1.629360
41     3       1.629063
6      38      1.628237
2      11      1.625523
46     42      1.625001
39     45      1.624650
31     34      1.624203
43     8       1.622206
45     3       1.617795
2      44      1.616223
5      1       1.615027
       3       1.611376
17     45      1.609809
6      3       1.608170
16     13      1.605829
17     39      1.605315
10     42      1.605111
44     34      1.604073
26     40      1.602739
8      8       1.601100
20     3       1.600886
37     42      1.599365
46     1       1.599363
7      11      1.595823
19     40      1.595275
25     37      1.594842
3      39      1.594068
27     44      1.592834
28     38      1.591825
4      40      1.591698
15     40      1.589640
40     11      1.589380
43     1       1.588752
8      13      1.588291
       27      1.587870
20     37      1.586954
40     27      1.585805
       1       1.582457
35     43      1.580955
44     39      1.580266
6      37      1.580047
10     40      1.571123
23     1       1.570028
38     43      1.569349
8      43      1.566729
27     37      1.566515
45     41      1.563603
21     44      1.563534
27     39      1.558764
29     42      1.557502
45     42      1.557341
18     39      1.555327
17     40      1.553271
18     44      1.553210
26     11      1.552130
36     45      1.551781
31     37      1.548940
10     38      1.542138
25     38      1.536770
3      45      1.533392
14     11      1.533245
11     42      1.533169
41     13      1.531229
22     41      1.531058
6      27      1.530885
31     27      1.529504
22     3       1.528377
25     44      1.526525
44     38      1.522776
15     13      1.522659
23     42      1.522417
18     40      1.516749
35     27      1.516605
26     39      1.516518
9      37      1.515547
27     40      1.514145
18     43      1.506296
26     43      1.501951
22     34      1.501277
9      8       1.501055
31     8       1.497446
36     44      1.494615
46     8       1.488384
37     27      1.485193
20     39      1.482098
13     38      1.480978
45     40      1.477098
31     3       1.476729
5      37      1.476702
15     39      1.473841
20     43      1.472823
25     40      1.472325
43     41      1.471055
25     43      1.470083
       11      1.469726
29     38      1.467283
35     44      1.465114
19     45      1.461398
29     13      1.459881
38     42      1.459658
45     1       1.456833
43     43      1.455565
6      8       1.454127
13     41      1.453342
20     11      1.453098
41     8       1.452080
13     8       1.441653
11     3       1.439156
25     27      1.435201
24     11      1.434468
18     8       1.434382
9      38      1.433600
28     3       1.433528
31     13      1.425882
16     43      1.425030
5      34      1.419403
24     45      1.417889
32     45      1.413463
14     45      1.413133
28     1       1.409939
32     43      1.409382
36     40      1.408973
40     38      1.408733
       41      1.406929
38     8       1.406324
35     1       1.404390
8      40      1.403327
46     39      1.397547
13     1       1.396676
35     8       1.393965
15     44      1.392225
46     41      1.390153
38     38      1.386612
46     43      1.384864
9      34      1.383108
34     27      1.383070
11     38      1.382509
6      44      1.381311
21     34      1.381135
11     1       1.380937
6      45      1.380672
41     44      1.379753
7      27      1.377542
10     44      1.376408
45     34      1.372226
22     37      1.371057
11     27      1.370857
       34      1.364955
42     45      1.364858
23     37      1.364523
37     38      1.364031
44     11      1.362768
28     27      1.356454
16     34      1.355161
23     38      1.351869
43     11      1.350829
3      44      1.349670
10     11      1.347824
7      13      1.346960
10     13      1.346377
5      41      1.346221
13     27      1.344241
       3       1.343940
43     27      1.343367
44     40      1.341989
46     3       1.337838
13     34      1.335442
38     40      1.331371
20     38      1.331275
1      40      1.325123
46     37      1.319289
29     41      1.318189
26     45      1.317941
35     37      1.316348
15     11      1.310689
28     43      1.307173
23     13      1.303640
15     45      1.303377
12     45      1.301729
35     13      1.300401
21     45      1.300292
14     40      1.298674
41     43      1.297476
10     1       1.297088
20     40      1.290303
40     39      1.288799
46     13      1.288330
10     39      1.286495
9      27      1.286425
8      44      1.285620
44     45      1.285025
22     40      1.280695
5      13      1.279332
29     37      1.278220
13     13      1.277620
31     44      1.277343
5      40      1.275691
2      39      1.275454
8      11      1.273474
23     41      1.272285
1      45      1.271653
16     37      1.271520
41     11      1.267283
31     42      1.265529
23     34      1.265441
46     11      1.263586
10     37      1.263164
22     44      1.262131
41     38      1.261945
40     43      1.261918
37     43      1.261241
40     13      1.260632
37     41      1.258518
11     11      1.254279
13     45      1.251213
8      39      1.244513
43     37      1.241200
38     39      1.240999
2      45      1.240259
22     1       1.235999
45     43      1.235887
13     44      1.229849
41     39      1.229810
37     11      1.228697
40     8       1.223290
23     11      1.220861
41     40      1.220079
22     38      1.217430
40     44      1.215226
5      11      1.211048
11     13      1.209994
35     39      1.209714
40     34      1.207613
23     39      1.205320
7      44      1.203008
6      40      1.199670
       43      1.199251
34     38      1.190904
6      11      1.190330
29     39      1.189879
22     45      1.186641
34     1       1.186173
25     45      1.180808
11     40      1.180772
28     34      1.179983
38     44      1.179506
11     44      1.176823
43     44      1.175224
31     11      1.169490
28     11      1.168184
38     13      1.167285
35     40      1.161437
29     43      1.159207
8      45      1.157875
29     40      1.156590
9      1       1.153502
16     39      1.149489
46     44      1.148087
21     40      1.146468
18     45      1.146364
6      13      1.144507
10     43      1.138214
28     40      1.135324
23     43      1.129761
41     34      1.121338
38     41      1.121165
7      40      1.120401
16     40      1.118409
23     40      1.115234
38     45      1.114810
5      8       1.113851
20     45      1.111943
5      27      1.102829
13     37      1.101488
       40      1.100397
43     34      1.100327
37     13      1.100285
34     42      1.098209
28     44      1.095334
11     37      1.094298
45     13      1.093022
28     39      1.092296
38     34      1.092063
31     1       1.087255
28     13      1.086889
3      40      1.083108
37     1       1.082157
9      3       1.081618
29     11      1.078017
5      44      1.074451
45     45      1.072343
       39      1.069386
34     11      1.066682
46     38      1.065999
29     27      1.065617
34     34      1.065393
43     38      1.062559
37     37      1.059753
7      45      1.059462
6      39      1.058750
7      43      1.058048
29     8       1.055055
25     39      1.053514
13     11      1.052914
29     34      1.050733
10     45      1.049201
31     39      1.048510
16     45      1.046897
37     44      1.038964
29     3       1.038124
46     34      1.037908
29     45      1.037392
22     11      1.034675
7      39      1.032651
23     44      1.025163
45     8       1.020356
35     45      1.011467
46     40      1.010654
37     34      1.009384
34     13      1.004393
46     45      1.003513
29     44      0.995035
37     8       0.993700
31     40      0.987735
34     8       0.982164
40     40      0.980587
41     45      0.980504
31     45      0.970299
37     40      0.967668
9      13      0.953869
34     37      0.950879
13     39      0.948704
28     45      0.946098
13     43      0.936056
31     43      0.929694
37     39      0.929275
23     45      0.928652
34     41      0.927818
40     45      0.917975
35     11      0.905323
37     45      0.903277
9      42      0.902879
34     44      0.901609
       43      0.897353
22     39      0.888240
11     8       0.887651
5      43      0.865483
       39      0.857759
34     45      0.853622
       40      0.847206
9      39      0.844469
       11      0.843750
5      45      0.842548
43     45      0.832047
9      44      0.808320
43     40      0.807350
34     3       0.802921
43     39      0.792631
34     39      0.789006
11     39      0.730254
9      43      0.719432
11     45      0.704662
9      40      0.702955
11     43      0.697794
9      45      0.670643
Name: error, dtype: float64

# Mean Absolute Error
error.groupby(["item"]).error.mean().sort_values(ascending = False)

item
3     2.300384
41    2.237997
42    2.224660
27    2.125854
1     2.119102
8     1.999352
38    1.992689
37    1.982540
34    1.842083
13    1.798909
11    1.722505
43    1.682012
44    1.597961
39    1.484561
40    1.441437
45    1.275640
Name: error, dtype: float64

# Store 1 Actual - Pred
sns.set_context("talk", font_scale=1.4)
# fig, axes = plt.subplots(8, 6, figsize=(20, 35))
sub = error[error.store == 1].set_index("date")
fig, axes = plt.subplots(8, 6, figsize=(30, 40))
for i in range(0,46):
    if i < 6:
        sub[sub.item == i].actual.plot(ax=axes[0, i], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[0, i], legend=True, label="Item " + str(i) + " Pred", linestyle = "dashed")
    elif i > 5 and i < 12:
        sub[sub.item == i].actual.plot(ax=axes[1, i - 6], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[1, i - 6], legend=True, label="Item " + str(i) + " Pred",  linestyle="dashed")
    elif i > 11 and i< 18:
        sub[sub.item == i].actual.plot(ax=axes[2, i - 12], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[2, i - 12], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    elif i > 17 and i< 24:
        sub[sub.item == i].actual.plot(ax=axes[3, i - 18], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[3, i - 18], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    elif i > 23 and i< 30:
        sub[sub.item == i].actual.plot(ax=axes[4, i - 24], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[4, i - 24], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    elif i > 29 and i< 36:
        sub[sub.item == i].actual.plot(ax=axes[5, i - 30], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[5, i - 30], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    elif i > 35 and i < 42:
        sub[sub.item == i].actual.plot(ax=axes[6, i - 36], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[6, i - 36], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    else:
        sub[sub.item == i].actual.plot(ax=axes[7, i - 42], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[7, i - 42], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")

plt.tight_layout(pad=3)
plt.suptitle("Store 1 Items");
plt.show()

sns.set_context('talk', font_scale = 1.4)
fig, axes = plt.subplots(4, 2, figsize = (20,20))
for axi in axes.flat:
    axi.ticklabel_format(style="sci", axis="y", scilimits=(0,10))
    axi.ticklabel_format(style="sci", axis="x", scilimits=(0,10))
    axi.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
    axi.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
    
(error.actual-error.pred).hist(ax = axes[0, 0], color = "steelblue", bins = 20)
error.error.hist(ax = axes[0,1], color = "steelblue", bins = 20)
sr = error.copy()
sr["StandardizedR"] = (sr.error / (sr.actual-sr.pred).std())
sr["StandardizedR2"] = ((sr.error / (sr.actual-sr.pred).std())**2)
sr.plot.scatter(x = "pred",y = "StandardizedR", color = "red", ax = axes[1,0])
sr.plot.scatter(x = "pred",y = "StandardizedR2", color = "red", ax = axes[1,1])
error.actual.hist(ax = axes[2, 0], color = "purple", bins = 20)
error.pred.hist(ax = axes[2, 1], color = "purple", bins = 20)
error.plot.scatter(x = "actual",y = "pred", color = "seagreen", ax = axes[3,0]);

# QQ Plot
import statsmodels.api as sm
import pylab
sm.qqplot(sr.pred, ax = axes[3,1], c = "seagreen")
plt.suptitle("ERROR ANALYSIS", fontsize = 20)
axes[0,0].set_title("Error Histogram", fontsize = 15)
axes[0,1].set_title("Absolute Error Histogram", fontsize = 15)
axes[1,0].set_title("Standardized Residuals & Fitted Values", fontsize = 15)
axes[1,1].set_title("Standardized Residuals^2 & Fitted Values", fontsize = 15)
axes[2,0].set_title("Actual Histogram", fontsize = 15)
axes[2,1].set_title("Pred Histogram", fontsize = 15);
axes[3,0].set_title("Actual Pred Relationship", fontsize = 15);
axes[3,1].set_title("QQ Plot", fontsize = 15);
axes[1,0].set_xlabel("Fitted Values (Pred)", fontsize = 12)
axes[1,1].set_xlabel("Fitted Values (Pred)", fontsize = 12)
axes[3,0].set_xlabel("Actual", fontsize = 12)
axes[1,0].set_ylabel("Standardized Residuals", fontsize = 12)
axes[1,1].set_ylabel("Standardized Residuals^2", fontsize = 12)
axes[3,0].set_ylabel("Pred", fontsize = 12)
fig.tight_layout(pad=3.0)
plt.savefig("errors.png")
plt.show()

Next Model

Default Parameters & Feature Selection with LGBM Feature Importance

# First model feature importance
cols = feature_imp_df[feature_imp_df.gain > 0.015].feature.tolist()
print("Independent Variables:", len(cols))

second_model = lgb.LGBMRegressor(random_state=384).fit(
    X_train[cols], Y_train, 
    eval_metric = lambda y_true, y_pred: [lgbm_smape(y_true, y_pred)])

print("TRAIN SMAPE:", smape(Y_train, second_model.predict(X_train[cols])))
print("VALID SMAPE:", smape(Y_val, second_model.predict(X_val[cols])))

Independent Variables: 117
TRAIN SMAPE: 69.48193950896932
VALID SMAPE: 95.93174397344767

lgbm_params = {

"num_leaves":[20,31], # Default 31
"max_depth":[-1, 20, 30], # Default -1
"learning_rate":[0.1, 0.05], # Default 0.1
"n_estimators":[10000,15000], # Default 100
"min_split_gain":[0.0, 2,5], # Default 0
"min_child_samples":[10, 20, 30], # Default 20
"colsample_bytree":[0.5, 0.8, 1.0], # Default 1
"reg_alpha":[0.0, 0.5, 1], # Default 0
"reg_lambda":[0.0, 0.5, 1] # Default 0

}

model = lgb.LGBMRegressor(random_state=384)

tscv = TimeSeriesSplit(n_splits=3)
rsearch = RandomizedSearchCV(model, lgbm_params, random_state=384, cv=tscv, scoring=make_scorer(smape), verbose = True, n_jobs = -1).fit( X_train[cols], Y_train )

print(rsearch.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 142.9min finished


{'reg_lambda': 0.0, 'reg_alpha': 1, 'num_leaves': 20, 'n_estimators': 10000, 'min_split_gain': 0.0, 'min_child_samples': 20, 'max_depth': 20, 'learning_rate': 0.1, 'colsample_bytree': 1.0}

model_tuned = lgb.LGBMRegressor(**rsearch.best_params_, random_state=384).fit(X_train[cols], Y_train)

print("TRAIN SMAPE:", smape(Y_train, model_tuned.predict(X_train[cols])))
print("VALID SMAPE:", smape(Y_val, model_tuned.predict(X_val[cols])))

TRAIN SMAPE: 48.433166801003026
VALID SMAPE: 97.24837091281393

model_tuned2 = lgb.LGBMRegressor(num_leaves=31, n_estimators=15000, max_depth=20, random_state=384, metric = "custom")      
model_tuned2.fit(
    X_train[cols], Y_train,
    eval_metric= lambda y_true, y_pred: [lgbm_smape(y_true, y_pred)],
    eval_set = [(X_train[cols], Y_train), (X_val[cols], Y_val)],
    eval_names = ["Train", "Valid"],
    early_stopping_rounds= 1000, verbose = 500
)
print("Best Iteration:", model_tuned2.booster_.best_iteration)

Training until validation scores don't improve for 1000 rounds
[500]   Train's SMAPE: 66.2324  Valid's SMAPE: 95.4403
[1000]  Train's SMAPE: 63.2125  Valid's SMAPE: 95.5624
[1500]  Train's SMAPE: 60.7468  Valid's SMAPE: 95.4656
Early stopping, best iteration is:
[696]   Train's SMAPE: 64.8907  Valid's SMAPE: 95.3564
Best Iteration: 696

df.sort_values(["LOCATIONNO", "PRODUCTID", "SALEDATE"], inplace = True)

train_final = df.loc[(df["SALEDATE"] < "2021-03-01"), :]
test_final = df.loc[(df["SALEDATE"] >= "2021-03-01"), :]

X_train_final = train_final[cols]
Y_train_final = train_final.UNIT_GROSS_SALES
X_test_final = test_final[cols]



final_model = lgb.LGBMRegressor(num_leaves=31, n_estimators=15000, max_depth=20, random_state=384, metric = "custom")
final_model.set_params(n_estimators=model_tuned2.booster_.best_iteration) # Best Iteration: 983
final_model.fit(X_train_final[cols], Y_train_final,
                eval_metric= lambda y_true, y_pred: [lgbm_smape(y_true, y_pred)])

LGBMRegressor(max_depth=20, metric='custom', n_estimators=696, random_state=384)

final_model.predict(X_test_final)

array([6.82137589, 5.82712452, 3.76803532, ..., 2.00950222, 1.83432827,
       1.17740984])

sales_test = test_final[['LOCATIONNO', 'SALEDATE', 'PRODUCTID', 'UNIT_GROSS_SALES']]
sales_test['Predictions'] = np.round(final_model.predict(X_test_final))
sales_test['error'] = sales_test['UNIT_GROSS_SALES'] - sales_test['Predictions']
sales_test.head()

	SALEDATE	PRODUCTID	UNIT_GROSS_SALES	Predictions	error
262726	2021-03-01	1	6.0	7.0	-1.0
263527	2021-03-02	1	12.0	6.0	6.0
264122	2021-03-03	1	1.0	4.0	-3.0
264932	2021-03-04	1	4.0	4.0	0.0
265271	2021-03-05	1	6.0	3.0	3.0

item1 = sales_test[(sales_test['LOCATIONNO'] == 1) & (sales_test['PRODUCTID'] == 34)][['UNIT_GROSS_SALES', 'Predictions', 'error']]

sns.lineplot(data=item1, palette="tab10", linewidth=2.5)

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

<matplotlib.legend.Legend at 0x2ac01a5f8b0>

Future Work

The next step in the process is to utilize the news vendor model to estimate demand in each store for each item. A simulation can then be run to identify how many deliveries should be schedled throughout the week to minimize waste.

Amol Kulkarni

Ph.D.

My research interests include application of Machine learning algorithms to the fields of Marketing and Supply Chain Engineering, Decision Theory and Process Optimization.