Photo on Blue-Granite

Demand Forecasting of Perishable Products

Photo on Blue-Granite

Demand Forecasting of Perishable Products

The objective of this project is to minimize wastage of meal kits in retail stores. Currently, this is being done by tracking each individual item from the source until the point of sale. This is a cumbersome process and is labor intensive. In order to realize the objective using machine learning the first step in the process is to have an accurate forecast of the demand. This project focuses on generating accurate forecast for each individual item (46 unique items) for each store (47 unique stores). There are three datasets that have been used for demand forecasting.

  • Product dataset
    • Product_ID
    • Cook Products
    • Sales Product
    • ShelfLife_after_Cook
    • ShelfLife_WFM
    • Price
    • Cost
  • Locations dataset

    • Cook_Loc
    • Sales_Loc
    • Market
  • Sales dataset

    • Locationno
    • Market
    • Saledate
    • Productid
    • Unit_gross_sales

Note: The unaltered dataset is not shared. Locations dataset contains features obtained from https://censusreporter.org/ not present in the original dataset

# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # visualization
import seaborn as sns  # visualization
import holidays   # stored list of holidays in U.S.

# Model
import lightgbm as lgb
import shap                  # For error analysis
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import make_scorer
from sklearn.preprocessing import LabelEncoder       

# Configuration
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# Read the dataset containing the products and sales of the products across the stores in the state of Texas

prod = pd.read_csv('WFM_Products', delimiter = '\t')
locations = pd.read_csv("locations.csv")
sales = pd.read_csv('WFM_Sales.txt', delimiter = '\t',  usecols = ['LOCATIONNO', 'MARKET', 'SALEDATE',
        'PRODUCTID', 'UNIT_GROSS_SALES'],
                    parse_dates = ['SALEDATE'])

Data Cleaning

Product dataset contains 9 missing values in the cost column. Client provided information for 4 of the nine missing values. Rest of the products were third party products thus were deemed unnecessary for the analysis. The price column contains ‘$’ symbol which was replaced and converted from string to float.

prod.isna().sum()
Product_ID              0
Cook Products           0
Sales Product           0
ShelfLife_after_Cook    0
ShelfLife_WFM           0
Price                   0
Cost                    9
dtype: int64
miss = {900979: 2.37, 901974: 1.78, 902588: 2.66, 
902901: 2.15 }

for key, val in miss.items():
    for i in prod['Product_ID']:
#         print(i)
        if i == key:
            prod.at[prod[prod['Product_ID'] == i].index[0],'Cost'] = val

        
prod.dropna(inplace = True)
i = prod[prod['Product_ID']== 902901].index
prod.drop(i, inplace = True)
prod['Price'] = prod['Price'].str.replace('$', '').astype(float)

The third party items were removed from the sales column as well.

sales['Cost'] = sales['PRODUCTID'].map(prod.set_index('Product_ID')['Cost'])
index_names = sales[(sales['PRODUCTID'] == 902729) | (sales['PRODUCTID'] == 902322) | (sales['PRODUCTID'] == 902787) |
                   (sales['PRODUCTID'] == 902901) | (sales['PRODUCTID'] == 902323) | (sales['PRODUCTID'] == 902733) |
                   (sales['PRODUCTID'] == 902730) | (sales['PRODUCTID'] == 902732) | (sales['PRODUCTID'] == 902911) |
                    (sales['PRODUCTID'] == 901902) | (sales['PRODUCTID'] == 902547) | (sales['PRODUCTID'] == 902546)].index

sales.drop(index_names, inplace = True)
sales.dropna(inplace = True)
print(sales.shape, "\n")
sales.head()
(290014, 6) 
LOCATIONNO MARKET SALEDATE PRODUCTID UNIT_GROSS_SALES Cost
1 10073 Austin 2019-09-08 902287 1.0 1.82
2 10086 Austin 2019-09-08 901352 1.0 2.28
4 10017 Austin 2019-09-08 901352 1.0 2.28
5 10085 Austin 2019-09-08 902714 1.0 1.96
6 10073 Austin 2019-09-08 902934 1.0 0.96

We check the total number of stores and number of unique items in the data. We also check the time range of the data.

# How many stores and items are there?
print("number of unique items :",sales.PRODUCTID.nunique())
print("number of unique store :",sales.LOCATIONNO.nunique())
number of unique items : 46
number of unique store : 47
# Time Range
sales["SALEDATE"].min(), sales["SALEDATE"].max()
(Timestamp('2019-09-08 00:00:00'), Timestamp('2021-04-06 00:00:00'))

Although, there are several unique items not all items are sold in all the stores. Most of the stores sell one less product except for store number 4, which sells ten less products. Each item and store is encoded using label encoder so that the store and item data can be incorporated in the model.

# How many items are in the store?
le = LabelEncoder()
sales['LOCATIONNO'] = le.fit_transform(sales['LOCATIONNO'].values)
inverse_loc = le.inverse_transform(sales['LOCATIONNO'])
trans_loc = sales['LOCATIONNO']
sales['PRODUCTID'] = le.fit_transform(sales['PRODUCTID'].values)
sales.groupby(["LOCATIONNO"])["PRODUCTID"].nunique()
LOCATIONNO
0     46
1     46
2     46
3     46
4     36
5     45
6     45
7     46
8     46
9     45
10    46
11    46
12    46
13    45
14    46
15    45
16    45
17    46
18    46
19    46
20    46
21    46
22    46
23    46
24    46
25    46
26    46
27    45
28    46
29    46
30    46
31    45
32    46
33    46
34    46
35    45
36    46
37    45
38    46
39    46
40    46
41    46
42    45
43    46
44    46
45    46
46    45
Name: PRODUCTID, dtype: int64

The next step is to look at the summary statistic for each store to find out which of the items is more popular. We also do the same for the stores. We find out that store number 1 outsells most other stores.

# Summary Stats for each store
sales.groupby(["LOCATIONNO"]).agg({"UNIT_GROSS_SALES": ["count","sum", "mean", 
                                                        "median", "std", "min", "max"]})
UNIT_GROSS_SALES
count sum mean median std min max
LOCATIONNO
0 7466 22435.23 3.004987 2.0 2.688379 0.0 25.0
1 6624 15233.00 2.299668 2.0 1.962819 0.0 14.0
2 7430 18278.17 2.460050 2.0 2.107790 0.0 16.0
3 6367 11731.00 1.842469 1.0 1.621748 0.0 14.0
4 4890 17768.00 3.633538 3.0 3.411545 0.0 36.0
5 5095 8027.75 1.575613 1.0 1.507925 0.0 13.0
6 5224 8745.00 1.674005 1.0 1.619680 0.0 14.0
7 5487 9790.80 1.784363 1.0 1.819337 0.0 20.0
8 5358 10026.00 1.871221 1.0 1.705230 0.0 12.0
9 5097 6627.00 1.300177 1.0 1.319514 0.0 9.0
10 4956 7925.70 1.599213 1.0 1.520835 0.0 11.0
11 4520 5239.00 1.159071 1.0 1.248378 0.0 10.0
12 6262 19845.88 3.169256 2.0 2.914780 0.0 30.0
13 4651 5736.00 1.233283 1.0 1.275230 0.0 12.0
14 5726 13203.00 2.305798 2.0 2.081381 0.0 16.0
15 5738 12329.29 2.148709 2.0 2.003545 0.0 16.1
16 5379 9544.90 1.774475 1.0 1.721056 0.0 18.0
17 6903 16933.00 2.452991 2.0 2.094687 0.0 16.0
18 6389 12324.00 1.928940 2.0 1.671323 0.0 20.0
19 7158 17659.00 2.467030 2.0 2.074877 0.0 15.0
20 6254 10149.00 1.622801 1.0 1.457495 0.0 12.0
21 6634 12702.00 1.914682 2.0 1.650885 0.0 13.0
22 6617 10787.00 1.630195 1.0 1.533882 0.0 15.0
23 6887 12882.00 1.870481 2.0 1.617796 0.0 14.0
24 6988 17344.00 2.481969 2.0 2.162589 0.0 18.0
25 6381 11172.00 1.750823 1.0 1.573079 0.0 16.0
26 6736 14235.00 2.113272 2.0 1.740413 0.0 12.0
27 6319 13571.00 2.147650 2.0 2.002659 0.0 16.0
28 6318 10464.00 1.656220 1.0 1.418869 0.0 11.0
29 5911 9410.42 1.592018 1.0 1.396673 0.0 15.0
30 8480 49427.82 5.828752 4.0 5.271472 0.0 85.0
31 5992 8855.00 1.477804 1.0 1.367350 0.0 18.0
32 7301 17564.35 2.405746 2.0 2.044244 0.0 25.0
33 7849 24421.02 3.111354 3.0 2.600314 0.0 22.0
34 5941 7162.00 1.205521 1.0 1.216750 0.0 11.0
35 5561 8248.25 1.483231 1.0 1.373421 0.0 12.0
36 5699 11326.00 1.987366 1.0 2.049454 0.0 29.0
37 5653 7343.00 1.298956 1.0 1.254628 0.0 9.0
38 6153 11187.00 1.818137 1.0 1.590339 0.0 13.0
39 6677 15645.00 2.343118 2.0 2.166764 0.0 22.0
40 6492 11928.00 1.837338 2.0 1.541209 0.0 14.0
41 6305 11331.00 1.797145 1.0 1.547141 0.0 12.0
42 6215 13815.00 2.222848 2.0 2.045364 0.0 15.0
43 5702 7844.00 1.375658 1.0 1.373397 0.0 12.0
44 6511 12741.73 1.956954 2.0 1.667920 0.0 18.0
45 6309 10392.00 1.647171 1.0 1.557735 0.0 13.0
46 5409 7842.00 1.449806 1.0 1.300789 0.0 12.0
# Summary Stats for each item
sales.groupby(["PRODUCTID"]).agg({"UNIT_GROSS_SALES": ["count","sum", "mean", "median",
                                                       "std", "min", "max"]})
UNIT_GROSS_SALES
count sum mean median std min max
PRODUCTID
0 881 1489.00 1.690125 1.0 1.149310 1.00 12.0
1 15271 39342.84 2.576311 2.0 2.507954 0.00 28.0
2 8558 23905.00 2.793293 2.0 2.145337 0.00 22.0
3 18194 51367.75 2.823335 2.0 2.821902 0.00 59.0
4 100 165.00 1.650000 1.0 1.192358 1.00 7.0
5 6961 15097.77 2.168908 2.0 1.496724 0.42 16.0
6 5752 13468.00 2.341446 2.0 1.631576 0.00 13.0
7 3199 5679.00 1.775242 1.0 1.152993 0.00 13.0
8 8967 21627.71 2.411923 2.0 2.707817 0.00 27.0
9 4387 7152.65 1.630419 1.0 0.942318 0.00 7.0
10 1002 1627.00 1.623752 1.0 1.042146 1.00 9.0
11 8836 12853.00 1.454617 0.0 2.310401 0.00 25.0
12 3086 5614.00 1.819183 1.0 1.048439 1.00 7.0
13 8958 17678.39 1.973475 1.0 2.339349 0.00 27.0
14 843 1317.00 1.562278 1.0 0.915348 1.00 7.0
15 6124 14881.00 2.429948 2.0 1.885544 0.00 22.0
16 11353 30249.00 2.664406 2.0 1.953958 0.00 25.0
17 4306 9994.81 2.321136 2.0 1.703804 0.11 15.0
18 2591 5086.00 1.962949 2.0 1.313200 1.00 11.0
19 1987 5283.00 2.658782 2.0 2.037989 0.00 18.0
20 5255 9629.00 1.832350 1.0 1.116143 1.00 11.0
21 1106 1902.50 1.720163 1.0 1.057462 1.00 8.5
22 6964 14536.00 2.087306 2.0 1.387399 0.00 14.0
23 2839 4561.00 1.606552 1.0 0.924477 0.00 11.0
24 3504 6131.00 1.749715 1.0 1.033758 1.00 8.0
25 5595 11575.00 2.068811 2.0 1.299769 0.00 10.0
26 3078 5623.00 1.826836 2.0 1.059599 0.00 9.0
27 13402 35149.25 2.622687 2.0 2.708546 0.00 35.0
28 4546 10005.00 2.200836 2.0 1.461169 0.00 14.0
29 3925 7301.00 1.860127 1.0 1.145417 1.00 8.0
30 1498 3500.00 2.336449 2.0 1.517748 1.00 13.0
31 755 1435.00 1.900662 1.0 1.301677 0.00 9.0
32 1358 2502.00 1.842415 1.0 1.160730 0.00 11.0
33 6111 13937.90 2.280789 2.0 1.605280 0.00 19.0
34 13443 31241.59 2.324004 2.0 2.274236 0.00 26.0
35 5769 11913.00 2.065003 2.0 1.470268 0.00 15.0
36 836 1309.00 1.565789 1.0 0.973531 1.00 7.0
37 13299 30004.98 2.256183 2.0 2.299639 0.00 57.0
38 13522 33390.15 2.469320 2.0 2.412362 0.00 29.0
39 8837 9548.00 1.080457 0.0 1.863488 0.00 22.0
40 8836 10658.50 1.206258 0.0 1.859811 0.00 23.0
41 8836 18154.00 2.054550 1.0 3.142165 0.00 85.0
42 8836 19553.76 2.212965 1.0 3.178544 0.00 43.0
43 8836 13038.76 1.475641 0.0 2.216686 0.00 31.0
44 8836 9877.00 1.117813 0.0 1.928911 0.00 25.0
45 8836 8837.00 1.000113 0.0 1.571769 0.00 15.0
sns.set_context("talk", font_scale=1.4)
markets = ['Dallas', 'Houston', 'Frontier', 'Austin', 'San Antonio']

fig, axes = plt.subplots(2, 3, figsize=(10, 10))
for i in range(len(markets)):
    if i < 3:
        sales[sales.MARKET == markets[i]].UNIT_GROSS_SALES.hist(ax=axes[0, i])
        axes[0,i].set_title("Market " + markets[i], fontsize = 15)
    else:
        sales[sales.MARKET == markets[i]].UNIT_GROSS_SALES.hist(ax=axes[1, i-3])
        axes[1,i-3].set_title("Market " + markets[i], fontsize = 15)
        
plt.tight_layout(pad= 4.5)
plt.suptitle("Histogram: Market Sales");

# Histogram Sales
sns.set_context("talk", font_scale=1.4)
fig, axes = plt.subplots(8, 6, figsize=(20, 35))
for i in range(0,47):
    if i < 6:
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[0, i])
        axes[0,i].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 5) & (i < 12):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[1, i - 6])
        axes[1,i-6].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 11) & (i < 18): 
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[2, i - 12])
        axes[2,i-12].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 17) & (i < 24):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[3, i - 18])
        axes[3,i-18].set_title("Store " + str(i), fontsize = 15)
    
    elif (i > 23) & (i < 30):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[4, i - 24])
        axes[4,i-24].set_title("Store " + str(i), fontsize = 15)
    
    elif (i > 29) & (i < 36):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[5, i - 30])
        axes[5,i-30].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 35) & (i < 42):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[6, i - 36])
        axes[6,i-36].set_title("Store " + str(i), fontsize = 15)
    
    else:
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[7, i - 42])
        axes[7,i-42].set_title("Store " + str(i), fontsize = 15)
        
plt.tight_layout(pad= 3)
plt.suptitle("Histogram: Sales");

# Histogram Sales
sns.set_context("talk", font_scale=1.4)
fig, axes = plt.subplots(8, 6, figsize=(20, 35))
for i in range(0,47):
    if i < 6:
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[0, i])
        axes[0,i].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 5) & (i < 12):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[1, i - 6])
        axes[1,i-6].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 11) & (i < 18): 
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[2, i - 12])
        axes[2,i-12].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 17) & (i < 24):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[3, i - 18])
        axes[3,i-18].set_title("Store " + str(i), fontsize = 15)
    
    elif (i > 23) & (i < 30):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[4, i - 24])
        axes[4,i-24].set_title("Store " + str(i), fontsize = 15)
    
    elif (i > 29) & (i < 36):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[5, i - 30])
        axes[5,i-30].set_title("Store " + str(i), fontsize = 15)
        
    elif (i > 35) & (i < 42):
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[6, i - 36])
        axes[6,i-36].set_title("Store " + str(i), fontsize = 15)
    
    else:
        sales[sales.LOCATIONNO == i].UNIT_GROSS_SALES.hist(ax=axes[7, i - 42])
        axes[7,i-42].set_title("Store " + str(i), fontsize = 15)
        
plt.tight_layout(pad= 3)
plt.suptitle("Histogram: Sales");

The following plot shows the sale of all the items in store 1.

store = 1
sub = sales[sales.LOCATIONNO == store].set_index("SALEDATE")

# sns.set_context("talk", font_scale=1.4)
fig, axes = plt.subplots(8, 6, figsize=(35, 40))
for i in range(0,46):
    if i < 6:
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[0, i], 
                                      legend=True, label = "Item "+str(i)+" Sales")
    elif (i > 5) & (i < 12):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[1, i - 6], 
                                      legend=True, label = "Item "+str(i)+" Sales")
    elif (i > 11) & (i < 18):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[2, i - 12], 
                                      legend=True, label = "Item "+str(i)+" Sales")    
    elif (i > 17) & (i < 24):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[3, i - 18], 
                                      legend=True, label = "Item "+str(i)+" Sales")    
    elif (i > 23) & (i < 30):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[4, i - 24], 
                                      legend=True, label = "Item "+str(i)+" Sales")  
    elif (i > 29) & (i < 36):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[5, i - 30], 
                                      legend=True, label = "Item "+str(i)+" Sales")    
    elif (i > 35) & (i < 42):
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[6, i - 36], 
                                      legend=True, label = "Item "+str(i)+" Sales")    
    else:
        sub[sub.PRODUCTID == i].UNIT_GROSS_SALES.plot(ax=axes[7, i - 42], 
                                      legend=True, label = "Item "+str(i)+" Sales")    

plt.tight_layout(pad=3)
plt.suptitle("Store 1 Items");

We check for correlation of total sales in each of the stores to find out if there stores or market that are correlated to one another. Based on the correlation data we can remove redundant data to improve the model prediction accuracy

# Correlation between total sales of stores

storesales = sales.groupby(["SALEDATE", "LOCATIONNO"]).UNIT_GROSS_SALES.sum().reset_index().set_index("SALEDATE")
corr =  pd.pivot_table(storesales, values = "UNIT_GROSS_SALES", columns="LOCATIONNO", index="SALEDATE").corr(method = "spearman")
plt.figure(figsize = (30,30))
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 9}, square=True);

# Correlation between total sales at market level

sns.set_context("talk", font_scale=1.4)
marketsales = sales.groupby(["SALEDATE", "MARKET"]).UNIT_GROSS_SALES.sum().reset_index().set_index("SALEDATE")
corr =  pd.pivot_table(marketsales, values = "UNIT_GROSS_SALES", columns="MARKET", index="SALEDATE").corr(method = "spearman")
plt.figure(figsize = (10,10))
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 9}, square=True);

A non-parametric T-test is conducted to see which store sales are similar to each other. The same test is carried out for different items and different markets. This is incorporated during the feaure engineering phase of the project. This test is also called A/B testing.

# T Test
def CompareTwoGroups(dataframe, group, target):
    
    import itertools
    from scipy.stats import shapiro
    import scipy.stats as stats
    
    # 1. Normality Test: Shapiro Test
    # 2. Homogeneity Test: Levene Test
    # 3. Parametric or Non-Parametric T Test: T-Test, Welch Test, Mann Whitney U
    
    # Create Combinations
    item_comb = list(itertools.combinations(dataframe[group].unique(), 2))
    
    AB = pd.DataFrame()
    for i in range(0, len(item_comb)):
        # Define Groups
        groupA = dataframe[dataframe[group] == item_comb[i][0]][target]
        groupB = dataframe[dataframe[group] == item_comb[i][1]][target]
        
        # Assumption: Normality
        ntA = shapiro(groupA)[1] < 0.05
        ntB = shapiro(groupB)[1] < 0.05
        # H0: Distribution is Normal! - False
        # H1: Distribution is not Normal! - True
        
        if (ntA == False) & (ntB == False): # "H0: Normal Distribution"
            # Parametric Test
            # Assumption: Homogeneity of variances
            leveneTest = stats.levene(groupA, groupB)[1] < 0.05
            # H0: Homogeneity: False
            # H1: Heterogeneous: True
            if leveneTest == False:
                # Homogeneity
                ttest = stats.ttest_ind(groupA, groupB, equal_var=True)[1]
                # H0: M1 = M2 - False
                # H1: M1 != M2 - True
            else:
                # Heterogeneous
                ttest = stats.ttest_ind(groupA, groupB, equal_var=False)[1]
                # H0: M1 = M2 - False
                # H1: M1 != M2 - True
        else:
            # Non-Parametric Test
            ttest = stats.mannwhitneyu(groupA, groupB)[1] 
            # H0: M1 = M2 - False
            # H1: M1 != M2 - True
            
        temp = pd.DataFrame({"Compare Two Groups":[ttest < 0.05], 
                             "p-value":[ttest],
                             "GroupA_Mean":[groupA.mean()], "GroupB_Mean":[groupB.mean()],
                             "GroupA_Median":[groupA.median()], "GroupB_Median":[groupB.median()],
                             "GroupA_Count":[groupA.count()], "GroupB_Count":[groupB.count()]
                            }, index = [item_comb[i]])
        temp["Compare Two Groups"] = np.where(temp["Compare Two Groups"] == True, "Different Groups", "Similar Groups")
        temp["TestType"] = np.where((ntA == False) & (ntB == False), "Parametric", "Non-Parametric")
        
        AB = pd.concat([AB, temp[["TestType", "Compare Two Groups", "p-value","GroupA_Median", "GroupB_Median","GroupA_Mean", "GroupB_Mean",
                                 "GroupA_Count", "GroupB_Count"]]])
        
    return AB
    
    
ctg_store = CompareTwoGroups(storesales, group = "LOCATIONNO", target = "UNIT_GROSS_SALES")
ctg_ss = ctg_store[ctg_store["Compare Two Groups"] == "Similar Groups"]
ctg_ss
TestType Compare Two Groups p-value GroupA_Median GroupB_Median GroupA_Mean GroupB_Mean GroupA_Count GroupB_Count
(0, 2) Non-Parametric Similar Groups 0.233386 31.0 32.0 38.950052 33.354325 576 548
(20, 29) Non-Parametric Similar Groups 0.337784 15.0 15.0 17.650435 17.266826 575 545
(20, 36) Non-Parametric Similar Groups 0.115768 15.0 15.0 17.650435 21.573333 575 525
(20, 10) Non-Parametric Similar Groups 0.472980 15.0 15.0 17.650435 18.136613 575 437
(31, 35) Non-Parametric Similar Groups 0.380683 14.0 14.0 15.453752 15.801245 573 522
(31, 43) Non-Parametric Similar Groups 0.310684 14.0 13.0 15.453752 14.912548 573 526
(31, 46) Non-Parametric Similar Groups 0.401160 14.0 13.0 15.453752 15.316406 573 512
(31, 9) Non-Parametric Similar Groups 0.361584 14.0 14.0 15.453752 14.959368 573 443
(32, 17) Non-Parametric Similar Groups 0.278548 27.5 27.0 30.599913 30.956124 574 547
(32, 24) Non-Parametric Similar Groups 0.281458 27.5 27.0 30.599913 31.649635 574 548
(32, 14) Non-Parametric Similar Groups 0.103080 27.5 26.0 30.599913 29.803612 574 443
(33, 12) Non-Parametric Similar Groups 0.275046 36.0 38.0 42.471339 44.900181 575 442
(3, 22) Non-Parametric Similar Groups 0.291572 18.0 19.0 21.446069 19.720293 547 547
(3, 25) Non-Parametric Similar Groups 0.388627 18.0 18.0 21.446069 20.424132 547 547
(3, 28) Non-Parametric Similar Groups 0.096847 18.0 17.0 21.446069 19.094891 547 548
(3, 38) Non-Parametric Similar Groups 0.059920 18.0 19.0 21.446069 21.227704 547 527
(3, 45) Non-Parametric Similar Groups 0.347969 18.0 18.0 21.446069 19.719165 547 527
(3, 7) Non-Parametric Similar Groups 0.208064 18.0 19.0 21.446069 22.051351 547 444
(3, 16) Non-Parametric Similar Groups 0.133355 18.0 19.0 21.446069 21.497523 547 444
(17, 24) Non-Parametric Similar Groups 0.146562 27.0 27.0 30.956124 31.649635 547 548
(17, 39) Non-Parametric Similar Groups 0.068823 27.0 25.0 30.956124 29.686907 547 527
(17, 1) Non-Parametric Similar Groups 0.142868 27.0 25.0 30.956124 29.070611 547 524
(17, 14) Non-Parametric Similar Groups 0.245477 27.0 26.0 30.956124 29.803612 547 443
(18, 22) Non-Parametric Similar Groups 0.078444 19.0 19.0 22.530165 19.720293 547 547
(18, 27) Non-Parametric Similar Groups 0.152819 19.0 20.0 22.530165 24.764599 547 548
(18, 38) Non-Parametric Similar Groups 0.389069 19.0 19.0 22.530165 21.227704 547 527
(18, 41) Non-Parametric Similar Groups 0.476575 19.0 20.0 22.530165 21.541825 547 526
(18, 7) Non-Parametric Similar Groups 0.216345 19.0 19.0 22.530165 22.051351 547 444
(18, 16) Non-Parametric Similar Groups 0.322617 19.0 19.0 22.530165 21.497523 547 444
(18, 8) Non-Parametric Similar Groups 0.407530 19.0 19.0 22.530165 22.734694 547 441
(19, 24) Non-Parametric Similar Groups 0.163828 29.0 27.0 32.224453 31.649635 548 548
(21, 23) Non-Parametric Similar Groups 0.160270 21.0 22.0 23.178832 23.507299 548 548
(21, 27) Non-Parametric Similar Groups 0.209002 21.0 20.0 23.178832 24.764599 548 548
(21, 40) Non-Parametric Similar Groups 0.471788 21.0 21.0 23.178832 22.633776 548 527
(21, 42) Non-Parametric Similar Groups 0.359416 21.0 20.0 23.178832 26.465517 548 522
(21, 44) Non-Parametric Similar Groups 0.213510 21.0 22.0 23.178832 24.177856 548 527
(21, 8) Non-Parametric Similar Groups 0.085189 21.0 19.0 23.178832 22.734694 548 441
(22, 25) Non-Parametric Similar Groups 0.187974 19.0 18.0 19.720293 20.424132 547 547
(22, 38) Non-Parametric Similar Groups 0.111677 19.0 19.0 19.720293 21.227704 547 527
(22, 45) Non-Parametric Similar Groups 0.179004 19.0 18.0 19.720293 19.719165 547 527
(22, 7) Non-Parametric Similar Groups 0.297087 19.0 19.0 19.720293 22.051351 547 444
(22, 16) Non-Parametric Similar Groups 0.141047 19.0 19.0 19.720293 21.497523 547 444
(23, 40) Non-Parametric Similar Groups 0.144820 22.0 21.0 23.507299 22.633776 548 527
(23, 42) Non-Parametric Similar Groups 0.284477 22.0 20.0 23.507299 26.465517 548 522
(23, 44) Non-Parametric Similar Groups 0.440538 22.0 22.0 23.507299 24.177856 548 527
(25, 28) Non-Parametric Similar Groups 0.141127 18.0 17.0 20.424132 19.094891 547 548
(25, 45) Non-Parametric Similar Groups 0.480488 18.0 18.0 20.424132 19.719165 547 527
(25, 6) Non-Parametric Similar Groups 0.062208 18.0 17.0 20.424132 19.740406 547 443
(25, 7) Non-Parametric Similar Groups 0.112849 18.0 19.0 20.424132 22.051351 547 444
(25, 16) Non-Parametric Similar Groups 0.057894 18.0 19.0 20.424132 21.497523 547 444
(27, 38) Non-Parametric Similar Groups 0.083419 20.0 19.0 24.764599 21.227704 548 527
(27, 40) Non-Parametric Similar Groups 0.218523 20.0 21.0 24.764599 22.633776 548 527
(27, 41) Non-Parametric Similar Groups 0.137742 20.0 20.0 24.764599 21.541825 548 526
(27, 42) Non-Parametric Similar Groups 0.146031 20.0 20.0 24.764599 26.465517 548 522
(27, 44) Non-Parametric Similar Groups 0.072787 20.0 22.0 24.764599 24.177856 548 527
(27, 16) Non-Parametric Similar Groups 0.068801 20.0 19.0 24.764599 21.497523 548 444
(27, 8) Non-Parametric Similar Groups 0.233037 20.0 19.0 24.764599 22.734694 548 441
(28, 45) Non-Parametric Similar Groups 0.157909 17.0 18.0 19.094891 19.719165 548 527
(28, 5) Non-Parametric Similar Groups 0.082682 17.0 17.0 19.094891 18.162330 548 442
(28, 6) Non-Parametric Similar Groups 0.282677 17.0 17.0 19.094891 19.740406 548 443
(29, 36) Non-Parametric Similar Groups 0.056004 15.0 15.0 17.266826 21.573333 545 525
(29, 10) Non-Parametric Similar Groups 0.340116 15.0 15.0 17.266826 18.136613 545 437
(35, 43) Non-Parametric Similar Groups 0.216217 14.0 13.0 15.801245 14.912548 522 526
(35, 46) Non-Parametric Similar Groups 0.272802 14.0 13.0 15.801245 15.316406 522 512
(35, 9) Non-Parametric Similar Groups 0.498473 14.0 14.0 15.801245 14.959368 522 443
(37, 34) Non-Parametric Similar Groups 0.478634 13.0 12.0 14.013359 13.615970 524 526
(39, 1) Non-Parametric Similar Groups 0.330655 25.0 25.0 29.686907 29.070611 527 524
(39, 14) Non-Parametric Similar Groups 0.232906 25.0 26.0 29.686907 29.803612 527 443
(39, 15) Non-Parametric Similar Groups 0.388649 25.0 26.0 29.686907 27.831354 527 443
(1, 14) Non-Parametric Similar Groups 0.364876 25.0 26.0 29.070611 29.803612 524 443
(1, 15) Non-Parametric Similar Groups 0.219119 25.0 26.0 29.070611 27.831354 524 443
(36, 5) Non-Parametric Similar Groups 0.323113 15.0 17.0 21.573333 18.162330 525 442
(36, 6) Non-Parametric Similar Groups 0.116215 15.0 17.0 21.573333 19.740406 525 443
(36, 10) Non-Parametric Similar Groups 0.125947 15.0 15.0 21.573333 18.136613 525 437
(38, 41) Non-Parametric Similar Groups 0.352019 19.0 20.0 21.227704 21.541825 527 526
(38, 7) Non-Parametric Similar Groups 0.304000 19.0 19.0 21.227704 22.051351 527 444
(38, 16) Non-Parametric Similar Groups 0.455081 19.0 19.0 21.227704 21.497523 527 444
(38, 8) Non-Parametric Similar Groups 0.247194 19.0 19.0 21.227704 22.734694 527 441
(40, 42) Non-Parametric Similar Groups 0.381842 21.0 20.0 22.633776 26.465517 527 522
(40, 44) Non-Parametric Similar Groups 0.188451 21.0 22.0 22.633776 24.177856 527 527
(40, 8) Non-Parametric Similar Groups 0.078163 21.0 19.0 22.633776 22.734694 527 441
(41, 7) Non-Parametric Similar Groups 0.204610 20.0 19.0 21.541825 22.051351 526 444
(41, 16) Non-Parametric Similar Groups 0.334767 20.0 19.0 21.541825 21.497523 526 444
(41, 8) Non-Parametric Similar Groups 0.338689 20.0 19.0 21.541825 22.734694 526 441
(42, 44) Non-Parametric Similar Groups 0.387730 20.0 22.0 26.465517 24.177856 522 527
(43, 46) Non-Parametric Similar Groups 0.432120 13.0 13.0 14.912548 15.316406 526 512
(43, 9) Non-Parametric Similar Groups 0.217890 13.0 14.0 14.912548 14.959368 526 443
(45, 6) Non-Parametric Similar Groups 0.072031 18.0 17.0 19.719165 19.740406 527 443
(45, 7) Non-Parametric Similar Groups 0.092305 18.0 19.0 19.719165 22.051351 527 444
(46, 9) Non-Parametric Similar Groups 0.298520 13.0 14.0 15.316406 14.959368 512 443
(5, 6) Non-Parametric Similar Groups 0.247393 17.0 17.0 18.162330 19.740406 442 443
(5, 10) Non-Parametric Similar Groups 0.062061 17.0 15.0 18.162330 18.136613 442 437
(7, 16) Non-Parametric Similar Groups 0.386442 19.0 19.0 22.051351 21.497523 444 444
(7, 8) Non-Parametric Similar Groups 0.151309 19.0 19.0 22.051351 22.734694 444 441
(11, 13) Non-Parametric Similar Groups 0.094553 10.0 11.0 12.071429 13.036364 434 440
(14, 15) Non-Parametric Similar Groups 0.134353 26.0 26.0 29.803612 27.831354 443 443
(16, 8) Non-Parametric Similar Groups 0.213394 19.0 19.0 21.497523 22.734694 444 441
itemsales = sales.groupby(["SALEDATE", "PRODUCTID"]).UNIT_GROSS_SALES.sum().reset_index().set_index("SALEDATE")
ctg_is = CompareTwoGroups(itemsales, group = "PRODUCTID", target = "UNIT_GROSS_SALES")
ctg_iss = ctg_is[ctg_is["Compare Two Groups"] == "Similar Groups"]
ctg_iss
TestType Compare Two Groups p-value GroupA_Median GroupB_Median GroupA_Mean GroupB_Mean GroupA_Count GroupB_Count
(6, 22) Non-Parametric Similar Groups 0.431454 48.0 46.0 47.590106 49.274576 283 295
(6, 25) Non-Parametric Similar Groups 0.062028 48.0 46.0 47.590106 46.115538 283 251
(6, 28) Non-Parametric Similar Groups 0.191243 48.0 51.0 47.590106 49.776119 283 201
(6, 18) Non-Parametric Similar Groups 0.395740 48.0 47.0 47.590106 47.981132 283 106
(6, 39) Non-Parametric Similar Groups 0.282581 48.0 55.0 47.590106 50.518519 283 189
(6, 44) Non-Parametric Similar Groups 0.279319 48.0 57.0 47.590106 52.537234 283 188
(6, 45) Non-Parametric Similar Groups 0.437603 48.0 51.0 47.590106 47.005319 283 188
(7, 32) Non-Parametric Similar Groups 0.267717 24.5 20.0 25.581081 26.617021 222 94
(7, 0) Non-Parametric Similar Groups 0.110582 24.5 27.0 25.581081 26.589286 222 56
(7, 10) Non-Parametric Similar Groups 0.385214 24.5 26.0 25.581081 23.926471 222 68
(7, 36) Non-Parametric Similar Groups 0.234398 24.5 25.0 25.581081 22.964912 222 57
(7, 14) Non-Parametric Similar Groups 0.267800 24.5 24.0 25.581081 23.105263 222 57
(9, 24) Non-Parametric Similar Groups 0.061105 28.0 31.0 28.610600 30.809045 250 199
(9, 21) Non-Parametric Similar Groups 0.235222 28.0 32.0 28.610600 29.269231 250 65
(9, 0) Non-Parametric Similar Groups 0.292907 28.0 27.0 28.610600 26.589286 250 56
(16, 19) Non-Parametric Similar Groups 0.313315 78.0 79.0 78.568831 78.850746 385 67
(16, 15) Non-Parametric Similar Groups 0.106617 78.0 73.5 78.568831 77.505208 385 192
(20, 29) Non-Parametric Similar Groups 0.330028 33.0 34.0 34.389286 36.143564 280 202
(20, 21) Non-Parametric Similar Groups 0.066783 33.0 32.0 34.389286 29.269231 280 65
(22, 25) Non-Parametric Similar Groups 0.055195 46.0 46.0 49.274576 46.115538 295 251
(22, 28) Non-Parametric Similar Groups 0.271737 46.0 51.0 49.274576 49.776119 295 201
(22, 18) Non-Parametric Similar Groups 0.477400 46.0 47.0 49.274576 47.981132 295 106
(22, 39) Non-Parametric Similar Groups 0.377522 46.0 55.0 49.274576 50.518519 295 189
(22, 44) Non-Parametric Similar Groups 0.403045 46.0 57.0 49.274576 52.537234 295 188
(22, 45) Non-Parametric Similar Groups 0.413270 46.0 51.0 49.274576 47.005319 295 188
(23, 30) Non-Parametric Similar Groups 0.278406 23.0 19.0 22.357843 26.315789 204 133
(23, 31) Non-Parametric Similar Groups 0.257764 23.0 19.0 22.357843 20.797101 204 69
(23, 32) Non-Parametric Similar Groups 0.094785 23.0 20.0 22.357843 26.617021 204 94
(23, 10) Non-Parametric Similar Groups 0.317212 23.0 26.0 22.357843 23.926471 204 68
(23, 36) Non-Parametric Similar Groups 0.416263 23.0 25.0 22.357843 22.964912 204 57
(23, 14) Non-Parametric Similar Groups 0.387084 23.0 24.0 22.357843 23.105263 204 57
(24, 21) Non-Parametric Similar Groups 0.332758 31.0 32.0 30.809045 29.269231 199 65
(25, 18) Non-Parametric Similar Groups 0.132354 46.0 47.0 46.115538 47.981132 251 106
(25, 39) Non-Parametric Similar Groups 0.211923 46.0 55.0 46.115538 50.518519 251 189
(25, 44) Non-Parametric Similar Groups 0.302716 46.0 57.0 46.115538 52.537234 251 188
(25, 45) Non-Parametric Similar Groups 0.238029 46.0 51.0 46.115538 47.005319 251 188
(28, 18) Non-Parametric Similar Groups 0.210105 51.0 47.0 49.776119 47.981132 201 106
(28, 39) Non-Parametric Similar Groups 0.311082 51.0 55.0 49.776119 50.518519 201 189
(28, 40) Non-Parametric Similar Groups 0.232932 51.0 54.0 49.776119 56.694149 201 188
(28, 44) Non-Parametric Similar Groups 0.395745 51.0 57.0 49.776119 52.537234 201 188
(28, 45) Non-Parametric Similar Groups 0.163656 51.0 51.0 49.776119 47.005319 201 188
(30, 31) Non-Parametric Similar Groups 0.308437 19.0 19.0 26.315789 20.797101 133 69
(30, 32) Non-Parametric Similar Groups 0.225922 19.0 20.0 26.315789 26.617021 133 94
(30, 21) Non-Parametric Similar Groups 0.071822 19.0 32.0 26.315789 29.269231 133 65
(30, 10) Non-Parametric Similar Groups 0.288977 19.0 26.0 26.315789 23.926471 133 68
(30, 36) Non-Parametric Similar Groups 0.281821 19.0 25.0 26.315789 22.964912 133 57
(30, 14) Non-Parametric Similar Groups 0.245616 19.0 24.0 26.315789 23.105263 133 57
(1, 3) Non-Parametric Similar Groups 0.205472 91.0 96.0 102.189195 107.689203 385 477
(1, 2) Non-Parametric Similar Groups 0.221018 91.0 91.0 102.189195 95.239044 385 251
(1, 8) Non-Parametric Similar Groups 0.053662 91.0 104.0 102.189195 111.483041 385 194
(1, 27) Non-Parametric Similar Groups 0.290834 91.0 93.0 102.189195 99.011972 385 355
(1, 38) Non-Parametric Similar Groups 0.230607 91.0 94.0 102.189195 98.206324 385 340
(1, 41) Non-Parametric Similar Groups 0.259650 91.0 104.5 102.189195 96.563830 385 188
(1, 42) Non-Parametric Similar Groups 0.170894 91.0 110.5 102.189195 104.009362 385 188
(31, 32) Non-Parametric Similar Groups 0.089981 19.0 20.0 20.797101 26.617021 69 94
(31, 36) Non-Parametric Similar Groups 0.062689 19.0 25.0 20.797101 22.964912 69 57
(32, 21) Non-Parametric Similar Groups 0.179261 20.0 32.0 26.617021 29.269231 94 65
(32, 10) Non-Parametric Similar Groups 0.445321 20.0 26.0 26.617021 23.926471 94 68
(32, 36) Non-Parametric Similar Groups 0.448063 20.0 25.0 26.617021 22.964912 94 57
(32, 14) Non-Parametric Similar Groups 0.485450 20.0 24.0 26.617021 23.105263 94 57
(26, 12) Non-Parametric Similar Groups 0.328254 40.0 41.5 41.043796 41.279412 137 136
(26, 44) Non-Parametric Similar Groups 0.069902 40.0 57.0 41.043796 52.537234 137 188
(12, 44) Non-Parametric Similar Groups 0.070772 41.5 57.0 41.279412 52.537234 136 188
(3, 8) Non-Parametric Similar Groups 0.181581 96.0 104.0 107.689203 111.483041 477 194
(3, 27) Non-Parametric Similar Groups 0.056285 96.0 93.0 107.689203 99.011972 477 355
(3, 41) Non-Parametric Similar Groups 0.068409 96.0 104.5 107.689203 96.563830 477 188
(3, 42) Non-Parametric Similar Groups 0.458308 96.0 110.5 107.689203 104.009362 477 188
(2, 27) Non-Parametric Similar Groups 0.289723 91.0 93.0 95.239044 99.011972 251 355
(2, 38) Non-Parametric Similar Groups 0.369148 91.0 94.0 95.239044 98.206324 251 340
(2, 13) Non-Parametric Similar Groups 0.127806 91.0 86.0 95.239044 91.597876 251 193
(2, 41) Non-Parametric Similar Groups 0.275901 91.0 104.5 95.239044 96.563830 251 188
(4, 44) Non-Parametric Similar Groups 0.050453 3.0 57.0 8.684211 52.537234 19 188
(5, 17) Non-Parametric Similar Groups 0.068951 58.0 64.0 60.150478 62.467562 251 160
(5, 35) Non-Parametric Similar Groups 0.361863 58.0 56.5 60.150478 57.274038 251 208
(5, 40) Non-Parametric Similar Groups 0.157302 58.0 54.0 60.150478 56.694149 251 188
(18, 39) Non-Parametric Similar Groups 0.270713 47.0 55.0 47.981132 50.518519 106 189
(18, 40) Non-Parametric Similar Groups 0.064286 47.0 54.0 47.981132 56.694149 106 188
(18, 44) Non-Parametric Similar Groups 0.325905 47.0 57.0 47.981132 52.537234 106 188
(18, 45) Non-Parametric Similar Groups 0.338588 47.0 51.0 47.981132 47.005319 106 188
(19, 15) Non-Parametric Similar Groups 0.103538 79.0 73.5 78.850746 77.505208 67 192
(19, 37) Non-Parametric Similar Groups 0.264685 79.0 81.5 78.850746 88.249941 67 340
(19, 13) Non-Parametric Similar Groups 0.084556 79.0 86.0 78.850746 91.597876 67 193
(19, 41) Non-Parametric Similar Groups 0.079300 79.0 104.5 78.850746 96.563830 67 188
(0, 10) Non-Parametric Similar Groups 0.244027 27.0 26.0 26.589286 23.926471 56 68
(0, 36) Non-Parametric Similar Groups 0.103580 27.0 25.0 26.589286 22.964912 56 57
(0, 14) Non-Parametric Similar Groups 0.081279 27.0 24.0 26.589286 23.105263 56 57
(10, 36) Non-Parametric Similar Groups 0.305561 26.0 25.0 23.926471 22.964912 68 57
(10, 14) Non-Parametric Similar Groups 0.297753 26.0 24.0 23.926471 23.105263 68 57
(15, 11) Non-Parametric Similar Groups 0.165177 73.5 78.0 77.505208 68.367021 192 188
(15, 43) Non-Parametric Similar Groups 0.102486 73.5 76.5 77.505208 69.355106 192 188
(8, 42) Non-Parametric Similar Groups 0.267297 104.0 110.5 111.483041 104.009362 194 188
(27, 38) Non-Parametric Similar Groups 0.425943 93.0 94.0 99.011972 98.206324 355 340
(27, 13) Non-Parametric Similar Groups 0.074162 93.0 86.0 99.011972 91.597876 355 193
(27, 41) Non-Parametric Similar Groups 0.234647 93.0 104.5 99.011972 96.563830 355 188
(27, 42) Non-Parametric Similar Groups 0.247916 93.0 110.5 99.011972 104.009362 355 188
(39, 40) Non-Parametric Similar Groups 0.179638 55.0 54.0 50.518519 56.694149 189 188
(39, 44) Non-Parametric Similar Groups 0.375144 55.0 57.0 50.518519 52.537234 189 188
(39, 45) Non-Parametric Similar Groups 0.208922 55.0 51.0 50.518519 47.005319 189 188
(17, 33) Non-Parametric Similar Groups 0.082676 64.0 68.0 62.467562 67.332850 160 207
(17, 11) Non-Parametric Similar Groups 0.159181 64.0 78.0 62.467562 68.367021 160 188
(17, 43) Non-Parametric Similar Groups 0.225410 64.0 76.5 62.467562 69.355106 160 188
(33, 11) Non-Parametric Similar Groups 0.276477 68.0 78.0 67.332850 68.367021 207 188
(33, 43) Non-Parametric Similar Groups 0.435254 68.0 76.5 67.332850 69.355106 207 188
(34, 13) Non-Parametric Similar Groups 0.497191 88.0 86.0 92.158083 91.597876 339 193
(34, 41) Non-Parametric Similar Groups 0.253872 88.0 104.5 92.158083 96.563830 339 188
(35, 40) Non-Parametric Similar Groups 0.240636 56.5 54.0 57.274038 56.694149 208 188
(35, 44) Non-Parametric Similar Groups 0.121123 56.5 57.0 57.274038 52.537234 208 188
(36, 14) Non-Parametric Similar Groups 0.490951 25.0 24.0 22.964912 23.105263 57 57
(37, 13) Non-Parametric Similar Groups 0.107570 81.5 86.0 88.249941 91.597876 340 193
(37, 41) Non-Parametric Similar Groups 0.115668 81.5 104.5 88.249941 96.563830 340 188
(38, 13) Non-Parametric Similar Groups 0.093953 94.0 86.0 98.206324 91.597876 340 193
(38, 41) Non-Parametric Similar Groups 0.381587 94.0 104.5 98.206324 96.563830 340 188
(38, 42) Non-Parametric Similar Groups 0.124984 94.0 110.5 98.206324 104.009362 340 188
(13, 41) Non-Parametric Similar Groups 0.309650 86.0 104.5 91.597876 96.563830 193 188
(11, 43) Non-Parametric Similar Groups 0.473016 78.0 76.5 68.367021 69.355106 188 188
(40, 44) Non-Parametric Similar Groups 0.243153 54.0 57.0 56.694149 52.537234 188 188
(41, 42) Non-Parametric Similar Groups 0.167488 104.5 110.5 96.563830 104.009362 188 188
(44, 45) Non-Parametric Similar Groups 0.129569 57.0 51.0 52.537234 47.005319 188 188
CompareTwoGroups(marketsales, group = "MARKET", target = "UNIT_GROSS_SALES")
TestType Compare Two Groups p-value GroupA_Median GroupB_Median GroupA_Mean GroupB_Mean GroupA_Count GroupB_Count
(Austin, Dallas) Non-Parametric Different Groups 2.142448e-28 223.5 315.0 261.493785 345.159545 576 549
(Austin, Houston) Non-Parametric Similar Groups 4.293589e-01 223.5 216.0 261.493785 244.136837 576 528
(Austin, San Antonio) Non-Parametric Different Groups 5.611602e-173 223.5 41.0 261.493785 43.897021 576 527
(Austin, Frontier) Non-Parametric Different Groups 2.915368e-02 223.5 232.5 261.493785 263.604324 576 444
(Dallas, Houston) Non-Parametric Different Groups 4.189280e-39 315.0 216.0 345.159545 244.136837 549 528
(Dallas, San Antonio) Non-Parametric Different Groups 3.759319e-172 315.0 41.0 345.159545 43.897021 549 527
(Dallas, Frontier) Non-Parametric Different Groups 1.311596e-22 315.0 232.5 345.159545 263.604324 549 444
(Houston, San Antonio) Non-Parametric Different Groups 1.266149e-165 216.0 41.0 244.136837 43.897021 528 527
(Houston, Frontier) Non-Parametric Different Groups 1.424168e-02 216.0 232.5 244.136837 263.604324 528 444
(San Antonio, Frontier) Non-Parametric Different Groups 2.881501e-153 41.0 232.5 43.897021 263.604324 527 444

Feature Engineering

Following features are engineered to be incorporated into the model for better prediction accuracy.

  • Time Related Features
    • Year
    • Month
    • Week
    • Day
    • Weekday
    • Day of the year
    • Day of the week
    • Holidays for the state of Texas
    • Is the saledate a weekend?
    • Is the saledate month end?
    • Is the saledate month start?
    • Is the saledate quarter end?
    • Is the saledate quarter start?
    • Is the saledate year end?
    • Is the saledate year start?
  • Lagged Features
    • sales lag for 3 days
    • sales lag for 7 days
    • sales lag for 14 days
    • sales lag for 30 days
    • sales lag for 60 days
    • sales lag for 90 days
  • Moving Average Features
    • Rolling mean
  • Hypothesis Testing: Similarity Features
    • Similar store sales
    • Similar item sales
    • Similar market sales
  • Exponentially Weighted Mean Features
# 1. Time Related Features
#####################################################
def add_datepart(df, clmn):
    df['Year'] = df[clmn].dt.year
    df['Month'] = df[clmn].dt.month
    df['Week'] = df[clmn].dt.isocalendar().week
    df['Day'] = df[clmn].dt.day
    
    wkday = []
    doty = []
    for dt in df[clmn]:
        wkday.append(dt.weekday())
        doty.append(dt.timetuple().tm_yday)
    df['DayofWeek'] = wkday
    df['DayofYear'] = doty
    
    tx = holidays.US(state = 'TX')
    is_holiday = []
    for i in df[clmn]: 
           is_holiday.append(i in tx)
            
    df['Is_holiday'] = is_holiday
    df['Is_holiday'] = df['Is_holiday'].astype(int)
    df['Is_wknd'] = df['DayofWeek'] // 4
    df['Is_month_end'] = df[clmn].dt.is_month_end.astype(int)
    df['Is_month_start'] = df[clmn].dt.is_month_start.astype(int)
    df['Is_quarter_end'] = df[clmn].dt.is_quarter_end.astype(int)
    df['Is_quarter_start'] = df[clmn].dt.is_quarter_start.astype(int)
    df['Is_year_end'] = df[clmn].dt.is_year_end.astype(int)
    df['Is_year_start'] = df[clmn].dt.is_year_start.astype(int)
#     df.drop([clmn], axis = 1, inplace = True)
    return df

new_sales = add_datepart(sales, 'SALEDATE').copy()
# Rolling Summary Stats Features
#####################################################
for i in [4, 7, 14, 30, 60, 100]:
    new_sales["sales_roll_mean_"+str(i)] = new_sales.groupby(["LOCATIONNO", "PRODUCTID"]).UNIT_GROSS_SALES.rolling(i).mean().shift(1).values
# List of similar store sales and similar item sales from the non-parametric t-test.

sim_store = list(ctg_ss.index)
sim_item = list(ctg_iss.index)
# 2. Hypothesis Testing: Similarity
#####################################################

# Store Based
j = 1
for i in sim_store:
    if j == 1:
        new_sales["StoreSalesSimilarity"] = np.where(new_sales.LOCATIONNO.isin(list(i)), j, 0)
    else:
        new_sales["StoreSalesSimilarity"] = np.where(new_sales.LOCATIONNO.isin(list(i)), j, new_sales["StoreSalesSimilarity"])
    j+=1

# item Based
j = 1
for i in sim_item:
    if j == 1:
        new_sales["ItemSalesSimilarity"] = np.where(new_sales.PRODUCTID.isin(list(i)), j, 0)
    else:
        new_sales["ItemSalesSimilarity"] = np.where(new_sales.PRODUCTID.isin(list(i)), j, new_sales["ItemSalesSimilarity"])
    j+=1

# Market Based
new_sales["MarketSalesSimilarity"] = np.where(new_sales.MARKET.isin(['Austin', 'Houston']), 1, 0)

# 3. Lag/Shifted Features
#####################################################

new_sales.sort_values(by=['LOCATIONNO', 'PRODUCTID', 'SALEDATE'], axis=0, inplace=True)

def lag_features(dataframe, lags, groups = ['LOCATIONNO', 'PRODUCTID'], target = 'UNIT_GROSS_SALES', prefix = ''):
    dataframe = dataframe.copy()
    for lag in lags:
        dataframe[prefix + str(lag)] = dataframe.groupby(groups)[target].transform(
            lambda x: x.shift(lag))
    return dataframe
lag_sales = lag_features(new_sales, lags = [3, 7, 14, 30, 60, 90], 
                         groups = ['LOCATIONNO', 'PRODUCTID'], 
                         target = 'UNIT_GROSS_SALES', prefix = 'sales_lag_').copy()
lag_sales.head()
LOCATIONNO MARKET SALEDATE PRODUCTID UNIT_GROSS_SALES Cost Year Month Week Day DayofWeek DayofYear Is_holiday Is_wknd Is_month_end Is_month_start Is_quarter_end Is_quarter_start Is_year_end Is_year_start sales_roll_mean_4 sales_roll_mean_7 sales_roll_mean_14 sales_roll_mean_30 sales_roll_mean_60 sales_roll_mean_100 StoreSalesSimilarity ItemSalesSimilarity MarketSalesSimilarity sales_lag_3 sales_lag_7 sales_lag_14 sales_lag_30 sales_lag_60 sales_lag_90
88361 0 Austin 2020-03-24 0 1.0 1.91 2020 3 13 24 1 84 0 0 0 0 0 0 0 0 1.0 1.428571 1.285714 NaN NaN NaN 1 86 1 NaN NaN NaN NaN NaN NaN
88948 0 Austin 2020-03-25 0 3.0 1.91 2020 3 13 25 2 85 0 0 0 0 0 0 0 0 1.0 1.000000 1.285714 0.733333 1.133333 1.35 1 86 1 NaN NaN NaN NaN NaN NaN
89312 0 Austin 2020-03-26 0 3.0 1.91 2020 3 13 26 3 86 0 0 0 0 0 0 0 0 2.0 1.714286 1.500000 1.766667 1.950000 NaN 1 86 1 NaN NaN NaN NaN NaN NaN
89530 0 Austin 2020-03-27 0 1.0 1.91 2020 3 13 27 4 87 0 1 0 0 0 0 0 0 1.0 0.571429 1.071429 1.233333 1.250000 1.63 1 86 1 1.0 NaN NaN NaN NaN NaN
89919 0 Austin 2020-03-28 0 1.0 1.91 2020 3 13 28 5 88 0 1 0 0 0 0 0 0 1.5 1.714286 1.571429 1.833333 NaN NaN 1 86 1 3.0 NaN NaN NaN NaN NaN

Remove correlated features from the dataset

def drop_cor(dataframe, name, index):
    ind = dataframe[dataframe.columns[dataframe.columns.str.contains(name)].tolist()+["UNIT_GROSS_SALES"]].corr().UNIT_GROSS_SALES.sort_values(ascending = False).index[1:index]
    ind = dataframe.drop(ind, axis = 1).columns[dataframe.drop(ind, axis = 1).columns.str.contains(name)]
    dataframe.drop(ind, axis = 1, inplace = True)

drop_cor(lag_sales, "sales_lag", 5)
lag_sales.head()
LOCATIONNO MARKET SALEDATE PRODUCTID UNIT_GROSS_SALES Cost Year Month Week Day DayofWeek DayofYear Is_holiday Is_wknd Is_month_end Is_month_start Is_quarter_end Is_quarter_start Is_year_end Is_year_start sales_roll_mean_4 sales_roll_mean_7 sales_roll_mean_14 sales_roll_mean_30 sales_roll_mean_60 sales_roll_mean_100 StoreSalesSimilarity ItemSalesSimilarity MarketSalesSimilarity sales_lag_3 sales_lag_7 sales_lag_14 sales_lag_30
88361 0 Austin 2020-03-24 0 1.0 1.91 2020 3 13 24 1 84 0 0 0 0 0 0 0 0 1.0 1.428571 1.285714 NaN NaN NaN 1 86 1 NaN NaN NaN NaN
88948 0 Austin 2020-03-25 0 3.0 1.91 2020 3 13 25 2 85 0 0 0 0 0 0 0 0 1.0 1.000000 1.285714 0.733333 1.133333 1.35 1 86 1 NaN NaN NaN NaN
89312 0 Austin 2020-03-26 0 3.0 1.91 2020 3 13 26 3 86 0 0 0 0 0 0 0 0 2.0 1.714286 1.500000 1.766667 1.950000 NaN 1 86 1 NaN NaN NaN NaN
89530 0 Austin 2020-03-27 0 1.0 1.91 2020 3 13 27 4 87 0 1 0 0 0 0 0 0 1.0 0.571429 1.071429 1.233333 1.250000 1.63 1 86 1 1.0 NaN NaN NaN
89919 0 Austin 2020-03-28 0 1.0 1.91 2020 3 13 28 5 88 0 1 0 0 0 0 0 0 1.5 1.714286 1.571429 1.833333 NaN NaN 1 86 1 3.0 NaN NaN NaN
# 4. Last i. Months
#####################################################
lag_sales["monthyear"] = lag_sales.SALEDATE.dt.to_period('M')

# Store-Item Based
for i in [3, 6, 9, 12, 15]:
    last_months = lag_sales.groupby(["LOCATIONNO", "PRODUCTID", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['LOCATIONNO', 'PRODUCTID', 'monthyear', 'last_'+str(i)+'months_sales_sum',
                           'last_'+str(i)+'months_sales_mean', 'last_'+str(i)+'months_sales_std',
                           'last_'+str(i)+'months_sales_min', 'last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how   = "left", on = ["LOCATIONNO", "PRODUCTID", "monthyear"])
del last_months, i

drop_cor(lag_sales, "last_", 4)

# Store Based


for i in [3, 6, 9, 12]:
    last_months = lag_sales.groupby(["LOCATIONNO", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['LOCATIONNO', 'monthyear', 'store_last_'+str(i)+'months_sales_sum',
                           'store_last_'+str(i)+'months_sales_mean', 'store_last_'+str(i)+'months_sales_std',
                           'store_last_'+str(i)+'months_sales_min', 'store_last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how = "left", on = ["LOCATIONNO", "monthyear"])
del last_months, i


# Item Based
for i in [3, 6, 9, 12]:
    last_months = lag_sales.groupby(["PRODUCTID", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['PRODUCTID', 'monthyear', 'item_last_'+str(i)+'months_sales_sum',
                           'item_last_'+str(i)+'months_sales_mean', 'item_last_'+str(i)+'months_sales_std',
                           'item_last_'+str(i)+'months_sales_min', 'item_last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how = "left", on = ["PRODUCTID", "monthyear"])
del last_months, i


# Similarity Based


for i in [3, 6, 9, 12]:
    last_months = lag_sales.groupby(["StoreSalesSimilarity", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['StoreSalesSimilarity', 'monthyear', 'storesim_last_'+str(i)+'months_sales_sum',
                           'storesim_last_'+str(i)+'months_sales_mean', 'storesim_last_'+str(i)+'months_sales_std',
                           'storesim_last_'+str(i)+'months_sales_min', 'storesim_last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how = "left", on = ["StoreSalesSimilarity", "monthyear"])
del last_months, i

for i in [3, 6, 9, 12]:
    last_months = lag_sales.groupby(["ItemSalesSimilarity", "monthyear"]).UNIT_GROSS_SALES.agg([
        "sum", "mean", "std", "min", "max"]).shift(i).reset_index()
    last_months.columns = ['ItemSalesSimilarity', 'monthyear', 'itemsim_last_'+str(i)+'months_sales_sum',
                           'itemsim_last_'+str(i)+'months_sales_mean', 'itemsim_last_'+str(i)+'months_sales_std',
                           'itemsim_last_'+str(i)+'months_sales_min', 'itemsim_last_'+str(i)+'months_sales_max']
    lag_sales = pd.merge(lag_sales, last_months, how = "left", on = ["ItemSalesSimilarity", "monthyear"])
del last_months, i

lag_sales.drop("monthyear", axis = 1, inplace = True)
# 5. Last i. day of week
#####################################################
lag_sales.sort_values(["LOCATIONNO", "PRODUCTID", "DayofWeek", "SALEDATE"], inplace = True)

lag_sales = lag_features(lag_sales, lags = np.arange(12,41, 1).tolist()+[3, 7, 60, 90],
                  groups = ["LOCATIONNO", "PRODUCTID", "DayofWeek"], target = 'UNIT_GROSS_SALES', prefix = 'dayofweek_sales_lag_')

lag_sales[lag_sales.columns[lag_sales.columns.str.contains("dayofweek_sales_lag_")].tolist()+["UNIT_GROSS_SALES"]].corr().UNIT_GROSS_SALES.sort_values(ascending = False)

drop_cor(lag_sales, "dayofweek_sales_lag_", 5)

lag_sales.sort_values(["LOCATIONNO", "PRODUCTID", "SALEDATE"], inplace = True)

#####################################################
# Exponentially Weighted Mean Features
#####################################################
def ewm_features(dataframe, alphas, lags):
    dataframe = dataframe.copy()
    for alpha in alphas:
        for lag in lags:
            dataframe['sales_ewm_alpha_' + str(alpha).replace(".", "") + "_lag_" + str(lag)] = \
                dataframe.groupby(["LOCATIONNO", "PRODUCTID"])['UNIT_GROSS_SALES']. \
                    transform(lambda x: x.shift(lag).ewm(alpha=alpha).mean())
    return dataframe

alphas = [0.95, 0.9, 0.8, 0.7, 0.5]
lags = [3, 7, 14, 30, 60, 90]

df = ewm_features(lag_sales, alphas, lags).copy()

df.shape
(290014, 150)

Train-Validation Split

Like most model building process, the data is split into training and validation data. Since it is time series data, the data for the months of January and February of 2021 are in the validation set. Any, data after the month of February 2021 is part of the test set. The test set will only be used at the end to check the performance of the model.

# Dataframe must be sorted by date because of Time Series Split
df = df.sort_values("SALEDATE").reset_index(drop = True)
df['Week'] = df['Week'].astype(int)
df = pd.get_dummies(df, columns = ['MARKET'])

# Train Validation Split
# Validation set includes 2 months (Jan, Feb)
train = df.loc[(df["SALEDATE"] < "2021-01-01"), :]
val = df.loc[(df["SALEDATE"] >= "2021-01-01") & (df["SALEDATE"] < "2021-03-01"), :]
test = df.loc[(df["SALEDATE"] >= "2021-03-01")] # Any data after Feb 28 is set aside as test set.

cols = [col for col in train.columns if col not in ['SALEDATE', "UNIT_GROSS_SALES", "Year"]]

Y_train = train['UNIT_GROSS_SALES']
X_train = train[cols]

Y_val = val['UNIT_GROSS_SALES']
X_val = val[cols]

Y_train.shape, X_train.shape, Y_val.shape, X_val.shape
((217822,), (217822, 151), (44368,), (44368, 151))

Custom Cost Function

Symmetric mean absolute percentage error is used as the cost function

SMAPE = equation

where $A_t$ is the actual value and $F_t$ is the forecast value. The absolute difference between $A_t$ and $F_t$ is divided by half the sum of absolute values of the actual value $A_t$ and the forecast value $F_t$. The value of this calculation is summed for every fitted point t and divided again by the number of fitted points n.

# SMAPE: Symmetric mean absolute percentage error (adjusted MAPE)
def smape(preds, target):
    n = len(preds)
    masked_arr = ~((preds == 0) & (target == 0))
    preds, target = preds[masked_arr], target[masked_arr]
    num = np.abs(preds-target)
    denom = np.abs(preds)+np.abs(target)
    smape_val = (200*np.sum(num/denom))/n
    return smape_val

def lgbm_smape(y_true, y_pred):
    smape_val = smape(y_true, y_pred)
    return 'SMAPE', smape_val, False

First Model

After completing the feature engineering phase the first model with default parameters and all the features is built and it’s performance is analysed. The model considered here is the lgbm model for its

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

  • Faster training speed and higher efficiency
  • Lower memory usage
  • Better accuracy
  • Capability to handle large-scale data.
first_model = lgb.LGBMRegressor(random_state=384).fit(X_train, Y_train, 
                                                      eval_metric= lambda y_true, y_pred: [lgbm_smape(y_true, y_pred)])

print("TRAIN SMAPE:", smape(Y_train, first_model.predict(X_train)))
print("VALID SMAPE:", smape(Y_val, first_model.predict(X_val)))
TRAIN SMAPE: 69.37855260588468
VALID SMAPE: 96.27532578267159

Feature Importance

def plot_lgb_importances(model, plot=False, num=10):
    from matplotlib import pyplot as plt
    import seaborn as sns
    
    
    
    # LGBM API
    #gain = model.feature_importance('gain')
    #feat_imp = pd.DataFrame({'feature': model.feature_name(),
    #                         'split': model.feature_importance('split'),
    #                         'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
    
    # SKLEARN API
    gain = model.booster_.feature_importance(importance_type='gain')
    feat_imp = pd.DataFrame({'feature': model.feature_name_,
                             'split': model.booster_.feature_importance(importance_type='split'),
                             'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
    if plot:
        plt.figure(figsize=(10, 10))
        sns.set_context("talk", font_scale=1)
        sns.set_style("whitegrid")
        sns.barplot(x="gain", y="feature", data=feat_imp[0:25], color = 'lightblue')
        plt.title('feature')
        plt.tight_layout()
        plt.show()
    else:
        print(feat_imp.head(num))
        return feat_imp

feature_imp_df = plot_lgb_importances(first_model, num=10)
                           feature  split       gain
141       sales_ewm_alpha_05_lag_7     78  31.387042
140       sales_ewm_alpha_05_lag_3    100  22.170319
142      sales_ewm_alpha_05_lag_14     51   9.177836
6                        DayofWeek    267   5.746324
114          dayofweek_sales_lag_3     76   2.918399
7                        DayofYear    190   2.184596
21             sales_roll_mean_100    204   1.659234
20              sales_roll_mean_60    173   1.539139
5                              Day    165   1.366325
43   store_last_9months_sales_mean      7   1.329502
feature_imp_df.shape, feature_imp_df[feature_imp_df.gain > 0].shape, feature_imp_df[feature_imp_df.gain > 0.57].shape
((151, 3), (133, 3), (21, 3))
plot_lgb_importances(first_model, plot=True)

Shap

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

sns.set_context('talk', font_scale = 1.4)
explainer = shap.Explainer(first_model)
shap_values_train = explainer(X_train)
shap_values_valid = explainer(X_val)

len(shap_values_train), len(shap_values_valid)
(217822, 44368)

A beeswarm summary plot

The beeswarm plot is designed to display an information-dense summary of how the top features in a dataset impact the model’s output. Each instance the given explanation is represented by a single dot on each feature fow. The x position of the dot is determined by the SHAP value (shap_values.value[instance,feature]) of that feature, and dots “pile up” along each feature row to show density. Color is used to display the original value of a feature (shap_values.data[instance,feature]). In the plot below we can see that sales lag for 3 days is the most important feature on average for the training data. However, for the validation data sales lag for 7 days is the most important feature on average.

# summarize the effects of all the features
sns.set_context('talk', font_scale = 1.4)
shap.plots.beeswarm(shap_values_train, max_display=30)

# summarize the effects of all the features
sns.set_context('talk', font_scale = 1.4)
shap.plots.beeswarm(shap_values_valid, max_display=30)

shap.plots.bar(shap_values_train, max_display=30)

Error Analysis

error = pd.DataFrame({
    "date":val.SALEDATE,
    "store":X_val.LOCATIONNO,
    "item":X_val.PRODUCTID,
    "actual":Y_val,
    "pred":first_model.predict(X_val)
}).reset_index(drop = True)

error["error"] = np.abs(error.actual-error.pred)

error.sort_values("error", ascending=False).head(20)
date store item actual pred error
38403 2021-02-21 30 41 85.0 11.357431 73.642569
38456 2021-02-21 30 37 57.0 9.129747 47.870253
38569 2021-02-21 30 3 59.0 13.590359 45.409641
38439 2021-02-21 4 41 36.0 5.282435 30.717565
18216 2021-01-25 30 42 43.0 13.941786 29.058214
39840 2021-02-22 30 42 38.0 9.921250 28.078750
43925 2021-02-28 30 41 44.0 16.407923 27.592077
14008 2021-01-19 30 42 32.0 8.389453 23.610547
38932 2021-02-21 4 37 31.0 7.826655 23.173345
38563 2021-02-21 30 43 31.0 8.182316 22.817684
39088 2021-02-21 36 3 29.0 7.635129 21.364871
41019 2021-02-24 30 3 26.0 4.827231 21.172769
29090 2021-02-08 30 41 32.0 11.175341 20.824659
22973 2021-01-31 12 42 30.0 9.534384 20.465616
40879 2021-02-24 30 27 25.0 4.890703 20.109297
8366 2021-01-12 30 27 29.0 9.047829 19.952171
23307 2021-01-31 30 41 35.0 15.066269 19.933731
28965 2021-02-08 4 37 26.0 6.205603 19.794397
27421 2021-02-06 30 27 27.0 7.547040 19.452960
35710 2021-02-17 32 41 25.0 5.810368 19.189632
error[["actual", "pred", "error"]].describe([0.7, 0.8, 0.9, 0.95, 0.99]).T
count mean std min 50% 70% 80% 90% 95% 99% max
actual 44368.0 2.584691 3.101834 0.000000 2.000000 3.000000 4.000000 6.000000 8.000000 14.000000 85.000000
pred 44368.0 2.350267 1.485043 0.101703 2.007167 2.537990 2.980707 3.834862 4.929053 8.650099 18.277589
error 44368.0 1.864230 1.917119 0.000021 1.405939 2.089352 2.657121 3.826397 5.232841 9.339638 73.642569

Waterfall plots are designed to display explanations for individual predictions, so they expect a single row of an Explanation object as input. The bottom of a waterfall plot starts as the expected value of the model output, and then each row shows how the positive (red) or negative (blue) contribution of each feature moves the value from the expected model output over the background dataset to the model output for this prediction.

sns.set_context('talk', font_scale = 1.4)
shap.plots.waterfall(shap_values_valid[30125])

Force plot is used to show how much a given feature contributes to our prediction (compared to if we made that prediction at some baseline value of that feature). The force plots are interactive, as github had difficulty rendering it; I have uploaded the pictures of the plots.

# visualize the first prediction's explanation with a force plot
shap.initjs()
shap.plots.force(shap_values_valid[30125])

# visualize the explanation for the all the prediction with a force plot. So as to not overwhelm the browser
# first 100 values are passed
valid_shap = explainer.shap_values(X_val)
shap.initjs()
shap.plots.force(explainer.expected_value, valid_shap[:100, :], X_val.iloc[:100, :])

# Mean Absolute Error
error.groupby(["store", "item"]).error.mean().sort_values(ascending = False)
store  item
30     41      8.796683
       3       7.730369
       27      7.417767
       42      7.331254
       37      6.566694
       1       6.551101
       38      6.450013
       8       5.688377
       34      5.215296
       11      4.847538
       43      4.786728
12     42      4.604071
       3       4.430166
30     13      4.366960
       40      4.349082
0      34      4.321580
       42      4.259871
30     39      4.251495
4      37      4.141325
       8       4.049376
33     3       3.985145
4      42      3.982138
12     8       3.964201
       27      3.933068
0      11      3.931204
       3       3.796247
       1       3.767996
12     34      3.690778
0      27      3.663062
30     44      3.632818
12     41      3.628897
4      3       3.591541
0      8       3.581333
4      41      3.575084
24     1       3.554002
0      41      3.517047
1      3       3.429900
4      38      3.424165
       1       3.372646
       27      3.366829
36     3       3.359900
33     8       3.340034
       1       3.331595
30     45      3.328554
14     41      3.318947
27     3       3.310898
39     42      3.283005
33     38      3.237391
39     1       3.190325
17     27      3.186373
32     41      3.174905
33     27      3.141269
4      13      3.132354
12     38      3.108171
       11      3.100573
19     3       3.097261
39     27      3.074528
24     42      3.071504
12     13      3.059989
0      44      2.998813
33     11      2.953767
42     1       2.942637
12     1       2.926406
14     38      2.910530
0      43      2.908644
2      13      2.896304
33     34      2.895418
42     13      2.895023
4      39      2.891847
33     41      2.886471
39     3       2.878725
0      37      2.872639
12     44      2.863298
15     41      2.859410
27     43      2.835657
14     34      2.831470
12     43      2.803658
2      42      2.795462
19     38      2.767300
17     8       2.758636
12     40      2.756067
14     42      2.754917
17     1       2.754808
       34      2.743868
33     44      2.742895
24     43      2.742398
8      41      2.708218
18     1       2.701318
17     38      2.678119
1      27      2.665648
14     37      2.664356
3      41      2.661593
42     3       2.646940
19     27      2.634222
27     42      2.632597
       27      2.621078
2      8       2.599549
16     1       2.599387
14     8       2.591194
32     11      2.586096
0      38      2.585180
32     37      2.580516
15     37      2.577988
19     41      2.577918
24     38      2.572516
2      27      2.558549
0      13      2.553786
24     13      2.549550
32     8       2.541285
7      38      2.538866
39     37      2.534707
       41      2.533105
36     41      2.526091
       42      2.524357
4      43      2.523554
24     27      2.517807
12     39      2.514400
       37      2.510257
32     3       2.495721
7      3       2.472591
32     27      2.472559
42     37      2.471904
27     1       2.468068
33     39      2.466154
16     41      2.456669
27     41      2.455888
       13      2.454431
17     44      2.451902
42     41      2.436222
24     41      2.427852
26     3       2.425128
15     42      2.423113
33     37      2.413726
2      1       2.412799
17     3       2.410413
0      45      2.398012
24     3       2.396723
32     42      2.390878
1      1       2.364970
7      41      2.363477
24     8       2.363074
2      3       2.360506
19     42      2.349965
24     37      2.349415
19     37      2.346062
18     11      2.344202
32     38      2.340597
7      37      2.332521
4      34      2.330637
33     42      2.328101
32     34      2.319122
14     27      2.315701
16     3       2.286446
15     27      2.285725
42     27      2.272444
1      8       2.258241
36     39      2.254372
21     42      2.237494
27     38      2.237187
16     27      2.231148
18     3       2.228452
42     11      2.226240
19     1       2.223807
36     37      2.216599
39     11      2.212751
3      3       2.209608
44     3       2.205291
19     8       2.202903
26     8       2.193774
1      44      2.187075
17     13      2.183657
25     1       2.183614
1      37      2.175282
21     38      2.171630
39     43      2.163067
3      37      2.157286
19     34      2.152494
1      11      2.149056
15     3       2.147725
17     43      2.134811
27     11      2.134671
1      38      2.134429
15     38      2.134370
8      42      2.130037
19     11      2.126680
36     27      2.125526
42     42      2.116560
7      1       2.115409
32     13      2.114694
1      42      2.113248
2      43      2.112902
42     38      2.108225
18     41      2.104434
16     42      2.101756
20     8       2.094922
23     3       2.089880
38     1       2.088971
18     42      2.083530
27     8       2.076874
44     43      2.066242
45     37      2.066145
15     1       2.066116
21     13      2.064104
39     8       2.063827
28     41      2.062566
0      40      2.060275
4      11      2.059599
17     42      2.058140
3      42      2.057786
38     3       2.054505
8      37      2.051368
43     42      2.041615
40     37      2.038056
33     13      2.035505
14     3       2.033955
8      3       2.033846
22     42      2.033555
36     13      2.029968
24     34      2.029384
16     38      2.027139
1      43      2.024892
26     41      2.024144
16     8       2.022475
11     41      2.012826
26     37      2.008662
       13      2.003392
44     1       1.999507
5      42      1.997661
1      13      1.996955
44     44      1.996313
10     41      1.989540
40     3       1.987509
25     42      1.984760
26     1       1.984527
1      41      1.978309
33     43      1.973729
17     41      1.973248
26     34      1.973114
41     27      1.970733
21     37      1.962348
3      1       1.960896
23     8       1.960636
44     42      1.957164
3      38      1.956553
42     43      1.949328
18     37      1.949230
39     13      1.948035
44     41      1.945350
20     41      1.945303
18     34      1.940320
44     8       1.940279
17     37      1.935355
33     40      1.933901
3      43      1.933638
25     3       1.932040
24     40      1.927933
6      42      1.927292
22     8       1.926427
39     44      1.917073
4      45      1.915448
26     27      1.915327
42     40      1.912418
27     34      1.912250
42     8       1.911520
3      8       1.909868
19     43      1.909620
39     38      1.908371
2      37      1.907649
18     27      1.902842
6      41      1.901781
3      11      1.896718
36     1       1.889901
44     27      1.886836
1      39      1.881931
36     8       1.873640
14     43      1.872762
26     38      1.871297
45     44      1.865992
39     39      1.865921
44     13      1.864748
27     45      1.863641
42     44      1.862351
21     1       1.859969
41     1       1.858548
15     34      1.858174
25     8       1.857868
2      34      1.857595
0      39      1.856275
20     34      1.851778
32     40      1.848275
24     44      1.845841
43     3       1.843875
31     41      1.843710
19     13      1.842873
17     11      1.842854
3      34      1.839337
21     43      1.838972
8      38      1.835077
43     13      1.824492
18     38      1.820752
25     41      1.820603
2      41      1.814130
14     39      1.813962
3      13      1.813783
20     13      1.813201
35     34      1.811165
20     27      1.810796
10     27      1.803908
14     44      1.801503
2      38      1.798955
6      34      1.798940
42     39      1.798875
45     27      1.796934
21     11      1.796591
36     11      1.793527
35     38      1.790625
8      34      1.790520
5      38      1.790312
21     39      1.787890
31     38      1.787751
7      8       1.783308
14     1       1.783071
24     39      1.781381
15     8       1.776310
9      41      1.768692
25     13      1.761052
10     3       1.760061
16     11      1.753975
28     37      1.752716
36     38      1.751559
6      1       1.748578
7      34      1.744266
38     37      1.743989
39     40      1.741437
18     13      1.737523
42     34      1.736864
2      40      1.735879
10     34      1.732812
32     1       1.730147
8      1       1.729259
3      27      1.728229
32     39      1.727863
16     44      1.726095
20     1       1.725890
40     42      1.724510
1      34      1.721047
25     34      1.710974
19     44      1.710354
26     44      1.707571
44     37      1.707304
32     44      1.705164
21     8       1.697304
38     27      1.693728
22     43      1.692139
45     38      1.691840
28     8       1.691809
20     42      1.690213
41     41      1.690204
45     11      1.688956
21     41      1.687332
15     43      1.683894
10     8       1.681712
46     27      1.678961
33     45      1.677837
22     13      1.669968
21     27      1.669549
37     3       1.667952
29     1       1.664966
39     34      1.664316
19     39      1.662362
36     43      1.660920
35     41      1.658878
38     11      1.658647
20     44      1.655116
7      42      1.653392
4      44      1.650880
35     42      1.649943
41     37      1.643977
22     27      1.642987
41     42      1.641927
14     13      1.639081
35     3       1.637331
28     42      1.635993
26     42      1.635693
36     34      1.633226
21     3       1.631875
23     27      1.629565
13     42      1.629360
41     3       1.629063
6      38      1.628237
2      11      1.625523
46     42      1.625001
39     45      1.624650
31     34      1.624203
43     8       1.622206
45     3       1.617795
2      44      1.616223
5      1       1.615027
       3       1.611376
17     45      1.609809
6      3       1.608170
16     13      1.605829
17     39      1.605315
10     42      1.605111
44     34      1.604073
26     40      1.602739
8      8       1.601100
20     3       1.600886
37     42      1.599365
46     1       1.599363
7      11      1.595823
19     40      1.595275
25     37      1.594842
3      39      1.594068
27     44      1.592834
28     38      1.591825
4      40      1.591698
15     40      1.589640
40     11      1.589380
43     1       1.588752
8      13      1.588291
       27      1.587870
20     37      1.586954
40     27      1.585805
       1       1.582457
35     43      1.580955
44     39      1.580266
6      37      1.580047
10     40      1.571123
23     1       1.570028
38     43      1.569349
8      43      1.566729
27     37      1.566515
45     41      1.563603
21     44      1.563534
27     39      1.558764
29     42      1.557502
45     42      1.557341
18     39      1.555327
17     40      1.553271
18     44      1.553210
26     11      1.552130
36     45      1.551781
31     37      1.548940
10     38      1.542138
25     38      1.536770
3      45      1.533392
14     11      1.533245
11     42      1.533169
41     13      1.531229
22     41      1.531058
6      27      1.530885
31     27      1.529504
22     3       1.528377
25     44      1.526525
44     38      1.522776
15     13      1.522659
23     42      1.522417
18     40      1.516749
35     27      1.516605
26     39      1.516518
9      37      1.515547
27     40      1.514145
18     43      1.506296
26     43      1.501951
22     34      1.501277
9      8       1.501055
31     8       1.497446
36     44      1.494615
46     8       1.488384
37     27      1.485193
20     39      1.482098
13     38      1.480978
45     40      1.477098
31     3       1.476729
5      37      1.476702
15     39      1.473841
20     43      1.472823
25     40      1.472325
43     41      1.471055
25     43      1.470083
       11      1.469726
29     38      1.467283
35     44      1.465114
19     45      1.461398
29     13      1.459881
38     42      1.459658
45     1       1.456833
43     43      1.455565
6      8       1.454127
13     41      1.453342
20     11      1.453098
41     8       1.452080
13     8       1.441653
11     3       1.439156
25     27      1.435201
24     11      1.434468
18     8       1.434382
9      38      1.433600
28     3       1.433528
31     13      1.425882
16     43      1.425030
5      34      1.419403
24     45      1.417889
32     45      1.413463
14     45      1.413133
28     1       1.409939
32     43      1.409382
36     40      1.408973
40     38      1.408733
       41      1.406929
38     8       1.406324
35     1       1.404390
8      40      1.403327
46     39      1.397547
13     1       1.396676
35     8       1.393965
15     44      1.392225
46     41      1.390153
38     38      1.386612
46     43      1.384864
9      34      1.383108
34     27      1.383070
11     38      1.382509
6      44      1.381311
21     34      1.381135
11     1       1.380937
6      45      1.380672
41     44      1.379753
7      27      1.377542
10     44      1.376408
45     34      1.372226
22     37      1.371057
11     27      1.370857
       34      1.364955
42     45      1.364858
23     37      1.364523
37     38      1.364031
44     11      1.362768
28     27      1.356454
16     34      1.355161
23     38      1.351869
43     11      1.350829
3      44      1.349670
10     11      1.347824
7      13      1.346960
10     13      1.346377
5      41      1.346221
13     27      1.344241
       3       1.343940
43     27      1.343367
44     40      1.341989
46     3       1.337838
13     34      1.335442
38     40      1.331371
20     38      1.331275
1      40      1.325123
46     37      1.319289
29     41      1.318189
26     45      1.317941
35     37      1.316348
15     11      1.310689
28     43      1.307173
23     13      1.303640
15     45      1.303377
12     45      1.301729
35     13      1.300401
21     45      1.300292
14     40      1.298674
41     43      1.297476
10     1       1.297088
20     40      1.290303
40     39      1.288799
46     13      1.288330
10     39      1.286495
9      27      1.286425
8      44      1.285620
44     45      1.285025
22     40      1.280695
5      13      1.279332
29     37      1.278220
13     13      1.277620
31     44      1.277343
5      40      1.275691
2      39      1.275454
8      11      1.273474
23     41      1.272285
1      45      1.271653
16     37      1.271520
41     11      1.267283
31     42      1.265529
23     34      1.265441
46     11      1.263586
10     37      1.263164
22     44      1.262131
41     38      1.261945
40     43      1.261918
37     43      1.261241
40     13      1.260632
37     41      1.258518
11     11      1.254279
13     45      1.251213
8      39      1.244513
43     37      1.241200
38     39      1.240999
2      45      1.240259
22     1       1.235999
45     43      1.235887
13     44      1.229849
41     39      1.229810
37     11      1.228697
40     8       1.223290
23     11      1.220861
41     40      1.220079
22     38      1.217430
40     44      1.215226
5      11      1.211048
11     13      1.209994
35     39      1.209714
40     34      1.207613
23     39      1.205320
7      44      1.203008
6      40      1.199670
       43      1.199251
34     38      1.190904
6      11      1.190330
29     39      1.189879
22     45      1.186641
34     1       1.186173
25     45      1.180808
11     40      1.180772
28     34      1.179983
38     44      1.179506
11     44      1.176823
43     44      1.175224
31     11      1.169490
28     11      1.168184
38     13      1.167285
35     40      1.161437
29     43      1.159207
8      45      1.157875
29     40      1.156590
9      1       1.153502
16     39      1.149489
46     44      1.148087
21     40      1.146468
18     45      1.146364
6      13      1.144507
10     43      1.138214
28     40      1.135324
23     43      1.129761
41     34      1.121338
38     41      1.121165
7      40      1.120401
16     40      1.118409
23     40      1.115234
38     45      1.114810
5      8       1.113851
20     45      1.111943
5      27      1.102829
13     37      1.101488
       40      1.100397
43     34      1.100327
37     13      1.100285
34     42      1.098209
28     44      1.095334
11     37      1.094298
45     13      1.093022
28     39      1.092296
38     34      1.092063
31     1       1.087255
28     13      1.086889
3      40      1.083108
37     1       1.082157
9      3       1.081618
29     11      1.078017
5      44      1.074451
45     45      1.072343
       39      1.069386
34     11      1.066682
46     38      1.065999
29     27      1.065617
34     34      1.065393
43     38      1.062559
37     37      1.059753
7      45      1.059462
6      39      1.058750
7      43      1.058048
29     8       1.055055
25     39      1.053514
13     11      1.052914
29     34      1.050733
10     45      1.049201
31     39      1.048510
16     45      1.046897
37     44      1.038964
29     3       1.038124
46     34      1.037908
29     45      1.037392
22     11      1.034675
7      39      1.032651
23     44      1.025163
45     8       1.020356
35     45      1.011467
46     40      1.010654
37     34      1.009384
34     13      1.004393
46     45      1.003513
29     44      0.995035
37     8       0.993700
31     40      0.987735
34     8       0.982164
40     40      0.980587
41     45      0.980504
31     45      0.970299
37     40      0.967668
9      13      0.953869
34     37      0.950879
13     39      0.948704
28     45      0.946098
13     43      0.936056
31     43      0.929694
37     39      0.929275
23     45      0.928652
34     41      0.927818
40     45      0.917975
35     11      0.905323
37     45      0.903277
9      42      0.902879
34     44      0.901609
       43      0.897353
22     39      0.888240
11     8       0.887651
5      43      0.865483
       39      0.857759
34     45      0.853622
       40      0.847206
9      39      0.844469
       11      0.843750
5      45      0.842548
43     45      0.832047
9      44      0.808320
43     40      0.807350
34     3       0.802921
43     39      0.792631
34     39      0.789006
11     39      0.730254
9      43      0.719432
11     45      0.704662
9      40      0.702955
11     43      0.697794
9      45      0.670643
Name: error, dtype: float64
# Mean Absolute Error
error.groupby(["item"]).error.mean().sort_values(ascending = False)
item
3     2.300384
41    2.237997
42    2.224660
27    2.125854
1     2.119102
8     1.999352
38    1.992689
37    1.982540
34    1.842083
13    1.798909
11    1.722505
43    1.682012
44    1.597961
39    1.484561
40    1.441437
45    1.275640
Name: error, dtype: float64
# Store 1 Actual - Pred
sns.set_context("talk", font_scale=1.4)
# fig, axes = plt.subplots(8, 6, figsize=(20, 35))
sub = error[error.store == 1].set_index("date")
fig, axes = plt.subplots(8, 6, figsize=(30, 40))
for i in range(0,46):
    if i < 6:
        sub[sub.item == i].actual.plot(ax=axes[0, i], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[0, i], legend=True, label="Item " + str(i) + " Pred", linestyle = "dashed")
    elif i > 5 and i < 12:
        sub[sub.item == i].actual.plot(ax=axes[1, i - 6], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[1, i - 6], legend=True, label="Item " + str(i) + " Pred",  linestyle="dashed")
    elif i > 11 and i< 18:
        sub[sub.item == i].actual.plot(ax=axes[2, i - 12], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[2, i - 12], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    elif i > 17 and i< 24:
        sub[sub.item == i].actual.plot(ax=axes[3, i - 18], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[3, i - 18], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    elif i > 23 and i< 30:
        sub[sub.item == i].actual.plot(ax=axes[4, i - 24], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[4, i - 24], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    elif i > 29 and i< 36:
        sub[sub.item == i].actual.plot(ax=axes[5, i - 30], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[5, i - 30], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    elif i > 35 and i < 42:
        sub[sub.item == i].actual.plot(ax=axes[6, i - 36], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[6, i - 36], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")
    else:
        sub[sub.item == i].actual.plot(ax=axes[7, i - 42], legend=True, label = "Item "+str(i)+" Sales")
        sub[sub.item == i].pred.plot(ax=axes[7, i - 42], legend=True, label="Item " + str(i) + " Pred", linestyle="dashed")

plt.tight_layout(pad=3)
plt.suptitle("Store 1 Items");
plt.show()

sns.set_context('talk', font_scale = 1.4)
fig, axes = plt.subplots(4, 2, figsize = (20,20))
for axi in axes.flat:
    axi.ticklabel_format(style="sci", axis="y", scilimits=(0,10))
    axi.ticklabel_format(style="sci", axis="x", scilimits=(0,10))
    axi.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
    axi.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
    
(error.actual-error.pred).hist(ax = axes[0, 0], color = "steelblue", bins = 20)
error.error.hist(ax = axes[0,1], color = "steelblue", bins = 20)
sr = error.copy()
sr["StandardizedR"] = (sr.error / (sr.actual-sr.pred).std())
sr["StandardizedR2"] = ((sr.error / (sr.actual-sr.pred).std())**2)
sr.plot.scatter(x = "pred",y = "StandardizedR", color = "red", ax = axes[1,0])
sr.plot.scatter(x = "pred",y = "StandardizedR2", color = "red", ax = axes[1,1])
error.actual.hist(ax = axes[2, 0], color = "purple", bins = 20)
error.pred.hist(ax = axes[2, 1], color = "purple", bins = 20)
error.plot.scatter(x = "actual",y = "pred", color = "seagreen", ax = axes[3,0]);

# QQ Plot
import statsmodels.api as sm
import pylab
sm.qqplot(sr.pred, ax = axes[3,1], c = "seagreen")
plt.suptitle("ERROR ANALYSIS", fontsize = 20)
axes[0,0].set_title("Error Histogram", fontsize = 15)
axes[0,1].set_title("Absolute Error Histogram", fontsize = 15)
axes[1,0].set_title("Standardized Residuals & Fitted Values", fontsize = 15)
axes[1,1].set_title("Standardized Residuals^2 & Fitted Values", fontsize = 15)
axes[2,0].set_title("Actual Histogram", fontsize = 15)
axes[2,1].set_title("Pred Histogram", fontsize = 15);
axes[3,0].set_title("Actual Pred Relationship", fontsize = 15);
axes[3,1].set_title("QQ Plot", fontsize = 15);
axes[1,0].set_xlabel("Fitted Values (Pred)", fontsize = 12)
axes[1,1].set_xlabel("Fitted Values (Pred)", fontsize = 12)
axes[3,0].set_xlabel("Actual", fontsize = 12)
axes[1,0].set_ylabel("Standardized Residuals", fontsize = 12)
axes[1,1].set_ylabel("Standardized Residuals^2", fontsize = 12)
axes[3,0].set_ylabel("Pred", fontsize = 12)
fig.tight_layout(pad=3.0)
plt.savefig("errors.png")
plt.show()

Next Model

Default Parameters & Feature Selection with LGBM Feature Importance

# First model feature importance
cols = feature_imp_df[feature_imp_df.gain > 0.015].feature.tolist()
print("Independent Variables:", len(cols))

second_model = lgb.LGBMRegressor(random_state=384).fit(
    X_train[cols], Y_train, 
    eval_metric = lambda y_true, y_pred: [lgbm_smape(y_true, y_pred)])

print("TRAIN SMAPE:", smape(Y_train, second_model.predict(X_train[cols])))
print("VALID SMAPE:", smape(Y_val, second_model.predict(X_val[cols])))
Independent Variables: 117
TRAIN SMAPE: 69.48193950896932
VALID SMAPE: 95.93174397344767
lgbm_params = {

"num_leaves":[20,31], # Default 31
"max_depth":[-1, 20, 30], # Default -1
"learning_rate":[0.1, 0.05], # Default 0.1
"n_estimators":[10000,15000], # Default 100
"min_split_gain":[0.0, 2,5], # Default 0
"min_child_samples":[10, 20, 30], # Default 20
"colsample_bytree":[0.5, 0.8, 1.0], # Default 1
"reg_alpha":[0.0, 0.5, 1], # Default 0
"reg_lambda":[0.0, 0.5, 1] # Default 0

}

model = lgb.LGBMRegressor(random_state=384)

tscv = TimeSeriesSplit(n_splits=3)
rsearch = RandomizedSearchCV(model, lgbm_params, random_state=384, cv=tscv, scoring=make_scorer(smape), verbose = True, n_jobs = -1).fit( X_train[cols], Y_train )

print(rsearch.best_params_)
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 142.9min finished


{'reg_lambda': 0.0, 'reg_alpha': 1, 'num_leaves': 20, 'n_estimators': 10000, 'min_split_gain': 0.0, 'min_child_samples': 20, 'max_depth': 20, 'learning_rate': 0.1, 'colsample_bytree': 1.0}
model_tuned = lgb.LGBMRegressor(**rsearch.best_params_, random_state=384).fit(X_train[cols], Y_train)

print("TRAIN SMAPE:", smape(Y_train, model_tuned.predict(X_train[cols])))
print("VALID SMAPE:", smape(Y_val, model_tuned.predict(X_val[cols])))
TRAIN SMAPE: 48.433166801003026
VALID SMAPE: 97.24837091281393
model_tuned2 = lgb.LGBMRegressor(num_leaves=31, n_estimators=15000, max_depth=20, random_state=384, metric = "custom")      
model_tuned2.fit(
    X_train[cols], Y_train,
    eval_metric= lambda y_true, y_pred: [lgbm_smape(y_true, y_pred)],
    eval_set = [(X_train[cols], Y_train), (X_val[cols], Y_val)],
    eval_names = ["Train", "Valid"],
    early_stopping_rounds= 1000, verbose = 500
)
print("Best Iteration:", model_tuned2.booster_.best_iteration)
Training until validation scores don't improve for 1000 rounds
[500]   Train's SMAPE: 66.2324  Valid's SMAPE: 95.4403
[1000]  Train's SMAPE: 63.2125  Valid's SMAPE: 95.5624
[1500]  Train's SMAPE: 60.7468  Valid's SMAPE: 95.4656
Early stopping, best iteration is:
[696]   Train's SMAPE: 64.8907  Valid's SMAPE: 95.3564
Best Iteration: 696
df.sort_values(["LOCATIONNO", "PRODUCTID", "SALEDATE"], inplace = True)

train_final = df.loc[(df["SALEDATE"] < "2021-03-01"), :]
test_final = df.loc[(df["SALEDATE"] >= "2021-03-01"), :]

X_train_final = train_final[cols]
Y_train_final = train_final.UNIT_GROSS_SALES
X_test_final = test_final[cols]



final_model = lgb.LGBMRegressor(num_leaves=31, n_estimators=15000, max_depth=20, random_state=384, metric = "custom")
final_model.set_params(n_estimators=model_tuned2.booster_.best_iteration) # Best Iteration: 983
final_model.fit(X_train_final[cols], Y_train_final,
                eval_metric= lambda y_true, y_pred: [lgbm_smape(y_true, y_pred)])
LGBMRegressor(max_depth=20, metric='custom', n_estimators=696, random_state=384)
final_model.predict(X_test_final)
array([6.82137589, 5.82712452, 3.76803532, ..., 2.00950222, 1.83432827,
       1.17740984])
sales_test = test_final[['LOCATIONNO', 'SALEDATE', 'PRODUCTID', 'UNIT_GROSS_SALES']]
sales_test['Predictions'] = np.round(final_model.predict(X_test_final))
sales_test['error'] = sales_test['UNIT_GROSS_SALES'] - sales_test['Predictions']
sales_test.head()
LOCATIONNO SALEDATE PRODUCTID UNIT_GROSS_SALES Predictions error
262726 0 2021-03-01 1 6.0 7.0 -1.0
263527 0 2021-03-02 1 12.0 6.0 6.0
264122 0 2021-03-03 1 1.0 4.0 -3.0
264932 0 2021-03-04 1 4.0 4.0 0.0
265271 0 2021-03-05 1 6.0 3.0 3.0
item1 = sales_test[(sales_test['LOCATIONNO'] == 1) & (sales_test['PRODUCTID'] == 34)][['UNIT_GROSS_SALES', 'Predictions', 'error']]
sns.lineplot(data=item1, palette="tab10", linewidth=2.5)

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
<matplotlib.legend.Legend at 0x2ac01a5f8b0>

Future Work

The next step in the process is to utilize the news vendor model to estimate demand in each store for each item. A simulation can then be run to identify how many deliveries should be schedled throughout the week to minimize waste.

Avatar
Amol Kulkarni
Ph.D.

My research interests include application of Machine learning algorithms to the fields of Marketing and Supply Chain Engineering, Decision Theory and Process Optimization.