Reimbursement Exploratory Data Analysis

This is a exploratory data analysis for global reimbursement amount.

  • The date used is "Sent for Payment Date"
  • Jun 2017 and July 2017 has been dropped in this dataset as that time we are under initial implementation status

Target:

  1. To generate a complete picture of global reimbursement spending over time
  2. To prepare for modeling cash forecast
  3. To help day-to-day management and decision making

Benefit:

  1. Generate cash forecast model to help prediction
  2. Get insights and intuitions into global reimbursement spendings
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import time
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()

%matplotlib inline
sns.set(font_scale = 1.2)

#plotly.offline doesn't push your charts to the clouds
import plotly.offline as pyo
#allows us to create the Data and Figure objects
from plotly.graph_objs import *
import plotly.graph_objs as go
#plotly.plotly pushes your charts to the cloud  
import plotly.plotly as py

#lets us see the charts in an iPython Notebook
pyo.offline.init_notebook_mode() # run at the start of every ipython 
In [2]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[2]:
In [3]:
# read and transfer date
df = pd.read_excel('sfpd_expenseType - Copy.xlsx',header=1)
df.drop(df.tail(1).index,inplace=True)

df['Year'] = df['Sent for Payment Date'].apply(lambda x: x.split('-')[0])
df['Month'] = df['Sent for Payment Date'].apply(lambda x: x.split('-')[1])

def func(x):
    return datetime.strptime(x, "%Y/%m/%d")
df['date']  = (df['Year'] + '/' + df['Month'] + '/1').apply(func)

# out of pocket dataframe
oop = df[df['Payment Type'] == 'Out-of-Pocket']

Total Reimbursement Spending

Trend

  • There is a clear trend that the amount reimbursed is increasing month by month.
  • There is a clear cycle pattern showed in the graph that we can use for estimating.
  • It seems that the largest amount is at year end, following that is a significant drop which maybe due to Christmas event.
In [4]:
# preparing data
total_df = oop.pivot_table('Expense Amount (rpt)','date',aggfunc=sum)
total_df.drop(total_df.index[0:2],inplace=True)
total_rolling = oop.pivot_table('Expense Amount (rpt)','date',aggfunc=sum).rolling(3).mean()
std = total_df.values.std()

# plot
sns.set_style('white')
plt.subplots(figsize=(18,6))

g = sns.lineplot(total_df.index,total_df['Expense Amount (rpt)'],label='OOP Amount')
g1= sns.lineplot(total_rolling.index,total_rolling['Expense Amount (rpt)'],label='Moving Avr (3)')

plt.fill_between(total_df.index,
                 total_df['Expense Amount (rpt)'] - std,
                 total_df['Expense Amount (rpt)'] + std,
                 alpha=0.1,label='Std')
# format
g.set_title('Global Reimbursement Trend (add moving avr and standard deviation)',fontdict={'fontsize':16},pad=15)

g.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

plt.show()

Difference

  • The seasonal/cycle trend can also be seen in this figure (group by 2 or 3 periods)
  • The absolute value of fluctuations has a mean of 2.1w, and a large standard deviation of 2.8w.
In [5]:
# plot
plt.subplots(figsize=(18,6))

g = sns.barplot(total_df.reset_index()['date'].dt.date,total_df.diff()['Expense Amount (rpt)'].values)

g.set_title('Rolling Differences for Monthly Reimbursement Amount',fontdict={'fontsize':16},pad=15)
plt.setp(g.get_xticklabels(), rotation=35) 

g.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

sns.despine()
plt.show()
# print out some information
print('Mean fluctuation each month: ${:,.2f}'.format(total_df.diff().apply(abs).mean()[0]))
print('Standard deviation: ${:,.2f}'.format(total_df.diff().std()[0]))
Mean fluctuation each month: $108,872.02
Standard deviation: $142,719.84

Statistic Data

From the historical data, we can know:

  • Average/mean reimbursement amount is 11w USD per month.
  • Median reimbursement amount is 10w USD per month.
  • Thus, the distribution is right skewed (means that there are some large amounts)
  • Standard deviation of 2.5w USD.
  • 90% of data lies in the range of 7w ~ 14w.

Distribution of Total Reimbursement Amount Each Month

In [6]:
total_oop = oop.pivot_table('Expense Amount (rpt)','date',aggfunc=sum)
total_oop.drop(total_oop.index[0:2],inplace=True)

sns.distplot(oop.pivot_table('Expense Amount (rpt)','date',aggfunc=sum))

print(total_oop.describe(percentiles=[.05,.25,.5,.75,.95]).apply(lambda x: '{:,.0f}'.format(x.values[0]),axis=1))
count         21
mean     559,929
std      128,652
min      328,200
5%       360,565
25%      480,256
50%      535,286
75%      661,012
95%      741,603
max      814,021
dtype: object

Parent Expense Type

In this section, we break down total expense spending into different expense types. Firstly, let's see the amount of each expense type.

Overall Amount

Not surprisingly, transportation and travel expenses (should these two types be together?) are the largest parts in employee reimbursement. Following is the Meal & Entertainment, which seems like 20% of the total. Communication (phone and internets) and HR benefit take another part. Surprisingly to see that marketing expenses only take like 2% or less.

In [7]:
# df
oop2018 = oop[oop['Sent for Payment Date'].apply(lambda x: '2018' in x)]

parent_et_df = (oop2018.pivot_table('Expense Amount (rpt)','Parent Expense Type', aggfunc=sum)
                .sort_values('Expense Amount (rpt)',ascending=False))

# plot
plt.subplots(figsize=(18,6))
g = sns.barplot(parent_et_df.index,parent_et_df['Expense Amount (rpt)'],palette='Set2')

# format
plt.setp(g.get_xticklabels(), rotation=35) 
sns.despine()
g.set_title('2018 Out-Of-Pocket Spending by Categories',fontdict={'fontsize':18},pad=15)
g.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

# annotate
for ind, val in enumerate(parent_et_df.values):
    g.annotate('{:,.2f}'.format(val[0]),(ind-.3,val[0]+10000),size=14)
In [8]:
parent_et_df
Out[8]:
Expense Amount (rpt)
Parent Expense Type
02. Transportation 3018615.20
01. Travel Expenses 1600285.90
03. Meals & Entertainment 1385768.25
04. Communications 411944.05
06. HR: Benefits, Relocation/Ex-Pat and Recruting 336544.70
09. Other 210751.60
08. IT, Operations and office expenses 143792.50
07. Marketing 73617.90
05. Subscriptions and Fees 22314.15

Let's then take a look at the picture for different subsidiaries. Three colors catch me at the first glance, which are orange, green and blue. We see that different subs have different portion of that content.

Percentage of total

To see the different percentage of categories for each sub, I made a stacked bar plot and a heatmap to visualize it. In these charts, we can see it clearly and say that:

  • Most of the subs have their the travel/transportation expenses as the top one or two expense types (60% ~ 70% of their total spending)
  • AU employees really fascinated with meal & entertainment(M&E takes 40% of total), interesting...
  • JP employees travel a lot (travel and transportation takes 77% of reimbursement, highest percentage)
  • We also see differences in NZ, CA, SW, SA and SZ which might be caused by their small organization scale.
In [9]:
tem_df = oop2018.pivot_table('Expense Amount (rpt)','Parent Expense Type','Sub', aggfunc=sum)
# total column
tem_df['Total'] = tem_df.apply(lambda x: x.sum(),axis=1)
# sort this column
tem_df.sort_values('Total',ascending=False,inplace=True)
# total row
tem_df.loc['Total',:] = tem_df.apply(lambda x: x.sum(),axis=0)
# sort this row
tem_df.sort_values('Total',axis=1,ascending=False,inplace=True)
# drop
tem_df.drop('Total',axis=0,inplace=True)
tem_df.drop('Total',axis=1,inplace=True)
# transform it into percentile
for col in tem_df.columns:
    total = tem_df[col].sum()
    tem_df[col] = tem_df[col].apply(lambda x: round(x/total,2))
In [10]:
# plot
sns.set(font_scale = 1)
plt.subplots(figsize=(12,7))

ax = sns.heatmap(tem_df.fillna(0)*100,cmap='coolwarm',annot=True)

ax.set_title('Percentage Of Expense Type',fontdict={'fontsize':16},pad=15)

for t in ax.texts: t.set_text(t.get_text() + "%")
    
plt.show()

Changes over time

One of the most important thing to create a model and predict a future value is that, the data you are predicting needs to have either seasonal pattern or a predictable trend.

Line Chart for Amount

In [11]:
oop['date'] = oop.date.dt.date
In [12]:
test = oop.pivot_table('Expense Amount (rpt)','date','Parent Expense Type', aggfunc=sum)
for col in test.columns:
    test[col] = test[col].apply(lambda x: round(x,2))
In [13]:
def visible(col):
    ls = ['01', '02', '03', '04']
    for i in ls:
        if i in col:
            return True
    return 'legendonly'

traces = []
for col in test.columns:
    traces.append({'type' : 'scatter',
                   'x' : test.index,
                   'y' : test[col],
                   'name' : col,
                   'mode' : 'lines',
                   'visible' : visible(col)})
data = Data(traces)


layout = {'title' : 'Interesting Trend in 2018',
         'xaxis' : {'title' : 'Month','tickformat':'%b %y'},
         'yaxis' : {'title' : 'Out-of-Pocket','tickformat':'$,0'}}

fig = Figure(traces, layout = layout)
pyo.iplot(fig)
f:\users\airya\anaconda3\envs\py3.7\lib\site-packages\plotly\graph_objs\_deprecations.py:39: DeprecationWarning:

plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.


In the line chart above, we can see:

  • An increasing trend for Travel, Transportation and Meal.
  • Transportation and meal have a peak during Nov and Dec, this might be the reason why Nov and Dec have a high overall amount of spending. We need to dig deeper for these two kind of parent types to see what the peak consists of.

Stacked bar chart and Line chart for percentage

In [14]:
# df
tem_df = oop.pivot_table('Expense Amount (rpt)','date','Parent Expense Type', aggfunc=sum).transpose()

for col in tem_df.columns:
    total = tem_df[col].sum()
    tem_df[col] = tem_df[col].apply(lambda x: round(x/total,2))

    
tem_df2 = tem_df.transpose()
tem_df2.index = [str(i)[:7] for i in tem_df2.index]

# plot
sns.set(font_scale = 1.5)
sns.set_style('white')
ax = tem_df2.plot(kind='bar',stacked=True,figsize=(18,8),legend=False,cmap='Set2')

# legend
patches, labels = ax.get_legend_handles_labels()
ax.legend(patches, labels, loc=5)
ax.legend(bbox_to_anchor=(1.02, 1.0))

# format
ax.set_ylabel('Percentage per Month',fontdict={'fontsize':16})
sns.despine()

plt.show()
In [15]:
# trying to use plotly... but the cmap is not cool

# traces = []

# for col in tem_df2.columns:
#     traces.append({'type': 'bar',
#                    'x': tem_df2.index,
#                    'y': tem_df2[col],
#                    'opacity' : 0.7,
#                   'name':col})

# data = Data(traces)
# layout = {'title' : "Proportion of Parent Expense Types, Jun 2017 to April 2019",
#          'xaxis' : {'title' : 'Month', 'tickformat':'%b %y'},
#          'yaxis' : {'title' : 'Proportion of Month Total','tickformat': '.0%'},
#          'barmode' :'stack'}
# fig = Figure(data = data, layout = layout)
# pyo.iplot(fig)

We can see the most unstable category is Meal & Entertainment. The rest of them are all good.

Child Expense Type

From the analysis of Parent Expense Types, we know that the three expense types below have most of the spendings. Their trend over time also shows a sudden growth in the end of 2018, and we are interested in this trend and would like to dig deeper into that.

  • Travel Expenses
  • Transportation
  • Meal & Entertainment

Now, we plot the data on the image for above selected parent types and classify them according to sub categories.

Interesting Trend in 2018

In [16]:
# shorten the list of expense types
mapping_dic = {
#     'Groud': ['Taxi','Public Transport','Parking','Train','Personal Car Mileage','Fuel','Car Rental','Tolls/Road Charges'],
'Meal': ['Meal While Traveling (employee only)','zNOTUSEDLunch',
         'Meal out-of-office (client/partners)','zNOTUSEDDinner',
        'Team Meal Out-of-office (employee only)','Meal in-office (employee only)'],
'Entertainment': ['Entertainment Out-of-office (employee only)','Entertainment Out-of-office (with Client)']}

def my_func(x):
    for category, item in mapping_dic.items():
        for i in item:
            if i == x:
                return category
    return x

oop2018['Expense Type'] = oop2018['Expense Type'].map(my_func)
In [17]:
# define a problem dataframe, which contains 3 parent types only
problem_df = (oop2018[oop2018['Parent Expense Type']
                  .apply(lambda x: x in ('01. Travel Expenses','02. Transportation', '03. Meals & Entertainment'))])

# temporary dataframe to show: amount, date and expense types
tem_df = (problem_df
          .pivot_table('Expense Amount (rpt)','date', 'Expense Type', aggfunc=sum)
          )

# amount rounded to 2 decimal places
for col in tem_df.columns:
    tem_df[col] = tem_df[col].apply(lambda x: round(x,2))

# fill na    
tem_df.fillna(0, inplace=True)

# vertical sum and sort by that row
tem_df.loc['Total',:] = tem_df.sum(axis=0)
tem_df = tem_df.sort_values('Total',ascending=False,axis=1)
tem_df.drop('Total', inplace=True)

# create another df for ranking
sup_df = tem_df.transpose()

for col in sup_df.columns:
    sup_df[col] = sup_df[col].rank(ascending=False).astype(int)

sup_df = sup_df.transpose()

# change the column names
sup_df.columns = [col + '_rank' for col in sup_df.columns]

# append the original df
sup_df = pd.concat([tem_df,sup_df],axis=1)
In [18]:
# create text support for the chart
for col in tem_df:
    sup_df[col + '_text'] = (sup_df
                                 .apply(lambda x: "{date}<br><b>{Type}</b><br>Amount: ${amount:,.2f}<br>Ranking: {rank}"
                                        .format(date=pd.to_datetime(x.name).strftime('%b %Y'),Type=col,amount=x[col],rank=x[col + '_rank']),axis=1))
In [19]:
ls = ['Meal','Taxi', 'Hotel','Entertainment', 'Personal Car Mileage',]

def chooseColour(type):
    if type in ls:
        return 'Red'
    else:
        return 'Grey'

    
def width(type):
    if type in ls:
        return 2
    else:
        return 1

traces = []
for col in tem_df.columns:
    if col != 'date' and 'rank' not in col:
        traces.append({'type' : 'scatter',
                       'x' : tem_df.index,
                       'y' : tem_df[col],
                       'name' : col,
                       'text' : sup_df[col + '_text'],
                       'line' : {'width' : width(col)},
                       'marker' : {'color' : chooseColour(col)},
                       'mode' : 'lines',
                       'opacity' : 0.7,
                       'hoverinfo' : 'text'
                      })
        
data = Data(traces)


layout = {'title' : 'Interesting Trend in 2018',
         'xaxis' : {'title' : 'Month'},
         'yaxis' : {'title' : 'Out-of-Pocket','tickformat':'$,0'},
         'hovermode' : 'closest'}

fig = Figure(data = data, layout = layout)
pyo.iplot(fig)

In the first glance, we notice that:

  • There are a lot of sub types...
  • Some are low in total amount and have no trend at all. Those are marked GREY.
  • Some are high in amount and show no trend at the end of the year, which are also marked GREY.
  • Some are high in amount and show a trend of growth at the end of the year, which are marked RED

For those types marked RED, I filter them out and create a new plot below for a clearer view.

In [20]:
def visible(col):
    ls = ['Meal','Taxi', 'Hotel','Entertainment', 'Personal Car Mileage',]
    for i in ls:
        if i == col:
            return True
    return 'legendonly'

traces = []
for col in tem_df.columns:
    if col != 'date' and 'rank' not in col:
        traces.append({'type' : 'scatter',
                       'x' : tem_df.index,
                       'y' : tem_df[col],
                       'name' : col,
                       'mode' : 'lines',
                       'text' : sup_df[col + '_text'],
                       'hoverinfo' : 'text',
                       'visible' : visible(col)})
data = Data(traces)


layout = {'title' : 'Interesting Trend in 2018',
         'xaxis' : {'title' : 'Month','tickformat':'%b %y'},
         'yaxis' : {'title' : 'Out-of-Pocket','tickformat':'$,0'},
         'hovermode' : 'closest'}

fig = Figure(data = data, layout = layout)
pyo.iplot(fig)
In [21]:
# get top n expense types
ls_exp_types = (problem_df
           .groupby(['Expense Type'])['Expense Amount (rpt)']
           .sum()
         .sort_values(ascending=False)
         .index[0:7])

# group smaller expense types into 'other misc types'
tem_df = problem_df.copy()
tem_df['Expense Type'] = tem_df['Expense Type'].apply(lambda x: x if x in ls_exp_types else 'other misc types')

# get pivot table
tem_df1 = (tem_df
           .groupby(['Sent for Payment Date','Expense Type'])['Expense Amount (rpt)']
           .sum()
           .unstack())


# vertical sum and sort by that row
tem_df1.loc['Total',:] = tem_df1.sum(axis=0)
tem_df1 = tem_df1.sort_values('Total',ascending=False,axis=1)
tem_df1.drop('Total', inplace=True)


# round the numbers for formatting
for col in tem_df1.columns:
    tem_df1[col] = tem_df1[col].apply(lambda x: round(x,2))

# plot
traces = []

for col in tem_df1.columns:
    if col != 'Total':
        traces.append({'type' : 'bar',
                       'name' : col,
                       'x' : tem_df1.index,
                       'y' : tem_df1[col],
                       'opacity' : 0.7})

layout = {'title' : "Top 6 Child Expense Types in 2018",
         'xaxis' : {'title' : 'Employee','tickformat':'%b %y'},
         'yaxis' : {'title' : 'Out-of-Pocket','tickformat':'$,0'},
         'barmode' : 'stack'}

fig = {'data' : traces,
      'layout' : layout}
pyo.iplot(fig)
In [22]:
tem_df1.index = pd.to_datetime(tem_df1.reset_index()['Sent for Payment Date']).dt.date

tem_df1.sort_index().style.background_gradient()
Out[22]:
Expense Type Meal Hotel Taxi Airfare other misc types Entertainment Personal Car Mileage Train
Sent for Payment Date
2018-01-01 78131.6 33920.7 50176.6 70025 38354.2 37474.8 14727 16759
2018-02-01 69006.7 56566.4 56227.3 59627.4 51990.2 60035.1 18193.2 13364.1
2018-03-01 99580.2 69503.2 73773.6 67189 78988.6 78043.9 42890.5 23012.5
2018-04-01 84926.6 78182.3 68803.5 59650.9 50899.4 52621.4 20648.9 10686.2
2018-05-01 83319.7 68523.2 69339.4 57607.5 70090.6 41501.4 21365.2 12275
2018-06-01 92950.4 126812 61095.8 76228 60779 32680.7 37790 24181.4
2018-07-01 113791 70307.8 85845.4 85978.6 86378.1 48186.9 25839.2 30378.2
2018-08-01 90177.6 59875.1 81919.4 70738.1 47992.1 29387.7 35320.7 20687.8
2018-09-01 94648.5 84239.7 60918.8 71735.2 66586.9 30513 40375.4 10713.9
2018-10-01 138482 97055.1 92139.9 77128.4 75699.8 39487 47831.1 15108.1
2018-11-01 158394 102916 93184.2 83034.3 106133 62471.3 60161.4 14773.4
2018-12-01 150569 126272 116034 74582.1 81011.4 40112.8 61198.6 27836

Data mining for the interesting trend

We want to know why there is a sudden increase at the end of the year. In other words, we need the root cause of them. In order to know that, I want to see the things listed:

  1. Business purposes - purposes are not properly categorized. Even we have three expenses for the same target, their purposes can be described in different manners, but I still wanna try what we can do.
    • Firstly, I need to query data with business purposes available in the dataset
    • Secondly, I need to convert those purposes into a Word Cloud which can visualize the purposes.
    • Thridly, I wanna compare the year end Word Cloud with the whole year Word Cloud to see the differences.
  2. Employees - I assume the year end explosion of expenses are caused by some employees only.
    • Some hold their expenses till the year end
    • Some celebrate
    • Some travel a lot at the year end
    • Christmas event expenses are paid by specific employees
  3. Subsidiaries - which subsidiary has the most increase in the year end?

Business Purposes in End of 2018

Business Purposes for Out-Of-Pocket Expenses, Aug 2018 to Dec 2018

From the word cloud, we know that we have a large portion of expense entries in the end of 2018 spent for China GM Summit.

In [23]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6357 entries, 0 to 6356
Data columns (total 10 columns):
Sub                      6357 non-null object
Payment Type             6357 non-null object
Parent Expense Type      6357 non-null object
Expense Type             6357 non-null object
Sent for Payment Date    6357 non-null object
Expense Amount (rpt)     6357 non-null float64
Reporting Currency       6357 non-null object
Year                     6357 non-null object
Month                    6357 non-null object
date                     6357 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(8)
memory usage: 546.3+ KB
In [24]:
df = pd.read_excel('Source Data - Reimbursement EDA - copy.xlsx')

filter_for_text = ['January','February','March','April','May','June','July',
 'August','September','October','November','December','Dinner','Lunch','Breakfast','Expense Report','Clara Lim']

import numpy as np

def filter_function(x):
    if pd.isna(x):
        return np.nan
    else:
        for i in filter_for_text:
            if i.lower() not in x.lower():
                pass
            else:
                return np.nan
        return x

import sys
from os import path
from PIL import Image

currdir = path.curdir
mask = np.array(Image.open(path.join(currdir, "cloud.png")))
In [25]:
time_ls = ['2018-12','2018-11','2018-10','2018-9']
endof2018_df = df[df['Sent for Payment Date'].apply(lambda x: x in time_ls)][df['Payment Type'] == 'Out-of-Pocket']
endof2018_df['Expense Type'] = endof2018_df['Expense Type'].map(my_func)

endof2018_df['Report Name'] = endof2018_df['Report Name'].astype(str)
endof2018_df['Purpose'] = endof2018_df['Purpose'].astype(str)

from wordcloud import WordCloud

text = "  ".join(endof2018_df['Purpose'].apply(filter_function).dropna()) + \
"  ".join(endof2018_df['Report Name'].apply(filter_function).dropna())

plt.subplots(figsize=(20,20))
wordcloud = WordCloud(background_color='white',
#                       colormap='ocean',
                      mask=mask,
                      max_words=100).generate(text)
plt.imshow(wordcloud)
plt.axis('off')

plt.show()

Employee Spending by Categories in the End of 2018

In [26]:
ls_exp_types = (endof2018_df
           .groupby(['Expense Type'])['Expense Amount (rpt)']
           .sum()
         .sort_values(ascending=False)
         .index[0:5])
tem_df = endof2018_df.copy()
tem_df['Expense Type'] = tem_df['Expense Type'].apply(lambda x: x if x in ls_exp_types else 'other misc types')

tem_df1 = (tem_df
           .groupby(['Employee','Expense Type'])['Expense Amount (rpt)']
           .sum()
           .unstack())

tem_df1['Total'] = tem_df1.sum(axis=1)
tem_df1 = tem_df1.sort_values('Total',ascending=False).iloc[0:10,:]

for col in tem_df1.columns:
    tem_df1[col] = tem_df1[col].apply(lambda x: round(x,2))
In [27]:
traces = []

for col in tem_df1.columns:
    if col != 'Total':
        

        traces.append({'type' : 'bar',
                       'name' : col,
                       'x' : tem_df1.index,
                       'y' : tem_df1[col],
                       'opacity' : 0.7})

layout = {'title' : "Top 10 Employee Spending, Sep 2018 to Dec 2018",
         'xaxis' : {'title' : 'Employee','tickformat':'%b %y'},
         'yaxis' : {'title' : 'USD Amount','tickformat':'$,0'},
         'barmode' : 'stack'}

fig = {'data' : traces,
      'layout' : layout}
pyo.iplot(fig)

TODO:

  1. We can add budget to the analysis to get a clearer image of our spending if it's available
  2. Detailed EDA for each subsidiary
    • We just proved that the overall trend is predictable. The remaining job is to know the possibility to predict reimbursement for individual subsidiary, especially for those small subs.
    • For the large spending categories like Travel, Transportation and Meal, we can certainly drill down deeper and see what's really going on here.
  3. Actionable Plots for users to check data online
  4. Time series Modeling and Prediction

Modification needed

  1. All figure in accounting format (thousand separator, 2 decimal places)
  2. Use Year and Month, don't use Date
  3. Keep One chart only (or less charts) for each overall estimating analysis