This module calculates and plots waterfall chart, this entire module was made by Pavel (Pak)
First let's train a model to analyze
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = IndexSplitter(list(range(800,1000)))(range_of(df))
to = TabularPandas(df, procs, cat_names, cont_names, y_names="salary", splits=splits)
dls = to.dataloaders()
learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)
learn.fit(1, 1e-2)
How does this version calculate each columns part.
- Calculate mean prediction for all the dataset. It will be the starting point for price of an indivilual row to shift from
- For every column calculate the difference between this row prediction and a mean prediction of this column shuffled (how this particular column for a certain values in other columns shifts the dep_var and in what direction)
- Assume that sum of these differences can be transfered as forces onthe first meran predictions
- Plot these forces
This class allows you to calculate and plot Waterfall graph. Also in can be useful in determining and vizualizing what is the best value of particular feature for a given row of data. Calculate all the parameters to plot Waterfall graph for a
sampl_row
fields
list of lists of columns to analyze, connected columns should be in the same list element (as a list)
sampl_row
row that is analyzed
max_row_used
how many rows to use for calculation. len(df) -- by default Can be absolute value or coeffficient (from the len(df)) On big datasets can easily be set to lower values as it's enough data for calculating differences anyway. 10k rows is often enough
num_tests
is used to reduce memory consumption, each run uses `max_row_used/num_tests` rows, the more 'num_tests' the less memory consumption is
use_log=True
is needed if we have transformed depended variable into log
use_int=True
is needed if we want to log-detransformed (exponented) var to me integer not float
fields = cat_names+cont_names
cur_item = df.iloc[10]
cur_item
fields = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'fnlwgt', 'education-num']
cat_names+cont_names
wf = InterpretWaterfall(learn=learn, df=df, fields=fields,
sampl_row=cur_item, max_row_used=0.3)
wf.get_forces()
wf.plot_forces()
Let's see what ages how affect this particular row
wf.plot_variants(fields=['age'])
wf.get_variants_pd(fields=['age'])
education
andeducation-num
are 100% correlated feature, we totally should group them
fields = ['workclass', ['education', 'education-num'], 'marital-status', 'occupation', 'relationship', 'race', 'age', 'fnlwgt']
wf = InterpretWaterfall(learn=learn, df=df, fields=fields,
sampl_row=cur_item, max_row_used=0.3)
wf.get_forces()
wf.plot_forces()
Methods exposed: plot_forces
-- plot waterfall graph calculated in initialization, get_forces
-- outputs all the forces for a given row as a ordered dict, plot_variants
-- plot graph of different variants of values of a particular column