This module calculates and plots waterfall chart, this entire module was made by Pavel (Pak)
First let's train a model to analyze
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = IndexSplitter(list(range(800,1000)))(range_of(df))
to = TabularPandas(df, procs, cat_names, cont_names, y_names="salary", splits=splits)
dls = to.dataloaders()
learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)
learn.fit(1, 1e-2)
How does this version calculate each columns part.
- Calculate mean prediction for all the dataset. It will be the starting point for price of an indivilual row to shift from
- For every column calculate the difference between this row prediction and a mean prediction of this column shuffled (how this particular column for a certain values in other columns shifts the dep_var and in what direction)
- Assume that sum of these differences can be transfered as forces onthe first meran predictions
- Plot these forces
This class allows you to calculate and plot Waterfall graph. Also in can be useful in determining and vizualizing what is the best value of particular feature for a given row of data. Calculate all the parameters to plot Waterfall graph for a
sampl_row
fieldslist of lists of columns to analyze, connected columns should be in the same list element (as a list)sampl_rowrow that is analyzedmax_row_usedhow many rows to use for calculation. len(df) -- by default Can be absolute value or coeffficient (from the len(df)) On big datasets can easily be set to lower values as it's enough data for calculating differences anyway. 10k rows is often enoughnum_testsis used to reduce memory consumption, each run uses `max_row_used/num_tests` rows, the more 'num_tests' the less memory consumption isuse_log=Trueis needed if we have transformed depended variable into loguse_int=Trueis needed if we want to log-detransformed (exponented) var to me integer not float
fields = cat_names+cont_names
cur_item = df.iloc[10]
cur_item
fields = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'fnlwgt', 'education-num']
cat_names+cont_names
wf = InterpretWaterfall(learn=learn, df=df, fields=fields,
sampl_row=cur_item, max_row_used=0.3)
wf.get_forces()
wf.plot_forces()
Let's see what ages how affect this particular row
wf.plot_variants(fields=['age'])
wf.get_variants_pd(fields=['age'])
education andeducation-num are 100% correlated feature, we totally should group them
fields = ['workclass', ['education', 'education-num'], 'marital-status', 'occupation', 'relationship', 'race', 'age', 'fnlwgt']
wf = InterpretWaterfall(learn=learn, df=df, fields=fields,
sampl_row=cur_item, max_row_used=0.3)
wf.get_forces()
wf.plot_forces()
Methods exposed: plot_forces -- plot waterfall graph calculated in initialization, get_forces -- outputs all the forces for a given row as a ordered dict, plot_variants -- plot graph of different variants of values of a particular column