Methods for the tabular models, including data preperation and model prediction

For an example we'll use the first five rows of the ADULT_SAMPLE dataset, which I have converted to a NumPy array below:

import pandas as pd
df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()

For procs we will use the same ones from training a model:

  • Note: we have to load in Categorize to have np.nan as an index to work properly. This is done automatically later
import pickle
with open('procs.pkl', 'rb') as handle:
    procs = pickle.load(handle)
    for proc in procs['Categorize']:
        procs['Categorize'][proc][np.nan] = 0 # we can't pickle np.nan
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-0537b619dbbe> in <module>
      1 #slow
      2 import pickle
----> 3 with open('procs.pkl', 'rb') as handle:
      4     procs = pickle.load(handle)
      5     for proc in procs['Categorize']:

FileNotFoundError: [Errno 2] No such file or directory: 'procs.pkl'

FillMissing[source]

FillMissing(arr, procs)

Fills in missing data in conts and potentially generates a new categorical column

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k'], dtype=object)
df = FillMissing(df, procs)
df[0]
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k', True, True, True], dtype=object)

Three bool columns were added at the end for our potential missing numerical values (if True they exist)

Categorize[source]

Categorize(arr, procs)

Encodes categorical data in arr based on procs

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k', True, True, True], dtype=object)
df = Categorize(df, procs)
df[0]
array([49, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0, 1902, 40,
       ' United-States', '>=50k', 2, True, True], dtype=object)

Our categorical variables are now all converted to integers. Any left as strings are not used by the model and are ignored at inference time.

Normalize[source]

Normalize(arr, procs)

Normalizes continous data in arr based on procs

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]
array([49, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0, 1902, 40,
       ' United-States', '>=50k', 2, True, True], dtype=object)
df = Normalize(df, procs)
df[0]
array([0.7634343827572744, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0,
       1902, 40, ' United-States', '>=50k', 2, True, True], dtype=object)

Our continous variables have now been adjusted for the model

apply_procs[source]

apply_procs(arr, procs)

Apply test-time pre-processing on NumPy array input

The specific order in which the pre-processing is done must occur, as Categorify can increase by a few columns from FillMissing if multiple is_na columns are added

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
df[0]
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k'], dtype=object)
df = apply_procs(df, procs)
df[0]
array([0.7634343827572744, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0,
       1902, 40, ' United-States', '>=50k', 2, True, True], dtype=object)

class TabularDataset[source]

TabularDataset(arr, procs, bs=64, device='cuda')

A tabular PyTorch dataset based on procs with batch size bs on device

TabularDataset.__init__[source]

TabularDataset.__init__(arr, procs, bs=64, device='cuda')

Stores array, grabs the indicies for cats and conts, and generates batches

TabularDataset.make_batches[source]

TabularDataset.make_batches()

Splits data into equal sized batches, excluding the final partial

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
df = apply_procs(df, procs)
dset = TabularDataset(df, procs)
dset[0]
[tensor([[ 5,  8,  3,  0,  6,  5,  2],
         [ 5, 13,  1,  5,  2,  5,  2],
         [ 5, 12,  1,  0,  5,  3,  1],
         [ 6, 15,  3, 11,  1,  2,  2],
         [ 7,  6,  3,  9,  6,  3,  1]], device='cuda:0'),
 tensor([[ 7.6343e-01,  1.0132e+05,  1.2000e+01],
         [ 3.9687e-01,  2.3675e+05,  1.4000e+01],
         [-4.3010e-02,  9.6185e+04,  1.0000e+01],
         [-4.3010e-02,  1.1285e+05,  1.5000e+01],
         [ 2.5024e-01,  8.2297e+04,  1.0000e+01]], device='cuda:0')]

class tabular_learner[source]

tabular_learner(data_fn, model_fn)

A Learner-like wrapper for tabular data

tabular_learner.__init__[source]

tabular_learner.__init__(data_fn, model_fn)

Accepts a data_fn and a model_fn corresponding to the named picle exports

learn = tabular_learner('procs.pkl', 'model.pkl')

tabular_learner.test_dl[source]

tabular_learner.test_dl(test_items, bs=64)

Applies procs to test_items

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
dl = learn.test_dl(df.iloc[:5].to_numpy())
dl[0]
[tensor([[ 5,  8,  3,  0,  6,  5,  2],
         [ 5, 13,  1,  5,  2,  5,  2],
         [ 5, 12,  1,  0,  5,  3,  1],
         [ 6, 15,  3, 11,  1,  2,  2],
         [ 7,  6,  3,  9,  6,  3,  1]], device='cuda:0'),
 tensor([[ 7.6343e-01,  1.0132e+05,  1.2000e+01],
         [ 3.9687e-01,  2.3675e+05,  1.4000e+01],
         [-4.3010e-02,  9.6185e+04,  1.0000e+01],
         [-4.3010e-02,  1.1285e+05,  1.5000e+01],
         [ 2.5024e-01,  8.2297e+04,  1.0000e+01]], device='cuda:0')]

tabular_learner.predict[source]

tabular_learner.predict(inps)

Predict a single tensor

learn.predict(dl[0])
['<50k', '<50k', '<50k', '<50k', '<50k']

tabular_learner.get_preds[source]

tabular_learner.get_preds(dl=None)

Predict on multiple batches of data in dl

learn.get_preds(dl=dl)
['<50k', '<50k', '<50k', '<50k', '<50k']