Methods for the tabular models, including data preperation and model prediction

For an example we'll use the first five rows of the ADULT_SAMPLE dataset, which I have converted to a NumPy array below:

import pandas as pd
df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()

For procs we will use the same ones from training a model:

  • Note: we have to load in Categorize to have np.nan as an index to work properly. This is done automatically later
import pickle
with open('procs.pkl', 'rb') as handle:
    procs = pickle.load(handle)
    for proc in procs['Categorize']:
        procs['Categorize'][proc][np.nan] = 0 # we can't pickle np.nan


FillMissing(arr, procs)

Fills in missing data in conts and potentially generates a new categorical column

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k'], dtype=object)
df = FillMissing(df, procs)
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k', True, True, True], dtype=object)

Three bool columns were added at the end for our potential missing numerical values (if True they exist)


Categorize(arr, procs)

Encodes categorical data in arr based on procs

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k', True, True, True], dtype=object)
df = Categorize(df, procs)
array([49, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0, 1902, 40,
       ' United-States', '>=50k', 2, True, True], dtype=object)

Our categorical variables are now all converted to integers. Any left as strings are not used by the model and are ignored at inference time.


Normalize(arr, procs)

Normalizes continous data in arr based on procs

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

array([49, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0, 1902, 40,
       ' United-States', '>=50k', 2, True, True], dtype=object)
df = Normalize(df, procs)
array([0.7634343827572744, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0,
       1902, 40, ' United-States', '>=50k', 2, True, True], dtype=object)

Our continous variables have now been adjusted for the model


apply_procs(arr, procs)

Apply test-time pre-processing on NumPy array input

The specific order in which the pre-processing is done must occur, as Categorify can increase by a few columns from FillMissing if multiple is_na columns are added

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k'], dtype=object)
df = apply_procs(df, procs)
array([0.7634343827572744, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0,
       1902, 40, ' United-States', '>=50k', 2, True, True], dtype=object)

class TabularDataset[source]

TabularDataset(arr, procs, bs=64)

A tabular NumPy dataset based on procs with batch size bs


TabularDataset.__init__(arr, procs, bs=64)

Stores array, grabs the indicies for cats and conts, and generates batches



Splits data into equal sized batches, excluding the final partial

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
df = apply_procs(df, procs)
dset = TabularDataset(df, procs)
[array([[ 5,  8,  3,  0,  6,  5,  2],
        [ 5, 13,  1,  5,  2,  5,  2],
        [ 5, 12,  1,  0,  5,  3,  1],
        [ 6, 15,  3, 11,  1,  2,  2],
        [ 7,  6,  3,  9,  6,  3,  1]]),
 array([[ 7.6343441e-01,  1.0132000e+05,  1.2000000e+01],
        [ 3.9686874e-01,  2.3674600e+05,  1.4000000e+01],
        [-4.3010049e-02,  9.6185000e+04,  1.0000000e+01],
        [-4.3010049e-02,  1.1284700e+05,  1.5000000e+01],
        [ 2.5024247e-01,  8.2297000e+04,  1.0000000e+01]], dtype=float32)]

class tabular_learner[source]


A Learner-like wrapper for tabular data



Accepts a fn pointing to exported procs and ONNX filename

learn = tabular_learner('procs')


tabular_learner.test_dl(test_items, bs=64)

Applies procs to test_items

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
dl = learn.test_dl(df.iloc[:5].to_numpy())
[array([[ 5,  8,  3,  0,  6,  5,  2],
        [ 5, 13,  1,  5,  2,  5,  2],
        [ 5, 12,  1,  0,  5,  3,  1],
        [ 6, 15,  3, 11,  1,  2,  2],
        [ 7,  6,  3,  9,  6,  3,  1]]),
 array([[ 7.6343441e-01,  1.0132000e+05,  1.2000000e+01],
        [ 3.9686874e-01,  2.3674600e+05,  1.4000000e+01],
        [-4.3010049e-02,  9.6185000e+04,  1.0000000e+01],
        [-4.3010049e-02,  1.1284700e+05,  1.5000000e+01],
        [ 2.5024247e-01,  8.2297000e+04,  1.0000000e+01]], dtype=float32)]



Predict a single numpy item

['<50k', '<50k', '<50k', '<50k', '<50k']



Predict on multiple batches of data in dl

['<50k', '<50k', '<50k', '<50k', '<50k']