Methods for the tabular models, including data preperation and model prediction

For an example we'll use the first five rows of the ADULT_SAMPLE dataset, which I have converted to a NumPy array below:

import pandas as pd
df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()

For procs we will use the same ones from training a model:

  • Note: we have to load in Categorize to have np.nan as an index to work properly. This is done automatically later
import pickle
with open('procs.pkl', 'rb') as handle:
    procs = pickle.load(handle)
    for proc in procs['Categorize']:
        procs['Categorize'][proc][np.nan] = 0 # we can't pickle np.nan

FillMissing[source]

FillMissing(arr, procs)

Fills in missing data in conts and potentially generates a new categorical column

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k'], dtype=object)
df = FillMissing(df, procs)
df[0]
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k', True, True, True], dtype=object)

Three bool columns were added at the end for our potential missing numerical values (if True they exist)

Categorize[source]

Categorize(arr, procs)

Encodes categorical data in arr based on procs

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k', True, True, True], dtype=object)
df = Categorize(df, procs)
df[0]
array([49, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0, 1902, 40,
       ' United-States', '>=50k', 2, True, True], dtype=object)

Our categorical variables are now all converted to integers. Any left as strings are not used by the model and are ignored at inference time.

Normalize[source]

Normalize(arr, procs)

Normalizes continous data in arr based on procs

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]
array([49, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0, 1902, 40,
       ' United-States', '>=50k', 2, True, True], dtype=object)
df = Normalize(df, procs)
df[0]
array([0.7634343827572744, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0,
       1902, 40, ' United-States', '>=50k', 2, True, True], dtype=object)

Our continous variables have now been adjusted for the model

apply_procs[source]

apply_procs(arr, procs)

Apply test-time pre-processing on NumPy array input

The specific order in which the pre-processing is done must occur, as Categorify can increase by a few columns from FillMissing if multiple is_na columns are added

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
df[0]
array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k'], dtype=object)
df = apply_procs(df, procs)
df[0]
array([0.7634343827572744, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0,
       1902, 40, ' United-States', '>=50k', 2, True, True], dtype=object)

class TabularDataset[source]

TabularDataset(arr, procs, bs=64)

A tabular NumPy dataset based on procs with batch size bs

TabularDataset.__init__[source]

TabularDataset.__init__(arr, procs, bs=64)

Stores array, grabs the indicies for cats and conts, and generates batches

TabularDataset.make_batches[source]

TabularDataset.make_batches()

Splits data into equal sized batches, excluding the final partial

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
df = apply_procs(df, procs)
dset = TabularDataset(df, procs)
dset[0]
[array([[ 5,  8,  3,  0,  6,  5,  2],
        [ 5, 13,  1,  5,  2,  5,  2],
        [ 5, 12,  1,  0,  5,  3,  1],
        [ 6, 15,  3, 11,  1,  2,  2],
        [ 7,  6,  3,  9,  6,  3,  1]]),
 array([[ 7.6343441e-01,  1.0132000e+05,  1.2000000e+01],
        [ 3.9686874e-01,  2.3674600e+05,  1.4000000e+01],
        [-4.3010049e-02,  9.6185000e+04,  1.0000000e+01],
        [-4.3010049e-02,  1.1284700e+05,  1.5000000e+01],
        [ 2.5024247e-01,  8.2297000e+04,  1.0000000e+01]], dtype=float32)]

class tabular_learner[source]

tabular_learner(fn)

A Learner-like wrapper for tabular data

tabular_learner.__init__[source]

tabular_learner.__init__(fn)

Accepts a fn pointing to exported procs and ONNX filename

learn = tabular_learner('procs')

tabular_learner.test_dl[source]

tabular_learner.test_dl(test_items, bs=64)

Applies procs to test_items

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
dl = learn.test_dl(df.iloc[:5].to_numpy())
dl[0]
[array([[ 5,  8,  3,  0,  6,  5,  2],
        [ 5, 13,  1,  5,  2,  5,  2],
        [ 5, 12,  1,  0,  5,  3,  1],
        [ 6, 15,  3, 11,  1,  2,  2],
        [ 7,  6,  3,  9,  6,  3,  1]]),
 array([[ 7.6343441e-01,  1.0132000e+05,  1.2000000e+01],
        [ 3.9686874e-01,  2.3674600e+05,  1.4000000e+01],
        [-4.3010049e-02,  9.6185000e+04,  1.0000000e+01],
        [-4.3010049e-02,  1.1284700e+05,  1.5000000e+01],
        [ 2.5024247e-01,  8.2297000e+04,  1.0000000e+01]], dtype=float32)]

tabular_learner.predict[source]

tabular_learner.predict(inps)

Predict a single numpy item

learn.predict(dl[0])
['<50k', '<50k', '<50k', '<50k', '<50k']

tabular_learner.get_preds[source]

tabular_learner.get_preds(dl=None)

Predict on multiple batches of data in dl

learn.get_preds(dl=dl)
['<50k', '<50k', '<50k', '<50k', '<50k']