For an example we'll use the first five rows of the ADULT_SAMPLE dataset, which I have converted to a NumPy array below:

import pandas as pd
df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()

For procs we will use the same ones from training a model:

Note: we have to load in Categorize to have np.nan as an index to work properly. This is done automatically later

import pickle
with open('procs.pkl', 'rb') as handle:
    procs = pickle.load(handle)
    for proc in procs['Categorize']:
        procs['Categorize'][proc][np.nan] = 0 # we can't pickle np.nan

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-0537b619dbbe> in <module>
      1 #slow
      2 import pickle
----> 3 with open('procs.pkl', 'rb') as handle:
      4     procs = pickle.load(handle)
      5     for proc in procs['Categorize']:

FileNotFoundError: [Errno 2] No such file or directory: 'procs.pkl'

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]

array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k'], dtype=object)

df = FillMissing(df, procs)

df[0]

array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k', True, True, True], dtype=object)

Three bool columns were added at the end for our potential missing numerical values (if True they exist)

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]

array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k', True, True, True], dtype=object)

df = Categorize(df, procs)

df[0]

array([49, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0, 1902, 40,
       ' United-States', '>=50k', 2, True, True], dtype=object)

Our categorical variables are now all converted to integers. Any left as strings are not used by the model and are ignored at inference time.

arr is expected to be a NumPy array, while procs should be the pre-processing dictionary exported after training

df[0]

array([49, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0, 1902, 40,
       ' United-States', '>=50k', 2, True, True], dtype=object)

df = Normalize(df, procs)

df[0]

array([0.7634343827572744, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0,
       1902, 40, ' United-States', '>=50k', 2, True, True], dtype=object)

Our continous variables have now been adjusted for the model

The specific order in which the pre-processing is done must occur, as Categorify can increase by a few columns from FillMissing if multiple is_na columns are added

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()

df[0]

array([49, ' Private', 101320, ' Assoc-acdm', 12.0, ' Married-civ-spouse',
       nan, ' Wife', ' White', ' Female', 0, 1902, 40, ' United-States',
       '>=50k'], dtype=object)

df = apply_procs(df, procs)

df[0]

array([0.7634343827572744, 5, 101320, 8, 12.0, 3, 0, 6, 5, ' Female', 0,
       1902, 40, ' United-States', '>=50k', 2, True, True], dtype=object)

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
df = apply_procs(df, procs)
dset = TabularDataset(df, procs)

dset[0]

[tensor([[ 5,  8,  3,  0,  6,  5,  2],
         [ 5, 13,  1,  5,  2,  5,  2],
         [ 5, 12,  1,  0,  5,  3,  1],
         [ 6, 15,  3, 11,  1,  2,  2],
         [ 7,  6,  3,  9,  6,  3,  1]], device='cuda:0'),
 tensor([[ 7.6343e-01,  1.0132e+05,  1.2000e+01],
         [ 3.9687e-01,  2.3675e+05,  1.4000e+01],
         [-4.3010e-02,  9.6185e+04,  1.0000e+01],
         [-4.3010e-02,  1.1285e+05,  1.5000e+01],
         [ 2.5024e-01,  8.2297e+04,  1.0000e+01]], device='cuda:0')]

learn = tabular_learner('procs.pkl', 'model.pkl')

df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
dl = learn.test_dl(df.iloc[:5].to_numpy())

dl[0]

[tensor([[ 5,  8,  3,  0,  6,  5,  2],
         [ 5, 13,  1,  5,  2,  5,  2],
         [ 5, 12,  1,  0,  5,  3,  1],
         [ 6, 15,  3, 11,  1,  2,  2],
         [ 7,  6,  3,  9,  6,  3,  1]], device='cuda:0'),
 tensor([[ 7.6343e-01,  1.0132e+05,  1.2000e+01],
         [ 3.9687e-01,  2.3675e+05,  1.4000e+01],
         [-4.3010e-02,  9.6185e+04,  1.0000e+01],
         [-4.3010e-02,  1.1285e+05,  1.5000e+01],
         [ 2.5024e-01,  8.2297e+04,  1.0000e+01]], device='cuda:0')]

learn.predict(dl[0])

['<50k', '<50k', '<50k', '<50k', '<50k']

learn.get_preds(dl=dl)

['<50k', '<50k', '<50k', '<50k', '<50k']

tabular

`FillMissing`[source]

`Categorize`[source]

`Normalize`[source]

`apply_procs`[source]

`class` `TabularDataset`[source]

`TabularDataset.init`[source]

`TabularDataset.make_batches`[source]

`class` `tabular_learner`[source]

`tabular_learner.init`[source]

`tabular_learner.test_dl`[source]

`tabular_learner.predict`[source]

`tabular_learner.get_preds`[source]

tabular

FillMissing[source]

Categorize[source]

Normalize[source]

apply_procs[source]

class TabularDataset[source]

TabularDataset.__init__[source]

TabularDataset.make_batches[source]

class tabular_learner[source]

tabular_learner.__init__[source]

tabular_learner.test_dl[source]

tabular_learner.predict[source]

tabular_learner.get_preds[source]

`FillMissing`[source]

`Categorize`[source]

`Normalize`[source]

`apply_procs`[source]

`class` `TabularDataset`[source]

`TabularDataset.init`[source]

`TabularDataset.make_batches`[source]

`class` `tabular_learner`[source]

`tabular_learner.init`[source]

`tabular_learner.test_dl`[source]

`tabular_learner.predict`[source]

`tabular_learner.get_preds`[source]