For an example we'll use the first five rows of the ADULT_SAMPLE
dataset, which I have converted to a NumPy
array below:
import pandas as pd
df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
For procs we will use the same ones from training a model:
- Note: we have to load in
Categorize
to havenp.nan
as an index to work properly. This is done automatically later
import pickle
with open('procs.pkl', 'rb') as handle:
procs = pickle.load(handle)
for proc in procs['Categorize']:
procs['Categorize'][proc][np.nan] = 0 # we can't pickle np.nan
arr
is expected to be a NumPy
array, while procs
should be the pre-processing dictionary exported after training
df[0]
df = FillMissing(df, procs)
df[0]
Three bool
columns were added at the end for our potential missing numerical values (if True
they exist)
arr
is expected to be a NumPy
array, while procs
should be the pre-processing dictionary exported after training
df[0]
df = Categorize(df, procs)
df[0]
Our categorical variables are now all converted to integers. Any left as strings are not used by the model and are ignored at inference time.
arr
is expected to be a NumPy
array, while procs
should be the pre-processing dictionary exported after training
df[0]
df = Normalize(df, procs)
df[0]
Our continous variables have now been adjusted for the model
The specific order in which the pre-processing is done must occur, as Categorify
can increase by a few columns from FillMissing
if multiple is_na
columns are added
df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
df[0]
df = apply_procs(df, procs)
df[0]
df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
df = df.head().to_numpy()
df = apply_procs(df, procs)
dset = TabularDataset(df, procs)
dset[0]
learn = tabular_learner('procs')
df = pd.read_csv('/home/ml1/.fastai/data/adult_sample/adult.csv')
dl = learn.test_dl(df.iloc[:5].to_numpy())
dl[0]
learn.predict(dl[0])
learn.get_preds(dl=dl)