Building an example `Dataset` and `DataLoader` with `NumPy`
For our data we'll first utilize TabularPandas
for pre-processing. One potential is to use TabularPandas
for pre-processing only, or to integrate NumPy
directly into it
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
splits = RandomSplitter()(range_of(df))
We'll still build our regular TabularPandas
, as we haven't done any NumPy
modifications yet
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=y_names, splits=splits)
ds = NumpyDataset(to)
ds.bs = 3
a,b,c = ds[[0]]
test_eq(len(a), 3)
dl = NumpyDataLoader(ds, bs=3)
batch = next(iter(dl))
test_eq(len(dl), len(ds)//3+1)
To ensure that we still see an improvement, we'll compare timings
train_ds = NumpyDataset(to.train)
valid_ds = NumpyDataset(to.valid)
train_dl = NumpyDataLoader(train_ds, bs=64, shuffle=True, drop_last=True)
valid_dl = NumpyDataLoader(valid_ds, bs=64)
dls = to.dataloaders(bs=64)
%%timeit
# Numpy
for _ in train_dl: pass
%%timeit
# fastai
for _ in dls[0]: pass
%%timeit
# Numpy
for _ in valid_dl: pass
%%timeit
# fastai
for _ in dls[1]: pass