Building an example `Dataset` and `DataLoader` with `NumPy`

For our data we'll first utilize TabularPandas for pre-processing. One potential is to use TabularPandas for pre-processing only, or to integrate NumPy directly into it

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
splits = RandomSplitter()(range_of(df))

We'll still build our regular TabularPandas, as we haven't done any NumPy modifications yet

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, splits=splits)

class NumpyDataset[source]

NumpyDataset(to:TabularPandas)

A Numpy dataset object from TabularPandas

ds = NumpyDataset(to)
ds.bs = 3
a,b,c = ds[[0]]
test_eq(len(a), 3)

class NumpyDataLoader[source]

NumpyDataLoader(dataset, bs=1, **kwargs) :: DataLoader

Inherit from this to have all attr accesses in self._xtra passed down to self.default

dl = NumpyDataLoader(ds, bs=3)
batch = next(iter(dl))
test_eq(len(dl), len(ds)//3+1)

NumpyDataLoader.shuffle_fn[source]

NumpyDataLoader.shuffle_fn(x:NumpyDataLoader)

Shuffle the interior dataset

NumpyDataLoader.get_idxs[source]

NumpyDataLoader.get_idxs(x:NumpyDataLoader)

Get index's to select

To ensure that we still see an improvement, we'll compare timings

train_ds = NumpyDataset(to.train)
valid_ds = NumpyDataset(to.valid)
train_dl = NumpyDataLoader(train_ds, bs=64, shuffle=True, drop_last=True)
valid_dl = NumpyDataLoader(valid_ds, bs=64)
dls = to.dataloaders(bs=64)
%%timeit
# Numpy
for _ in train_dl: pass
31.2 ms ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
# fastai
for _ in dls[0]: pass
1.02 s ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
# Numpy
for _ in valid_dl: pass
7.35 ms ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
# fastai
for _ in dls[1]: pass
250 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

class NumpyDataLoaders[source]

NumpyDataLoaders(to, bs=64, val_bs=None, shuffle_train=True, device='cpu', **kwargs) :: DataLoaders

Basic wrapper around several DataLoaders.