Zero to Hero with fastai
- Intermediate
A general overview of the major differences between the old
fastai
and the new
Zero to Hero
The “Zero to Hero” series is a series of three articles geared towards getting anyone familair with the fastai
library based upon their skill sets. The previous article is aimed towards those who have barely heard of “Deep Learning” and have zero experience with frameworks. This article article comes from a perspective of those who utilized the original fastai library in the past and understand the broad strokes of the library. Finally, the last article will briefly explain the advanced artifacts inside of fastai
and how they all function. > Note: These articles also presume you have read the previous to avoid redundancy, please read it before continuing so some context can be gathered here
Who am I
My name is Zach Mueller, I’ve extensively been involved and using fastai (and the newest version) for the better part of two years now. I’ve designed my own course geared around the library from an implementation standpoint without getting too complex. At the time of writing this I’m still an Undergraduate at the University of West Florida majoring in Computer Science. I’m also heavily involved inside the fastai community, of which I would emplore you to join! My goal is to make fastai more approachable at all levels through examples and help further Jeremy’s dream in the process. My specific interests involve speeding up the framework for deployment, tabular neural networks, and providing overall usage guides to help further the community
What will we cover in this article?
In the second iteration of “Zero to Hero” we will be going through the major differences between the two API’s. We will look more in detail at the high-level DataBlock
API with a 1:1 code example to learn how to adjust your code from the old fastai
. Afterwards we will look into the Mid-level API and transforms briefly to see how simple it can be to customize and adapt what you want into the framework through two seperate examples. Finally will then go into customizing test sets to include labelled and non-labelled data.
Installing the library
First let’s install fastai
:
!pip install fastai -qqq
What’s new?
Let’s first look at two sets of code for a Tabular task, specifically Adult Sample > Note: Some code cells may have # DO NOT RUN
. Do not run these if you choose to open this notebook in Colaboratory as they reference the old codebase and will not work anymore
As per usual, we’ll import the tabular library and use untar_data
to grab the dataset:
from fastai.tabular.all import *
= untar_data(URLs.ADULT_SAMPLE) path
Then we will open the DataFrame
in pandas:
= pd.read_csv(path/'adult.csv') df
df.head()
In both versions we still need to define our variables and procs, and the naming for each has not changed:
= 'salary'
dep_var = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [FillMissing, Categorify, Normalize] procs
However what did change was the API. Before our code would have looked like so:
# DO NOT RUN
= (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
data list(range(800,1000)))
.split_by_idx(=dep_var)
.label_from_df(cols .databunch())
Where we specify our API to have a TabularList
, then split that list, label the list, and finally DataBunch
it. This is gone, or at least simplified in the new version. Instead we have TabularPandas, a complete rework of the tabular API, which is different from the normal API. First we have special Splitter
classes that we can call and use depending on our task. Since we are splitting by a list of indicies, it would make sense to utilize the IndexSplitter
class. To utilize it we’ll instantate the class with our list of indicies to split by, and then split our dataset via it’s indicies. To explore more of these splitters, see the documentation.
= IndexSplitter(list(range(800,1000)))(range_of(df)); splits splits
Then we can pass everything to TabularPandas
:
= TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
to ='salary', splits=splits) y_names
Something very unique and nice about
TabularPandas
is we can actually use it in more than justfastai
! To see how we can utilize it with Random Forests and XGBoost, see my course notebook where this is discussed
And now we can build our DataLoaders
:
= to.dataloaders(bs=512) dls
From here, the API remains the same. We have a tabular_learner
and we can fit
, fit_one_cycle
, etc. One minor change is now to find the learning rate, it’s just lr_find
:
= tabular_learner(dls, layers=[200,100], metrics=accuracy)
learn learn.lr_find()
Where it will also return the two suggested learning rates seen at the top of the graph.
A Text Example
The next major upgrade is the Text API. This is a true example of the High-Level DataBlock API discussed in the previous article. It follows the same overall pattern we saw in TabularPandas
, just a bit more split-up. Let’s take this example from my slides:
We’ll run through this example step by step with a text example with IMDB_SAMPLE
. Let’s load the library and grab our data:
from fastai.text.all import *
= untar_data(URLs.IMDB_SAMPLE) path_imdb
= pd.read_csv(path_imdb/'texts.csv') df_imdb
In the first version of fastai
, to build our language model DataLoader
it would look something like below:
# DO NOT RUN
= (TextList.from_csv(path, 'texts.csv', cols='text')
data 0.1)
.split_by_rand_pct(
.label_for_lm()=8)) .databunch(bs
Now let’s convert this to the new API following our pipeline description above
- Define your input and output blocks:
Here we have text as an input, so we simply say we have a TextBlock.from_df
as have a DataFrame
> Note: text is a bit different in this regard as things get tokenized when generating the Dataset
. As a result, we have .from_df
and .from_folder
, specifying where the data comes from:
= TextBlock.from_df(text_cols='text', res_col_name='text', is_lm=True) blocks
Now an important distinction here is when tokenized, we will get rid of text_cols
in our DataFrame
and it will be replaced with res_col_name
. This is text
by default. Finally we want to specify that it is a language model by passing is_lm=True
.
- Define our getters
Now we need to tell fastai
how to grab the data. Our text will be stored in a column named text, so we will use a ColReader
to grab it:
= ColReader('text') get_x
- Split the Data
We’ll use another splitter like we did for tabular, this time using RandomSplitter
. When calling our DataBlock
we won’t need to pass in the direct indicies, fastai
will do this for us, so we can define it as below:
= RandomSplitter(valid_pct=0.2, seed=42) splitter
- Label the Data
We have already done this by specifying is_lm
to True
back when we defined our blocks. When we examine a non-language model classification example next you will be able to understand the difference
- Build the
DataLoaders
Now let’s build our DataBlock
by passing in what we have:
= DataBlock(blocks=blocks,
dblock =get_x,
get_x=splitter) splitter
And we can build our DataLoaders
:
= dblock.dataloaders(df_imdb, bs=8) dls
Let’s look at an example batch:
=2) dls.show_batch(max_n
Now if we wanted to train, the API still looks the same, where we call language_model_learner
and pass in our data. We won’t train in this example though as that can take a bit with language models:
= language_model_learner(dls, arch=AWD_LSTM, metrics=accuracy) lm_learn
Now let’s move onto a text classification example. This only requires two major changes to what we had before in our DataBlock
: the addition of another block
and a get_y
to tell fastai where our label is:
= (TextBlock.from_df(text_cols='text', res_col_name='text', is_lm=False), CategoryBlock()) blocks
We set is_lm
to False
(the default) and added a CategoryBlock
telling fastai we will be dealing with a classification problem. Next we need a get_y
to say where the label is. It’s still in that same DataFrame
, so we can use another ColReader
:
= ColReader('label') get_y
Finally, we’ll make a new splitter that splits from a column, as our DataFrame
has a is_valid
option:
= ColSplitter(col='is_valid') splitter
Now let’s remake our DataBlock
:
= DataBlock(blocks=blocks,
clas_dblock =get_x,
get_x=get_y,
get_y=splitter) splitter
And make some new DataLoaders
:
= clas_dblock.dataloaders(df_imdb, bs=8) dls
And that’s it for the text example! Now you’ve seen the basic building blocks and how it all works. For the final example we’ll walk through the PETS dataset as we did during the previous article, and recreate it with the API
Pets
from fastai.vision.all import *
Let’s first grab our data:
= untar_data(URLs.PETS) path
We’ll define our blocks
again. This time, since we have an image problem we’ll use an ImageBlock
and re-use CategoryBlock
:
= (ImageBlock(cls=PILImage), CategoryBlock()) blocks
Note that we can define sub-classes for blocks to use. If we were doing a black and white image problem (such as MNIST), we could define our
ImageBlock
asImageBlock(cls=PILImageBW)
Next we want our getters
. This is actually just as simple as get_image_files
. Why? Let’s look:
= get_image_files(path/'images') imgs
0] imgs[
Here we have a list of our images, so this is all we actually need since both our x
and y
are there.
Next we want to split the data. We did a random 80/20 split in the first article, so we will repeat this here using RandomSplitter
:
= RandomSplitter(valid_pct=0.2, seed=42) splitter
Finally we need our labeller, which is a RegexLabeller
:
=RegexLabeller(pat = r'/([^/]+)_\d+.*') get_y
Now before we continue we need some item
and batch
transforms to augment our data:
= RandomResizedCrop(460, min_scale=0.75, ratio=(1.,1.))
item_tfms = [*aug_transforms(size=224, max_warp=0), Normalize.from_stats(*imagenet_stats)] batch_tfms
And finally we can work this into our DataBlock
:
= DataBlock(blocks=blocks,
dblock =get_image_files,
get_items=splitter,
splitter=get_y,
get_y=item_tfms,
item_tfms=batch_tfms) batch_tfms
Let’s build our data:
= dblock.dataloaders(path/'images', bs=64) dls
And view some data just to be sure:
dls.show_batch()
We have now seen three major examples of the API from a DataLoader
perspective. Along with this article I invite you to read my other articles related to the DataBlock
API:
As they cover a few more specifics in regards to the API.
Test Sets
Now finally I mentioned the addition of labelled and non-labelled test sets. Originally back in the old fastai version when you did add_test
for a test set and wanted it labelled you had to do a weird workaround. However this is no longer the case. With fastai
’s test_dl
method we can pass with_labels=True
and it will attempt to label our data if it is labelled the same format as it were for training.
Note: tabular problems will always assume
with_labels
to beTrue
if the y is present in theDataFrame
Now let’s first use the defaults for test_dl
on some data:
= dls.test_dl(imgs[:10], with_labels=False) dl
We can look at a batch:
dl.show_batch()
And we can just see blank images! Now if we change this:
= dls.test_dl(imgs[:10], with_labels=True)
dl dl.show_batch()
We have our labels again! This is fantastically nice as you can then just pass this DataLoader
into learn.validate
by doing learn.validate(dl=dl)
and there will be no complaints!
Minor Changes and Closing Thoughts
Finally, let’s cover some minor naming changes.
- Passing callbacks to our
Learner
and during anyfit
are now calledcbs
rather thancallbacks
- Callbacks have more events in which you are able to adjust and their naming is slightly different
- Metrics should inherit AccuMetric, but loss functions do not need this
Everything is is a major API change or difference altogether.
Thank you so much for reading, and I implore you to check out the new library! It’s been carefully crafted over the last year and a half (since the previous part 2) and has really turned into something special. These first two articles I wanted out the day of release, so part 3 will take me a few more days. Thanks again for reading and have fun exploring!