HuggingFace Course Notes, Chapter 1 (And Zero), Part 1
This notebook covers all of Chapter 0, and Chapter 1 up to "How do Transformers Work?"
Since HF in of itself has no dependancy requirements, they recommend us installing transformers[dev]
so it gets all the dev requirements for "any imaginable use case".
A full list of what it installs is below:
deps = {
"Pillow": "Pillow",
"black": "black==21.4b0",
"cookiecutter": "cookiecutter==1.7.2",
"dataclasses": "dataclasses",
"datasets": "datasets",
"deepspeed": "deepspeed>=0.4.0",
"docutils": "docutils==0.16.0",
"fairscale": "fairscale>0.3",
"faiss-cpu": "faiss-cpu",
"fastapi": "fastapi",
"filelock": "filelock",
"flake8": "flake8>=3.8.3",
"flax": "flax>=0.3.4",
"fugashi": "fugashi>=1.0",
"huggingface-hub": "huggingface-hub==0.0.8",
"importlib_metadata": "importlib_metadata",
"ipadic": "ipadic>=1.0.0,<2.0",
"isort": "isort>=5.5.4",
"jax": "jax>=0.2.8",
"jaxlib": "jaxlib>=0.1.65",
"jieba": "jieba",
"keras2onnx": "keras2onnx",
"nltk": "nltk",
"numpy": "numpy>=1.17",
"onnxconverter-common": "onnxconverter-common",
"onnxruntime-tools": "onnxruntime-tools>=1.4.2",
"onnxruntime": "onnxruntime>=1.4.0",
"optuna": "optuna",
"packaging": "packaging",
"parameterized": "parameterized",
"protobuf": "protobuf",
"psutil": "psutil",
"pydantic": "pydantic",
"pytest": "pytest",
"pytest-sugar": "pytest-sugar",
"pytest-xdist": "pytest-xdist",
"python": "python>=3.6.0",
"ray": "ray",
"recommonmark": "recommonmark",
"regex": "regex!=2019.12.17",
"requests": "requests",
"rouge-score": "rouge-score",
"sacrebleu": "sacrebleu>=1.4.12",
"sacremoses": "sacremoses",
"sagemaker": "sagemaker>=2.31.0",
"scikit-learn": "scikit-learn",
"sentencepiece": "sentencepiece==0.1.91",
"soundfile": "soundfile",
"sphinx-copybutton": "sphinx-copybutton",
"sphinx-markdown-tables": "sphinx-markdown-tables",
"sphinx-rtd-theme": "sphinx-rtd-theme==0.4.3",
"sphinx": "sphinx==3.2.1",
"sphinxext-opengraph": "sphinxext-opengraph==0.4.1",
"starlette": "starlette",
"tensorflow-cpu": "tensorflow-cpu>=2.3",
"tensorflow": "tensorflow>=2.3",
"timeout-decorator": "timeout-decorator",
"timm": "timm",
"tokenizers": "tokenizers>=0.10.1,<0.11",
"torch": "torch>=1.0",
"torchaudio": "torchaudio",
"tqdm": "tqdm>=4.27",
"unidic": "unidic>=1.0.2",
"unidic_lite": "unidic_lite>=1.0.7",
"uvicorn": "uvicorn",
}
!pip install transformers[dev] -U >> /dev/null # Ensure we upgrade and clean the output
This should take a bit to run. I noticed four incompatibility errors in Colab, we'll see if it has any issues.
!pip show transformers
Alright! We can move onto Chapter 1! 🤗
Introduction
Looks as though it's split into three main chunks:
- Introduction
- Diving in
- Advanced
Introduction will show a very surface level with Transformers models and HF Transformers, fine-tuning a basic model, and sharing models and tokenizers.
Diving in will go further into the HF datasets and tokenizers library, basic NLP tasks, and how to ask for help (presumably on the forums or on Twitter?)
Advanced looks to be covering specialized architecture, speeding up training, custom training loops (yay!) and contributing to HF itself.
The wonderful authors:
- Matthew Carrigan - MLE @ HF
- Lysandre Debut - MLE @ HF, worked with Transformers library from the very beginning
- Sylvain Gugger - Research Engineer @ HF, core maintainer of Transformers. And one of our favorite former fastai folk
What we will learn:
- The
pipeline
function - The Transformer architecture
- Encoder, decoder, and encoder/decoder architectures and when to use each
Natural Language Processing
- What is it?
Classifying whole sentences or each word in a sentence, generating text content, question answering, and generating a new sentence from an input text
- Why is it challenging?
For a human, given "I am hungry" and "I am sad" we can know how similar thye are. That's hard for ML models.
The Model Hub is a super valuable resource because it contains thousands of pretrained models for you to use, and you can upload your own. The language model zoo.
Working with Pipelines, with Sylvain
Offhand note, I like that the videos are broken up into ~4-5 minute chunks
General approach to how I will take these notes:
- Watch video without notes
- Read the website and take notes
- Go back to the video and catch anything I missed
The pipeline
is a very quick and powerful way to grab inference with any HF model.
Let's break down one example below they showed:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course all my life!")
What did this do here?
- Downloaded a model (judging by the download bar). Don't know which model yet is the default
- I think we downloaded a pretrained tokenizer too?
- Said model was the default for a
sentiment-analysis
task - We asked it to classify the sentiment in our sentence. Labels are positive and negative, and it gave us back an array of dictionaries with those values
We can also pass in multiple inputs/texts:
classifier([
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!"
])
The default model for this task is a pretrained model fine-uned for sentient analysis in english. Let's see if I can't find it
dir(classifier)
classifier.framework
type(classifier.model)
So it's a DistilBertForSequenceClassification
, likely using the default which would be en-sentiment
Current available pipeline classes:
-
feature-extraction
(vector representation of a text) fill-mask
-
ner
(Named-Entity-Recognition) question-answering
sentiment-analysis
text-generation
translation
zero-shot-classification
Side Note:I'm going to write a quick namespace class via
mk_class
in fastcore to hold all of these tasks, so I can get tab-completion
pip install fastcore >> /dev/null
from fastcore.basics import mk_class
cls_dict = {'FeatureExtraction':'feature-extraction',
'FillMask':'fill-mask',
'NER':'ner',
'QuestionAnswering':'question-answering',
'SentimentAnalysis':'sentiment-analysis',
'Summarization':'summarization',
'TextGeneration':'text-generation',
'Translation':'translation',
'ZeroShotClassification':'zero-shot-classification'
}
mk_class('Task', **cls_dict)
Task.FeatureExtraction
As you can see all I've done is load a fancy namespace-like object from fastcore
that holds my dictionary values as attributes instead.
Back to the HF stuff. Let's load in a pipeline:
classifier = pipeline(Task.ZeroShotClassification)
Seems this model took quite a bit longer to download, but our Task
object is working great!
classifier(
"This is a course about the Transformers library",
candidate_labels=['education','politics','business']
)
Very interesting, so we can see right away it could tell this was educational! (Or fit the closest to that label.) I wonder how it works under the hood, something I may peruse later.
generator = pipeline(Task.TextGeneration)
generator("In this course we will teach you how to")
Theres a few args we can control and pass to it, such as num_return_sequences
and max_length
.
The homework is to try and generate two sentences of 15 words each. Let's try that:
generator(
"In Marine Biology,",
num_return_sequences=2,
max_length=15
)
Cool! Easy to use
A headache I ran into is it's num_return_sequences
, not num_returned_sequences
.
Models can be found on the ModelHub. In this example we use distilgpt2
generator = pipeline(Task.TextGeneration, model='distilgpt2')
generator(
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2
)
unmasker = pipeline(Task.FillMask)
unmasker('This course will teach you all about <mask> models.', top_k=2)
So here it thought the best word to fill that with was mathematical, followed by computational (and showed the filled in sentence)
top_k
is how many possibilities are displayed
<mask>
, and different models will have different things it will try and fill that with. One way to check this is by looking at the mask word used in the widget (on HF ModelHub)
ner = pipeline(Task.NER, grouped_entities=True)
ner("My name is Zach Mueller and I go to school in Pensacola")
What does having it not grouped do?
ner = pipeline(Task.NER, grouped_entities=False)
ner("My name is Zach Mueller and I go to school in Pensacola")
So we can see that the first grouped "Zach" and "Mueller" together as a single item, and Pen, Sa, Cola together too (likely split with the subword tokenizer). Having grouped=True
sounds like a good default in this case
Most models that you want to have aligned with this task have some form of POS
abbriviation in the name or tag
qa = pipeline(Task.QuestionAnswering)
qa(
question="Where do I work?",
context="My name is Zach Mueller and I go to school in Pensacola"
)
Reduce a text to a shorter one, while keeping most of the important aspects referenced in the text
summarizer = pipeline(Task.Summarization)
summarizer("""
America has changed dramatically during recent years. Not only has the number of
graduates in traditional engineering disciplines such as mechanical, civil,
electrical, chemical, and aeronautical engineering declined, but in most of
the premier American universities engineering curricula now concentrate on
and encourage largely the study of engineering science. As a result, there
are declining offerings in engineering subjects dealing with infrastructure,
the environment, and related issues, and greater concentration on high
technology subjects, largely supporting increasingly complex scientific
developments. While the latter is important, it should not be at the expense
of more traditional engineering.
Rapidly developing economies such as China and India, as well as other
industrial countries in Europe and Asia, continue to encourage and advance
the teaching of engineering. Both China and India, respectively, graduate
six and eight times as many traditional engineers as does the United States.
Other industrial countries at minimum maintain their output, while America
suffers an increasingly serious decline in the number of engineering graduates
and a lack of well-educated engineers.
""")
Translation
The last task in the tutorial/lesson is machine translation. Usually the model name will have some lang1_to_lang2
naming convention in the title. The easiest way to pick one is to search on the model hub. In this example we'll translate French to english (let's see how much I remember from my French classes in high school!)
translator = pipeline(Task.Translation, model='Helsinki-NLP/opus-mt-fr-en')
translator("Je m'apelle Zach, comment-vous est appelez-vous?")
We can also specify a max_lenght
or min_length
for the generated result
In the next chapter, we'll learn what is inside a pipeline and customizing its behavior