Hugging Face

Wednesday, June 10, 2020

Hugging Face

Transformersis an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.

The library was designed with two strong goals in mind:

Be as easy and fast to use as possible:
- we strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
- all of these classes can be initialized in a simple and unified way from pretrained instances by using a common from_pretrained() instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
- as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
Provide state-of-the-art models with performances as close as possible to the original models:
- we provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture,
- the code is usually as close to the original code base as possible which means some PyTorch code may be not as pytorchic as it could be as a result of being converted TensorFlow code.

Model classes e.g., BertModel which are 20+ PyTorch models (torch.nn.Modules) that work with the pretrained weights provided in the library. In TF2, these are tf.keras.Model.
Configuration classes which store all the parameters required to build a model, e.g., BertConfig. You don’t always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
Tokenizer classes which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, e.g., BertTokenizer

All these classes can be instantiated from pretrained instances and saved locally using two methods:

from_pretrained() let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed here) or stored locally (or on a server) by the user,
save_pretrained() let you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained().