There is a lot of hype recently around the GPT-3 language model developed by OpenAI. I will dig deeper into the architecture of this language models using PyTorch
OpenAI released this public API, that allow people to access their GPT language model. Paul Graham claims it might be the next Altair, that justify me spending the afternoon looking into it.
Maybe. It may be the Altair.— Paul Graham (@paulg) July 19, 2020
I will be following the transformer tutorial from PyTorch (since I use it at work), that implements the famous paper attention is what you need, that AFAIK has been in 2017 the seminal work that started the new trend in NN architecture (BERT, ELM) that eventually led to GPT models from OpenAI. To go over the paper there is a nice 1h Paper Reading Group by Rachel Tatman and also a famous blog post Illustrated Transformers by Jay Alammar. Another interesting blog post The Annotated Transformer from Stanford NLP that is the text of the paper plus additional PyTorch code from ground up.
>>> Building a Transformer model using torch.nn.TransformerEncoder
The tutorial is about training a nn.TransformerEncoder to assign a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words. Given the Encoder and Decoder, building the model is pretty simple, as explained in define-the-model section.
We take our text input, and we add two parts to it.
- mask = self._generate_square_subsequent_mask(len(src))The most straightforward is the mask, that ensures that predictions for position i depends only on the known outputs at position less than i by masking those.
- src = self.pos_encoder(src): the second is the positional encoding, we basically can't just pass plain text to the NN, but we need to encode it with relative and absolute positional data that might look like "this is the third word in the sentence, the one before cat". The encoding is actually more complex, and done using sine and cosine functions of different frequencies.
So when we fix the above, we just run our data through Encoder and Decoder and we have our Transformer! Nice :) But of course I want to understand what is inside this nn.TransformerEncoder so before running the training I will have a look inside that.
>>> Inside the Encoder: self-attention
To follow what happens inside the encoder we can look at the nn.TransformerEncoder source code that has pretty decent docs. At the same time is pretty helpful to keep a look both at Self-Attention at a High Level from jalammar that has a nice English words explaination of what's going on, and also at the Attention code from the original paper. We are also at part two of the Kaggle Reading Group.
>>> Visualising the inner working of transformers (BERT)
After some theory I want to have an intuitive understanding of what happens inside transformer. A good approach is to examine sentence embeddings using some encoder like BERT (the name actually means Encoded Representation from Transformer). A good repo is bert-as-a-service that offer some helpful code snippets. The one I will try to use is this inner layers dimensionality reduction script: it takes one of the internal transformer layers of BERT of size [N*H] and uses PCA to project it to a 2D plan to visualise how different sentences have different embeddings.
The results look good. I embedded sentences from the music and computer pages of Wikipedia and I plotted their BERT representation on a 2D plane. The topic distribution looks coherent.