Data Science Machine Learning Data Analysis

How do transformers work? Learn it by hand 👇

𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵

[1] Given
↳ Input features from the previous block (5 positions)

[2] Attention
↳ Feed all 5 features to a query-key attention module (QK) to obtain an attention weight matrix (A). I will skip the details of this module. In a follow-up post I will unpack this module.

[3] Attention Weighting
↳ Multiply the input features with the attention weight matrix to obtain attention weighted features (Z). Note that there are still 5 positions.
↳ The effect is to combine features across positions (horizontally), in this case, X1 := X1 + X2, X2 := X2 + X3....etc.

[4] FFN: First Layer
↳ Feed all 5 attention weighted features into the first layer.
↳ Multiply these features with the weights and biases.
↳ The effect is to combine features across feature dimensions (vertically).
↳ The dimensionality of each feature is increased from 3 to 4.
↳ Note that each position is processed by the same weight matrix. This is what the term "position-wise" is referring to.
↳ Note that the FFN is essentially a multi layer perceptron.

[5] ReLU
↳ Negative values are set to zeros by ReLU.

[6] FFN: Second Layer
↳ Feed all 5 features (d=3) into the second layer.
↳ The dimensionality of each feature is decreased from 4 back to 3.
↳ The output is fed to the next block to repeat this process.
↳ Note that the next block would have a completely separate set of parameters.

#ai #tranformers #genai #learning

💯

BEST DATA SCIENCE CHANNELS ON TELEGRAM

🌟

Please open Telegram to view this post

VIEW IN TELEGRAM

👍7❤4

4.05K viewsedited 10:44

About

Blog

Apps

Platform