강의복습 Transformer 더보기 1) Scaled Dot-Product Attention input: Query, Key, Value (벡터) output: weighted sum of values multiple queries 2) Multi-head attention : W 행렬을 사용해 Q, K, V를 h개의 lower dimensional space에 mapping시킴 3) Block-Based Model 각 블록은 2개의 sub-layer로 구성 Multi-head attention, Two-layer feed-forward NN (with ReLU) 각 단계마다 Residual connection, layer normalization 존재 → 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚(𝑥 + 𝑠𝑢𝑏𝑙𝑎𝑦𝑒𝑟(𝑥)..