Xavier Bresson    @xbresson    11/23/2021      

Sharing my lecture slides on Attention Nets/Transformers with two simple codes for (1) Language Modeling and (2) Sequence-To-Sequence Modeling to understand Transformers from scratch. Slides : https://t.co/NQJnriY2Yt Codes : https://t.co/2RdgeANrbc
  
    231         1118










 
  Related  

/MachineLearning    @slashML    12/2/2021      

(Rant) Most of us just pretend to understand Transformers https://t.co/Ulew5dfHnK
  
    1         16



DeepMind    @DeepMind    7/6/2021      

Like Transformers, Perceivers process inputs using attention. But unlike Transformers, they first map inputs to a small latent space where processing is cheap & doesn’t depend on the input size. This allows us to build deep networks even when using large inputs like images. (2/)
  
    13         76



Aran Komatsuzaki    @arankomatsuzaki    11/29/2021      

Sparse is Enough in Scaling Transformers Sparsifying attention and FFN leads to dramatic speedup in decoding https://t.co/RySSyfZTvo
  
          3



Xavier Bresson    @xbresson    12/1/2021      

I was asked about self-attention and cross-attention. See slides 53-58 that intuitively describe SA & CA (w/ adaptative context, reception field, hierarchy) and why it is essential to use multiple layers for deep representation and multi-step reasoning. https://t.co/NQJnriY2Yt
  
    1         2