본문 바로가기

카테고리 없음

Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu - 693

https://www.youtube.com/watch?v=yceNl9C6Ir0

 

Attention vs state space model :

  Attention does kv cache with selection (softmax selection), state-space model does the compression.

  State-space model is hard to recover its past data.

  Attention works great on the well-defined tokenizer, which every of its tokens has meaningful values, but needs     compression.

 Many works are integrating this two aspects.

- Think ViT where it compress high-resoltuion first as a token and run transformer.

 

Tonkerinzing (Preprocessing) are can nor run on end-to-end , which leads to hand-crafted feature vs machine learning features. We need more flexibility that can run with raw data.