Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu

https://www.youtube.com/watch?v=yceNl9C6Ir0

Attention vs state space model :

Attention does kv cache with selection (softmax selection), state-space model does the compression.

State-space model is hard to recover its past data.

Attention works great on the well-defined tokenizer, which every of its tokens has meaningful values, but needs compression.

Many works are integrating this two aspects.

- Think ViT where it compress high-resoltuion first as a token and run transformer.

Tonkerinzing (Preprocessing) are can nor run on end-to-end , which leads to hand-crafted feature vs machine learning features. We need more flexibility that can run with raw data.

CSEE, AI, CV 매일 매일 조금씩 성장하는 블로그

Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu - 693

티스토리툴바