https://www.youtube.com/watch?v=yceNl9C6Ir0
Attention vs state space model :
Attention does kv cache with selection (softmax selection), state-space model does the compression.
State-space model is hard to recover its past data.
Attention works great on the well-defined tokenizer, which every of its tokens has meaningful values, but needs compression.
Many works are integrating this two aspects.
- Think ViT where it compress high-resoltuion first as a token and run transformer.
Tonkerinzing (Preprocessing) are can nor run on end-to-end , which leads to hand-crafted feature vs machine learning features. We need more flexibility that can run with raw data.