Any truly multi-modal transformer architectures?
Most of the multi-modal architectures consume images as tokens in same dimension. Any architectures which look at text and images as first class citizens and also produce image tokens interleaved with text?
What do you mean? We want images and text to live in the same latent space, and be represented by similar vectors if the two correlate. How else would you want to do it?