MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks

Comments

from Hacker News https://ift.tt/gCtdQ1s

Comments