Alibaba's multimodal model family with vision-language capabilities. Used for image understanding, captioning, and combined text-image tasks.
Multimodal Transformer
Natural language