All talks
AI & ML

Foundations of Multimodal AI: Fusions, Architectures, and Beyond

From the McGurk effect to modern vision-language models and how they fuse modalities.

Foundations of Multimodal AI: Fusions, Architectures, and Beyond title slide
Type
Talk
Category
AI & ML
Level
Advanced
Duration
120 min
Language
English
MultimodalityVLMFusionCLIPLLaVABLIP-2Transformers

Abstract

A deep technical tour of multimodal AI: what 'multimodal' really means, the fusion techniques (early, middle, late) that combine modalities, the architectural evolution from RNNs and CNNs to Transformers and modern vision-language models (CLIP, LLaVA, BLIP-2, Qwen2-VL), plus datasets, training, evaluation, and where the field is heading.

Outline

  1. 01What is multimodality (human-centered vs machine-centric)
  2. 02Fusion techniques: early, middle, and late fusion
  3. 03Architecture evolution: RNN, CNN, Transformer, ViT
  4. 04Large vision-language models: CLIP, LLaVA, BLIP-2, MiniGPT-4, InstructBLIP
  5. 05Datasets for alignment and instruction tuning
  6. 06Training, evaluation metrics, and MLLM benchmarks
  7. 07The future: reasoning, multilingual, mobile-first, RAG

Key takeaways

  • Fusion strategy (early/middle/late) is the core design decision
  • A projection layer is what bridges a vision encoder to an LLM
  • Benchmarks reveal a 'blind faith in text' failure mode in VLMs

Delivered once