All talksAI & ML 
Foundations of Multimodal AI: Fusions, Architectures, and Beyond
From the McGurk effect to modern vision-language models and how they fuse modalities.

- Type
- Talk
- Category
- AI & ML
- Level
- Advanced
- Duration
- 120 min
- Language
- English
MultimodalityVLMFusionCLIPLLaVABLIP-2Transformers
Abstract
A deep technical tour of multimodal AI: what 'multimodal' really means, the fusion techniques (early, middle, late) that combine modalities, the architectural evolution from RNNs and CNNs to Transformers and modern vision-language models (CLIP, LLaVA, BLIP-2, Qwen2-VL), plus datasets, training, evaluation, and where the field is heading.
Outline
- 01What is multimodality (human-centered vs machine-centric)
- 02Fusion techniques: early, middle, and late fusion
- 03Architecture evolution: RNN, CNN, Transformer, ViT
- 04Large vision-language models: CLIP, LLaVA, BLIP-2, MiniGPT-4, InstructBLIP
- 05Datasets for alignment and instruction tuning
- 06Training, evaluation metrics, and MLLM benchmarks
- 07The future: reasoning, multilingual, mobile-first, RAG
Key takeaways
- Fusion strategy (early/middle/late) is the core design decision
- A projection layer is what bridges a vision encoder to an LLM
- Benchmarks reveal a 'blind faith in text' failure mode in VLMs
Slides
Open in new tabDelivered once
EventOrganizerDateReach
- Foundations of Multimodal AIOnlineData Engineering PilipinasJul 14, 202540