AI & ML

Foundations of Multimodal AI: Fusions, Architectures, and Beyond

From the McGurk effect to modern vision-language models and how they fuse modalities.

Type: Talk
Category: AI & ML
Level: Advanced
Duration: 120 min
Language: English

MultimodalityVLMFusionCLIPLLaVABLIP-2Transformers

Abstract

A deep technical tour of multimodal AI: what 'multimodal' really means, the fusion techniques (early, middle, late) that combine modalities, the architectural evolution from RNNs and CNNs to Transformers and modern vision-language models (CLIP, LLaVA, BLIP-2, Qwen2-VL), plus datasets, training, evaluation, and where the field is heading.

Outline

01What is multimodality (human-centered vs machine-centric)
02Fusion techniques: early, middle, and late fusion
03Architecture evolution: RNN, CNN, Transformer, ViT
04Large vision-language models: CLIP, LLaVA, BLIP-2, MiniGPT-4, InstructBLIP
05Datasets for alignment and instruction tuning
06Training, evaluation metrics, and MLLM benchmarks
07The future: reasoning, multilingual, mobile-first, RAG

Key takeaways

Fusion strategy (early/middle/late) is the core design decision
A projection layer is what bridges a vision encoder to an LLM
Benchmarks reveal a 'blind faith in text' failure mode in VLMs

Slides

Open in new tab

Delivered once

EventOrganizerDateReach

Foundations of Multimodal AIOnlineData Engineering PilipinasJul 14, 202540

Foundations of Multimodal AI: Fusions, Architectures, and Beyond

Abstract

Outline

Key takeaways

Slides

Delivered once

More talks

From Spark to System: Turning Ideas into Cloud-Driven Real-World Solutions

Navigating the Grey: Scaling from Single Worker to Multi-VM Undetectable Scrapers

De-mystifying PyTorch for ASICs