Intorduction to Multimodal Learning / Fall 2022
Course Description
With the rapid growth of big data, information dissemination has gradually transitioned from single text media to a cross-media form that involves images, audio, video, and new sensors signal. Learning from the human brain's cross-media capability so that computers can overcome the limitations of the single-modal analysis and achieve more generalized multimodal analysis in the real world is crucial to improving the intelligence level of computers. Therefore, multimodal learning has been a very hot research field in recent years.
This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include images/videos, text and audio. The course will present (1)Unimodal representation based on deep learning, including convolutional neural network (CNN), recurrent neural network (RNN), transformer, etc. (2)Fundamental concepts in multimodal learning: multimodal representation learning, modality alignment and multimodal fusion. These include, but are not limited to,multimodal auto-encoder, deep canonical correlation analysis, attention models and multimodal recurrent neural networks. The course will also discuss many of the recent applications of multimodal learning, including multimodal pretraining, cross-modal retrieval, cross-modal translation, visual question answering.
Logistics
- Time: Monday 10:10-12:00, Thursday 15:10-18:00
- Location: Room 302, Teaching Building 2, Peking University
- Office Hour: Thursday 4:30PM - 5:30PM (Email Appointment only)
Prerequisite
- Linear Algebra
- Basic Probability and Statistics
- Introduction to AI
Grading
- Assignments(30%)
- Midterm (30%)
- Final (40%)
Instructors
Teaching Assistants

Jiahua Zhang