Intorduction to Multimodal Learning / Fall 2022

Course Description

With the rapid growth of big data, information dissemination has gradually transitioned‬ ‪from single text media to a cross-media form that involves images, audio, video, and new‬ ‪sensors signal. Learning from the human brain's cross-media capability so that computers can‬ ‪overcome the limitations of the single-modal analysis and achieve more generalized multimodal‬ ‪analysis in the real world is crucial to improving the intelligence level of computers. Therefore,‬ ‪multimodal learning has been a very hot research field in recent years. ‬ ‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬

This course focuses on core techniques and modern advances for integrating different‬ ‪"modalities" into a shared representation or reasoning system. Specifically, these include‬ ‪images/videos, text and audio. The course will present (1)Unimodal representation based on deep learning, including convolutional neural network (CNN), recurrent neural network (RNN), transformer, etc. (2)Fundamental concepts‬ ‪in multimodal learning: multimodal representation learning, modality ‪alignment and multimodal fusion. These include, but are not limited to,‬multimodal auto-encoder, deep canonical correlation analysis, attention‬ ‪models and multimodal recurrent neural networks. The course will also discuss many of the‬ ‪recent applications of multimodal learning, including multimodal pretraining, cross-modal‬ ‪retrieval, cross-modal translation, visual question answering.


  • Time: Monday 10:10-12:00, Thursday 15:10-18:00
  • Location: Room 302, Teaching Building 2, Peking University
  • Office Hour: Thursday 4:30PM - 5:30PM (Email Appointment only)


  • Linear Algebra
  • Basic Probability and Statistics
  • Introduction to AI


  • Assignments(30%)
  • Midterm (30%)
  • Final (40%)


Teaching Assistants

Jiahua Zhang