Keynote Speeches

Day 1: Monday, November 21


Keynote: Prof. Jitendra Malik

Title: Deep Visual Understanding from Deep Learning

Vision is the interplay of the processes of recognition, reconstruction and grouping. The availability of large annotated datasets of images and GPU computing have enabled artificial neural networks to become the machine learning technique of choice, leading to quite remarkable progress in the last few years. We now have a good capability to detect and localize objects and determine their 3d pose and layout in a scene. For people we can categorize them according to certain attributes or actions, and estimate their poses.

However we are still quite short of “visual understanding" which has many other aspects. For example, we should not just detect people, objects and actions but also link them together, by what we call “visual semantic role labeling", essentially identifying subject-verb-object relationships. We should be able to make predictions – what will happen next in a video stream? Vision helps guide manipulation and locomotion, and this requires building dynamic models of consequences of various actions. I will show several case studies of work along these lines, which holds the promise of developing learning procedures that are self-supervised and bypass the need for manual annotation.

Jitendra Malik was born in Mathura, India in 1960. He received the B.Tech degree in Electrical Engineering from Indian Institute of Technology, Kanpur in 1980 and the PhD degree in Computer Science from Stanford University in 1985. In January 1986, he joined the university of California at Berkeley, where he is currently the Arthur J. Chick Professor in the Computer Science Division, Department of Electrical Engg and Computer Sciences. He is also on the faculty of the department of Bioengineering, and the Cognitive Science and Vision Science groups. During 2002-2004 he served as the Chair of the Computer Science Division and during 2004-2006 as the Department Chair of EECS.

Prof. Malik’s research group has worked on many different topics in computer vision, computational modeling of human vision, computer graphics and the analysis of biological images. Several well-known concepts and algorithms arose in this research, such as anisotropic diffusion, normalized cuts, high dynamic range imaging, and shape contexts. He has mentored more than 50 PhD students and postdoctoral fellows.

He received the gold medal for the best graduating student in Electrical Engineering from IIT Kanpur in 1980 and a Presidential Young Investigator Award in 1989. At UC Berkeley, he was selected for the Diane S. McEntyre Award for Excellence in Teaching in 2000, a Miller Research Professorship in 2001, and appointed to be the Arthur J. Chick Professor in 2002. He received the Distinguished Alumnus Award from IIT Kanpur in 2008. He was awarded the Longuet-Higgins Prize for a contribution that has stood the test of time twice, in 2007 and in 2008. He received the PAMI-TC Distinguished Researcher in Computer Vision Award in 2013, and in 2014 the K.S. Fu Prize from the International Association of Pattern Recognition. He is a fellow of the IEEE and the ACM. He is a member of the National Academy of Engineering, a fellow of the American Academy of Arts and Sciences and in 2015 was elected to the National Academy of Sciences.

Day 2: Tuesday, November 22


Keynote: Prof. Katsushi Ikeuchi (Microsoft Research)

Title: e-Intangible Heritage

Tangible heritage, such as temples and statues, is disappearing day-by-day due to human and natural disaster. In-tangible heritage, such as folk dances, local songs, and dialects, has the same story due to lack of inheritors and mixing cultures. We have been developing methods to preserve such tangible and in-tangible heritage in the digital form. This project, which we refer to as e-Heritage, aims not only record heritage, but also analyze those recorded data for better understanding as well as display those data in new forms for promotion and education.
This talk mainly covers how to preserve in-tangible heritage, in particular, preservation of Japanese and Taiwanese folk dances. The first half of my talk covers how to display such a Japanese folk dance on a humanoid robot. Here, we follow the paradigm, learning-from-observation, in which a robot learns how to dance from observing human dance. Due to the physical difference between a human and a robot, the robot cannot mimic the entire human actions. Instead, the robot first extracts important actions of a dance, referred to key poses and then represents such key poses into symbolic representations. These symbolic representations are mapped to robot motions depending on the robot hardware. The second half of my talk covers our effort to apply similar technics to Taiwanese folk dances. We extract symbolic representations of folk dances, examine similarities among those folk dances, and compare the classification trees obtained from the representations with trees based on other aspects such as DNAs, languages and social institutions.

Dr. Katsushi Ikeuchi is a Principal Researcher of Microsoft Research. He received a Ph.D. degree in Information Engineering from the University of Tokyo in 1978. After working at AI Lab of MIT as a pos-doc fellows for three years, Electrotechnical Lab, Japan as a researcher for five years, Robotics Institute of Carnegie Mellon University as a faculty member for ten years, the University of Tokyo as a faculty member for nineteen years, he joined Microsoft Research Asia in 2015. His research interest spans computer vision, robotics, and computer graphics. He has received several awards, including IEEE-PAMI Distinguished Researcher Award, the Okawa Prize from the Okawa foundation, and Shi-Ju-Ho-sho (the Medal of Honor with Purple ribbon) from the Emperor of Japan. He is a fellow of IEEE, IEICE, IPSJ, and RSJ.(

Day 3: Wednesday, November 23


Keynote: Prof. Marc Pollefeys

Title: Semantic 3D reconstruction

While purely geometric models of the world can be sufficient for some applications, there are also many application that need additional semantic information. In this talk I will focus on 3D reconstruction approaches which combine geometric and appearance cues to obtain semantic 3D reconstructions. Specifically, the approaches I will discuss are formulated as multi-label volumetric segmentation, i.e. each voxel gets assigned a label corresponding to one of the semantic classes considered, including free-space. We propose a formulation representing raw geometric and appearance data as unary or high-order (pixel-ray) energy terms on voxels, with class-pair-specific learned anisotropic smoothness terms to regularize the results. We will see how by solving both reconstruction and segmentation/recognition jointly the quality of the results improves significantly and we can make progress towards 3D scene understanding.

Marc Pollefeys is director of science at Microsoft HoloLens and a full professor of the Dept. of Computer Science at ETH Zurich. Previously he was on the faculty at the University of North Carolina at Chapel Hill. He obtained his MS and PhD degrees from the KU Leuven in Belgium. His main area of research is computer vision. Dr. Pollefeys has received several prizes for his research, including a Marr prize, an NSF CAREER award, a Packard Fellowship and a ERC Starting Grant. He is the author or co-author of more than 280 peer-reviewed papers. He will be general chair of ICCV 2019, was a general chair for ECCV 2014 in Zurich and one of the program chairs for the IEEE Conf. on Computer Vision and Pattern Recognition 2009. Prof. Pollefeys was on the Editorial Board of the IEEE Transactions on Pattern Analysis and Machine Intelligence, the International Journal of Computer Vision, Foundations and Trends in Computer Graphics and Computer Vision and several other journals. He is an IEEE Fellow.