Latent space navigation – interpretation, probing and steering

on Sat, 9:00in Main Roomfor 30min

Abstract

Modern neural networks are powerful but opaque: they encode meaning in high-dimensional latent spaces that are hard to inspect. Interpretability research asks what these internal representations contain, and whether we can read, probe, and even steer them. In this talk, I’ll give an accessible introduction to latent-space interpretability for a broad audience and connect it to my own work on the structure of learned representations. I’ll show how human-meaningful concepts emerge across the layers of a network, and how concept regions tend to be approximately convex - structural regularities that offer a window into what these models have actually learned.

More Information:

Overview Program