VisualScratchpad: Grounding Visual Concepts in Large Vision Language Models

Submission to ICLR 2026 Workshop on
Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities
1KAIST AI, 2NAVER AI Lab, *Correspondence
VisualScratchpad: Grounding Visual Concepts in Large Vision Language Models

To make vision-language model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. A. During inference in a vision-language model, we extract the intermediate representation z from the vision encoder. B. A sparse autoencoder processes z to produce concept activations. The attention map from output text tokens to image tokens is applied at the patch level to weight these activations. Latents exhibiting similar activation patterns across output tokens are then clustered and visualized in a token-latent heatmap. C. The causal influence of these concepts on the model's output can be evaluated through latent ablation.

Interactive Demo

Result Highlights

VisualScratchpad Result Highlights

We present three failure cases in which LLaVA-Next-8B initially produces incorrect outputs.
Case 1: Even when the vision encoder successfully captures the correct cue, the model may fail to utilize it; providing a more explicit description in the input prompt corrects the output.
Case 2: When the model relies on a misleading visual cue, removing that cue leads to a change in prediction.
Case 3: Although the vision encoder may capture multiple plausible visual concepts, the model often relies disproportionately on the most dominant one.

How does VisualScratchpad work?

VisualScratchpad Method Overview
Attention-based concept re-ranking

A. SAEs return latent activations for each image patch. B. Image-level activations can be computed by naïvely averaging activations across all patches, or C. by applying a weighted average where the text-to-image attention map serves as the weighting coefficient, promoting concepts relevant to the text tokens to the top of the ranking. The bottom row shows the top-ranked concept obtained from the corresponding method

VisualScratchpad Method Overview
Token-Latent heatmap visualization and clustering

Raw values are difficult to analyze, so we normalize in column-wise to show if the latent is specifically attended by certain tokens or by overall tokens. Moreover, we cluster and sort by activation correlation in column-wise, using hierarchical clustering.

VisualScratchpad Method Overview VisualScratchpad Method Overview
Causal inference

Ablating latent clusters corresponding to distinct semantic topics removes those topics from the generated output, demonstrating their causal role in shaping the model's predictions.