VisualScratchpad: Grounding Visual Concepts in Large Vision Language Models

Anonimous

VisualScratchpad: Grounding Visual Concepts in Large Vision Language Models

Submission to ICLR 2026 Workshop on
Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities

Hyesu Lim¹, Jinho Choi¹, Taekyung Kim², Byeongho Heo², Jaegul Choo¹, Dongyoon Han^2*

¹KAIST AI, ²NAVER AI Lab, *Correspondence

Paper Code Demo

VisualScratchpad: Grounding Visual Concepts in Large Vision Language Models

To make vision-language model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. A. During inference in a vision-language model, we extract the intermediate representation z from the vision encoder. B. A sparse autoencoder processes z to produce concept activations. The attention map from output text tokens to image tokens is applied at the patch level to weight these activations. Latents exhibiting similar activation patterns across output tokens are then clustered and visualized in a token-latent heatmap. C. The causal influence of these concepts on the model's output can be evaluated through latent ablation.

Interactive Demo

The user first explores SAE statistics such as sparsity, activation levels, and label entropy to assess training quality. A well-trained SAE typically exhibits clusters of latents that are highly sparse and strongly activating.

The user then identifies meaningful latents from the clustering results of the decoder weights. Meaningful latents tend to form clear, coherent clusters.

The user provides an input image to the VLM and specifies an input prompt. The interface then displays the tokenized outputs, along with the tokenized system prompt and input prompt.

The user can steer the model by selecting specific latents and assigning new values, then running the VLM again with the steered latents applied.

In the right panel, the user can view the VLM's output attention, showing the spatial regions of the input image that the model attends to for each output token or question token.

VisualScratchpad provides a token-latent heatmap view, where latents are clustered and visualized according to their activation similarity across output tokens.

Users may select individual clusters to steer the model based on the latents grouped within those clusters.

This panel visualizes each latent's activation weighted by the attention map for the selected token, allowing users to identify which latents contribute most strongly to the production of that token.

Users can optionally apply attention weighting to the SAE activation values and filter out noisy latents using sparsity and mean activation thresholds.

In the right panel, users can inspect how the selected latent activates on the input image and on reference images through their spatial attributions.

Result Highlights

We present three failure cases in which LLaVA-Next-8B initially produces incorrect outputs.
Case 1: Even when the vision encoder successfully captures the correct cue, the model may fail to utilize it; providing a more explicit description in the input prompt corrects the output.
Case 2: When the model relies on a misleading visual cue, removing that cue leads to a change in prediction.
Case 3: Although the vision encoder may capture multiple plausible visual concepts, the model often relies disproportionately on the most dominant one.

How does VisualScratchpad work?

Attention-based concept re-ranking

A. SAEs return latent activations for each image patch. B. Image-level activations can be computed by naïvely averaging activations across all patches, or C. by applying a weighted average where the text-to-image attention map serves as the weighting coefficient, promoting concepts relevant to the text tokens to the top of the ranking. The bottom row shows the top-ranked concept obtained from the corresponding method

Token-Latent heatmap visualization and clustering

Raw values are difficult to analyze, so we normalize in column-wise to show if the latent is specifically attended by certain tokens or by overall tokens. Moreover, we cluster and sort by activation correlation in column-wise, using hierarchical clustering.

Causal inference

Ablating latent clusters corresponding to distinct semantic topics removes those topics from the generated output, demonstrating their causal role in shaping the model's predictions.