Interactive Demo
The user first explores SAE statistics such as sparsity, activation levels, and label entropy to assess training quality. A well-trained SAE typically exhibits clusters of latents that are highly sparse and strongly activating.
The user then identifies meaningful latents from the clustering results of the decoder weights. Meaningful latents tend to form clear, coherent clusters.
The user provides an input image to the VLM and specifies an input prompt. The interface then displays the tokenized outputs, along with the tokenized system prompt and input prompt.
The user can steer the model by selecting specific latents and assigning new values, then running the VLM again with the steered latents applied.
In the right panel, the user can view the VLM's output attention, showing the spatial regions of the input image that the model attends to for each output token or question token.
VisualScratchpad provides a token-latent heatmap view, where latents are clustered and visualized according to their activation similarity across output tokens.
Users may select individual clusters to steer the model based on the latents grouped within those clusters.
This panel visualizes each latent's activation weighted by the attention map for the selected token, allowing users to identify which latents contribute most strongly to the production of that token.
Users can optionally apply attention weighting to the SAE activation values and filter out noisy latents using sparsity and mean activation thresholds.
In the right panel, users can inspect how the selected latent activates on the input image and on reference images through their spatial attributions.
Result Highlights
We present three failure cases in which LLaVA-Next-8B initially produces incorrect
outputs.
Case 1: Even when the vision encoder successfully captures the correct cue, the model may fail to
utilize it; providing a more explicit description in the input prompt corrects the output.
Case 2: When the model relies on a misleading visual cue, removing that cue leads to a change in
prediction.
Case 3: Although the vision encoder may capture multiple plausible visual concepts, the model
often relies disproportionately on the most dominant one.
How does VisualScratchpad work?
A. SAEs return latent activations for each image patch. B. Image-level activations can be computed by naïvely averaging activations across all patches, or C. by applying a weighted average where the text-to-image attention map serves as the weighting coefficient, promoting concepts relevant to the text tokens to the top of the ranking. The bottom row shows the top-ranked concept obtained from the corresponding method
Raw values are difficult to analyze, so we normalize in column-wise to show if the latent is specifically attended by certain tokens or by overall tokens. Moreover, we cluster and sort by activation correlation in column-wise, using hierarchical clustering.
Ablating latent clusters corresponding to distinct semantic topics removes those topics from the generated output, demonstrating their causal role in shaping the model's predictions.