first commit
This commit is contained in:
@@ -0,0 +1,267 @@
|
||||
# Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning
|
||||
|
||||
Jing Tang \( {}^{1 * } \) Kun Wang \( {}^{2 * } \) Haolang Lu \( {}^{3 * } \) Hongjin Chen \( {}^{3} \) KaiTao Chen \( {}^{3} \) Zhongxiang Sun \( {}^{4} \) Qiankun Li \( {}^{2} \) Lingjuan Lyu \( {}^{5} \) Guoshun Nan \( {}^{3} \) Zhigang Zeng \( {}^{1} \)
|
||||
|
||||
jingtang@hust.edu.cn wang.kun@ntu.edu.sg luhaolang@bupt.edu.cn
|
||||
|
||||
## Abstract
|
||||
|
||||
Multimodal large language models in long chain-of-thought reasoning often fail when different knowledge sources provide conflicting signals. We formalize these failures under a unified notion of knowledge conflict, distinguishing input-level objective conflict from process-level effective conflict. Through probing internal representations, we reveal that: (I) Linear Separability: different conflict types are explicitly encoded as linearly separable features rather than entangled; (II) Depth Localization: conflict signals concentrate in mid-to-late layers, indicating a distinct processing stage for conflict encoding; (III) Hierarchical Consistency: aggregating noisy token-level signals along trajectories robustly recovers input-level conflict types; and (IV) Directional Asymmetry: reinforcing the model's implicit source preference under conflict is far easier than enforcing the opposite source. Our findings provide a mechanism-level view of multimodal reasoning under knowledge conflict and enable principled diagnosis and control of long-CoT failures. Code is available at anonymous link.
|
||||
|
||||
## 1. Introduction
|
||||
|
||||
Multimodal large language models (MLLMs) (Jin et al., 2025; Caffagni et al., 2024; Zhang et al., 2024a) have made substantial progress in visual understanding (Tong et al., 2024a; Ghatkesar et al., 2025; Ma et al., 2025), textual reasoning (Wang et al., 2024; Du et al., 2025; Mirzadeh et al., 2025), and cross-modal alignment (Yu et al., 2024; Yan et al., 2025; Yu et al., 2025), enabling complex perception-reasoning-decision workflows. A defining capability is long-form reasoning: beyond producing answers, these models can generate extended chains-of-thought (CoT) (Wang et al., 2025b; Yue et al., 2025) that support challenging multi-step tasks. However, recent work increasingly documents failures under mutually contradictory evidence or constraints: models may ignore explicit instructions (Wang et al., 2025a; Zhao et al., 2025), privilege the wrong evidential source (Guan et al., 2024; Liu et al., 2025b), or yield plausible yet goal-inconsistent conclusions (Fanous et al., 2025). These observations suggest that a key bottleneck in multimodal reasoning is not always missing information, but reliable decision-making under conflicting signals.
|
||||
|
||||
Building on these observations, prior work (Zhang et al., 2024c; Lu et al., 2024) has characterized abnormal behavior under conflicting signals from several largely independent angles. In retrieval-augmented generation, a central question is whether models remain faithful to retrieved evidence or drift toward parametric priors (Wu et al., 2024). In vision settings with counterfactual or commonsense-violating inputs, MLLMs are often found to underweight visual evidence and default to "reasonable" answers that match world knowledge (Tong et al., 2024b; Liu et al., 2025c). In high-stakes domains, studies further report over-accommodation to user assertions, which can pull predictions away from the underlying evidence (Sharma et al., 2024). Although these lines of work differ in tasks, datasets, and evaluation criteria, their failure modes are strikingly similar: when information sources disagree, models do not reliably follow the appropriate basis for a decision, and instead exhibit unstable, hard-to-control trade-offs across sources.
|
||||
|
||||
In this paper, we take a unified view that these phenomena arise from knowledge conflict in multimodal reasoning. When generating tokens, MLLMs jointly rely on multiple knowledge sources, including visual evidence, textual instructions and contextual constraints, and parametric priors stored in the model weights (Han et al., 2025; Liu et al., 2024a; Karamcheti et al., 2024). When these sources provide inconsistent signals for the same goal, the model must resolve which source to follow. Importantly, the resulting failures are not fabrications from missing knowledge, but incorrect source selection under conflict: the model may have access to competing plausible cues yet follow the wrong basis. Accordingly, our focus is not the act of answer generation itself, but whether conflict-induced failures can be localized, measured, and mechanistically tested.
|
||||
|
||||
---
|
||||
|
||||
\( {}^{1} \) Huazhong University of Science and Technology \( {}^{2} \) Nanyang Technological University \( {}^{3} \) Beijing University of Posts and Telecommunications \( {}^{4} \) Renmin University of China \( {}^{5} \) Sony AI, Zurich, Switzerland. Correspondence to: Guoshun Nan <nan-guoshun@gmail.com>, Zhigang Zeng <zgzeng@hust.edu.cn>.
|
||||
|
||||
Preprint. February 17, 2026.
|
||||
|
||||
---
|
||||
|
||||
Multimodal long-CoT reasoning (Ni et al., 2025) makes this problem sharper by unfolding decisions over many steps, with the internal reasoning state evolving over time. Under this setting, knowledge conflict can be triggered at any point and modality along the trajectory rather than only at the final answer. Once a step commits to the wrong basis, subsequent steps may continue from that premise in a locally coherent manner, eventually producing a globally incorrect conclusion (Zhang et al., 2024b). More challenging, such deviations are often masked by fluent rationales (Turpin et al., 2023), making it difficult to infer when the conflict emerged, what triggered it, and how it propagated from the final output alone. Understanding and correcting failures in long-CoT therefore requires step-level tools that can expose the underlying conflict dynamics.
|
||||
|
||||
In this work, * We diagnose knowledge conflict dynamics on 7,500+ long-CoT trajectories from an objective conflict benchmark, where effective conflicts are activated in 78-90% of samples. * Through layer-wise analysis of three models, we identify a depth-dependent conflict encoding stage. Using streaming probes to detect token-level conflict states, we find they exhibit high linear separability (93.2~98.8% AUC, 76.9~97.8% Recall@0.1), revealing them as explicit, decodable features. * We employ three pluggable methods for intervention. These methods can either steer model outputs toward selected directions, reducing conflict frequency by up to 80%, or suppress high-confidence errors by up to 55%.
|
||||
|
||||
## 2. Related Work
|
||||
|
||||
Knowledge Conflict. Research on knowledge conflicts has identified three primary sources: conflicts between internal priors and visual information (Liu et al., 2025b; Du. et al., 2025) or textual inputs (Zhang et al., 2025a; Su et al., 2024), and conflicts between visual and textual modalities (Deng et al., 2025). Building on these findings, significant efforts have been made to mitigate such conflicts through advanced strategies (Xie et al., 2024; Guo et al., 2024), including knowledge editing (Tan et al., 2024; Zhang et al., 2025d; Cheng et al., 2024; Chen et al., 2025) and retrieval augmentation (Huo et al., 2025; Zhang et al., 2025b; Li et al., 2025). These approaches have demonstrated potential in enhancing model faithfulness and reliability (Huang et al., 2025b; An et al., 2025; Shi et al., 2024; Zhang et al., 2024d; Lu et al., 2025). Although the above evidence suggests that conflicts are coupled and multi-source, existing solutions remain fragmented across modalities and fail to model conflicts holistically, thereby limiting their applicability in complex settings.
|
||||
|
||||
Probe Detection. Investigating internal states via probe detection is a developing field, yet the history of probing in LLMs (Kahana et al., 2025) provides clear precedents. Notably, the evolution of probe detection primarily centers on hallucination and faithfulness (Feng et al., 2025; Yi et al., 2025). Core techniques, such as linear probe generators (Ka-hana et al., 2025) and propositional probes (Feng et al., 2025), have inspired analogous approaches in watermark identification (Liu et al., 2025a), reward maximization (Li et al., 2024), and combinatorial optimization (Zhang et al., 2025e). However, these approaches predominantly focus on single-modal issues or specific downstream tasks, leaving the detection and localization of multimodal knowledge conflicts largely unexplored. Inspired by this, we introduce a specialized probe detection framework to identify the three sources of knowledge conflicts in MLLMs.
|
||||
|
||||

|
||||
|
||||
Figure 1. Overview of Knowledge Sources and Conflict Types. We categorize knowledge into Visual \( \left( {\mathcal{K}}_{\text{ vision }}\right) \) , Textual \( \left( {\mathcal{K}}_{\text{ text }}\right) \) , and Parametric Prior \( \left( {\mathcal{K}}_{\text{ prior }}\right) \) . Knowledge conflicts arise when factual statements from different sources act as incompatible signals. We define three primary conflict types: Vision-Text \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , Vision-Prior \( \left( {\mathcal{C}}_{\mathrm{{VP}}}\right) \) , and Prior-Text \( \left( {\mathcal{C}}_{\mathrm{{PT}}}\right) \) .
|
||||
|
||||
## 3. Conflict in Multimodal Reasoning
|
||||
|
||||
### 3.1. Knowledge Sources and Pairwise Conflicts
|
||||
|
||||
We consider a multimodal long-CoT reasoning task with input \( x = \left( {{X}_{V},{X}_{T}}\right) \) , where \( {X}_{V} \) denotes the visual input and \( {X}_{T} \) the textual input. Given a multimodal generative model \( {M}_{\theta } \) , reasoning unfolds as a sequence of tokens \( \tau \left( x\right) = \left( {{y}_{1},{y}_{2},\ldots ,{y}_{T}}\right) \) , with each token sampled as
|
||||
|
||||
\[
|
||||
{y}_{t} \sim {M}_{\theta }\left( {\cdot \mid x,{y}_{ < t}}\right) . \tag{1}
|
||||
\]
|
||||
|
||||
We denote the internal state at step \( t \) by
|
||||
|
||||
\[
|
||||
{\mathbf{h}}_{t} = {f}_{\theta }\left( {x,{y}_{ < t}}\right) , \tag{2}
|
||||
\]
|
||||
|
||||
where \( {f}_{\theta } \) denotes the model’s hidden representation extraction, i.e., the forward pass up to a specified layer.
|
||||
|
||||
To analyze how factual inconsistencies arise during reasoning, we abstract the knowledge available to the model into three sources, \( \mathcal{K} = \left\{ {{\mathcal{K}}_{\text{ vision }},{\mathcal{K}}_{\text{ text }},{\mathcal{K}}_{\text{ prior }}}\right\} \) .
|
||||
|
||||
Table 1. Output-level conflict profile across models (objective conflict subsets). We present statistics of generated trajectories under three types of conflict (model details in Appendix B). Metrics reported include sample count, average CoT length, average conflict spans per sample (spans are contiguous conflict segments identified via an automated LLM annotation pipeline, may consist of one or multiple tokens.), conflict token density (proportion of conflicting tokens), and sample conflict rate (% of samples exhibiting effective conflict).
|
||||
|
||||
<table><tr><td rowspan="2">Metric</td><td colspan="4">Llama-3.2V-11B-cot</td><td colspan="4">R1-Onevision-7B</td><td colspan="4">Ocean-R1-7B-Instruct</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>All</td><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>All</td><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>All</td></tr><tr><td>Samples</td><td>749</td><td>1012</td><td>803</td><td>2564</td><td>724</td><td>993</td><td>769</td><td>2486</td><td>640</td><td>1026</td><td>807</td><td>2473</td></tr><tr><td>Avg. CoT length (tokens)</td><td>326.79</td><td>1768.85</td><td>238.50</td><td>868.32</td><td>706.85</td><td>790.63</td><td>558.97</td><td>694.57</td><td>488.15</td><td>711.26</td><td>302.97</td><td>520.28</td></tr><tr><td>Avg. conflict spans per sample</td><td>2.69</td><td>6.20</td><td>4.04</td><td>4.50</td><td>3.66</td><td>6.73</td><td>7.02</td><td>5.93</td><td>8.68</td><td>9.00</td><td>5.43</td><td>7.75</td></tr><tr><td>Conflict token density (%)</td><td>4.92</td><td>1.65</td><td>11.25</td><td>5.61</td><td>3.20</td><td>2.16</td><td>7.68</td><td>4.17</td><td>8.70</td><td>3.23</td><td>11.77</td><td>7.43</td></tr><tr><td>Conflict Sample Ratio (%)</td><td>63.68</td><td>82.21</td><td>86.43</td><td>78.12</td><td>59.67</td><td>85.90</td><td>87.91</td><td>78.88</td><td>88.75</td><td>90.25</td><td>89.34</td><td>89.57</td></tr></table>
|
||||
|
||||
Here, \( {\mathcal{K}}_{\text{ vision }} \) consists of facts supported by the visual input \( {X}_{V},{\mathcal{K}}_{\text{ text }} \) consists of facts constrained by the textual input \( {X}_{T} \) , and \( {\mathcal{K}}_{\text{ prior }} \) denotes parametric prior knowledge implicitly encoded in the model parameters \( \theta \) .
|
||||
|
||||
For each knowledge source \( {\mathcal{K}}_{ * } \in \mathcal{K} \) , we represent its supported factual content as a set of atomic factual statements \( F\left( {\mathcal{K}}_{ * }\right) \) , where each element \( \psi \in F\left( {\mathcal{K}}_{ * }\right) \) corresponds to an indivisible factual judgment. We use \( {\psi }_{a} \bot {\psi }_{b} \) to denote that two facts are semantically incompatible, i.e., they cannot simultaneously be true under the given context.
|
||||
|
||||
Based on this notion, we define a pairwise knowledge conflict between two sources \( {\mathcal{K}}_{i} \) and \( {\mathcal{K}}_{j}\left( {i \neq j}\right) \) as the set of incompatible fact pairs:
|
||||
|
||||
\[
|
||||
{\mathcal{C}}_{i, j} = \left\{ {\left( {{\psi }_{i},{\psi }_{j}}\right) \mid {\psi }_{i} \in F\left( {\mathcal{K}}_{i}\right) ,{\psi }_{j} \in F\left( {\mathcal{K}}_{j}\right) ,{\psi }_{i} \bot {\psi }_{j}}\right\} .
|
||||
\]
|
||||
|
||||
(3)
|
||||
|
||||
In this work, we focus on three primary pairwise conflict types induced by the three knowledge sources: Vision-Prior \( \left( {\mathcal{C}}_{\mathrm{{VP}}}\right) \) , Vision-Text \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , and Prior-Text \( \left( {\mathcal{C}}_{\mathrm{{PT}}}\right) \) .
|
||||
|
||||
### 3.2. Objective vs. Effective Conflict
|
||||
|
||||
As illustrated in Figure 1, we distinguish between two related but fundamentally different notions: objective conflict, which is defined at the input level, and effective conflict, which manifests as a process-level state during reasoning.
|
||||
|
||||
Objective Conflict describes factual inconsistency induced by the input and the model's parametric priors, independent of any particular reasoning trajectory. Given a conflict type \( {\mathcal{C}}_{i, j} \in \left\{ {{\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{VT}}},{\mathcal{C}}_{\mathrm{{PT}}}}\right\} \) , we define a binary variable \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \in \{ 0,1\} \) to indicate whether the input \( x \) exhibits an objective conflict of type \( {\mathcal{C}}_{i, j} \) . For example, \( {\mathcal{C}}_{\mathrm{{VP}}}^{o}\left( x\right) = 1 \) indicates that the visual evidence \( {X}_{V} \) contradicts the parametric prior knowledge encoded in \( \theta \) with respect to a specific fact. By definition, \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) depends only on the factual relations supported by the input \( x \) and the model priors, and does not reference the reasoning process itself.
|
||||
|
||||
Importantly, the presence of an objective conflict does not by itself determine whether the model will engage with this conflict during inference. From the input-level specification alone, it is not directly inferable whether, when, or how a given conflict influences the model's internal reasoning dynamics. This gap motivates a process-level notion that captures conflict activation within the model.
|
||||
|
||||
Effective Conflict characterizes whether an objective conflict is actually triggered during reasoning and reflected in the model’s internal state. Concretely, we use \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \in \; \{ 0,1\} \) to indicate whether, at reasoning step \( t \) , the model relies on mutually incompatible factual information of type \( {\mathcal{C}}_{i, j} \) . Here, \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \) means that the corresponding conflict is active and influences the current reasoning step, as encoded in the internal state at that step.
|
||||
|
||||
The relationship between the two notions is asymmetric:
|
||||
|
||||
\[
|
||||
\mathbb{P}\left( {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \mid {\mathcal{C}}_{i, j}^{o}\left( x\right) = 1}\right) < 1. \tag{4}
|
||||
\]
|
||||
|
||||
That is, objective conflict captures whether a conflict exists at the input level, whereas effective conflict captures whether and when that conflict is activated in the model's internal state during reasoning. The former is induced jointly by the input and priors, while the latter is both model-dependent and process-dependent.
|
||||
|
||||
Objective conflict data construction. For mechanistic analysis, we construct an objective-conflict benchmark with isolated pairwise conflicts, where each example contains exactly one conflict type (VP, VT, or PT) and is intended to elicit effective conflict states. This setting is designed as a diagnostic stress-test of conflict arbitration under contradiction, rather than an estimate of in-the-wild conflict prevalence. For each input x, we generate a long-CoT trajectory and align the input-level labels \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) with step-level effective conflict signals \( {\left\{ {\mathcal{C}}_{i, j}^{e}\left( t \mid x\right) \right\} }_{t = 1}^{T} \) inferred from the model outputs. Table 1 reports conflict activation statistics for this benchmark. Full details are provided in Appendix A.
|
||||
|
||||
## 4. Probing Conflict from Internal States
|
||||
|
||||
In Section 3, we formalize knowledge conflict as an input-level \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) and a process-level \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) . Moving forward, this section addresses the core question: Is \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) reflected in the model's internal states, and can it be identified in a streaming manner during generation?
|
||||
|
||||
### 4.1. Token-level Probing of Knowledge Conflict
|
||||
|
||||
We construct a streaming detector: when generating the \( t \) -th token, it determines whether an effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) is triggered based solely on the hidden state \( {\mathbf{h}}_{t}^{\left( l\right) } \) . While prior work has employed probes for binary hallucination detection (Obeso et al., 2025), we extend this to a four-class classification task based on the definition in Section 3.2.
|
||||
|
||||
Here, we use \( z = 0 \) as label, to indicate that no conflict is triggered (i.e., \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 0,\forall {\mathcal{C}}_{i, j} \) ); while \( z \in \{ 1,2,3\} \) corresponds to the active state of specific pairwise knowledge conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \) , namely \( {\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}} \) , and \( {\mathcal{C}}_{\mathrm{{VT}}} \) .
|
||||
|
||||
Formally, we define a probe \( {f}_{\phi } \) that maps hidden states to a probability distribution over conflict labels:
|
||||
|
||||
\[
|
||||
{P}_{\phi }\left( {z \mid {\mathbf{h}}_{t}^{\left( l\right) }}\right) = \operatorname{Softmax}\left( {{f}_{\phi }\left( {\mathbf{h}}_{t}^{\left( l\right) }\right) }\right) , z \in \{ 0,1,2,3\} .
|
||||
\]
|
||||
|
||||
(5)
|
||||
|
||||
The supervision signal for training \( {f}_{\phi } \) comes from the span-level assertion annotations constructed in Table 1. We project the label of each annotated span to all its constituent tokens to obtain the dense label sequence \( \left\{ {z}_{t}\right\} \) .
|
||||
|
||||
Since conflict tokens are extremely sparse in long-CoT, we train the probe using a weighted cross-entropy objective:
|
||||
|
||||
\[
|
||||
{\mathcal{L}}_{\text{ probe }} = - \mathop{\sum }\limits_{t}{w}_{t}\log {P}_{\phi }\left( {{z}_{t} \mid {\mathbf{h}}_{t}^{\left( l\right) }}\right) , \tag{6}
|
||||
\]
|
||||
|
||||
where \( {w}_{t} \) is a sample weight that assigns higher weight to \( z \in \{ 1,2,3\} \) (i.e., tokens where knowledge conflict \( {\mathcal{C}}_{i, j} \) occurs), preventing the probe from degenerating into predicting only the no-conflict background class. This objective allows the probe to maintain overall stability while remaining sufficiently sensitive to critical conflict-triggering moments. Full training details are provided in Appendix C.
|
||||
|
||||
### 4.2. Verifying the Separability of Knowledge Conflicts
|
||||
|
||||
We evaluate whether the probe reliably diagnoses knowledge conflicts from internal states. Specifically, we examine the token-level separability of effective conflicts and whether their sample-level recovers the objective conflict types.
|
||||
|
||||

|
||||
|
||||
Figure 2. Token-level separability of effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) . The left panel shows the confusion matrix over token-level conflict predictions. The right panels decompose performance into binary detection of conflict versus no-conflict, and fine-grained attribution among conflict types. Values denote row-normalized recall.
|
||||
|
||||
(I) Separability of Effective Conflicts: Local Signals in Sparse Regimes. We first examine whether the probe can distinguish different types of effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) from the model's internal states during reasoning.
|
||||
|
||||
As shown in Figure 2, the probe demonstrates robust discrimination capabilities. In the binary detection stage (Stage I), the model achieves a high True Negative rate of 88.7%, effectively filtering out non-conflicting steps. Conversely, a False Negative rate of 46.6% is observed, primarily driven by semantic sparsity within conflict spans-where 67.1% of \( {\mathcal{C}}_{\mathrm{{VP}}} \) tokens are misclassified as non-conflicting due to weak local signals. However, once effective conflict is activated (Stage II), the separability between conflict types sharply increases: \( {\mathcal{C}}_{\mathrm{{PT}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) achieve near-perfect identification accuracies of 99.4% and 94.8%, respectively. Even \( {\mathcal{C}}_{\mathrm{{VP}}} \) , the most subtle type, sees its recognition accuracy jump from 26.6% in the global view to 80.7% in the conditioned view. The minimal off-diagonal confusion ( \( < 1\% \) between PT and others) confirms that effective conflict types possess distinct, highly separable internal representations.
|
||||
|
||||
Conclusion (Local Effective Conflicts): Even under extreme sparsity and noise, different types of effective knowledge conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) give rise to distinct local structures in the model's internal states that can be reliably captured by the probe. This validates the feasibility of streaming diagnosis of effective conflicts while revealing differences in their intrinsic detectability.
|
||||
|
||||

|
||||
|
||||
Figure 3. Sample-level separability of conflict types. We visualize the t-SNE projection of hidden states at layer 20 (R1-Onevision) and layer 39 (Llama-3.2V). The three conflict categories are colored according to their Objective Conflict labels, pre-defined during dataset construction. The top-right confusion matrices illustrate the sample-level attribution performance.
|
||||
|
||||
(II) Alignment to Objective Conflicts: Aggregating Effective Signals. We next examine whether aggregating local effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) along a reasoning trajectory recovers the corresponding objective conflict \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) defined at the input level. This analysis evaluates the robustness of effective conflict signals beyond individual steps.
|
||||
|
||||
For each long-CoT trajectory, we aggregate hidden states of activated effective conflicts via mean pooling to obtain a sample-level representation. We visualize these representations using t-SNE (Figure 3), where samples sharing the same objective conflict type form compact clusters that are well separated, indicating consistent global structure.
|
||||
|
||||

|
||||
|
||||
Figure 4. Cross-layer distribution of conflict signals. Top row: attention-head activation ratio on conflict tokens vs. no-conflict tokens (lines), and their difference (bars), computed using effective conflict labels. Middle/bottom rows: layer-wise probe performance (one-vs-rest AUC and Recall@0.1) for \( {\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}},{\mathcal{C}}_{\mathrm{{VT}}} \) across three MLLM backbones.
|
||||
|
||||
Quantitatively, we infer the objective conflict type by aggregating stepwise effective conflict activations:
|
||||
|
||||
\[
|
||||
{\widehat{\mathcal{C}}}_{\text{ sample }} = \arg \mathop{\max }\limits_{{\mathcal{C}}_{i, j}}\mathop{\sum }\limits_{{t = 1}}^{T}\mathbb{I}\left\lbrack {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1}\right\rbrack . \tag{7}
|
||||
\]
|
||||
|
||||
Comparing \( {\widehat{\mathcal{C}}}_{\text{ sample }} \) with the ground-truth objective labels \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) directly tests whether the model’s internal conflict aligns with the conflict structure inherent in the input.
|
||||
|
||||
As shown in the inset matrices of Figure 3, aggregation substantially enhances separability. Notably, \( {\mathcal{C}}_{\mathrm{{PT}}} \) achieves a perfect 100.0% on both R1-Onevision and Llama-3.2V, confirming that text-prior conflicts induce unique and stable shifts in internal states. The remaining confusion is largely confined to the visual-conflict types: for instance,25.1% of \( {\mathcal{C}}_{\mathrm{{VT}}} \) samples in R1-Onevision are misclassified as \( {\mathcal{C}}_{\mathrm{{VP}}} \) , and \( {14.7}\% \) of \( {\mathcal{C}}_{\mathrm{{VP}}} \) samples in Llama-3.2V are misidentified as \( {\mathcal{C}}_{\mathrm{{VT}}} \) . This overlap is expected, as both categories involve failures in processing visual evidence, leading to partially shared representations.
|
||||
|
||||
### 4.3. Cross-Layer Distribution of Conflict Signals
|
||||
|
||||
We scan model depth to localize where effective knowledge conflicts are most strongly encoded. Concretely, for each layer \( l \) , we train the same token-level probe on hidden states \( {\mathbf{h}}_{t}^{\left( l\right) } \) and evaluate its one-vs-rest AUC / Recall@0.1 for \( \left\{ {{\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}},{\mathcal{C}}_{\mathrm{{VT}}}}\right\} \) .
|
||||
|
||||
Beyond probe separability, we also quantify a lightweight mechanistic correlate (Huang et al., 2025a): how attention-head activations differ between conflict and no-conflict token positions. Let \( {\mathcal{A}}^{\left( l\right) } \) denote the set of attention heads at layer \( l \) , and let \( {\mathbf{o}}_{t}^{\left( l, a\right) } \) be the output of head \( a \in {\mathcal{A}}^{\left( l\right) } \) at token \( t \) . We define token sets using effective conflict signals:
|
||||
|
||||
\[
|
||||
{\mathcal{S}}_{\text{ conf }} = \{ \left( {x, t}\right) \mid \exists \left( {i, j}\right) ,{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1\} , \tag{8}
|
||||
\]
|
||||
|
||||
\[
|
||||
{\mathcal{S}}_{\text{ nconf }} = \left\{ {\left( {x, t}\right) \mid \forall \left( {i, j}\right) ,{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 0}\right\} . \tag{9}
|
||||
\]
|
||||
|
||||
The layer-wise head activation ratio on a token set \( \mathcal{S} \) is
|
||||
|
||||
\[
|
||||
{R}^{\left( l\right) }\left( \mathcal{S}\right) = {\mathbb{E}}_{\left( {x, t}\right) \in \mathcal{S}}\frac{1}{\left| {\mathcal{A}}^{\left( l\right) }\right| }\mathop{\sum }\limits_{{a \in {\mathcal{A}}^{\left( l\right) }}}\mathbb{I}\left\lbrack {{\begin{Vmatrix}{\mathbf{o}}_{t}^{\left( l, a\right) }\end{Vmatrix}}_{2} > \gamma }\right\rbrack , \tag{10}
|
||||
\]
|
||||
|
||||
where \( \gamma \) is a fixed activation threshold (details in Appendix C.3). We then report the activation drift
|
||||
|
||||
\[
|
||||
\Delta {R}^{\left( l\right) } = {R}^{\left( l\right) }\left( {\mathcal{S}}_{\text{ conf }}\right) - {R}^{\left( l\right) }\left( {\mathcal{S}}_{\text{ nconf }}\right) , \tag{11}
|
||||
\]
|
||||
|
||||
which measures how strongly attention activations shift when effective conflicts are triggered.
|
||||
|
||||
As shown in Figure 4, both measurements reveal distinct depth-dependent signatures. (I) Probe Separability: In 7B models (R1-Onevision, Ocean-R1), discrimination performance rises in early layers and maximizes in the mid-to-late block (Layers 15-22), where AUC scores for \( {\mathcal{C}}_{\mathrm{{PT}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) consistently exceed 93%, before declining in the final layers. Llama-3.2V pushes this saturation deeper, maintaining highly robust separability \( \left( { \geq {95}\% }\right) \) as deep as Layer 39. (II) Activation Drift: This aligns with attention shifts. R1-series models show negative drift (suppression) peaking at Layers 18-22, while Llama-3.2V displays positive drift (enhancement) in Layers 30-39. We term these co-located peaks (Layer 20 for 7B, 39 for 11B) the conflict encoding stage, anchoring our analysis.
|
||||
|
||||
---
|
||||
|
||||
Conclusion (Global Effective Confilcts): By aggregating stepwise effective conflict signals \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) along the reasoning trajectory, different objective conflict types \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) become clearly and robustly separable at the sample level. This indicates that effective conflicts are not merely local artifacts, but form consistent global patterns that reliably reflect the underlying input-level objective conflict structure.
|
||||
|
||||
---
|
||||
|
||||
Table 2. Assessment of conflict probe performance across three VLM backbones. We report AUC and Recall at FPR=0.1 (Rec@0.1) under the One-vs-Rest setting. Gray rows indicate the Span-Max aggregation, which consistently outperforms token-level baselines. Values are presented as percentages (%).
|
||||
|
||||
<table><tr><td rowspan="2">Models</td><td rowspan="2">Probe</td><td rowspan="2">Granularity</td><td colspan="4">AUC (%)</td><td colspan="4">Recall@0.1 (%)</td></tr><tr><td>w/o Conflict</td><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>w/o Conflict</td><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td></tr><tr><td rowspan="6">(7B) R1-Onevision</td><td rowspan="3">Linear</td><td>All Token</td><td>81.7±0.1</td><td>\( {86.3} \pm {0.2} \)</td><td>92.0±0.1</td><td>94.8±0.2</td><td>50.0±0.3</td><td>56.8±0.2</td><td>75.1±0.1</td><td>\( {87.3} \pm {0.3} \)</td></tr><tr><td>Span Only</td><td>76.8±0.2</td><td>\( {82.5} \pm {0.1} \)</td><td>90.8±0.3</td><td>95.4±0.2</td><td>35.5±0.2</td><td>\( {44.5} \pm {0.3} \)</td><td>70.5±0.2</td><td>\( {88.5} \pm {0.1} \)</td></tr><tr><td>Span-Max</td><td>93.2±0.1</td><td>94.2±0.2</td><td>98.6±0.1</td><td>97.3±0.1</td><td>81.5±0.2</td><td>\( {82.4} \pm {0.1} \)</td><td>97.2±0.1</td><td>\( {93.8} \pm {0.2} \)</td></tr><tr><td rowspan="3">MLP</td><td>All Token</td><td>95.5±0.1</td><td>90.4±0.2</td><td>85.2±0.3</td><td>94.1±0.1</td><td>89.1±0.2</td><td>68.4+0.1</td><td>62.7±0.2</td><td>\( {79.3} \pm {0.1} \)</td></tr><tr><td>Span Only</td><td>95.7±0.2</td><td>86.1±0.3</td><td>80.3±0.1</td><td>93.3±0.2</td><td>89.8±0.1</td><td>53.0±0.2</td><td>43.7±0.2</td><td>\( {76.8} \pm {0.3} \)</td></tr><tr><td>Span-Max</td><td>97.3±0.1</td><td>94.5±0.1</td><td>93.2±0.2</td><td>99.1±0.1</td><td>93.4±0.2</td><td>82.4±0.1</td><td>82.1±0.1</td><td>98.7±0.2</td></tr><tr><td rowspan="6">(7B-Instruct) Ocean-R1</td><td rowspan="3">Linear</td><td>All Token</td><td>83.0±0.2</td><td>90.6+0.1</td><td>94.2±0.2</td><td>94.9+0.1</td><td>53.7±0.3</td><td>69.4±0.1</td><td>81.3±0.2</td><td>85.6+0.1</td></tr><tr><td>Span Only</td><td>\( {78.5} \pm {0.1} \)</td><td>86.7+0.3</td><td>90.0±0.2</td><td>97.6+0.1</td><td>41.4+0.2</td><td>52.5+0.2</td><td>66.6±0.1</td><td>94.6±0.3</td></tr><tr><td>Span-Max</td><td>\( \mathbf{{95.0} \pm {0.2}} \)</td><td>95.9±0.1</td><td>98.6±0.1</td><td>98.8±0.1</td><td>85.7±0.1</td><td>87.9+0.2</td><td>97.1±0.1</td><td>97.8±0.2</td></tr><tr><td rowspan="3">MLP</td><td>All Token</td><td>95.5+0.1</td><td>92.8+0.1</td><td>85.0±0.2</td><td>95.5+0.1</td><td>87.1+0.2</td><td>75.6+0.3</td><td>61.6+0.1</td><td>85.2±0.2</td></tr><tr><td>Span Only</td><td>97.8+0.2</td><td>87.3+0.2</td><td>79.7±0.1</td><td>91.7±0.2</td><td>95.7±0.3</td><td>53.9±0.1</td><td>43.3±0.2</td><td>71.0±0.1</td></tr><tr><td>Span-Max</td><td>99.2</td><td>96.5±0.1</td><td>95.3±0.2</td><td>98.4±0.1</td><td>98.9±0.1</td><td>89.8±0.2</td><td>87.5±0.1</td><td>96.1±0.1</td></tr><tr><td rowspan="6">(11B-cot) Llama-3.2V</td><td rowspan="3">Linear</td><td>All Token</td><td>88.7+0.2</td><td>\( {90.5} \pm {0.1} \)</td><td>96.9+0.2</td><td>94.5±0.1</td><td>68.4±0.3</td><td>67.2±0.2</td><td>94.4+0.1</td><td>85.8±0.2</td></tr><tr><td>Span Only</td><td>\( {79.6} \pm {0.2} \)</td><td>85.8 + 0.2</td><td>90.2</td><td>95.2±0.3</td><td>43.2±0.1</td><td>\( {51.1} \pm {0.2} \)</td><td>66.0+0.2</td><td>88.4</td></tr><tr><td>Span-Max</td><td>\( \mathbf{{93.9} \pm {0.1}} \)</td><td>93.4</td><td>98.4</td><td>97.2±0.1</td><td>83.5±0.2</td><td>\( \mathbf{{76.9}} \pm {0.1} \)</td><td>96.1±0.2</td><td>93.1±0.1</td></tr><tr><td rowspan="3">MLP</td><td>All Token</td><td>95.8+0.2</td><td>90.7+0.1</td><td>88.7+0.2</td><td>96.9+0.1</td><td>89.4±0.1</td><td>\( {64.3} \pm {0.2} \)</td><td>70.6+0.3</td><td>93.7±0.2</td></tr><tr><td>Span Only</td><td>96.1</td><td>85.5+0.3</td><td>79.2+0.2</td><td>\( {89.2}\overline{ + }{0.1} \)</td><td>90.8±0.2</td><td>\( {46.7} \pm {0.1} \)</td><td>\( {40.5} \pm {0.2} \)</td><td>65.2±0.1</td></tr><tr><td>Span-Max</td><td>97.2</td><td>94.5±0.2</td><td>\( \mathbf{{93.4} \pm {0.1}} \)</td><td>97.8</td><td>93.2±0.1</td><td>\( {82.3} \pm {0.2} \)</td><td>82.3±0.1</td><td>\( {94.4} \pm {0.2} \)</td></tr></table>
|
||||
|
||||
Conclusion (Layer-level): Layer-scanning reveals that both probe separability and attention drift co-localize in a specific mid-to-late layer band across all three MLLM backbones. This indicates that conflict-related signals are depth-dependent and concentrated in a distinct "conflict encoding stage," bridging early perception and late decoding rather than being uniformly distributed across the network.
|
||||
|
||||
### 4.4. Linearity of Conflict Representation
|
||||
|
||||
To comprehensively assess the nature of effective conflict signals \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) encoded in the hidden states \( {\mathbf{h}}_{t}^{\left( l\right) } \) (specifically, whether they are explicitly linear or highly entangled) we conducted experiments on specific layers identified as the "Conflict Encoding Stage" in Section 4.3. We designed two probe architectures with distinct underlying assumptions: (I) Linear Probe \( \left( {f}_{lin}\right) \) , consisting of a single projection layer \( \mathbf{W} \in {\mathbb{R}}^{d \times 4} \) (where \( d \) denotes the hidden state dimension), aimed at evaluating the Linear Separability of conflict states. High classification accuracy with a linear mapping would indicate that the model has formed clear, decoupled conflict boundaries at the current layer. (II) MLP Probe \( \left( {f}_{mlp}\right) \) , designed to assess Non-linear Entanglement. Recognizing the potential manifold complexity in deep Transformer features, we construct a deep MLP with three dimension-reducing layers \( \left( {{1024} \rightarrow {512} \rightarrow {256}}\right) \) and ReLU activation to capture high-order interaction features.
|
||||
|
||||
As shown in Table 2, we report AUC and Recall@0.1 for both probes using "Span-Max" aggregation, which takes the maximum predicted probability across tokens within each span (details in Appendix C.5). The Linear Probe achieves strong performance across all conflict types: AUC reaches 93.2-98.8% and Recall@0.1 reaches 76.9-97.8%. For \( {\mathcal{C}}_{\mathrm{{PT}}} \) , Linear Probe achieves 98.6% AUC and 96.1-97.2% Recall@0.1; for \( {\mathcal{C}}_{\mathrm{{VP}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) , it reaches 93.4-95.9% AUC and 76.9-87.9% Recall@0.1, comparable to MLP. The fact that a single linear layer suffices to achieve such performance indicates that for knowledge conflicts, the "features" extracted by LLMs are already explicitly disentangled in the high-dimensional space, and introducing additional nonlinear complexity (MLP) does not yield significant gain.
|
||||
|
||||
Conclusion (Linearity): It was observed that a simple linear probing method could achieve detection performance comparable to that of a non-linear MLP. This suggests that effective conflicts are not entangled within complex nonlinear manifolds, but rather are explicitly and approximately linearly separable. This makes real-time detection of conflict states during inference possible.
|
||||
|
||||
## 5. Intervening in Knowledge Conflict
|
||||
|
||||
Section 4 showed that effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) are streaming-decodable from internal states and are encoded as linearly separable features in specific mid-to-late layers. Building on this observation, we ask the following: given an input with \( {\mathcal{C}}_{i, j}^{o}\left( x\right) = 1 \) , can inference-time interventions bias the model toward a desired knowledge source, or suppress the activation of effective conflicts during generation?
|
||||
|
||||

|
||||
|
||||
Figure 5. Semantic performance of targeted source control. We evaluate three conflict subsets \( \left( {{\mathcal{C}}_{\mathrm{{VP}}}^{o},{\mathcal{C}}_{\mathrm{{VT}}}^{o},{\mathcal{C}}_{\mathrm{{PT}}}^{o}}\right) \) using judge-based metrics: ASR (Anchor Support Rate, ↑), ARR (Anchor Rejection Rate, ↓), and OER (Obvious Error Rate, ↓). Forward/Reverse denote intervening toward the truth-anchored (benchmark-reliable) vs. conflicting source. Arrows indicate relative changes against the baseline. Note that VCD is inapplicable to the non-visual \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} \) subset.
|
||||
|
||||
#### 5.1.A unified framework for directional interventions
|
||||
|
||||
Two control objectives. We study inference-time control under objectively conflicting inputs, two settings are considered. (I) Targeted source control. We choose a target source \( {\mathcal{K}}_{s} \in \left\{ {{\mathcal{K}}_{i},{\mathcal{K}}_{j}}\right\} \) and intervene so that the model follows \( {\mathcal{K}}_{s} \) under conflict. This yields two directions: Forward, which intervenes toward the truth-anchored (benchmark-reliable) source, and Reverse, which enforces the opposite source. (II) Conflict mitigation. We measure whether interventions reduce how often effective conflicts are activated during generation, quantified by the expected fraction of reasoning steps where a conflict is detected:
|
||||
|
||||
\[
|
||||
{\mathbb{E}}_{x}{\mathbb{E}}_{t}\left\lbrack {\mathbb{I}\left\lbrack {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1}\right\rbrack }\right\rbrack . \tag{12}
|
||||
\]
|
||||
|
||||
A unified view of directional interventions. Let \( {\ell }_{t} \in \; {\mathbb{R}}^{\left| \mathcal{V}\right| } \) denote the pre-softmax logits at step \( t \) . We view an inference-time intervention as modifying decoding through an additive logit perturbation, either directly or implicitly via hidden-state manipulation:
|
||||
|
||||
\[
|
||||
{\widetilde{p}}_{t} = \operatorname{softmax}\left( {{\ell }_{t} + \Delta {\ell }_{t}}\right) ,\;\Delta {\ell }_{t} = \mathcal{I}\left( {x,{y}_{ < t}}\right) . \tag{13}
|
||||
\]
|
||||
|
||||
We consider three instantiations of \( \mathcal{I} \) : (I) Visual contrastive decoding (VCD). VCD applies a logit-level correction (Leng et al., 2023) and is restricted to conflicts involving visual sources (i.e., \( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \) or \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \) ). (II) Representation steering. Leveraging the linear separability found in Section 4, we adopt a representation steering (Zhang et al., 2025c) that shifts the hidden state at a selected conflict-sensitive layer by a learned direction, i.e., \( {\widetilde{\mathbf{h}}}_{t} = {\mathbf{h}}_{t} + \lambda \mathbf{v} \) (where \( \lambda \) is the steering strength and \( \mathbf{v} \) is the direction vector). (III) Probe-guided control. We use the streaming probe to score candidate continuations, reweighting decoding toward options less likely to trigger conflicts. For the top- \( k \) candidates \( {\mathcal{V}}_{k} \) with base probabilities \( {p}_{t}\left( w\right) \) , we apply
|
||||
|
||||
\[
|
||||
{\widetilde{p}}_{t}\left( w\right) \propto {p}_{t}\left( w\right) \exp \left( {\alpha {P}_{t}^{\left( w\right) }}\right) ,\;w \in {\mathcal{V}}_{k}, \tag{14}
|
||||
\]
|
||||
|
||||
where \( {P}_{t}^{\left( w\right) } \) is the probe-predicted probability of the no-conflict state for the continuation committing to token \( w \) , and \( \alpha \) controls the strength of guidance. Full implementation details and hyperparameters are provided in Appendices D.4 and D.5.
|
||||
|
||||
### 5.2. Targeted source control: semantic-level evaluation
|
||||
|
||||
We evaluate whether targeted interventions successfully bias the model toward a specified knowledge source under objectively conflicting inputs. We adopt an automated assertion-level judge, implemented with a strong off-the-shelf large language model, to assess semantic alignment with the target source. The judge extracts factual claims from the model output and verifies each claim against the corresponding truth anchor (image, input text, or world knowledge), producing compact aggregate metrics: ASR (Anchor Support Rate), ARR (Anchor Refutation Rate), and OER (Obvious Error Rate). To validate judge reliability, we conducted human verification on a stratified 10% subset (~1,500 spans), yielding high inter-annotator agreement \( \left( {\kappa = {0.87}}\right) \) , confirming that automated verdicts align closely with human perception of conflict resolution (details in Appendix D.2).
|
||||
|
||||
As shown in Figure 5, targeted source control is feasible but exhibits a pronounced directional asymmetry. Across objective-conflict subsets, Forward interventions (intervening toward the truth-anchored source; vision for VP/VT and prior knowledge for PT) reliably improve semantic alignment, whereas Reverse control (forcing reliance on the competing source) often degrades it. We hypothesize this asymmetry reflects an internal source-reliability prior: when sources disagree, the model is more resistant to reversing arbitration away from the source it treats as reliable, even under strong contextual pressure. This asymmetry cannot be explained by construction bias alone: if it were purely a data artifact, we would expect the probe to learn shortcuts to anchor proximity rather than capturing genuine conflict dynamics. However, the asymmetry persists across all three architecturally distinct backbones, suggesting it reflects shared instruction-tuning biases that favor user-provided context (Sharma et al., 2024; Zhang et al., 2025c). Under Forward control, Probe-guided control interventions improve ASR while lowering OER by \( \sim {30}\% \) ; VCD yields stronger but selective gains on \( {\mathcal{C}}_{\mathrm{{VP}}} \) (ASR +15%, ARR halved). Reverse control remains challenging-most methods regress or show negligible gains. Mechanistically, the probe primarily suppresses conflict states rather than enforcing weaker-source selection. This highlights a trade-off: VCD is high-gain but direction-sensitive, whereas Representation steering reliably reduces errors (ARR/OER) but rarely drives sustained ASR gains.
|
||||
|
||||
Table 3. Token-level conflict mitigation under the forward direction. Results are reported on three objective-conflict subsets \( \left( {{\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1}\right. \) , \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \) , and \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \) ) across three backbones. We report four token-level mitigation metrics: \( \mathbf{{SS}} \uparrow ,\mathbf{{CR}} \downarrow ,\mathbf{{CAC}} \downarrow \) and \( \mathbf{{CCI}} \downarrow \) (metric definitions in Appendix D.3). VCD is not applicable when \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \) and is therefore reported only for the first two subsets.
|
||||
|
||||
<table><tr><td rowspan="2">Subset</td><td colspan="4">R1-Onevision-7B</td><td colspan="4">Ocean-R1-7B-Instruct</td><td colspan="4">Llama-3.2V-11B-cot</td></tr><tr><td>SS↑</td><td>CAC↓</td><td>CCL</td><td>CR↓</td><td>SS↑</td><td>CAC↓</td><td>CCI↓</td><td>CR \( \downarrow \)</td><td>SS↑</td><td>CAC↓</td><td>CCI↓</td><td>\( \mathbf{{CR} \downarrow } \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>0.94</td><td>0.04</td><td>0.70</td><td>0.03</td><td>0.89</td><td>0.07</td><td>0.71</td><td>0.06</td><td>0.94</td><td>0.04</td><td>0.45</td><td>0.02</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>0.88</td><td>0.08</td><td>0.80</td><td>0.10</td><td>0.87</td><td>0.09</td><td>0.79</td><td>0.10</td><td>0.90</td><td>0.06</td><td>0.72</td><td>0.03</td></tr><tr><td>baseline \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>0.82</td><td>0.12</td><td>0.80</td><td>0.15</td><td>0.82</td><td>0.12</td><td>0.80</td><td>0.15</td><td>0.84</td><td>0.11</td><td>0.70</td><td>0.11</td></tr><tr><td><img src="https://cdn.noedgeai.com/bo_d6nb7sc601uc73e2hngg_7.jpg?x=163&y=461&w=28&h=52&r=0"/> \( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {0.92}^{-{0.02}} \)</td><td>\( {0.05}^{+{0.01}} \)</td><td>0.69</td><td>\( {0.04}^{+{0.01}} \)</td><td>0.90</td><td>\( {0.06}^{-{0.01}} \)</td><td>\( {0.69}^{-{0.01}} \)</td><td>\( {0.05}^{-{0.01}} \)</td><td>0.85</td><td>0.08+0.04</td><td>0.63</td><td>\( {0.06}^{+{0.05}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {0.90}^{+{0.01}} \)</td><td>\( {0.07}^{-{0.01}} \)</td><td>0.79</td><td>\( {0.08}^{-{0.01}} \)</td><td>0.92</td><td>\( {0.05}^{-{0.03}} \)</td><td>\( {0.75}^{-{0.04}} \)</td><td>\( {0.05}^{-{0.05}} \)</td><td>0.78</td><td>0.12+0.06</td><td>0.69</td><td>\( {0.15}^{+{0.11}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {0.92}^{-{0.02}} \)</td><td>\( {0.05}^{+{0.01}} \)</td><td>0.69</td><td>\( {0.05}^{+{0.02}} \)</td><td>0.89</td><td>\( {0.07}^{+{0.00}} \)</td><td>\( {0.71}^{+{0.00}} \)</td><td>\( {0.07}^{+{0.01}} \)</td><td>0.92</td><td>0.05+0.01</td><td>0.55+0.10</td><td>\( {0.03}^{+{0.02}} \)</td></tr><tr><td>steering \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {0.91}^{+{0.02}} \)</td><td>\( {0.06}^{-{0.02}} \)</td><td>0.76</td><td>\( {0.07}^{-{0.03}} \)</td><td>0.91</td><td>\( {0.06}^{-{0.03}} \)</td><td>\( {0.77}^{-{0.03}} \)</td><td>\( {0.06}^{-{0.04}} \)</td><td>\( {0.90}^{+{0.00}} \)</td><td>\( {0.06}^{-{0.00}} \)</td><td>\( {0.67}^{-{0.04}} \)</td><td>\( {0.04}^{+{0.01}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>\( {0.77}^{-{0.06}} \)</td><td>\( {0.16}^{+{0.04}} \)</td><td>0.76</td><td>\( {0.20}^{+{0.05}} \)</td><td>\( {0.82}^{+{0.00}} \)</td><td>\( {0.12}^{-{0.00}} \)</td><td>0.80</td><td>\( {0.15}^{+{0.00}} \)</td><td>\( {0.84}^{+{0.00}} \)</td><td>\( {0.11}^{-{0.00}} \)</td><td>\( {0.69}^{-{0.01}} \)</td><td>\( {0.12}^{+{0.01}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {0.95}^{+{0.01}} \)</td><td>\( {0.03}^{-{0.01}} \)</td><td>\( {0.67}^{-{0.03}} \)</td><td>\( {0.02}^{-{0.01}} \)</td><td>\( {0.92}^{+{0.03}} \)</td><td>\( {0.05}^{-{0.02}} \)</td><td>0.66</td><td>\( {0.03}^{-{0.03}} \)</td><td>\( {0.94}^{+{0.01}} \)</td><td>\( {0.04}^{-{0.00}} \)</td><td>\( {0.39}^{-{0.06}} \)</td><td>\( {0.02}^{-{0.00}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {0.94}^{+{0.06}} \)</td><td>\( {0.04}^{-{0.04}} \)</td><td>\( {0.64}^{-{0.16}} \)</td><td>\( {0.02}^{-{0.07}} \)</td><td>\( {0.93}^{+{0.06}} \)</td><td>\( {0.04}^{-{0.04}} \)</td><td>0.72</td><td>\( {0.04}^{-{0.06}} \)</td><td>\( {0.92}^{+{0.02}} \)</td><td>\( {0.05}^{-{0.01}} \)</td><td>\( {0.67}^{-{0.05}} \)</td><td>\( {0.03}^{-{0.00}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>\( {0.78}^{-{0.04}} \)</td><td>\( {0.10}^{-{0.02}} \)</td><td>\( {0.60}^{-{0.20}} \)</td><td>\( {0.15}^{+{0.01}} \)</td><td>\( {0.87}^{+{0.04}} \)</td><td>\( {0.08}^{-{0.04}} \)</td><td>0.72</td><td>\( {0.09}^{-{0.06}} \)</td><td>\( {0.87}^{+{0.04}} \)</td><td>\( {0.08}^{-{0.03}} \)</td><td>0.63</td><td>\( {0.10}^{-{0.01}} \)</td></tr></table>
|
||||
|
||||
Conclusion (Targeted Source Control). When objective conflicts are present, inference-time interventions exhibit a clear directional asymmetry: biasing the model toward fact-consistent, truth-anchored sources is significantly easier and more reliable than forcing it to rely on fact-inconsistent sources. This suggests that conflict resolution in MLLMs is governed by a stable, source-dependent inductive tendency, which can be strengthened but is difficult to reverse.
|
||||
|
||||
### 5.3. Conflict mitigation under the default direction
|
||||
|
||||
Semantic evaluation in Section 5.2 demonstrated that, under objectively conflicting inputs, inference-time interventions can bias model outputs toward the truth-anchored source. Here, we pose a complementary process-level question: under the default (Forward) direction, can we reduce the activation of effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) during generation? We employ token-level mitigation metrics to summarize these internal dynamics (Support Score (SS), Conflict Rate (CR), Confidence-Adjusted Conflict (CAC), and Conflict Confidence Index (CCI)) as a further complement to the independent semantic correctness evaluation in Figure 5. Table 3 summarizes the token-level mitigation results. We observe that interventions targeting the identified conflict features (Probe-guided control) consistently suppress conflict dynamics across backbones. Specifically, on visually involved subsets \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , the frequency of conflict activation (CR) decreases significantly (e.g., \( {0.10} \rightarrow {0.02} \) on R1-Onevision). Crucially, even when conflict frequency remains stable (e.g., \( {\mathcal{C}}_{\mathrm{{PT}}} \) ), confidence-aware measures reveal deeper suppression (CCI drops by 25%), indicating that the intervention mitigates the intensity of conflicts even if not their occurrence. In contrast, rigid interventions like Representation steering or unguided perturbations like VCD struggle to generalize. For instance, VCD exacerbated conflict rates fivefold on Llama-3.2V for \( {\mathcal{C}}_{\mathrm{{VT}}} \; \left( {{0.03} \rightarrow {0.15}}\right) \) . This disparity highlights that effective mitigation requires precise targeting of the conflict-encoding subspaces rather than broad adjustments.
|
||||
|
||||
Conclusion (Conflict Mitigation). Guiding the model toward the reliable source attenuates internal conflict dynamics during reasoning, reducing both the intensity and the frequency of effective conflict states. This implies that effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) activation is not an inherent attribute of generation, but a plastic internal state that can be suppressed during reasoning.
|
||||
|
||||
## 6. Conclusion
|
||||
|
||||
In this work, we study failures in multimodal long-CoT reasoning from the perspective of knowledge conflict, rather than knowledge absence. By distinguishing objective conflicts from effective conflicts during reasoning, we show that many failures arise from how conflicting knowledge is resolved over time. We find that effective conflicts are encoded as explicit and linearly decodable signals, concentrated in mid-to-late layers of the model. Leveraging these signals, we uncover a pronounced directional asymmetry: guiding the model toward its reliability-aligned source is substantially easier than forcing conflict resolution in the opposite direction, indicating a biased and path-dependent mechanism. Looking forward, we hope this perspective motivates analysis and control methods for richer conflict structures and more complex multimodal reasoning settings.
|
||||
|
||||
## Impact Statement
|
||||
|
||||
This paper presents work whose goal is to advance the understanding and reliability of MLLMs in long-CoT reasoning scenarios. By diagnosing knowledge conflicts and their intervention mechanisms, our research contributes to making AI systems more transparent and trustworthy. The diagnostic framework and intervention methods proposed here could help identify and mitigate reasoning failures before deployment, potentially reducing the propagation of misinformation or hallucinated content in real-world applications. We do not foresee specific negative societal consequences that need to be highlighted beyond the general considerations applicable to advancing machine learning capabilities.
|
||||
Reference in New Issue
Block a user