first commit

This commit is contained in:
龙澳
2026-04-02 09:48:38 +08:00
commit 5e2eb7b8c0
8 changed files with 1934 additions and 0 deletions

View File

@@ -0,0 +1,267 @@
# Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning
Jing Tang \( {}^{1 * } \) Kun Wang \( {}^{2 * } \) Haolang Lu \( {}^{3 * } \) Hongjin Chen \( {}^{3} \) KaiTao Chen \( {}^{3} \) Zhongxiang Sun \( {}^{4} \) Qiankun Li \( {}^{2} \) Lingjuan Lyu \( {}^{5} \) Guoshun Nan \( {}^{3} \) Zhigang Zeng \( {}^{1} \)
jingtang@hust.edu.cn wang.kun@ntu.edu.sg luhaolang@bupt.edu.cn
## Abstract
Multimodal large language models in long chain-of-thought reasoning often fail when different knowledge sources provide conflicting signals. We formalize these failures under a unified notion of knowledge conflict, distinguishing input-level objective conflict from process-level effective conflict. Through probing internal representations, we reveal that: (I) Linear Separability: different conflict types are explicitly encoded as linearly separable features rather than entangled; (II) Depth Localization: conflict signals concentrate in mid-to-late layers, indicating a distinct processing stage for conflict encoding; (III) Hierarchical Consistency: aggregating noisy token-level signals along trajectories robustly recovers input-level conflict types; and (IV) Directional Asymmetry: reinforcing the model's implicit source preference under conflict is far easier than enforcing the opposite source. Our findings provide a mechanism-level view of multimodal reasoning under knowledge conflict and enable principled diagnosis and control of long-CoT failures. Code is available at anonymous link.
## 1. Introduction
Multimodal large language models (MLLMs) (Jin et al., 2025; Caffagni et al., 2024; Zhang et al., 2024a) have made substantial progress in visual understanding (Tong et al., 2024a; Ghatkesar et al., 2025; Ma et al., 2025), textual reasoning (Wang et al., 2024; Du et al., 2025; Mirzadeh et al., 2025), and cross-modal alignment (Yu et al., 2024; Yan et al., 2025; Yu et al., 2025), enabling complex perception-reasoning-decision workflows. A defining capability is long-form reasoning: beyond producing answers, these models can generate extended chains-of-thought (CoT) (Wang et al., 2025b; Yue et al., 2025) that support challenging multi-step tasks. However, recent work increasingly documents failures under mutually contradictory evidence or constraints: models may ignore explicit instructions (Wang et al., 2025a; Zhao et al., 2025), privilege the wrong evidential source (Guan et al., 2024; Liu et al., 2025b), or yield plausible yet goal-inconsistent conclusions (Fanous et al., 2025). These observations suggest that a key bottleneck in multimodal reasoning is not always missing information, but reliable decision-making under conflicting signals.
Building on these observations, prior work (Zhang et al., 2024c; Lu et al., 2024) has characterized abnormal behavior under conflicting signals from several largely independent angles. In retrieval-augmented generation, a central question is whether models remain faithful to retrieved evidence or drift toward parametric priors (Wu et al., 2024). In vision settings with counterfactual or commonsense-violating inputs, MLLMs are often found to underweight visual evidence and default to "reasonable" answers that match world knowledge (Tong et al., 2024b; Liu et al., 2025c). In high-stakes domains, studies further report over-accommodation to user assertions, which can pull predictions away from the underlying evidence (Sharma et al., 2024). Although these lines of work differ in tasks, datasets, and evaluation criteria, their failure modes are strikingly similar: when information sources disagree, models do not reliably follow the appropriate basis for a decision, and instead exhibit unstable, hard-to-control trade-offs across sources.
In this paper, we take a unified view that these phenomena arise from knowledge conflict in multimodal reasoning. When generating tokens, MLLMs jointly rely on multiple knowledge sources, including visual evidence, textual instructions and contextual constraints, and parametric priors stored in the model weights (Han et al., 2025; Liu et al., 2024a; Karamcheti et al., 2024). When these sources provide inconsistent signals for the same goal, the model must resolve which source to follow. Importantly, the resulting failures are not fabrications from missing knowledge, but incorrect source selection under conflict: the model may have access to competing plausible cues yet follow the wrong basis. Accordingly, our focus is not the act of answer generation itself, but whether conflict-induced failures can be localized, measured, and mechanistically tested.
---
\( {}^{1} \) Huazhong University of Science and Technology \( {}^{2} \) Nanyang Technological University \( {}^{3} \) Beijing University of Posts and Telecommunications \( {}^{4} \) Renmin University of China \( {}^{5} \) Sony AI, Zurich, Switzerland. Correspondence to: Guoshun Nan <nan-guoshun@gmail.com>, Zhigang Zeng <zgzeng@hust.edu.cn>.
Preprint. February 17, 2026.
---
Multimodal long-CoT reasoning (Ni et al., 2025) makes this problem sharper by unfolding decisions over many steps, with the internal reasoning state evolving over time. Under this setting, knowledge conflict can be triggered at any point and modality along the trajectory rather than only at the final answer. Once a step commits to the wrong basis, subsequent steps may continue from that premise in a locally coherent manner, eventually producing a globally incorrect conclusion (Zhang et al., 2024b). More challenging, such deviations are often masked by fluent rationales (Turpin et al., 2023), making it difficult to infer when the conflict emerged, what triggered it, and how it propagated from the final output alone. Understanding and correcting failures in long-CoT therefore requires step-level tools that can expose the underlying conflict dynamics.
In this work, * We diagnose knowledge conflict dynamics on 7,500+ long-CoT trajectories from an objective conflict benchmark, where effective conflicts are activated in 78-90% of samples. * Through layer-wise analysis of three models, we identify a depth-dependent conflict encoding stage. Using streaming probes to detect token-level conflict states, we find they exhibit high linear separability (93.2~98.8% AUC, 76.9~97.8% Recall@0.1), revealing them as explicit, decodable features. * We employ three pluggable methods for intervention. These methods can either steer model outputs toward selected directions, reducing conflict frequency by up to 80%, or suppress high-confidence errors by up to 55%.
## 2. Related Work
Knowledge Conflict. Research on knowledge conflicts has identified three primary sources: conflicts between internal priors and visual information (Liu et al., 2025b; Du. et al., 2025) or textual inputs (Zhang et al., 2025a; Su et al., 2024), and conflicts between visual and textual modalities (Deng et al., 2025). Building on these findings, significant efforts have been made to mitigate such conflicts through advanced strategies (Xie et al., 2024; Guo et al., 2024), including knowledge editing (Tan et al., 2024; Zhang et al., 2025d; Cheng et al., 2024; Chen et al., 2025) and retrieval augmentation (Huo et al., 2025; Zhang et al., 2025b; Li et al., 2025). These approaches have demonstrated potential in enhancing model faithfulness and reliability (Huang et al., 2025b; An et al., 2025; Shi et al., 2024; Zhang et al., 2024d; Lu et al., 2025). Although the above evidence suggests that conflicts are coupled and multi-source, existing solutions remain fragmented across modalities and fail to model conflicts holistically, thereby limiting their applicability in complex settings.
Probe Detection. Investigating internal states via probe detection is a developing field, yet the history of probing in LLMs (Kahana et al., 2025) provides clear precedents. Notably, the evolution of probe detection primarily centers on hallucination and faithfulness (Feng et al., 2025; Yi et al., 2025). Core techniques, such as linear probe generators (Ka-hana et al., 2025) and propositional probes (Feng et al., 2025), have inspired analogous approaches in watermark identification (Liu et al., 2025a), reward maximization (Li et al., 2024), and combinatorial optimization (Zhang et al., 2025e). However, these approaches predominantly focus on single-modal issues or specific downstream tasks, leaving the detection and localization of multimodal knowledge conflicts largely unexplored. Inspired by this, we introduce a specialized probe detection framework to identify the three sources of knowledge conflicts in MLLMs.
![bo_d6nb7sc601uc73e2hngg_1_901_194_698_547_0.jpg](images/bo_d6nb7sc601uc73e2hngg_1_901_194_698_547_0.jpg)
Figure 1. Overview of Knowledge Sources and Conflict Types. We categorize knowledge into Visual \( \left( {\mathcal{K}}_{\text{ vision }}\right) \) , Textual \( \left( {\mathcal{K}}_{\text{ text }}\right) \) , and Parametric Prior \( \left( {\mathcal{K}}_{\text{ prior }}\right) \) . Knowledge conflicts arise when factual statements from different sources act as incompatible signals. We define three primary conflict types: Vision-Text \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , Vision-Prior \( \left( {\mathcal{C}}_{\mathrm{{VP}}}\right) \) , and Prior-Text \( \left( {\mathcal{C}}_{\mathrm{{PT}}}\right) \) .
## 3. Conflict in Multimodal Reasoning
### 3.1. Knowledge Sources and Pairwise Conflicts
We consider a multimodal long-CoT reasoning task with input \( x = \left( {{X}_{V},{X}_{T}}\right) \) , where \( {X}_{V} \) denotes the visual input and \( {X}_{T} \) the textual input. Given a multimodal generative model \( {M}_{\theta } \) , reasoning unfolds as a sequence of tokens \( \tau \left( x\right) = \left( {{y}_{1},{y}_{2},\ldots ,{y}_{T}}\right) \) , with each token sampled as
\[
{y}_{t} \sim {M}_{\theta }\left( {\cdot \mid x,{y}_{ < t}}\right) . \tag{1}
\]
We denote the internal state at step \( t \) by
\[
{\mathbf{h}}_{t} = {f}_{\theta }\left( {x,{y}_{ < t}}\right) , \tag{2}
\]
where \( {f}_{\theta } \) denotes the models hidden representation extraction, i.e., the forward pass up to a specified layer.
To analyze how factual inconsistencies arise during reasoning, we abstract the knowledge available to the model into three sources, \( \mathcal{K} = \left\{ {{\mathcal{K}}_{\text{ vision }},{\mathcal{K}}_{\text{ text }},{\mathcal{K}}_{\text{ prior }}}\right\} \) .
Table 1. Output-level conflict profile across models (objective conflict subsets). We present statistics of generated trajectories under three types of conflict (model details in Appendix B). Metrics reported include sample count, average CoT length, average conflict spans per sample (spans are contiguous conflict segments identified via an automated LLM annotation pipeline, may consist of one or multiple tokens.), conflict token density (proportion of conflicting tokens), and sample conflict rate (% of samples exhibiting effective conflict).
<table><tr><td rowspan="2">Metric</td><td colspan="4">Llama-3.2V-11B-cot</td><td colspan="4">R1-Onevision-7B</td><td colspan="4">Ocean-R1-7B-Instruct</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>All</td><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>All</td><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>All</td></tr><tr><td>Samples</td><td>749</td><td>1012</td><td>803</td><td>2564</td><td>724</td><td>993</td><td>769</td><td>2486</td><td>640</td><td>1026</td><td>807</td><td>2473</td></tr><tr><td>Avg. CoT length (tokens)</td><td>326.79</td><td>1768.85</td><td>238.50</td><td>868.32</td><td>706.85</td><td>790.63</td><td>558.97</td><td>694.57</td><td>488.15</td><td>711.26</td><td>302.97</td><td>520.28</td></tr><tr><td>Avg. conflict spans per sample</td><td>2.69</td><td>6.20</td><td>4.04</td><td>4.50</td><td>3.66</td><td>6.73</td><td>7.02</td><td>5.93</td><td>8.68</td><td>9.00</td><td>5.43</td><td>7.75</td></tr><tr><td>Conflict token density (%)</td><td>4.92</td><td>1.65</td><td>11.25</td><td>5.61</td><td>3.20</td><td>2.16</td><td>7.68</td><td>4.17</td><td>8.70</td><td>3.23</td><td>11.77</td><td>7.43</td></tr><tr><td>Conflict Sample Ratio (%)</td><td>63.68</td><td>82.21</td><td>86.43</td><td>78.12</td><td>59.67</td><td>85.90</td><td>87.91</td><td>78.88</td><td>88.75</td><td>90.25</td><td>89.34</td><td>89.57</td></tr></table>
Here, \( {\mathcal{K}}_{\text{ vision }} \) consists of facts supported by the visual input \( {X}_{V},{\mathcal{K}}_{\text{ text }} \) consists of facts constrained by the textual input \( {X}_{T} \) , and \( {\mathcal{K}}_{\text{ prior }} \) denotes parametric prior knowledge implicitly encoded in the model parameters \( \theta \) .
For each knowledge source \( {\mathcal{K}}_{ * } \in \mathcal{K} \) , we represent its supported factual content as a set of atomic factual statements \( F\left( {\mathcal{K}}_{ * }\right) \) , where each element \( \psi \in F\left( {\mathcal{K}}_{ * }\right) \) corresponds to an indivisible factual judgment. We use \( {\psi }_{a} \bot {\psi }_{b} \) to denote that two facts are semantically incompatible, i.e., they cannot simultaneously be true under the given context.
Based on this notion, we define a pairwise knowledge conflict between two sources \( {\mathcal{K}}_{i} \) and \( {\mathcal{K}}_{j}\left( {i \neq j}\right) \) as the set of incompatible fact pairs:
\[
{\mathcal{C}}_{i, j} = \left\{ {\left( {{\psi }_{i},{\psi }_{j}}\right) \mid {\psi }_{i} \in F\left( {\mathcal{K}}_{i}\right) ,{\psi }_{j} \in F\left( {\mathcal{K}}_{j}\right) ,{\psi }_{i} \bot {\psi }_{j}}\right\} .
\]
(3)
In this work, we focus on three primary pairwise conflict types induced by the three knowledge sources: Vision-Prior \( \left( {\mathcal{C}}_{\mathrm{{VP}}}\right) \) , Vision-Text \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , and Prior-Text \( \left( {\mathcal{C}}_{\mathrm{{PT}}}\right) \) .
### 3.2. Objective vs. Effective Conflict
As illustrated in Figure 1, we distinguish between two related but fundamentally different notions: objective conflict, which is defined at the input level, and effective conflict, which manifests as a process-level state during reasoning.
Objective Conflict describes factual inconsistency induced by the input and the model's parametric priors, independent of any particular reasoning trajectory. Given a conflict type \( {\mathcal{C}}_{i, j} \in \left\{ {{\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{VT}}},{\mathcal{C}}_{\mathrm{{PT}}}}\right\} \) , we define a binary variable \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \in \{ 0,1\} \) to indicate whether the input \( x \) exhibits an objective conflict of type \( {\mathcal{C}}_{i, j} \) . For example, \( {\mathcal{C}}_{\mathrm{{VP}}}^{o}\left( x\right) = 1 \) indicates that the visual evidence \( {X}_{V} \) contradicts the parametric prior knowledge encoded in \( \theta \) with respect to a specific fact. By definition, \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) depends only on the factual relations supported by the input \( x \) and the model priors, and does not reference the reasoning process itself.
Importantly, the presence of an objective conflict does not by itself determine whether the model will engage with this conflict during inference. From the input-level specification alone, it is not directly inferable whether, when, or how a given conflict influences the model's internal reasoning dynamics. This gap motivates a process-level notion that captures conflict activation within the model.
Effective Conflict characterizes whether an objective conflict is actually triggered during reasoning and reflected in the models internal state. Concretely, we use \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \in \; \{ 0,1\} \) to indicate whether, at reasoning step \( t \) , the model relies on mutually incompatible factual information of type \( {\mathcal{C}}_{i, j} \) . Here, \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \) means that the corresponding conflict is active and influences the current reasoning step, as encoded in the internal state at that step.
The relationship between the two notions is asymmetric:
\[
\mathbb{P}\left( {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \mid {\mathcal{C}}_{i, j}^{o}\left( x\right) = 1}\right) < 1. \tag{4}
\]
That is, objective conflict captures whether a conflict exists at the input level, whereas effective conflict captures whether and when that conflict is activated in the model's internal state during reasoning. The former is induced jointly by the input and priors, while the latter is both model-dependent and process-dependent.
Objective conflict data construction. For mechanistic analysis, we construct an objective-conflict benchmark with isolated pairwise conflicts, where each example contains exactly one conflict type (VP, VT, or PT) and is intended to elicit effective conflict states. This setting is designed as a diagnostic stress-test of conflict arbitration under contradiction, rather than an estimate of in-the-wild conflict prevalence. For each input x, we generate a long-CoT trajectory and align the input-level labels \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) with step-level effective conflict signals \( {\left\{ {\mathcal{C}}_{i, j}^{e}\left( t \mid x\right) \right\} }_{t = 1}^{T} \) inferred from the model outputs. Table 1 reports conflict activation statistics for this benchmark. Full details are provided in Appendix A.
## 4. Probing Conflict from Internal States
In Section 3, we formalize knowledge conflict as an input-level \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) and a process-level \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) . Moving forward, this section addresses the core question: Is \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) reflected in the model's internal states, and can it be identified in a streaming manner during generation?
### 4.1. Token-level Probing of Knowledge Conflict
We construct a streaming detector: when generating the \( t \) -th token, it determines whether an effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) is triggered based solely on the hidden state \( {\mathbf{h}}_{t}^{\left( l\right) } \) . While prior work has employed probes for binary hallucination detection (Obeso et al., 2025), we extend this to a four-class classification task based on the definition in Section 3.2.
Here, we use \( z = 0 \) as label, to indicate that no conflict is triggered (i.e., \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 0,\forall {\mathcal{C}}_{i, j} \) ); while \( z \in \{ 1,2,3\} \) corresponds to the active state of specific pairwise knowledge conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \) , namely \( {\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}} \) , and \( {\mathcal{C}}_{\mathrm{{VT}}} \) .
Formally, we define a probe \( {f}_{\phi } \) that maps hidden states to a probability distribution over conflict labels:
\[
{P}_{\phi }\left( {z \mid {\mathbf{h}}_{t}^{\left( l\right) }}\right) = \operatorname{Softmax}\left( {{f}_{\phi }\left( {\mathbf{h}}_{t}^{\left( l\right) }\right) }\right) , z \in \{ 0,1,2,3\} .
\]
(5)
The supervision signal for training \( {f}_{\phi } \) comes from the span-level assertion annotations constructed in Table 1. We project the label of each annotated span to all its constituent tokens to obtain the dense label sequence \( \left\{ {z}_{t}\right\} \) .
Since conflict tokens are extremely sparse in long-CoT, we train the probe using a weighted cross-entropy objective:
\[
{\mathcal{L}}_{\text{ probe }} = - \mathop{\sum }\limits_{t}{w}_{t}\log {P}_{\phi }\left( {{z}_{t} \mid {\mathbf{h}}_{t}^{\left( l\right) }}\right) , \tag{6}
\]
where \( {w}_{t} \) is a sample weight that assigns higher weight to \( z \in \{ 1,2,3\} \) (i.e., tokens where knowledge conflict \( {\mathcal{C}}_{i, j} \) occurs), preventing the probe from degenerating into predicting only the no-conflict background class. This objective allows the probe to maintain overall stability while remaining sufficiently sensitive to critical conflict-triggering moments. Full training details are provided in Appendix C.
### 4.2. Verifying the Separability of Knowledge Conflicts
We evaluate whether the probe reliably diagnoses knowledge conflicts from internal states. Specifically, we examine the token-level separability of effective conflicts and whether their sample-level recovers the objective conflict types.
![bo_d6nb7sc601uc73e2hngg_3_159_1535_668_387_0.jpg](images/bo_d6nb7sc601uc73e2hngg_3_159_1535_668_387_0.jpg)
Figure 2. Token-level separability of effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) . The left panel shows the confusion matrix over token-level conflict predictions. The right panels decompose performance into binary detection of conflict versus no-conflict, and fine-grained attribution among conflict types. Values denote row-normalized recall.
(I) Separability of Effective Conflicts: Local Signals in Sparse Regimes. We first examine whether the probe can distinguish different types of effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) from the model's internal states during reasoning.
As shown in Figure 2, the probe demonstrates robust discrimination capabilities. In the binary detection stage (Stage I), the model achieves a high True Negative rate of 88.7%, effectively filtering out non-conflicting steps. Conversely, a False Negative rate of 46.6% is observed, primarily driven by semantic sparsity within conflict spans-where 67.1% of \( {\mathcal{C}}_{\mathrm{{VP}}} \) tokens are misclassified as non-conflicting due to weak local signals. However, once effective conflict is activated (Stage II), the separability between conflict types sharply increases: \( {\mathcal{C}}_{\mathrm{{PT}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) achieve near-perfect identification accuracies of 99.4% and 94.8%, respectively. Even \( {\mathcal{C}}_{\mathrm{{VP}}} \) , the most subtle type, sees its recognition accuracy jump from 26.6% in the global view to 80.7% in the conditioned view. The minimal off-diagonal confusion ( \( < 1\% \) between PT and others) confirms that effective conflict types possess distinct, highly separable internal representations.
Conclusion (Local Effective Conflicts): Even under extreme sparsity and noise, different types of effective knowledge conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) give rise to distinct local structures in the model's internal states that can be reliably captured by the probe. This validates the feasibility of streaming diagnosis of effective conflicts while revealing differences in their intrinsic detectability.
![bo_d6nb7sc601uc73e2hngg_3_896_1223_703_419_0.jpg](images/bo_d6nb7sc601uc73e2hngg_3_896_1223_703_419_0.jpg)
Figure 3. Sample-level separability of conflict types. We visualize the t-SNE projection of hidden states at layer 20 (R1-Onevision) and layer 39 (Llama-3.2V). The three conflict categories are colored according to their Objective Conflict labels, pre-defined during dataset construction. The top-right confusion matrices illustrate the sample-level attribution performance.
(II) Alignment to Objective Conflicts: Aggregating Effective Signals. We next examine whether aggregating local effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) along a reasoning trajectory recovers the corresponding objective conflict \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) defined at the input level. This analysis evaluates the robustness of effective conflict signals beyond individual steps.
For each long-CoT trajectory, we aggregate hidden states of activated effective conflicts via mean pooling to obtain a sample-level representation. We visualize these representations using t-SNE (Figure 3), where samples sharing the same objective conflict type form compact clusters that are well separated, indicating consistent global structure.
![bo_d6nb7sc601uc73e2hngg_4_155_190_1446_524_0.jpg](images/bo_d6nb7sc601uc73e2hngg_4_155_190_1446_524_0.jpg)
Figure 4. Cross-layer distribution of conflict signals. Top row: attention-head activation ratio on conflict tokens vs. no-conflict tokens (lines), and their difference (bars), computed using effective conflict labels. Middle/bottom rows: layer-wise probe performance (one-vs-rest AUC and Recall@0.1) for \( {\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}},{\mathcal{C}}_{\mathrm{{VT}}} \) across three MLLM backbones.
Quantitatively, we infer the objective conflict type by aggregating stepwise effective conflict activations:
\[
{\widehat{\mathcal{C}}}_{\text{ sample }} = \arg \mathop{\max }\limits_{{\mathcal{C}}_{i, j}}\mathop{\sum }\limits_{{t = 1}}^{T}\mathbb{I}\left\lbrack {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1}\right\rbrack . \tag{7}
\]
Comparing \( {\widehat{\mathcal{C}}}_{\text{ sample }} \) with the ground-truth objective labels \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) directly tests whether the models internal conflict aligns with the conflict structure inherent in the input.
As shown in the inset matrices of Figure 3, aggregation substantially enhances separability. Notably, \( {\mathcal{C}}_{\mathrm{{PT}}} \) achieves a perfect 100.0% on both R1-Onevision and Llama-3.2V, confirming that text-prior conflicts induce unique and stable shifts in internal states. The remaining confusion is largely confined to the visual-conflict types: for instance,25.1% of \( {\mathcal{C}}_{\mathrm{{VT}}} \) samples in R1-Onevision are misclassified as \( {\mathcal{C}}_{\mathrm{{VP}}} \) , and \( {14.7}\% \) of \( {\mathcal{C}}_{\mathrm{{VP}}} \) samples in Llama-3.2V are misidentified as \( {\mathcal{C}}_{\mathrm{{VT}}} \) . This overlap is expected, as both categories involve failures in processing visual evidence, leading to partially shared representations.
### 4.3. Cross-Layer Distribution of Conflict Signals
We scan model depth to localize where effective knowledge conflicts are most strongly encoded. Concretely, for each layer \( l \) , we train the same token-level probe on hidden states \( {\mathbf{h}}_{t}^{\left( l\right) } \) and evaluate its one-vs-rest AUC / Recall@0.1 for \( \left\{ {{\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}},{\mathcal{C}}_{\mathrm{{VT}}}}\right\} \) .
Beyond probe separability, we also quantify a lightweight mechanistic correlate (Huang et al., 2025a): how attention-head activations differ between conflict and no-conflict token positions. Let \( {\mathcal{A}}^{\left( l\right) } \) denote the set of attention heads at layer \( l \) , and let \( {\mathbf{o}}_{t}^{\left( l, a\right) } \) be the output of head \( a \in {\mathcal{A}}^{\left( l\right) } \) at token \( t \) . We define token sets using effective conflict signals:
\[
{\mathcal{S}}_{\text{ conf }} = \{ \left( {x, t}\right) \mid \exists \left( {i, j}\right) ,{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1\} , \tag{8}
\]
\[
{\mathcal{S}}_{\text{ nconf }} = \left\{ {\left( {x, t}\right) \mid \forall \left( {i, j}\right) ,{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 0}\right\} . \tag{9}
\]
The layer-wise head activation ratio on a token set \( \mathcal{S} \) is
\[
{R}^{\left( l\right) }\left( \mathcal{S}\right) = {\mathbb{E}}_{\left( {x, t}\right) \in \mathcal{S}}\frac{1}{\left| {\mathcal{A}}^{\left( l\right) }\right| }\mathop{\sum }\limits_{{a \in {\mathcal{A}}^{\left( l\right) }}}\mathbb{I}\left\lbrack {{\begin{Vmatrix}{\mathbf{o}}_{t}^{\left( l, a\right) }\end{Vmatrix}}_{2} > \gamma }\right\rbrack , \tag{10}
\]
where \( \gamma \) is a fixed activation threshold (details in Appendix C.3). We then report the activation drift
\[
\Delta {R}^{\left( l\right) } = {R}^{\left( l\right) }\left( {\mathcal{S}}_{\text{ conf }}\right) - {R}^{\left( l\right) }\left( {\mathcal{S}}_{\text{ nconf }}\right) , \tag{11}
\]
which measures how strongly attention activations shift when effective conflicts are triggered.
As shown in Figure 4, both measurements reveal distinct depth-dependent signatures. (I) Probe Separability: In 7B models (R1-Onevision, Ocean-R1), discrimination performance rises in early layers and maximizes in the mid-to-late block (Layers 15-22), where AUC scores for \( {\mathcal{C}}_{\mathrm{{PT}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) consistently exceed 93%, before declining in the final layers. Llama-3.2V pushes this saturation deeper, maintaining highly robust separability \( \left( { \geq {95}\% }\right) \) as deep as Layer 39. (II) Activation Drift: This aligns with attention shifts. R1-series models show negative drift (suppression) peaking at Layers 18-22, while Llama-3.2V displays positive drift (enhancement) in Layers 30-39. We term these co-located peaks (Layer 20 for 7B, 39 for 11B) the conflict encoding stage, anchoring our analysis.
---
Conclusion (Global Effective Confilcts): By aggregating stepwise effective conflict signals \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) along the reasoning trajectory, different objective conflict types \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) become clearly and robustly separable at the sample level. This indicates that effective conflicts are not merely local artifacts, but form consistent global patterns that reliably reflect the underlying input-level objective conflict structure.
---
Table 2. Assessment of conflict probe performance across three VLM backbones. We report AUC and Recall at FPR=0.1 (Rec@0.1) under the One-vs-Rest setting. Gray rows indicate the Span-Max aggregation, which consistently outperforms token-level baselines. Values are presented as percentages (%).
<table><tr><td rowspan="2">Models</td><td rowspan="2">Probe</td><td rowspan="2">Granularity</td><td colspan="4">AUC (%)</td><td colspan="4">Recall@0.1 (%)</td></tr><tr><td>w/o Conflict</td><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>w/o Conflict</td><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td></tr><tr><td rowspan="6">(7B) R1-Onevision</td><td rowspan="3">Linear</td><td>All Token</td><td>81.7±0.1</td><td>\( {86.3} \pm {0.2} \)</td><td>92.0±0.1</td><td>94.8±0.2</td><td>50.0±0.3</td><td>56.8±0.2</td><td>75.1±0.1</td><td>\( {87.3} \pm {0.3} \)</td></tr><tr><td>Span Only</td><td>76.8±0.2</td><td>\( {82.5} \pm {0.1} \)</td><td>90.8±0.3</td><td>95.4±0.2</td><td>35.5±0.2</td><td>\( {44.5} \pm {0.3} \)</td><td>70.5±0.2</td><td>\( {88.5} \pm {0.1} \)</td></tr><tr><td>Span-Max</td><td>93.2±0.1</td><td>94.2±0.2</td><td>98.6±0.1</td><td>97.3±0.1</td><td>81.5±0.2</td><td>\( {82.4} \pm {0.1} \)</td><td>97.2±0.1</td><td>\( {93.8} \pm {0.2} \)</td></tr><tr><td rowspan="3">MLP</td><td>All Token</td><td>95.5±0.1</td><td>90.4±0.2</td><td>85.2±0.3</td><td>94.1±0.1</td><td>89.1±0.2</td><td>68.4+0.1</td><td>62.7±0.2</td><td>\( {79.3} \pm {0.1} \)</td></tr><tr><td>Span Only</td><td>95.7±0.2</td><td>86.1±0.3</td><td>80.3±0.1</td><td>93.3±0.2</td><td>89.8±0.1</td><td>53.0±0.2</td><td>43.7±0.2</td><td>\( {76.8} \pm {0.3} \)</td></tr><tr><td>Span-Max</td><td>97.3±0.1</td><td>94.5±0.1</td><td>93.2±0.2</td><td>99.1±0.1</td><td>93.4±0.2</td><td>82.4±0.1</td><td>82.1±0.1</td><td>98.7±0.2</td></tr><tr><td rowspan="6">(7B-Instruct) Ocean-R1</td><td rowspan="3">Linear</td><td>All Token</td><td>83.0±0.2</td><td>90.6+0.1</td><td>94.2±0.2</td><td>94.9+0.1</td><td>53.7±0.3</td><td>69.4±0.1</td><td>81.3±0.2</td><td>85.6+0.1</td></tr><tr><td>Span Only</td><td>\( {78.5} \pm {0.1} \)</td><td>86.7+0.3</td><td>90.0±0.2</td><td>97.6+0.1</td><td>41.4+0.2</td><td>52.5+0.2</td><td>66.6±0.1</td><td>94.6±0.3</td></tr><tr><td>Span-Max</td><td>\( \mathbf{{95.0} \pm {0.2}} \)</td><td>95.9±0.1</td><td>98.6±0.1</td><td>98.8±0.1</td><td>85.7±0.1</td><td>87.9+0.2</td><td>97.1±0.1</td><td>97.8±0.2</td></tr><tr><td rowspan="3">MLP</td><td>All Token</td><td>95.5+0.1</td><td>92.8+0.1</td><td>85.0±0.2</td><td>95.5+0.1</td><td>87.1+0.2</td><td>75.6+0.3</td><td>61.6+0.1</td><td>85.2±0.2</td></tr><tr><td>Span Only</td><td>97.8+0.2</td><td>87.3+0.2</td><td>79.7±0.1</td><td>91.7±0.2</td><td>95.7±0.3</td><td>53.9±0.1</td><td>43.3±0.2</td><td>71.0±0.1</td></tr><tr><td>Span-Max</td><td>99.2</td><td>96.5±0.1</td><td>95.3±0.2</td><td>98.4±0.1</td><td>98.9±0.1</td><td>89.8±0.2</td><td>87.5±0.1</td><td>96.1±0.1</td></tr><tr><td rowspan="6">(11B-cot) Llama-3.2V</td><td rowspan="3">Linear</td><td>All Token</td><td>88.7+0.2</td><td>\( {90.5} \pm {0.1} \)</td><td>96.9+0.2</td><td>94.5±0.1</td><td>68.4±0.3</td><td>67.2±0.2</td><td>94.4+0.1</td><td>85.8±0.2</td></tr><tr><td>Span Only</td><td>\( {79.6} \pm {0.2} \)</td><td>85.8 + 0.2</td><td>90.2</td><td>95.2±0.3</td><td>43.2±0.1</td><td>\( {51.1} \pm {0.2} \)</td><td>66.0+0.2</td><td>88.4</td></tr><tr><td>Span-Max</td><td>\( \mathbf{{93.9} \pm {0.1}} \)</td><td>93.4</td><td>98.4</td><td>97.2±0.1</td><td>83.5±0.2</td><td>\( \mathbf{{76.9}} \pm {0.1} \)</td><td>96.1±0.2</td><td>93.1±0.1</td></tr><tr><td rowspan="3">MLP</td><td>All Token</td><td>95.8+0.2</td><td>90.7+0.1</td><td>88.7+0.2</td><td>96.9+0.1</td><td>89.4±0.1</td><td>\( {64.3} \pm {0.2} \)</td><td>70.6+0.3</td><td>93.7±0.2</td></tr><tr><td>Span Only</td><td>96.1</td><td>85.5+0.3</td><td>79.2+0.2</td><td>\( {89.2}\overline{ + }{0.1} \)</td><td>90.8±0.2</td><td>\( {46.7} \pm {0.1} \)</td><td>\( {40.5} \pm {0.2} \)</td><td>65.2±0.1</td></tr><tr><td>Span-Max</td><td>97.2</td><td>94.5±0.2</td><td>\( \mathbf{{93.4} \pm {0.1}} \)</td><td>97.8</td><td>93.2±0.1</td><td>\( {82.3} \pm {0.2} \)</td><td>82.3±0.1</td><td>\( {94.4} \pm {0.2} \)</td></tr></table>
Conclusion (Layer-level): Layer-scanning reveals that both probe separability and attention drift co-localize in a specific mid-to-late layer band across all three MLLM backbones. This indicates that conflict-related signals are depth-dependent and concentrated in a distinct "conflict encoding stage," bridging early perception and late decoding rather than being uniformly distributed across the network.
### 4.4. Linearity of Conflict Representation
To comprehensively assess the nature of effective conflict signals \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) encoded in the hidden states \( {\mathbf{h}}_{t}^{\left( l\right) } \) (specifically, whether they are explicitly linear or highly entangled) we conducted experiments on specific layers identified as the "Conflict Encoding Stage" in Section 4.3. We designed two probe architectures with distinct underlying assumptions: (I) Linear Probe \( \left( {f}_{lin}\right) \) , consisting of a single projection layer \( \mathbf{W} \in {\mathbb{R}}^{d \times 4} \) (where \( d \) denotes the hidden state dimension), aimed at evaluating the Linear Separability of conflict states. High classification accuracy with a linear mapping would indicate that the model has formed clear, decoupled conflict boundaries at the current layer. (II) MLP Probe \( \left( {f}_{mlp}\right) \) , designed to assess Non-linear Entanglement. Recognizing the potential manifold complexity in deep Transformer features, we construct a deep MLP with three dimension-reducing layers \( \left( {{1024} \rightarrow {512} \rightarrow {256}}\right) \) and ReLU activation to capture high-order interaction features.
As shown in Table 2, we report AUC and Recall@0.1 for both probes using "Span-Max" aggregation, which takes the maximum predicted probability across tokens within each span (details in Appendix C.5). The Linear Probe achieves strong performance across all conflict types: AUC reaches 93.2-98.8% and Recall@0.1 reaches 76.9-97.8%. For \( {\mathcal{C}}_{\mathrm{{PT}}} \) , Linear Probe achieves 98.6% AUC and 96.1-97.2% Recall@0.1; for \( {\mathcal{C}}_{\mathrm{{VP}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) , it reaches 93.4-95.9% AUC and 76.9-87.9% Recall@0.1, comparable to MLP. The fact that a single linear layer suffices to achieve such performance indicates that for knowledge conflicts, the "features" extracted by LLMs are already explicitly disentangled in the high-dimensional space, and introducing additional nonlinear complexity (MLP) does not yield significant gain.
Conclusion (Linearity): It was observed that a simple linear probing method could achieve detection performance comparable to that of a non-linear MLP. This suggests that effective conflicts are not entangled within complex nonlinear manifolds, but rather are explicitly and approximately linearly separable. This makes real-time detection of conflict states during inference possible.
## 5. Intervening in Knowledge Conflict
Section 4 showed that effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) are streaming-decodable from internal states and are encoded as linearly separable features in specific mid-to-late layers. Building on this observation, we ask the following: given an input with \( {\mathcal{C}}_{i, j}^{o}\left( x\right) = 1 \) , can inference-time interventions bias the model toward a desired knowledge source, or suppress the activation of effective conflicts during generation?
![bo_d6nb7sc601uc73e2hngg_6_152_184_1451_464_0.jpg](images/bo_d6nb7sc601uc73e2hngg_6_152_184_1451_464_0.jpg)
Figure 5. Semantic performance of targeted source control. We evaluate three conflict subsets \( \left( {{\mathcal{C}}_{\mathrm{{VP}}}^{o},{\mathcal{C}}_{\mathrm{{VT}}}^{o},{\mathcal{C}}_{\mathrm{{PT}}}^{o}}\right) \) using judge-based metrics: ASR (Anchor Support Rate, ↑), ARR (Anchor Rejection Rate, ↓), and OER (Obvious Error Rate, ↓). Forward/Reverse denote intervening toward the truth-anchored (benchmark-reliable) vs. conflicting source. Arrows indicate relative changes against the baseline. Note that VCD is inapplicable to the non-visual \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} \) subset.
#### 5.1.A unified framework for directional interventions
Two control objectives. We study inference-time control under objectively conflicting inputs, two settings are considered. (I) Targeted source control. We choose a target source \( {\mathcal{K}}_{s} \in \left\{ {{\mathcal{K}}_{i},{\mathcal{K}}_{j}}\right\} \) and intervene so that the model follows \( {\mathcal{K}}_{s} \) under conflict. This yields two directions: Forward, which intervenes toward the truth-anchored (benchmark-reliable) source, and Reverse, which enforces the opposite source. (II) Conflict mitigation. We measure whether interventions reduce how often effective conflicts are activated during generation, quantified by the expected fraction of reasoning steps where a conflict is detected:
\[
{\mathbb{E}}_{x}{\mathbb{E}}_{t}\left\lbrack {\mathbb{I}\left\lbrack {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1}\right\rbrack }\right\rbrack . \tag{12}
\]
A unified view of directional interventions. Let \( {\ell }_{t} \in \; {\mathbb{R}}^{\left| \mathcal{V}\right| } \) denote the pre-softmax logits at step \( t \) . We view an inference-time intervention as modifying decoding through an additive logit perturbation, either directly or implicitly via hidden-state manipulation:
\[
{\widetilde{p}}_{t} = \operatorname{softmax}\left( {{\ell }_{t} + \Delta {\ell }_{t}}\right) ,\;\Delta {\ell }_{t} = \mathcal{I}\left( {x,{y}_{ < t}}\right) . \tag{13}
\]
We consider three instantiations of \( \mathcal{I} \) : (I) Visual contrastive decoding (VCD). VCD applies a logit-level correction (Leng et al., 2023) and is restricted to conflicts involving visual sources (i.e., \( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \) or \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \) ). (II) Representation steering. Leveraging the linear separability found in Section 4, we adopt a representation steering (Zhang et al., 2025c) that shifts the hidden state at a selected conflict-sensitive layer by a learned direction, i.e., \( {\widetilde{\mathbf{h}}}_{t} = {\mathbf{h}}_{t} + \lambda \mathbf{v} \) (where \( \lambda \) is the steering strength and \( \mathbf{v} \) is the direction vector). (III) Probe-guided control. We use the streaming probe to score candidate continuations, reweighting decoding toward options less likely to trigger conflicts. For the top- \( k \) candidates \( {\mathcal{V}}_{k} \) with base probabilities \( {p}_{t}\left( w\right) \) , we apply
\[
{\widetilde{p}}_{t}\left( w\right) \propto {p}_{t}\left( w\right) \exp \left( {\alpha {P}_{t}^{\left( w\right) }}\right) ,\;w \in {\mathcal{V}}_{k}, \tag{14}
\]
where \( {P}_{t}^{\left( w\right) } \) is the probe-predicted probability of the no-conflict state for the continuation committing to token \( w \) , and \( \alpha \) controls the strength of guidance. Full implementation details and hyperparameters are provided in Appendices D.4 and D.5.
### 5.2. Targeted source control: semantic-level evaluation
We evaluate whether targeted interventions successfully bias the model toward a specified knowledge source under objectively conflicting inputs. We adopt an automated assertion-level judge, implemented with a strong off-the-shelf large language model, to assess semantic alignment with the target source. The judge extracts factual claims from the model output and verifies each claim against the corresponding truth anchor (image, input text, or world knowledge), producing compact aggregate metrics: ASR (Anchor Support Rate), ARR (Anchor Refutation Rate), and OER (Obvious Error Rate). To validate judge reliability, we conducted human verification on a stratified 10% subset (~1,500 spans), yielding high inter-annotator agreement \( \left( {\kappa = {0.87}}\right) \) , confirming that automated verdicts align closely with human perception of conflict resolution (details in Appendix D.2).
As shown in Figure 5, targeted source control is feasible but exhibits a pronounced directional asymmetry. Across objective-conflict subsets, Forward interventions (intervening toward the truth-anchored source; vision for VP/VT and prior knowledge for PT) reliably improve semantic alignment, whereas Reverse control (forcing reliance on the competing source) often degrades it. We hypothesize this asymmetry reflects an internal source-reliability prior: when sources disagree, the model is more resistant to reversing arbitration away from the source it treats as reliable, even under strong contextual pressure. This asymmetry cannot be explained by construction bias alone: if it were purely a data artifact, we would expect the probe to learn shortcuts to anchor proximity rather than capturing genuine conflict dynamics. However, the asymmetry persists across all three architecturally distinct backbones, suggesting it reflects shared instruction-tuning biases that favor user-provided context (Sharma et al., 2024; Zhang et al., 2025c). Under Forward control, Probe-guided control interventions improve ASR while lowering OER by \( \sim {30}\% \) ; VCD yields stronger but selective gains on \( {\mathcal{C}}_{\mathrm{{VP}}} \) (ASR +15%, ARR halved). Reverse control remains challenging-most methods regress or show negligible gains. Mechanistically, the probe primarily suppresses conflict states rather than enforcing weaker-source selection. This highlights a trade-off: VCD is high-gain but direction-sensitive, whereas Representation steering reliably reduces errors (ARR/OER) but rarely drives sustained ASR gains.
Table 3. Token-level conflict mitigation under the forward direction. Results are reported on three objective-conflict subsets \( \left( {{\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1}\right. \) , \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \) , and \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \) ) across three backbones. We report four token-level mitigation metrics: \( \mathbf{{SS}} \uparrow ,\mathbf{{CR}} \downarrow ,\mathbf{{CAC}} \downarrow \) and \( \mathbf{{CCI}} \downarrow \) (metric definitions in Appendix D.3). VCD is not applicable when \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \) and is therefore reported only for the first two subsets.
<table><tr><td rowspan="2">Subset</td><td colspan="4">R1-Onevision-7B</td><td colspan="4">Ocean-R1-7B-Instruct</td><td colspan="4">Llama-3.2V-11B-cot</td></tr><tr><td>SS↑</td><td>CAC↓</td><td>CCL</td><td>CR↓</td><td>SS↑</td><td>CAC↓</td><td>CCI↓</td><td>CR \( \downarrow \)</td><td>SS↑</td><td>CAC↓</td><td>CCI↓</td><td>\( \mathbf{{CR} \downarrow } \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>0.94</td><td>0.04</td><td>0.70</td><td>0.03</td><td>0.89</td><td>0.07</td><td>0.71</td><td>0.06</td><td>0.94</td><td>0.04</td><td>0.45</td><td>0.02</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>0.88</td><td>0.08</td><td>0.80</td><td>0.10</td><td>0.87</td><td>0.09</td><td>0.79</td><td>0.10</td><td>0.90</td><td>0.06</td><td>0.72</td><td>0.03</td></tr><tr><td>baseline \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>0.82</td><td>0.12</td><td>0.80</td><td>0.15</td><td>0.82</td><td>0.12</td><td>0.80</td><td>0.15</td><td>0.84</td><td>0.11</td><td>0.70</td><td>0.11</td></tr><tr><td><img src="https://cdn.noedgeai.com/bo_d6nb7sc601uc73e2hngg_7.jpg?x=163&y=461&w=28&h=52&r=0"/> \( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {0.92}^{-{0.02}} \)</td><td>\( {0.05}^{+{0.01}} \)</td><td>0.69</td><td>\( {0.04}^{+{0.01}} \)</td><td>0.90</td><td>\( {0.06}^{-{0.01}} \)</td><td>\( {0.69}^{-{0.01}} \)</td><td>\( {0.05}^{-{0.01}} \)</td><td>0.85</td><td>0.08+0.04</td><td>0.63</td><td>\( {0.06}^{+{0.05}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {0.90}^{+{0.01}} \)</td><td>\( {0.07}^{-{0.01}} \)</td><td>0.79</td><td>\( {0.08}^{-{0.01}} \)</td><td>0.92</td><td>\( {0.05}^{-{0.03}} \)</td><td>\( {0.75}^{-{0.04}} \)</td><td>\( {0.05}^{-{0.05}} \)</td><td>0.78</td><td>0.12+0.06</td><td>0.69</td><td>\( {0.15}^{+{0.11}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {0.92}^{-{0.02}} \)</td><td>\( {0.05}^{+{0.01}} \)</td><td>0.69</td><td>\( {0.05}^{+{0.02}} \)</td><td>0.89</td><td>\( {0.07}^{+{0.00}} \)</td><td>\( {0.71}^{+{0.00}} \)</td><td>\( {0.07}^{+{0.01}} \)</td><td>0.92</td><td>0.05+0.01</td><td>0.55+0.10</td><td>\( {0.03}^{+{0.02}} \)</td></tr><tr><td>steering \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {0.91}^{+{0.02}} \)</td><td>\( {0.06}^{-{0.02}} \)</td><td>0.76</td><td>\( {0.07}^{-{0.03}} \)</td><td>0.91</td><td>\( {0.06}^{-{0.03}} \)</td><td>\( {0.77}^{-{0.03}} \)</td><td>\( {0.06}^{-{0.04}} \)</td><td>\( {0.90}^{+{0.00}} \)</td><td>\( {0.06}^{-{0.00}} \)</td><td>\( {0.67}^{-{0.04}} \)</td><td>\( {0.04}^{+{0.01}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>\( {0.77}^{-{0.06}} \)</td><td>\( {0.16}^{+{0.04}} \)</td><td>0.76</td><td>\( {0.20}^{+{0.05}} \)</td><td>\( {0.82}^{+{0.00}} \)</td><td>\( {0.12}^{-{0.00}} \)</td><td>0.80</td><td>\( {0.15}^{+{0.00}} \)</td><td>\( {0.84}^{+{0.00}} \)</td><td>\( {0.11}^{-{0.00}} \)</td><td>\( {0.69}^{-{0.01}} \)</td><td>\( {0.12}^{+{0.01}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)</td><td>\( {0.95}^{+{0.01}} \)</td><td>\( {0.03}^{-{0.01}} \)</td><td>\( {0.67}^{-{0.03}} \)</td><td>\( {0.02}^{-{0.01}} \)</td><td>\( {0.92}^{+{0.03}} \)</td><td>\( {0.05}^{-{0.02}} \)</td><td>0.66</td><td>\( {0.03}^{-{0.03}} \)</td><td>\( {0.94}^{+{0.01}} \)</td><td>\( {0.04}^{-{0.00}} \)</td><td>\( {0.39}^{-{0.06}} \)</td><td>\( {0.02}^{-{0.00}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)</td><td>\( {0.94}^{+{0.06}} \)</td><td>\( {0.04}^{-{0.04}} \)</td><td>\( {0.64}^{-{0.16}} \)</td><td>\( {0.02}^{-{0.07}} \)</td><td>\( {0.93}^{+{0.06}} \)</td><td>\( {0.04}^{-{0.04}} \)</td><td>0.72</td><td>\( {0.04}^{-{0.06}} \)</td><td>\( {0.92}^{+{0.02}} \)</td><td>\( {0.05}^{-{0.01}} \)</td><td>\( {0.67}^{-{0.05}} \)</td><td>\( {0.03}^{-{0.00}} \)</td></tr><tr><td>\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)</td><td>\( {0.78}^{-{0.04}} \)</td><td>\( {0.10}^{-{0.02}} \)</td><td>\( {0.60}^{-{0.20}} \)</td><td>\( {0.15}^{+{0.01}} \)</td><td>\( {0.87}^{+{0.04}} \)</td><td>\( {0.08}^{-{0.04}} \)</td><td>0.72</td><td>\( {0.09}^{-{0.06}} \)</td><td>\( {0.87}^{+{0.04}} \)</td><td>\( {0.08}^{-{0.03}} \)</td><td>0.63</td><td>\( {0.10}^{-{0.01}} \)</td></tr></table>
Conclusion (Targeted Source Control). When objective conflicts are present, inference-time interventions exhibit a clear directional asymmetry: biasing the model toward fact-consistent, truth-anchored sources is significantly easier and more reliable than forcing it to rely on fact-inconsistent sources. This suggests that conflict resolution in MLLMs is governed by a stable, source-dependent inductive tendency, which can be strengthened but is difficult to reverse.
### 5.3. Conflict mitigation under the default direction
Semantic evaluation in Section 5.2 demonstrated that, under objectively conflicting inputs, inference-time interventions can bias model outputs toward the truth-anchored source. Here, we pose a complementary process-level question: under the default (Forward) direction, can we reduce the activation of effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) during generation? We employ token-level mitigation metrics to summarize these internal dynamics (Support Score (SS), Conflict Rate (CR), Confidence-Adjusted Conflict (CAC), and Conflict Confidence Index (CCI)) as a further complement to the independent semantic correctness evaluation in Figure 5. Table 3 summarizes the token-level mitigation results. We observe that interventions targeting the identified conflict features (Probe-guided control) consistently suppress conflict dynamics across backbones. Specifically, on visually involved subsets \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , the frequency of conflict activation (CR) decreases significantly (e.g., \( {0.10} \rightarrow {0.02} \) on R1-Onevision). Crucially, even when conflict frequency remains stable (e.g., \( {\mathcal{C}}_{\mathrm{{PT}}} \) ), confidence-aware measures reveal deeper suppression (CCI drops by 25%), indicating that the intervention mitigates the intensity of conflicts even if not their occurrence. In contrast, rigid interventions like Representation steering or unguided perturbations like VCD struggle to generalize. For instance, VCD exacerbated conflict rates fivefold on Llama-3.2V for \( {\mathcal{C}}_{\mathrm{{VT}}} \; \left( {{0.03} \rightarrow {0.15}}\right) \) . This disparity highlights that effective mitigation requires precise targeting of the conflict-encoding subspaces rather than broad adjustments.
Conclusion (Conflict Mitigation). Guiding the model toward the reliable source attenuates internal conflict dynamics during reasoning, reducing both the intensity and the frequency of effective conflict states. This implies that effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) activation is not an inherent attribute of generation, but a plastic internal state that can be suppressed during reasoning.
## 6. Conclusion
In this work, we study failures in multimodal long-CoT reasoning from the perspective of knowledge conflict, rather than knowledge absence. By distinguishing objective conflicts from effective conflicts during reasoning, we show that many failures arise from how conflicting knowledge is resolved over time. We find that effective conflicts are encoded as explicit and linearly decodable signals, concentrated in mid-to-late layers of the model. Leveraging these signals, we uncover a pronounced directional asymmetry: guiding the model toward its reliability-aligned source is substantially easier than forcing conflict resolution in the opposite direction, indicating a biased and path-dependent mechanism. Looking forward, we hope this perspective motivates analysis and control methods for richer conflict structures and more complex multimodal reasoning settings.
## Impact Statement
This paper presents work whose goal is to advance the understanding and reliability of MLLMs in long-CoT reasoning scenarios. By diagnosing knowledge conflicts and their intervention mechanisms, our research contributes to making AI systems more transparent and trustworthy. The diagnostic framework and intervention methods proposed here could help identify and mitigate reasoning failures before deployment, potentially reducing the propagation of misinformation or hallucinated content in real-world applications. We do not foresee specific negative societal consequences that need to be highlighted beyond the general considerations applicable to advancing machine learning capabilities.

View File

@@ -0,0 +1,229 @@
# TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs
Shuyi Liu, Yuming Shang, Xi Zhang*
Key Laboratory of Trustworthy Distributed Computing and Service (MoE)
Beijing University of Posts and Telecommunications, China
\{liushuyi111, shangym, zhangx\}@bupt.edu.cn
## Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, Truth-fulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.
## Introduction
Large Language Models (LLMs) have demonstrated impressive performance across diverse natural language understanding and generation tasks (Achiam et al. 2023; Tou-vron and et al. 2023; Yang et al. 2025). Despite their proficiency, LLMs remain ineffective in handling specialized, privacy-sensitive, or time-sensitive knowledge that is not encompassed within their training corpora (Zhang et al. 2024; Huang et al. 2025). For the solutions, Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm that enhances the relevance and factuality of the generated responses by integrating external knowledge retrieval with the remarkable generative capabilities of LLMs (Lewis et al. 2020; Gao et al. 2023; Fan et al. 2024). However, as RAG systems continuously update their knowledge repositories, the temporal disparity between dynamic external sources and static parametric knowledge within LLMs inevitably leads to knowledge conflicts (Xie et al. 2023; Xu et al. 2024; Shi et al. 2024), which can significantly undermine the accuracy and reliability of the generated content.
![bo_d6nbbd4601uc73e2hqsg_0_930_625_726_730_0.jpg](images/bo_d6nbbd4601uc73e2hqsg_0_930_625_726_730_0.jpg)
Figure 1: The illustration of knowledge conflicts and the differences between existing solutions and TruthfulRAG.
Recent research has begun to investigate the impact of knowledge conflicts on the performance of RAG systems (Chen, Zhang, and Choi 2022; Xie et al. 2023; Tan et al. 2024) and explore methods to mitigate such conflicts (Wang et al. 2024; Jin et al. 2024; Zhang et al. 2025; Bi et al. 2025). Existing resolution approaches can be categorized into two methodological types: (i) token-level methods, which manage LLMs' preference between internal and external knowledge by adjusting the probability distribution over the output tokens (Jin et al. 2024; Bi et al. 2025); (ii) semantic-level methods, which resolve conflicts by semantically integrating and aligning knowledge segments from internal and external sources (Wang et al. 2024; Zhang et al. 2025). However, these token-level or semantic-level conflict resolution methods generally employ coarse-grained strategies that rely on fragmented data representations, resulting in insufficient contextual awareness. This may prevent LLMs from accurately capturing complex interdependencies and fine-grained factual inconsistencies, especially in knowledge-intensive conflict scenarios (Han et al. 2024).
---
*Corresponding author.
Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
---
To address the above limitations, we propose Truthful-RAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level conflicts in RAG systems. As illustrated in Figure 1, unlike previous studies, Truthful-RAG uses structured triple-based knowledge representations to construct reliable contexts, thereby enhancing the confidence of LLMs in external knowledge and facilitating trustworthy reasoning. The TruthfulRAG framework comprises three key modules: (a) Graph Construction, which derives structured triples from retrieved external knowledge by identifying entities, relations, and attributes to construct knowledge graphs; (b) Graph Retrieval, which conducts query-based retrieval algorithms to obtain relevant knowledge that exhibit strong factual associations with the input query; and (c) Conflict Resolution, which applies entropy-based filtering techniques to locate conflicting elements and mitigate factual inconsistencies, ultimately forming more reliable reasoning paths and promoting more accurate outputs. This framework integrates seamlessly with existing RAG architectures, enabling the extraction of highly relevant and factually consistent knowledge, effectively eliminating factual-level conflicts and improving generation reliability.
The contributions of this paper are as follows:
- We discover that constructing contexts through textual representations on structured triples can enhance the confidence of LLMs in external knowledge, thereby promoting trustworthy and reliable model reasoning.
- We introduce TruthfulRAG, the first framework that leverages knowledge graphs to resolve factual-level conflicts in RAG systems through systematic triple extraction, query-based graph retrieval, and entropy-based filtering mechanisms.
- We conduct extensive experiments demonstrating that TruthfulRAG outperforms existing methods in mitigating knowledge conflicts while improving the robustness and trustworthiness of RAG systems.
## Methodology
In this section, we provide a detailed introduction to the TruthfulRAG framework. As illustrated in Figure 2, Truth-fulRAG comprises three interconnected modules: (i) Graph Construction, which transforms unstructured retrieved content into structured knowledge graphs through systematic triple extraction; (ii) Graph Retrieval, which employs query-aware graph traversal algorithms to identify semantically relevant reasoning paths; and (iii) Conflict Resolution, which utilizes entropy-based filtering mechanisms to detect and mitigate factual inconsistencies between parametric and external knowledge.
## Graph Construction
The construction of a knowledge graph begins with the conversion of raw information retrieved from the RAG system into structured knowledge representations through systematic entity-relation-attribute extraction.
Given the retrieved content \( C \) for the users query \( q \) , we first perform fine-grained semantic segmentation to partition the content into coherent textual segments \( \mathcal{S} = \; \left\{ {{s}_{1},{s}_{2},\ldots ,{s}_{m}}\right\} \) , where each segment \( {s}_{i} \) represents a semantically coherent unit containing factual information. For each textual segment \( {s}_{i} \in \mathcal{S} \) , we employ the generative model \( \mathcal{M} \) from the RAG system to extract a set of structured knowledge triples \( {\mathcal{T}}_{\text{ all }} = \left\{ {{\mathcal{T}}_{i,1},{\mathcal{T}}_{i,2},\ldots ,{\mathcal{T}}_{i, n}}\right\} \) , with each triple \( {\mathcal{T}}_{i, j} = \left( {h, r, t}\right) \) consisting of a head entity \( h \) , relation \( r \) , tail entity \( t \) . This extraction process aims to capture both explicit factual statements and implicit semantic relationships embedded within the original content, thereby ensuring the comprehensiveness and semantic integrity of the knowledge representation.
The aggregated triple set from all retrieved content forms the foundation for constructing the knowledge graph \( \mathcal{G} \) :
\[
\mathcal{G} = \left( {\mathcal{E},\mathcal{R},{\mathcal{T}}_{\text{ all }}}\right) \tag{1}
\]
where \( \mathcal{E} = \mathop{\bigcup }\limits_{{i, j, k}}{h}_{i, j, k},{t}_{i, j, k} \) represents the entity set, \( \mathcal{R} = \mathop{\bigcup }\limits_{{i, j, k}}{r}_{i, j, k} \) denotes the relation set, and \( {\mathcal{T}}_{\text{ all }} = \; \mathop{\bigcup }\limits_{{i, j}}{\mathcal{T}}_{i, j} \) constitutes the complete triple repository. This structured knowledge representation enables the filtering of low-information noise and captures detailed factual associations, thereby providing a clear and semantically enriched foundation for subsequent query-aware knowledge retrieval.
## Graph Retrieval
To acquire knowledge that is strongly aligned with user queries at the factual level, we design a query-aware graph traversal algorithm that can identify critical knowledge paths within the graph, ensuring both semantic relevance and factual consistency in the retrieval process.
Initially, key elements are extracted from the user query \( q \) to serve as important references for matching components in the knowledge graph. These elements include the query's target entities, relations, and intent categories, denoted as \( {\mathcal{K}}_{q} \) . Subsequently, semantic similarity matching is employed to identify the top- \( k \) most relevant entities and relations within the knowledge graph:
\[
\mathcal{E}{imp} = \operatorname{TopK}\left( {\operatorname{sim}\left( {e,{\mathcal{K}}_{q}}\right) : e \in \mathcal{E}, k}\right) \tag{2}
\]
\[
\mathcal{R}{imp} = \operatorname{TopK}\left( {\operatorname{sim}\left( {r,{\mathcal{K}}_{q}}\right) : r \in \mathcal{R}, k}\right) \tag{3}
\]
where \( \operatorname{sim}\left( {\cdot , \cdot }\right) \) represents the semantic similarity function computed using dense embeddings, Eimp denotes the set of key entities, and \( \mathcal{R}{imp} \) represents the set of key relations. From each key entity \( e \in \mathcal{E} \) imp, we perform a two-hop graph traversal to systematically collect the entire set of possible initial reasoning paths \( \mathcal{P} \) init.
To further filter reasoning paths with stronger factual associations, we introduce a fact-aware scoring mechanism that evaluates the relevance of paths to the query based on the coverage of key entities and relations within each path p:
\[
\operatorname{Ref}\left( p\right) = \alpha \cdot \frac{\left| e \in p \cap \mathcal{E}imp\right| }{\left| \mathcal{E}imp\right| } + \beta \cdot \frac{\left| r \in p \cap \mathcal{R}imp\right| }{\left| \mathcal{R}imp\right| } \tag{4}
\]
where \( \alpha \) and \( \beta \) are hyperparameters that control the relative importance of entity and relationship coverage, respectively. The top-scored reasoning paths from Pinit constitute the core knowledge paths \( \mathcal{P} \) super.
\[
\mathcal{P}\text{ super } = \operatorname{TopK}\left( {\operatorname{Ref}\left( p\right) : p \in \mathcal{P}\text{ init, }K}\right) \tag{5}
\]
![bo_d6nbbd4601uc73e2hqsg_2_147_140_1502_806_0.jpg](images/bo_d6nbbd4601uc73e2hqsg_2_147_140_1502_806_0.jpg)
Figure 2: The overall pipeline of the TruthfulRAG framework. TruthfulRAG first extracts structured knowledge triples to construct a comprehensive knowledge graph. Subsequently, it employs query-aware graph traversal to identify salient reasoning paths, where each path comprises entities and relationships enriched with associated attributes. Finally, the framework applies entropy-based conflict resolution to detect and filter out corrective paths that challenge parametric misconceptions, thereby alleviating knowledge conflicts between internal and external information, prompting consistent and credible responses.
In order to construct detailed contextual information, each core reasoning path \( p \in \mathcal{P} \) super will be represented as a comprehensive contextual structure consisting of three essential components:
\[
p = {\mathcal{C}}_{\text{ path }} \oplus {\mathcal{C}}_{\text{ entities }} \oplus {\mathcal{C}}_{\text{ relations }} \tag{6}
\]
where:
- Cpath represents the complete sequential reasoning path: \( {e}_{1}\overset{{r}_{1}}{ \rightarrow }{e}_{2}\overset{{r}_{2}}{ \rightarrow }\cdots \overset{{r}_{n - 1}}{ \rightarrow }{e}_{n} \) , capturing the logical progression of entities connected through relational links.
- Centities \( = \left( {e,\mathcal{A}e}\right) : e \in p \cap \mathcal{E} \) imp encompasses all important entities within the path along with their corresponding attribute descriptions \( \mathcal{A}e \) , providing thorough entity-specific information for the context.
- Crelations \( = \left( {r,\mathcal{A}r}\right) : r \in p \cap \mathcal{R} \) imp includes all important relations on the path together with their corresponding attributes \( \mathcal{A}r \) , enriching the semantic and contextual understanding of the relations.
This formalized representation of knowledge ensures that each extracted reasoning path preserves structural coherence through the entity-relation sequence and reinforces semantic richness via comprehensive attribute information, thereby facilitating more nuanced and context-aware knowledge integration for subsequent conflict resolution processes.
## Conflict Resolution
To address factual inconsistencies between parametric knowledge and external information, ensuring that LLMs consistently follow the retrieved knowledge paths to achieve accurate reasoning, we employ entropy-based model confidence analysis to investigate the influence of conflicting knowledge on model prediction uncertainty, thereby systematically identifying and resolving factual conflicts based on uncertainty quantification mechanisms.
We implement conflict detection by comparing model performance under two distinct conditions: (1) pure parametric generation without access to external context, and (2) retrieval-augmented generation that incorporates structured reasoning paths constructed from knowledge graph. For parametric-based generation, we calculate the response probability from LLMs as baselines:
\[
{P}_{\text{ param }}\left( {\text{ ans } \mid q}\right) = \mathcal{M}\left( q\right) \tag{7}
\]
where ans represents the generated answer and \( \mathcal{M}\left( q\right) \) denotes the response distribution of the LLM based solely on query \( q \) . For retrieval-augmented generation, we incorporate each reasoning path from \( \mathcal{P} \) super as contextual information to obtain the model's output probability:
\[
{P}_{\text{ aug }}\left( {\left. {\operatorname{ans} \mid q, p}\right| \; = \mathcal{M}\left( {q \oplus p}\right) ,\;\forall p \in \mathcal{P}\text{ super }}\right) \tag{8}
\]
where \( \mathcal{M}\left( {q \oplus p}\right) \) represents the response distribution of the LLM conditioned on the query \( q \) and its corresponding reasoning paths extracted from the knowledge graph.
Inspired by previous research on probability-based uncertainty estimation (Arora, Huang, and He 2021; Duan et al. 2024), we adopt entropy-based metrics to quantify the model's confidence in the retrieved knowledge:
\[
H\left( {P\left( {\text{ ans } \mid \text{ context }}\right) }\right) = - \frac{1}{\left| l\right| }\mathop{\sum }\limits_{{t = 1}}^{\left| l\right| }\mathop{\sum }\limits_{{i = 1}}^{k}p{r}_{i}^{\left( t\right) }{\log }_{2}p{r}_{i}^{\left( t\right) } \tag{9}
\]
where \( p{r}_{i}^{\left( t\right) } \) represents the probability distribution over the top- \( k \) candidate tokens at position \( t \) , and \( \left| l\right| \) denotes the token length of the answer. Accordingly, we obtain \( H\left( {{P}_{\text{ param }}\left( {\text{ ans } \mid q}\right) }\right) \) for parametric generation and \( H\left( {{P}_{\text{ aug }}\left( {\text{ ans } \mid q, p}\right) }\right) \) for retrieval-augmented generation incorporating with individual reasoning path \( p \) . Consequently, we can utilize the entropy variation under different reasoning paths as a characteristic indicator of knowledge conflict:
\[
\Delta {H}_{p} = H\left( {{P}_{\text{ aug }}\left( {\text{ ans } \mid q, p}\right) }\right) - H\left( {{P}_{\text{ param }}\left( {\text{ ans } \mid q}\right) }\right) \tag{10}
\]
where positive values of \( \Delta {H}_{p} \) indicate that the retrieved external knowledge intensifies uncertainty in the LLM's reasoning, potentially indicating factual inconsistencies with its parametric knowledge, whereas negative values suggest that the retrieved knowledge aligns with the LLM's internal understanding, thereby reducing uncertainty. Reasoning paths exhibiting entropy changes exceeding a predefined threshold \( \tau \) are classified as \( {\mathcal{P}}_{\text{ corrective }} \) :
\[
\mathcal{P}\text{ corrective } = p \in \mathcal{P}\text{ super: }\Delta {H}_{p} > \tau \tag{11}
\]
These identified corrective knowledge paths, which effectively challenge and potentially rectify the LLM's internal misconceptions, are subsequently aggregated to construct the refined contextual input. The final response is then generated by the LLM based on the enriched context:
\[
\text{ Response } = \mathcal{M}\left( {q \oplus \mathcal{P}\text{ corrective }}\right) \tag{12}
\]
This entropy-based conflict resolution mechanism ensures that LLMs consistently prioritize factually accurate external information when generating responses, improving reasoning accuracy and trustworthiness, thereby enhancing the overall robustness of the RAG system.
## Experiments
In this section, we present comprehensive experiments to evaluate the effectiveness of TruthfulRAG in resolving knowledge conflicts and enhancing the reliability of RAG systems. Specifically, we aim to address the following research questions: (1) How does TruthfulRAG perform compared to other methods in terms of factual accuracy? (2) What is the performance of TruthfulRAG in non-conflicting contexts? (3) To what extent do structured reasoning paths affect the confidence of LLMs compared to raw natural language context? (4) What are the individual contributions of each module within the TruthfulRAG framework?
## Experimental Setup
Datasets We conduct experiments on four datasets that encompass various knowledge-intensive tasks and conflict scenarios. FaithEval (Ming et al. 2025) is designed to assess whether LLMs remain faithful to unanswerable, inconsistent, or counterfactual contexts involving complex logical-level conflicts beyond the entity level. MuSiQue (Trivedi et al. 2022) and SQuAD (Rajpurkar et al. 2016) come from previous research KRE (Ying et al. 2024), which contain fact-level knowledge conflicts that necessitate compositional multi-hop reasoning, making it particularly suitable for evaluating knowledge integration and conflict resolution in complex reasoning scenarios. RealtimeQA (Kasai et al. 2023) focuses on temporal conflicts, where answers may quickly become outdated, leading to inconsistencies between static parametric knowledge and dynamic external sources.
Evaluated Models We select three representative LLMs across different architectures and model scales to ensure comprehensive evaluations: GPT-40-mini (Achiam et al. 2023), Qwen2.5-7B-Instruct (Yang et al. 2025), and Mistral- 7B-Instruct (Jiang et al. 2024). This selection encompasses both open-source and closed-source models, ensuring that TruthfulRAG is broadly applicable to RAG systems built upon diverse LLM backbones.
Baselines We compare TruthfulRAG against five baseline approaches spanning different methodological categories: (i) Direct Generation requires LLMs to generate responses solely based on their parametric knowledge without any external retrieval. (ii) Standard RAG represents the conventional retrieval-augmented generation paradigm, where LLMs generate responses using retrieved textual passages directly. (iii) KRE (Ying et al. 2024) serves as a representative prompt optimization method, which enhances reasoning faithfulness by adopting specialized prompting strategies to guide the model in resolving knowledge conflicts. (iv) COIECD (Yuan et al. 2024) represents the decoding manipulation category, which modifies the model's decoding strategy during the inference stage to guide LLMs toward greater reliance on retrieved context rather than parametric knowledge. (v) FaithfulRAG (Zhang et al. 2025) incorporates a self-reflection mechanism that identifies factual discrepancies between parametric knowledge and retrieved context, enabling LLMs to reason and integrate conflicting facts before generating content.
Evaluation Metrics Following prior studies, we adopt accuracy (ACC) as the primary evaluation metric, measuring the proportion of questions for which the LLM generates correct answers, thereby providing a direct assessment of the factual correctness of the generated responses. To evaluate the method's capability to precisely extract information pertinent to the target answer from retrieved corpora, we introduce the Context Precision Ratio (CPR) metric, which measures the proportion of answer-related content within the processed context:
\[
\mathrm{{CPR}} = \frac{\left| {\mathcal{A}}_{\text{ gold }} \cap {\mathcal{C}}_{\text{ processed }}\right| }{\left| {\mathcal{C}}_{\text{ processed }}\right| } \tag{13}
\]
where \( \left| {\text{ Context }}_{\text{ gold }}\right| \) denotes the length of segments directly related to the correct answer, and |Context \( {}_{\text{ processed }} \) | represents the total length of the processed context.
<table><tr><td rowspan="2">Method</td><td rowspan="2">LLM</td><td colspan="4">Dataset</td><td rowspan="2">Avg.</td><td rowspan="2">Imp.</td></tr><tr><td>FaithEval</td><td>MuSiQue</td><td>RealtimeQA</td><td>SQuAD</td></tr><tr><td rowspan="3">w/o RAG</td><td>GPT-40-mini</td><td>4.6</td><td>15.1</td><td>43.4</td><td>11.2</td><td>18.6</td><td>-</td></tr><tr><td>Qwen2.5-7B-Instruct</td><td>4.2</td><td>19.6</td><td>40.7</td><td>11.1</td><td>18.9</td><td>-</td></tr><tr><td>Mistral-7B-Instruct</td><td>6.3</td><td>13.8</td><td>29.2</td><td>11.5</td><td>15.2</td><td>-</td></tr><tr><td rowspan="3">w/ RAG</td><td>GPT-40-mini</td><td>61.3</td><td>72.6</td><td>67.3</td><td>73.1</td><td>68.6</td><td>50.0</td></tr><tr><td>Qwen2.5-7B-Instruct</td><td>53.1</td><td>75.2</td><td>78.7</td><td>68.3</td><td>68.8</td><td>49.9</td></tr><tr><td>Mistral-7B-Instruct</td><td>61.9</td><td>67.6</td><td>52.2</td><td>67.2</td><td>62.2</td><td>47.0</td></tr><tr><td rowspan="3">KRE</td><td>GPT-4o-mini</td><td>50.7</td><td>34.6</td><td>47.5</td><td>65.3</td><td>49.5</td><td>30.9</td></tr><tr><td>Qwen2.5-7B-Instruct</td><td>59.6</td><td>70.7</td><td>86.7</td><td>73.7</td><td>72.7</td><td>53.8</td></tr><tr><td>Mistral-7B-Instruct</td><td>73.2</td><td>50.6</td><td>76.9</td><td>74.6</td><td>68.8</td><td>53.6</td></tr><tr><td rowspan="3">COIECD</td><td>GPT-40-mini</td><td>53.9</td><td>56.4</td><td>48.7</td><td>57.6</td><td>54.2</td><td>35.6</td></tr><tr><td>Qwen2.5-7B-Instruct</td><td>62.3</td><td>69.7</td><td>78.8</td><td>70.8</td><td>70.4</td><td>51.5</td></tr><tr><td>Mistral-7B-Instruct</td><td>62.8</td><td>66.8</td><td>58.4</td><td>65.4</td><td>63.3</td><td>48.1</td></tr><tr><td rowspan="3">FaithfulRAG</td><td>GPT-40-mini</td><td>67.2</td><td>79.3</td><td>78.8</td><td>80.8</td><td>76.5</td><td>58.0</td></tr><tr><td>Qwen2.5-7B-Instruct</td><td>71.8</td><td>78.0</td><td>84.1</td><td>78.3</td><td>78.1</td><td>59.1</td></tr><tr><td>Mistral-7B-Instruct</td><td>81.7</td><td>78.5</td><td>77.0</td><td>85.7</td><td>80.7</td><td>65.5</td></tr><tr><td rowspan="3">TruthfulRAG (Ours)</td><td>GPT-40-mini</td><td>69.5</td><td>79.4</td><td>85.0</td><td>81.1</td><td>78.8</td><td>60.2</td></tr><tr><td>Qwen2.5-7B-Instruct</td><td>73.2</td><td>79.1</td><td>82.3</td><td>78.7</td><td>78.3</td><td>59.4</td></tr><tr><td>Mistral-7B-Instruct</td><td>81.9</td><td>79.3</td><td>81.4</td><td>82.7</td><td>81.3</td><td>66.1</td></tr></table>
Table 1: Comparison of ACC between TruthfulRAG and five baselines across four datasets within three representive LLMs. The best result for each backbone LLM within each dataset is highlighted in bold, and the second best is emphasized with an underline. Avg. denotes the arithmetic mean accuracy across the four datasets, while Imp. indicates the average improvement over the corresponding LLM's w/o RAG baseline.
Implementation Details For dense retrieval, cosine similarity is computed using embeddings generated by the all-MiniLM-L6-v2. For entropy-based filtering, we set model-specific thresholds \( \tau \) for entropy variation \( \Delta {H}_{p} \) : GPT-40- mini and Mistral-7B-Instruct use \( \tau = 1 \) , while Qwen2.5- 7B-Instruct adopts a higher threshold of \( \tau = 3 \) . All experiments are conducted using NVIDIA V100 GPUs with 32GB memory. To ensure reproducibility, the temperature for text generation is set to 0, and all Top- \( K \) values are set to 10 .
## Results and Analysis
Overall Performance Table 1 presents a comprehensive comparison of TruthfulRAG against five baseline methods across four datasets, evaluating performance in terms of factual accuracy (ACC) using three representative LLMs. To facilitate overall assessment, we additionally report Avg., the arithmetic mean accuracy across the four datasets, and Imp., the average improvement over the corresponding LLM's w/o RAG baseline, serving as a proxy for the number of factual conflicts successfully corrected by the method from the LLM's parametric knowledge.
The results clearly demonstrate that TruthfulRAG consistently achieves superior or competitive performance relative to all baseline approaches. Specifically, it achieves the highest accuracy on FaithEval (81.9%), MuSiQue (79.4%), and RealtimeQA (85.0%), and ranks first or second on SQuAD across all models. Notably, TruthfulRAG achieves the highest overall performance across all backbone LLMs, attaining both the best average accuracy (Avg.) and the greatest relative improvement (Imp.) compared to all baseline methods. This clearly illustrates its robustness in mitigating factual inconsistencies that standard RAG systems struggle with due to unresolved evidence conflicts.
Compared to standard RAG systems, which exhibit significant variability in accuracy due to unresolved knowledge conflicts, TruthfulRAG achieves improvements ranging from 3.6% to 29.2%, highlighting its robustness in mitigating factual inconsistencies. Furthermore, while methods like FaithfulRAG and KRE offer partial gains through semantic alignment or prompt-based mechanisms, they fall short in consistently resolving fine-grained factual discrepancies. In contrast, TruthfulRAG integrates knowledge graph-based reasoning with entropy-guided conflict filtering mechanisms to identify and resolve contradictory information, thereby substantially enhancing factual reliability. These findings validate the effectiveness of TruthfulRAG in delivering accurate, faithful, and contextually grounded responses across diverse knowledge-intensive tasks.
Performance on Non-Conflicting Contexts To evaluate the robustness of TruthfulRAG in scenarios where retrieved contexts free from factual conflicts, we conduct experiments on golden standard datasets in which the retrieved passages are guaranteed to be non-contradictory.
As shown in Table 2, TruthfulRAG consistently outperforms all baseline methods across both the MuSiQue-golden and SQuAD-golden datasets. These findings substantiate that TruthfulRAG not only excels at resolving conflicting information but also maintains superior performance in nonconflicting contexts, thereby revealing its universal applicability and effectiveness. The consistent performance improvements can be attributed to the structured knowledge representation provided by the knowledge graph module, which enables the identification of fine-grained entities and relational links in non-conflicting contexts. This capability facilitates the extraction of query-relevant information and promotes a more comprehensive understanding and integration of factual knowledge by the LLMs. Notably, while methods such as KRE exhibit significant performance degradation in non-conflicting scenarios, TruthfulRAG maintains its robustness across diverse contextual settings. This consistency highlights its practical utility and reliability for real-world RAG applications.
<table><tr><td rowspan="2">Dataset</td><td colspan="6">Method</td></tr><tr><td>w/o RAG</td><td>w/ RAG</td><td>KRE</td><td>COIECD</td><td>FaithfulRAG</td><td>TruthfulRAG (Ours)</td></tr><tr><td>MuSiQue-golden</td><td>45.6</td><td>89.9</td><td>44.1(-45.8)</td><td>89.5(-0.4)</td><td>91.8(+1.9)</td><td>93.2 (+3.3)</td></tr><tr><td>SQuAD-golden</td><td>68.7</td><td>97.9</td><td>83.2(-14.7)</td><td>97.1(-0.8)</td><td>98.1(+0.2)</td><td>98.3 (+0.4)</td></tr></table>
Table 2: Performance comparison on non-conflicting contexts with GPT-40-mini as the backbone LLM. The best result on each dataset is highlighted in bold. The numbers in parentheses indicates the change in accuracy compared to the standard RAG.
![bo_d6nbbd4601uc73e2hqsg_5_169_471_1470_348_0.jpg](images/bo_d6nbbd4601uc73e2hqsg_5_169_471_1470_348_0.jpg)
Figure 3: Comparison of LLM confidence, measured by negative log-probability (logprob) values using GPT-40-mini, when reasoning with natural language contexts versus structured reasoning paths across four datasets. Lower negative logprob values indicate higher actual log-probability scores and thus increased model confidence in generating correct answers.
Impact of Structured Reasoning Paths To investigate the impact of structured reasoning paths on the confidence of LLMs relative to raw natural language context, we conduct a comprehensive analysis across four datasets. Specifically, we compare the model's confidence when reasoning with retrieved knowledge presented in natural language format or as structured reasoning paths derived through our knowledge graph construction mechanism. To quantify the model's confidence in its predicted answers, we measure the log-probability of the correct answer tokens generated by LLMs and compute the average across all test instances.
As shown in Figure 3, our experimental results reveal a consistent pattern across all evaluated datasets. Structured reasoning paths consistently lead to higher logprob values for correct answers compared to natural language contexts, indicating greater model confidence when reasoning with structured knowledge representations. This empirical evidence demonstrates that transforming unstructured natural language into structured reasoning paths through knowledge graphs significantly strengthens the LLM's confidence in following external retrieved knowledge for inference. Furthermore, this finding provides crucial insights into the superior performance of TruthfulRAG in both conflicting and non-conflicting semantic scenarios, as the enhanced confidence facilitates more reliable adherence to external knowledge sources, thereby supporting factual consistency and promoting the generation of faithful model outputs.
Ablation Study To comprehensively evaluate the contribution of each component in TruthfulRAG, we conduct systematic ablation experiments by removing key modules from the full framework. Since knowledge graph construction and retrieval are two closely coupled modules, we combine them as an integrated component for ablation evaluation.
As shown in table 3, the complete TruthfulRAG framework achieves superior performance across all datasets, with accuracy improvements ranging from 6.8% to 17.7% compared to the standard RAG, demonstrating that the structured knowledge graph and the conflict resolution mechanism function synergistically to enhance both factual accuracy and contextual precision. The ablation results reveal several critical insights. First, when employing only the filtering mechanism without knowledge graph integration (w/o Knowledge Graph), although accuracy demonstrates modest improvements, CPR exhibits a notable decline across most datasets, particularly in MuSiQue (1.86 to 1.15) and SQuAD (2.71 to 1.97). This phenomenon indicates that LLMs encounter substantial difficulties in effectively extracting relevant information from naturally organized contexts, thereby constraining their ability to achieve higher accuracy. In contrast, when utilizing solely the knowledge graph component without conflict resolution (w/o Conflict Resolution), CPR achieves significant improvements, yet the introduction of extensive structured knowledge simultaneously introduces redundant information, resulting in limited improvements in accuracy across most datasets. These findings support our hypothesis that structured knowledge representations facilitate the precise localization of query-relevant information, enabling more targeted and effective information extraction compared to unstructured contexts.
<table><tr><td rowspan="2">Method</td><td colspan="4">Dataset</td></tr><tr><td>FaithEval</td><td>MuSiQue</td><td>RealtimeQA</td><td>SQuAD</td></tr><tr><td>Standard RAG</td><td>61.3 / 0.51</td><td>72.6 / 1.86</td><td>67.3 / 0.47</td><td>73.1 / 2.71</td></tr><tr><td>w/o Knowledge Graph</td><td>64.8 / 0.52</td><td>78.9 / 1.15</td><td>83.2 / 0.23</td><td>78.8 / 1.97</td></tr><tr><td>w/o Conflict Resolution</td><td>69.3 / 0.59</td><td>77.8 / 2.79</td><td>84.1 / 1.80</td><td>78.2 / 2.85</td></tr><tr><td>Full Method</td><td>69.5 / 0.56</td><td>79.4 / 2.25</td><td>85.0 / 1.54</td><td>81.1 / 2.56</td></tr></table>
Table 3: Ablation study results of different components in TruthfulRAG with GPT-40-mini as the backbone LLM. The results are presented in the format ACC / CPR, where ACC denotes accuracy and CPR represents Context Precision Ratio.
## Related Work
This section reviews existing research on knowledge conflicts in RAG systems, categorizing the literature into two main areas: impact analysis and resolution strategies.
## Impact Analysis of Knowledge Conflicts
Recent studies have extensively explored the influence of knowledge conflicts on the performance of RAG systems (Longpre et al. 2021; Chen, Zhang, and Choi 2022; Xie et al. 2023; Tan et al. 2024; Ming et al. 2025), which primarily highlight differential preferences between the parametric knowledge and retrieved external information. Long-pre et al. (Longpre et al. 2021) first expose entity-based knowledge conflicts in question answering, revealing that LLMs tend to rely on parametric memory when retrieved passages are perturbed or contain contradictory information. Chen et al. (Chen, Zhang, and Choi 2022) demonstrate that while retrieval-based LLMs predominantly depend on nonparametric evidence when recall is high, their confidence scores fail to reflect inconsistencies among retrieved documents. Xie et al. (Xie et al. 2023) find that LLMs are receptive to single external evidence, yet exhibit strong confirmation bias when presented with both supporting and conflicting information. Tan et al. (Tan et al. 2024) reveal a systematic bias toward self-generated contexts over retrieved ones, attributing this to the higher query-context similarity and semantic incompleteness of retrieved snippets.
Our work aligns with the non-parametric knowledge preference paradigm, aiming to guide LLMs to follow updated and comprehensive external knowledge while correcting for temporal and factual errors within internal memory, thereby generating accurate and trustworthy outputs.
## Solutions to Knowledge Conflicts
Current approaches for knowledge conflict resolution can be categorized into token-level and semantic-level methods (Jin et al. 2024; Wang et al. 2024; Bi et al. 2025; Zhang et al. 2025; Wang et al. 2025). Token-level approaches focus on fine-grained intervention during generation. \( C{D}^{2} \) (Jin et al. 2024) employs attention weight manipulation to suppress parametric knowledge when conflicts are detected. ASTUTE RAG (Wang et al. 2024) utilizes gradient-based attribution to identify and mask conflicting tokens during inference. These methods achieve precise control, but often suffer from computational overhead and lack semantic awareness among generated contents. Semantic-level approaches operate at higher abstraction levels. CK-PLUG (Bi et al. 2025) develops parameter-efficient conflict resolution through adapter-based architectures that learn to weight parametric versus non-parametric knowledge dynamically. FaithfulRAG (Zhang et al. 2025) externalizes LLMs' parametric knowledge and aligns it with retrieved context, thereby achieving higher faithfulness without sacrificing accuracy. However, these methods primarily address surface-level conflicts without capturing the underlying factual relationships that drive knowledge inconsistencies.
Different from these approaches, TruthfulRAG leverages structured triple-based knowledge representations to precisely identify and resolve factual-level knowledge conflicts arising from complex natural language expressions, thereby ensuring the reliability and consistency of reasoning.
## Conclusion
In this paper, we introduce TruthfulRAG, the first framework that leverages knowledge graphs to address factual-level conflicts in RAG systems. By integrating systematic triple extraction, query-aware graph retrieval, and entropy-based filtering mechanisms, TruthfulRAG transforms unstructured retrieved contexts into structured reasoning paths that enhance LLMs' confidence in external knowledge while effectively mitigating factual inconsistencies. Our comprehensive experiments demonstrate that TruthfulRAG consistently outperforms existing SOTA methods. These results establish TruthfulRAG as a robust and generalizable solution for improving the trustworthiness and accuracy of RAG systems, with significant implications for knowledge-intensive applications requiring high reliability and precision.