Files
Mars-RAG-paper/参考论文/groundtruth/TruthfulRAG.md
2026-04-02 09:48:38 +08:00

38 KiB
Raw Blame History

TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs

Shuyi Liu, Yuming Shang, Xi Zhang*

Key Laboratory of Trustworthy Distributed Computing and Service (MoE)

Beijing University of Posts and Telecommunications, China

{liushuyi111, shangym, zhangx}@bupt.edu.cn

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, Truth-fulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.

Introduction

Large Language Models (LLMs) have demonstrated impressive performance across diverse natural language understanding and generation tasks (Achiam et al. 2023; Tou-vron and et al. 2023; Yang et al. 2025). Despite their proficiency, LLMs remain ineffective in handling specialized, privacy-sensitive, or time-sensitive knowledge that is not encompassed within their training corpora (Zhang et al. 2024; Huang et al. 2025). For the solutions, Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm that enhances the relevance and factuality of the generated responses by integrating external knowledge retrieval with the remarkable generative capabilities of LLMs (Lewis et al. 2020; Gao et al. 2023; Fan et al. 2024). However, as RAG systems continuously update their knowledge repositories, the temporal disparity between dynamic external sources and static parametric knowledge within LLMs inevitably leads to knowledge conflicts (Xie et al. 2023; Xu et al. 2024; Shi et al. 2024), which can significantly undermine the accuracy and reliability of the generated content.

bo_d6nbbd4601uc73e2hqsg_0_930_625_726_730_0.jpg

Figure 1: The illustration of knowledge conflicts and the differences between existing solutions and TruthfulRAG.

Recent research has begun to investigate the impact of knowledge conflicts on the performance of RAG systems (Chen, Zhang, and Choi 2022; Xie et al. 2023; Tan et al. 2024) and explore methods to mitigate such conflicts (Wang et al. 2024; Jin et al. 2024; Zhang et al. 2025; Bi et al. 2025). Existing resolution approaches can be categorized into two methodological types: (i) token-level methods, which manage LLMs' preference between internal and external knowledge by adjusting the probability distribution over the output tokens (Jin et al. 2024; Bi et al. 2025); (ii) semantic-level methods, which resolve conflicts by semantically integrating and aligning knowledge segments from internal and external sources (Wang et al. 2024; Zhang et al. 2025). However, these token-level or semantic-level conflict resolution methods generally employ coarse-grained strategies that rely on fragmented data representations, resulting in insufficient contextual awareness. This may prevent LLMs from accurately capturing complex interdependencies and fine-grained factual inconsistencies, especially in knowledge-intensive conflict scenarios (Han et al. 2024).


*Corresponding author.

Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.


To address the above limitations, we propose Truthful-RAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level conflicts in RAG systems. As illustrated in Figure 1, unlike previous studies, Truthful-RAG uses structured triple-based knowledge representations to construct reliable contexts, thereby enhancing the confidence of LLMs in external knowledge and facilitating trustworthy reasoning. The TruthfulRAG framework comprises three key modules: (a) Graph Construction, which derives structured triples from retrieved external knowledge by identifying entities, relations, and attributes to construct knowledge graphs; (b) Graph Retrieval, which conducts query-based retrieval algorithms to obtain relevant knowledge that exhibit strong factual associations with the input query; and (c) Conflict Resolution, which applies entropy-based filtering techniques to locate conflicting elements and mitigate factual inconsistencies, ultimately forming more reliable reasoning paths and promoting more accurate outputs. This framework integrates seamlessly with existing RAG architectures, enabling the extraction of highly relevant and factually consistent knowledge, effectively eliminating factual-level conflicts and improving generation reliability.

The contributions of this paper are as follows:

  • We discover that constructing contexts through textual representations on structured triples can enhance the confidence of LLMs in external knowledge, thereby promoting trustworthy and reliable model reasoning.

  • We introduce TruthfulRAG, the first framework that leverages knowledge graphs to resolve factual-level conflicts in RAG systems through systematic triple extraction, query-based graph retrieval, and entropy-based filtering mechanisms.

  • We conduct extensive experiments demonstrating that TruthfulRAG outperforms existing methods in mitigating knowledge conflicts while improving the robustness and trustworthiness of RAG systems.

Methodology

In this section, we provide a detailed introduction to the TruthfulRAG framework. As illustrated in Figure 2, Truth-fulRAG comprises three interconnected modules: (i) Graph Construction, which transforms unstructured retrieved content into structured knowledge graphs through systematic triple extraction; (ii) Graph Retrieval, which employs query-aware graph traversal algorithms to identify semantically relevant reasoning paths; and (iii) Conflict Resolution, which utilizes entropy-based filtering mechanisms to detect and mitigate factual inconsistencies between parametric and external knowledge.

Graph Construction

The construction of a knowledge graph begins with the conversion of raw information retrieved from the RAG system into structured knowledge representations through systematic entity-relation-attribute extraction.

Given the retrieved content ( C ) for the users query ( q ) , we first perform fine-grained semantic segmentation to partition the content into coherent textual segments ( \mathcal{S} = ; \left{ {{s}{1},{s}{2},\ldots ,{s}{m}}\right} ) , where each segment ( {s}{i} ) represents a semantically coherent unit containing factual information. For each textual segment ( {s}{i} \in \mathcal{S} ) , we employ the generative model ( \mathcal{M} ) from the RAG system to extract a set of structured knowledge triples ( {\mathcal{T}}{\text{ all }} = \left{ {{\mathcal{T}}{i,1},{\mathcal{T}}{i,2},\ldots ,{\mathcal{T}}{i, n}}\right} ) , with each triple ( {\mathcal{T}}{i, j} = \left( {h, r, t}\right) ) consisting of a head entity ( h ) , relation ( r ) , tail entity ( t ) . This extraction process aims to capture both explicit factual statements and implicit semantic relationships embedded within the original content, thereby ensuring the comprehensiveness and semantic integrity of the knowledge representation.

The aggregated triple set from all retrieved content forms the foundation for constructing the knowledge graph ( \mathcal{G} ) :

[ \mathcal{G} = \left( {\mathcal{E},\mathcal{R},{\mathcal{T}}_{\text{ all }}}\right) \tag{1} ]

where ( \mathcal{E} = \mathop{\bigcup }\limits_{{i, j, k}}{h}{i, j, k},{t}{i, j, k} ) represents the entity set, ( \mathcal{R} = \mathop{\bigcup }\limits_{{i, j, k}}{r}{i, j, k} ) denotes the relation set, and ( {\mathcal{T}}{\text{ all }} = ; \mathop{\bigcup }\limits_{{i, j}}{\mathcal{T}}_{i, j} ) constitutes the complete triple repository. This structured knowledge representation enables the filtering of low-information noise and captures detailed factual associations, thereby providing a clear and semantically enriched foundation for subsequent query-aware knowledge retrieval.

Graph Retrieval

To acquire knowledge that is strongly aligned with user queries at the factual level, we design a query-aware graph traversal algorithm that can identify critical knowledge paths within the graph, ensuring both semantic relevance and factual consistency in the retrieval process.

Initially, key elements are extracted from the user query ( q ) to serve as important references for matching components in the knowledge graph. These elements include the query's target entities, relations, and intent categories, denoted as ( {\mathcal{K}}_{q} ) . Subsequently, semantic similarity matching is employed to identify the top- ( k ) most relevant entities and relations within the knowledge graph:

[ \mathcal{E}{imp} = \operatorname{TopK}\left( {\operatorname{sim}\left( {e,{\mathcal{K}}_{q}}\right) : e \in \mathcal{E}, k}\right) \tag{2} ]

[ \mathcal{R}{imp} = \operatorname{TopK}\left( {\operatorname{sim}\left( {r,{\mathcal{K}}_{q}}\right) : r \in \mathcal{R}, k}\right) \tag{3} ]

where ( \operatorname{sim}\left( {\cdot , \cdot }\right) ) represents the semantic similarity function computed using dense embeddings, Eimp denotes the set of key entities, and ( \mathcal{R}{imp} ) represents the set of key relations. From each key entity ( e \in \mathcal{E} ) imp, we perform a two-hop graph traversal to systematically collect the entire set of possible initial reasoning paths ( \mathcal{P} ) init.

To further filter reasoning paths with stronger factual associations, we introduce a fact-aware scoring mechanism that evaluates the relevance of paths to the query based on the coverage of key entities and relations within each path p:

[ \operatorname{Ref}\left( p\right) = \alpha \cdot \frac{\left| e \in p \cap \mathcal{E}imp\right| }{\left| \mathcal{E}imp\right| } + \beta \cdot \frac{\left| r \in p \cap \mathcal{R}imp\right| }{\left| \mathcal{R}imp\right| } \tag{4} ]

where ( \alpha ) and ( \beta ) are hyperparameters that control the relative importance of entity and relationship coverage, respectively. The top-scored reasoning paths from Pinit constitute the core knowledge paths ( \mathcal{P} ) super.

[ \mathcal{P}\text{ super } = \operatorname{TopK}\left( {\operatorname{Ref}\left( p\right) : p \in \mathcal{P}\text{ init, }K}\right) \tag{5} ]

bo_d6nbbd4601uc73e2hqsg_2_147_140_1502_806_0.jpg

Figure 2: The overall pipeline of the TruthfulRAG framework. TruthfulRAG first extracts structured knowledge triples to construct a comprehensive knowledge graph. Subsequently, it employs query-aware graph traversal to identify salient reasoning paths, where each path comprises entities and relationships enriched with associated attributes. Finally, the framework applies entropy-based conflict resolution to detect and filter out corrective paths that challenge parametric misconceptions, thereby alleviating knowledge conflicts between internal and external information, prompting consistent and credible responses.

In order to construct detailed contextual information, each core reasoning path ( p \in \mathcal{P} ) super will be represented as a comprehensive contextual structure consisting of three essential components:

[ p = {\mathcal{C}}{\text{ path }} \oplus {\mathcal{C}}{\text{ entities }} \oplus {\mathcal{C}}_{\text{ relations }} \tag{6} ]

where:

  • Cpath represents the complete sequential reasoning path: ( {e}{1}\overset{{r}{1}}{ \rightarrow }{e}{2}\overset{{r}{2}}{ \rightarrow }\cdots \overset{{r}{n - 1}}{ \rightarrow }{e}{n} ) , capturing the logical progression of entities connected through relational links.

  • Centities ( = \left( {e,\mathcal{A}e}\right) : e \in p \cap \mathcal{E} ) imp encompasses all important entities within the path along with their corresponding attribute descriptions ( \mathcal{A}e ) , providing thorough entity-specific information for the context.

  • Crelations ( = \left( {r,\mathcal{A}r}\right) : r \in p \cap \mathcal{R} ) imp includes all important relations on the path together with their corresponding attributes ( \mathcal{A}r ) , enriching the semantic and contextual understanding of the relations.

This formalized representation of knowledge ensures that each extracted reasoning path preserves structural coherence through the entity-relation sequence and reinforces semantic richness via comprehensive attribute information, thereby facilitating more nuanced and context-aware knowledge integration for subsequent conflict resolution processes.

Conflict Resolution

To address factual inconsistencies between parametric knowledge and external information, ensuring that LLMs consistently follow the retrieved knowledge paths to achieve accurate reasoning, we employ entropy-based model confidence analysis to investigate the influence of conflicting knowledge on model prediction uncertainty, thereby systematically identifying and resolving factual conflicts based on uncertainty quantification mechanisms.

We implement conflict detection by comparing model performance under two distinct conditions: (1) pure parametric generation without access to external context, and (2) retrieval-augmented generation that incorporates structured reasoning paths constructed from knowledge graph. For parametric-based generation, we calculate the response probability from LLMs as baselines:

[ {P}_{\text{ param }}\left( {\text{ ans } \mid q}\right) = \mathcal{M}\left( q\right) \tag{7} ]

where ans represents the generated answer and ( \mathcal{M}\left( q\right) ) denotes the response distribution of the LLM based solely on query ( q ) . For retrieval-augmented generation, we incorporate each reasoning path from ( \mathcal{P} ) super as contextual information to obtain the model's output probability:

[ {P}_{\text{ aug }}\left( {\left. {\operatorname{ans} \mid q, p}\right| ; = \mathcal{M}\left( {q \oplus p}\right) ,;\forall p \in \mathcal{P}\text{ super }}\right) \tag{8} ]

where ( \mathcal{M}\left( {q \oplus p}\right) ) represents the response distribution of the LLM conditioned on the query ( q ) and its corresponding reasoning paths extracted from the knowledge graph.

Inspired by previous research on probability-based uncertainty estimation (Arora, Huang, and He 2021; Duan et al. 2024), we adopt entropy-based metrics to quantify the model's confidence in the retrieved knowledge:

[ H\left( {P\left( {\text{ ans } \mid \text{ context }}\right) }\right) = - \frac{1}{\left| l\right| }\mathop{\sum }\limits_{{t = 1}}^{\left| l\right| }\mathop{\sum }\limits_{{i = 1}}^{k}p{r}{i}^{\left( t\right) }{\log }{2}p{r}_{i}^{\left( t\right) } \tag{9} ]

where ( p{r}{i}^{\left( t\right) } ) represents the probability distribution over the top- ( k ) candidate tokens at position ( t ) , and ( \left| l\right| ) denotes the token length of the answer. Accordingly, we obtain ( H\left( {{P}{\text{ param }}\left( {\text{ ans } \mid q}\right) }\right) ) for parametric generation and ( H\left( {{P}_{\text{ aug }}\left( {\text{ ans } \mid q, p}\right) }\right) ) for retrieval-augmented generation incorporating with individual reasoning path ( p ) . Consequently, we can utilize the entropy variation under different reasoning paths as a characteristic indicator of knowledge conflict:

[ \Delta {H}{p} = H\left( {{P}{\text{ aug }}\left( {\text{ ans } \mid q, p}\right) }\right) - H\left( {{P}_{\text{ param }}\left( {\text{ ans } \mid q}\right) }\right) \tag{10} ]

where positive values of ( \Delta {H}{p} ) indicate that the retrieved external knowledge intensifies uncertainty in the LLM's reasoning, potentially indicating factual inconsistencies with its parametric knowledge, whereas negative values suggest that the retrieved knowledge aligns with the LLM's internal understanding, thereby reducing uncertainty. Reasoning paths exhibiting entropy changes exceeding a predefined threshold ( \tau ) are classified as ( {\mathcal{P}}{\text{ corrective }} ) :

[ \mathcal{P}\text{ corrective } = p \in \mathcal{P}\text{ super: }\Delta {H}_{p} > \tau \tag{11} ]

These identified corrective knowledge paths, which effectively challenge and potentially rectify the LLM's internal misconceptions, are subsequently aggregated to construct the refined contextual input. The final response is then generated by the LLM based on the enriched context:

[ \text{ Response } = \mathcal{M}\left( {q \oplus \mathcal{P}\text{ corrective }}\right) \tag{12} ]

This entropy-based conflict resolution mechanism ensures that LLMs consistently prioritize factually accurate external information when generating responses, improving reasoning accuracy and trustworthiness, thereby enhancing the overall robustness of the RAG system.

Experiments

In this section, we present comprehensive experiments to evaluate the effectiveness of TruthfulRAG in resolving knowledge conflicts and enhancing the reliability of RAG systems. Specifically, we aim to address the following research questions: (1) How does TruthfulRAG perform compared to other methods in terms of factual accuracy? (2) What is the performance of TruthfulRAG in non-conflicting contexts? (3) To what extent do structured reasoning paths affect the confidence of LLMs compared to raw natural language context? (4) What are the individual contributions of each module within the TruthfulRAG framework?

Experimental Setup

Datasets We conduct experiments on four datasets that encompass various knowledge-intensive tasks and conflict scenarios. FaithEval (Ming et al. 2025) is designed to assess whether LLMs remain faithful to unanswerable, inconsistent, or counterfactual contexts involving complex logical-level conflicts beyond the entity level. MuSiQue (Trivedi et al. 2022) and SQuAD (Rajpurkar et al. 2016) come from previous research KRE (Ying et al. 2024), which contain fact-level knowledge conflicts that necessitate compositional multi-hop reasoning, making it particularly suitable for evaluating knowledge integration and conflict resolution in complex reasoning scenarios. RealtimeQA (Kasai et al. 2023) focuses on temporal conflicts, where answers may quickly become outdated, leading to inconsistencies between static parametric knowledge and dynamic external sources.

Evaluated Models We select three representative LLMs across different architectures and model scales to ensure comprehensive evaluations: GPT-40-mini (Achiam et al. 2023), Qwen2.5-7B-Instruct (Yang et al. 2025), and Mistral- 7B-Instruct (Jiang et al. 2024). This selection encompasses both open-source and closed-source models, ensuring that TruthfulRAG is broadly applicable to RAG systems built upon diverse LLM backbones.

Baselines We compare TruthfulRAG against five baseline approaches spanning different methodological categories: (i) Direct Generation requires LLMs to generate responses solely based on their parametric knowledge without any external retrieval. (ii) Standard RAG represents the conventional retrieval-augmented generation paradigm, where LLMs generate responses using retrieved textual passages directly. (iii) KRE (Ying et al. 2024) serves as a representative prompt optimization method, which enhances reasoning faithfulness by adopting specialized prompting strategies to guide the model in resolving knowledge conflicts. (iv) COIECD (Yuan et al. 2024) represents the decoding manipulation category, which modifies the model's decoding strategy during the inference stage to guide LLMs toward greater reliance on retrieved context rather than parametric knowledge. (v) FaithfulRAG (Zhang et al. 2025) incorporates a self-reflection mechanism that identifies factual discrepancies between parametric knowledge and retrieved context, enabling LLMs to reason and integrate conflicting facts before generating content.

Evaluation Metrics Following prior studies, we adopt accuracy (ACC) as the primary evaluation metric, measuring the proportion of questions for which the LLM generates correct answers, thereby providing a direct assessment of the factual correctness of the generated responses. To evaluate the method's capability to precisely extract information pertinent to the target answer from retrieved corpora, we introduce the Context Precision Ratio (CPR) metric, which measures the proportion of answer-related content within the processed context:

[ \mathrm{{CPR}} = \frac{\left| {\mathcal{A}}{\text{ gold }} \cap {\mathcal{C}}{\text{ processed }}\right| }{\left| {\mathcal{C}}_{\text{ processed }}\right| } \tag{13} ]

where ( \left| {\text{ Context }}{\text{ gold }}\right| ) denotes the length of segments directly related to the correct answer, and |Context ( {}{\text{ processed }} ) | represents the total length of the processed context.

MethodLLMDatasetAvg.Imp.
FaithEvalMuSiQueRealtimeQASQuAD
w/o RAGGPT-40-mini4.615.143.411.218.6-
Qwen2.5-7B-Instruct4.219.640.711.118.9-
Mistral-7B-Instruct6.313.829.211.515.2-
w/ RAGGPT-40-mini61.372.667.373.168.650.0
Qwen2.5-7B-Instruct53.175.278.768.368.849.9
Mistral-7B-Instruct61.967.652.267.262.247.0
KREGPT-4o-mini50.734.647.565.349.530.9
Qwen2.5-7B-Instruct59.670.786.773.772.753.8
Mistral-7B-Instruct73.250.676.974.668.853.6
COIECDGPT-40-mini53.956.448.757.654.235.6
Qwen2.5-7B-Instruct62.369.778.870.870.451.5
Mistral-7B-Instruct62.866.858.465.463.348.1
FaithfulRAGGPT-40-mini67.279.378.880.876.558.0
Qwen2.5-7B-Instruct71.878.084.178.378.159.1
Mistral-7B-Instruct81.778.577.085.780.765.5
TruthfulRAG (Ours)GPT-40-mini69.579.485.081.178.860.2
Qwen2.5-7B-Instruct73.279.182.378.778.359.4
Mistral-7B-Instruct81.979.381.482.781.366.1

Table 1: Comparison of ACC between TruthfulRAG and five baselines across four datasets within three representive LLMs. The best result for each backbone LLM within each dataset is highlighted in bold, and the second best is emphasized with an underline. Avg. denotes the arithmetic mean accuracy across the four datasets, while Imp. indicates the average improvement over the corresponding LLM's w/o RAG baseline.

Implementation Details For dense retrieval, cosine similarity is computed using embeddings generated by the all-MiniLM-L6-v2. For entropy-based filtering, we set model-specific thresholds ( \tau ) for entropy variation ( \Delta {H}_{p} ) : GPT-40- mini and Mistral-7B-Instruct use ( \tau = 1 ) , while Qwen2.5- 7B-Instruct adopts a higher threshold of ( \tau = 3 ) . All experiments are conducted using NVIDIA V100 GPUs with 32GB memory. To ensure reproducibility, the temperature for text generation is set to 0, and all Top- ( K ) values are set to 10 .

Results and Analysis

Overall Performance Table 1 presents a comprehensive comparison of TruthfulRAG against five baseline methods across four datasets, evaluating performance in terms of factual accuracy (ACC) using three representative LLMs. To facilitate overall assessment, we additionally report Avg., the arithmetic mean accuracy across the four datasets, and Imp., the average improvement over the corresponding LLM's w/o RAG baseline, serving as a proxy for the number of factual conflicts successfully corrected by the method from the LLM's parametric knowledge.

The results clearly demonstrate that TruthfulRAG consistently achieves superior or competitive performance relative to all baseline approaches. Specifically, it achieves the highest accuracy on FaithEval (81.9%), MuSiQue (79.4%), and RealtimeQA (85.0%), and ranks first or second on SQuAD across all models. Notably, TruthfulRAG achieves the highest overall performance across all backbone LLMs, attaining both the best average accuracy (Avg.) and the greatest relative improvement (Imp.) compared to all baseline methods. This clearly illustrates its robustness in mitigating factual inconsistencies that standard RAG systems struggle with due to unresolved evidence conflicts.

Compared to standard RAG systems, which exhibit significant variability in accuracy due to unresolved knowledge conflicts, TruthfulRAG achieves improvements ranging from 3.6% to 29.2%, highlighting its robustness in mitigating factual inconsistencies. Furthermore, while methods like FaithfulRAG and KRE offer partial gains through semantic alignment or prompt-based mechanisms, they fall short in consistently resolving fine-grained factual discrepancies. In contrast, TruthfulRAG integrates knowledge graph-based reasoning with entropy-guided conflict filtering mechanisms to identify and resolve contradictory information, thereby substantially enhancing factual reliability. These findings validate the effectiveness of TruthfulRAG in delivering accurate, faithful, and contextually grounded responses across diverse knowledge-intensive tasks.

Performance on Non-Conflicting Contexts To evaluate the robustness of TruthfulRAG in scenarios where retrieved contexts free from factual conflicts, we conduct experiments on golden standard datasets in which the retrieved passages are guaranteed to be non-contradictory.

As shown in Table 2, TruthfulRAG consistently outperforms all baseline methods across both the MuSiQue-golden and SQuAD-golden datasets. These findings substantiate that TruthfulRAG not only excels at resolving conflicting information but also maintains superior performance in nonconflicting contexts, thereby revealing its universal applicability and effectiveness. The consistent performance improvements can be attributed to the structured knowledge representation provided by the knowledge graph module, which enables the identification of fine-grained entities and relational links in non-conflicting contexts. This capability facilitates the extraction of query-relevant information and promotes a more comprehensive understanding and integration of factual knowledge by the LLMs. Notably, while methods such as KRE exhibit significant performance degradation in non-conflicting scenarios, TruthfulRAG maintains its robustness across diverse contextual settings. This consistency highlights its practical utility and reliability for real-world RAG applications.

DatasetMethod
w/o RAGw/ RAGKRECOIECDFaithfulRAGTruthfulRAG (Ours)
MuSiQue-golden45.689.944.1(-45.8)89.5(-0.4)91.8(+1.9)93.2 (+3.3)
SQuAD-golden68.797.983.2(-14.7)97.1(-0.8)98.1(+0.2)98.3 (+0.4)

Table 2: Performance comparison on non-conflicting contexts with GPT-40-mini as the backbone LLM. The best result on each dataset is highlighted in bold. The numbers in parentheses indicates the change in accuracy compared to the standard RAG.

bo_d6nbbd4601uc73e2hqsg_5_169_471_1470_348_0.jpg

Figure 3: Comparison of LLM confidence, measured by negative log-probability (logprob) values using GPT-40-mini, when reasoning with natural language contexts versus structured reasoning paths across four datasets. Lower negative logprob values indicate higher actual log-probability scores and thus increased model confidence in generating correct answers.

Impact of Structured Reasoning Paths To investigate the impact of structured reasoning paths on the confidence of LLMs relative to raw natural language context, we conduct a comprehensive analysis across four datasets. Specifically, we compare the model's confidence when reasoning with retrieved knowledge presented in natural language format or as structured reasoning paths derived through our knowledge graph construction mechanism. To quantify the model's confidence in its predicted answers, we measure the log-probability of the correct answer tokens generated by LLMs and compute the average across all test instances.

As shown in Figure 3, our experimental results reveal a consistent pattern across all evaluated datasets. Structured reasoning paths consistently lead to higher logprob values for correct answers compared to natural language contexts, indicating greater model confidence when reasoning with structured knowledge representations. This empirical evidence demonstrates that transforming unstructured natural language into structured reasoning paths through knowledge graphs significantly strengthens the LLM's confidence in following external retrieved knowledge for inference. Furthermore, this finding provides crucial insights into the superior performance of TruthfulRAG in both conflicting and non-conflicting semantic scenarios, as the enhanced confidence facilitates more reliable adherence to external knowledge sources, thereby supporting factual consistency and promoting the generation of faithful model outputs.

Ablation Study To comprehensively evaluate the contribution of each component in TruthfulRAG, we conduct systematic ablation experiments by removing key modules from the full framework. Since knowledge graph construction and retrieval are two closely coupled modules, we combine them as an integrated component for ablation evaluation.

As shown in table 3, the complete TruthfulRAG framework achieves superior performance across all datasets, with accuracy improvements ranging from 6.8% to 17.7% compared to the standard RAG, demonstrating that the structured knowledge graph and the conflict resolution mechanism function synergistically to enhance both factual accuracy and contextual precision. The ablation results reveal several critical insights. First, when employing only the filtering mechanism without knowledge graph integration (w/o Knowledge Graph), although accuracy demonstrates modest improvements, CPR exhibits a notable decline across most datasets, particularly in MuSiQue (1.86 to 1.15) and SQuAD (2.71 to 1.97). This phenomenon indicates that LLMs encounter substantial difficulties in effectively extracting relevant information from naturally organized contexts, thereby constraining their ability to achieve higher accuracy. In contrast, when utilizing solely the knowledge graph component without conflict resolution (w/o Conflict Resolution), CPR achieves significant improvements, yet the introduction of extensive structured knowledge simultaneously introduces redundant information, resulting in limited improvements in accuracy across most datasets. These findings support our hypothesis that structured knowledge representations facilitate the precise localization of query-relevant information, enabling more targeted and effective information extraction compared to unstructured contexts.

MethodDataset
FaithEvalMuSiQueRealtimeQASQuAD
Standard RAG61.3 / 0.5172.6 / 1.8667.3 / 0.4773.1 / 2.71
w/o Knowledge Graph64.8 / 0.5278.9 / 1.1583.2 / 0.2378.8 / 1.97
w/o Conflict Resolution69.3 / 0.5977.8 / 2.7984.1 / 1.8078.2 / 2.85
Full Method69.5 / 0.5679.4 / 2.2585.0 / 1.5481.1 / 2.56

Table 3: Ablation study results of different components in TruthfulRAG with GPT-40-mini as the backbone LLM. The results are presented in the format ACC / CPR, where ACC denotes accuracy and CPR represents Context Precision Ratio.

This section reviews existing research on knowledge conflicts in RAG systems, categorizing the literature into two main areas: impact analysis and resolution strategies.

Impact Analysis of Knowledge Conflicts

Recent studies have extensively explored the influence of knowledge conflicts on the performance of RAG systems (Longpre et al. 2021; Chen, Zhang, and Choi 2022; Xie et al. 2023; Tan et al. 2024; Ming et al. 2025), which primarily highlight differential preferences between the parametric knowledge and retrieved external information. Long-pre et al. (Longpre et al. 2021) first expose entity-based knowledge conflicts in question answering, revealing that LLMs tend to rely on parametric memory when retrieved passages are perturbed or contain contradictory information. Chen et al. (Chen, Zhang, and Choi 2022) demonstrate that while retrieval-based LLMs predominantly depend on nonparametric evidence when recall is high, their confidence scores fail to reflect inconsistencies among retrieved documents. Xie et al. (Xie et al. 2023) find that LLMs are receptive to single external evidence, yet exhibit strong confirmation bias when presented with both supporting and conflicting information. Tan et al. (Tan et al. 2024) reveal a systematic bias toward self-generated contexts over retrieved ones, attributing this to the higher query-context similarity and semantic incompleteness of retrieved snippets.

Our work aligns with the non-parametric knowledge preference paradigm, aiming to guide LLMs to follow updated and comprehensive external knowledge while correcting for temporal and factual errors within internal memory, thereby generating accurate and trustworthy outputs.

Solutions to Knowledge Conflicts

Current approaches for knowledge conflict resolution can be categorized into token-level and semantic-level methods (Jin et al. 2024; Wang et al. 2024; Bi et al. 2025; Zhang et al. 2025; Wang et al. 2025). Token-level approaches focus on fine-grained intervention during generation. ( C{D}^{2} ) (Jin et al. 2024) employs attention weight manipulation to suppress parametric knowledge when conflicts are detected. ASTUTE RAG (Wang et al. 2024) utilizes gradient-based attribution to identify and mask conflicting tokens during inference. These methods achieve precise control, but often suffer from computational overhead and lack semantic awareness among generated contents. Semantic-level approaches operate at higher abstraction levels. CK-PLUG (Bi et al. 2025) develops parameter-efficient conflict resolution through adapter-based architectures that learn to weight parametric versus non-parametric knowledge dynamically. FaithfulRAG (Zhang et al. 2025) externalizes LLMs' parametric knowledge and aligns it with retrieved context, thereby achieving higher faithfulness without sacrificing accuracy. However, these methods primarily address surface-level conflicts without capturing the underlying factual relationships that drive knowledge inconsistencies.

Different from these approaches, TruthfulRAG leverages structured triple-based knowledge representations to precisely identify and resolve factual-level knowledge conflicts arising from complex natural language expressions, thereby ensuring the reliability and consistency of reasoning.

Conclusion

In this paper, we introduce TruthfulRAG, the first framework that leverages knowledge graphs to address factual-level conflicts in RAG systems. By integrating systematic triple extraction, query-aware graph retrieval, and entropy-based filtering mechanisms, TruthfulRAG transforms unstructured retrieved contexts into structured reasoning paths that enhance LLMs' confidence in external knowledge while effectively mitigating factual inconsistencies. Our comprehensive experiments demonstrate that TruthfulRAG consistently outperforms existing SOTA methods. These results establish TruthfulRAG as a robust and generalizable solution for improving the trustworthiness and accuracy of RAG systems, with significant implications for knowledge-intensive applications requiring high reliability and precision.