commit 5e2eb7b8c08dd15fbede363687e64f9cd66bfd34 Author: 龙澳 Date: Thu Apr 2 09:48:38 2026 +0800 first commit diff --git a/RAG for Spatial Data.md b/RAG for Spatial Data.md new file mode 100644 index 0000000..4ec12dc --- /dev/null +++ b/RAG for Spatial Data.md @@ -0,0 +1,54 @@ +## Multi-source Retrieval Augmented Generation存在的问题(没有考虑到空间数据) + +考虑火星的情况: + +1. 连续时空拓扑的离散表示失效与异构参考系的割裂 (The Discretization Failure of Continuous Topologies and the Fragmentation of Heterogeneous Reference Frames) +现有的多源知识聚合方法(如多源线图 Multi-source Line Graphs)高度依赖离散的文本实体与显式语义关联来构建图拓扑。然而,火星科学数据本质上嵌于连续的欧几里得物理空间中,且存在极其复杂的参考系割裂(例如:轨道器使用的全球绝对坐标系与火星车使用的局部相对坐标系、地球 UTC 时间与火星当地太阳日 Sol 的错位)。一方面,仅基于语义实体的离散图构建方法完全无法实现跨参考系的物理空间对齐;另一方面,若试图在传统的离散图结构中强制编码连续的空间邻近性(Spatial Proximity)和方向关系,将不可避免地引发“边爆炸(Edge Explosion)”问题,从而彻底摧毁现有图模型针对数据稀疏性所做的优化。因此,传统的离散逻辑图结构无法跨越物理连续性与语义离散性之间的鸿沟,成为了制约行星空间推理能力的结构性瓶颈。 + +2. 科学认知冲突与传统 RAG “去伪”机制的底层逻辑矛盾 (The Contradiction Between Scientific Cognitive Conflict and Traditional "De-falsification" Mechanisms) +现有多源 RAG 框架的核心假设是:源间数据的不一致性(Inter-source Inconsistency)通常源于错误信息或模型的“幻觉”,因此依赖多级置信度计算来剔除冲突节点(Eliminate unreliable nodes)。然而,在深空探测场景中,由于缺乏绝对的“真实基准(Ground Truth)”,不同探测平台(如轨道器与火星车)因观测尺度、穿透深度及仪器原理的差异,对同一目标区域的观测结果往往存在显著冲突(例如:轨道器发现表面水合矿物,而原位钻探未见异常)。这种“冲突”并非数据错误,而是多维度科学观测的固有属性,蕴含着地质演化等重大科学发现的线索。若生搬硬套现有框架的冲突过滤机制,将导致严重的“过度平滑(Over-smoothing)”,无差别地抹杀高价值的科学异常特征,从根本上违背了深空探索中“保留争议、多源印证”的知识发现规律。 + +## RAG for Spatial Data存在的问题(没有考虑到多源空间数据的可靠性问题) + +当前RAG for Spatial Data往往针对单一来源[@zhang2025imagerag]:它只处理单一来源(一张大图)。它假设这张图就是真理,不存在“图里显示有房子,但文字报告说房子拆了”这种冲突。它解决的是 Scale (尺度) 问题,不是 Consistency (一致性) 问题。即便涉及了多源、多模态,它们的侧重点也在于"Capability (能力)" —— 即“如何把这些难处理的空间数据(超大图像、异构数据库)塞进 RAG 里让 LLM 读懂”。而并不关注 "Reliability (可靠性)" —— 即“当不同来源的数据打架时,如何防止 LLM 胡说八道”。[@yu2025spatialrag],[@amendola2025spatiallyenhanced]研究Hybrid Retrieval (混合检索)。把 Spatial Database 的过滤(如距离筛选)和 Semantic Search(文本语义)结合起来。但它们侧重于 Fusion (融合) —— 默认 Spatial Data 和 Text Data 是互补的。如果 Spatial DB 说“这里有路”,但 Text Description 说“路在施工已封闭”,这些框架大概率会产生幻觉或直接忽略冲突。它们没有 Conflict Resolution (冲突消解) 机制。[@wen2025rsrag],[@canada2025multimodal]构建数据集 (Dataset Construction) 和 向量空间对齐 (Alignment)。它们确实是 Multi-source (Image + Text),但主要关注 Representation (表征) —— 如何把图和文映射到同一个向量空间。而无法解决的“稀疏性导致逻辑断层”和“源间冲突”。 + +目前的空间RAG系统专注于对齐异构模式(矢量,光栅,文本)。然而,它们忽略了空间数据固有的不一致性和逻辑稀疏性(例如,过时的POI文本与新的卫星图像与不精确的OSM矢量)。这导致了'空间幻觉',其中LLM生成几何上不可能或事实上相互冲突的答案。 + +## 我的方法 + +### 一、 针对痛点 1:拓扑关系的缺失 (Solving Missing Topology) + +原论文缺陷分析:MultiRAG 使用 Multi-source Line Graph (MLG),其核心是将“实体-关系-实体”的三元组转换为节点。这种结构只能捕捉逻辑连接 (Logical Connectivity),例如“A 是 B 的一部分”。它无法编码欧氏空间 (Euclidean Space) 中的距离、方位和包含关系。对于 LLM 来说,"Near" 和 "Far" 在这种图里只是两个普通的单词标签,失去了度量意义。 + +我们的解决方案:Topo-Semantic Dual Graph (拓扑-语义双层图)我们不使用单一的 Line Graph,而是构建一个双层耦合图结构: + +1. Layer 1: Semantic Line Graph (逻辑层)继承 MultiRAG 的设计,处理非空间的语义信息(如“祝融号-属于-CNSA”)。 +2. Layer 2: Spatial-Topology Graph (空间拓扑图) 定义: 这是一个显式的空间索引层。我们将空间离散化(例如使用 H3 六边形网格或 S2 单元),或者利用 Delaunay Triangulation 构建邻接图。创新算子:Spatial Edge Encoding (空间边编码)在 MultiRAG 中,两个节点连边是因为它们共享一个实体。在 Geo-MultiRAG 中,我们引入 "Spatial Proximity Edge" (空间邻近边)。如果实体 $e_i$ 和 $e_j$ 在空间投影上的 IoU (Intersection over Union) $> 0$ 或者距离 $dist(e_i, e_j) < \delta$,我们在它们之间建立一条带权重的空间边。 + +数学形式化 (Formalization for Paper):Let $\mathcal{G}_{sem} = (V, E_{sem})$ be the semantic graph. We introduce a metric graph $\mathcal{G}_{geo} = (V, E_{geo})$, where an edge $e_{ij} \in E_{geo}$ exists iff: + +$$\text{SpatialRel}(v_i, v_j) \in \{\text{Contains, Overlaps, Meets, Near}\}$$ + +We define a Spatial Encoding Kernel $K_{spa}(v_i, v_j)$ to replace the simple binary connection in MultiRAG: + +$$K_{spa}(v_i, v_j) = \exp(-\frac{\|coord(v_i) - coord(v_j)\|^2}{2\sigma^2}) \cdot \mathbb{I}(\text{Visible})$$ + +解释: 这样一来,大模型在检索时,不仅通过语义关联游走,还可以通过“空间核函数”感知到物理上相邻但语义上没直接联系的实体(例如:虽然“沙丘”和“火星车”在语义图没连线,但因为空间距离近,它们的关联被激活)。 + +### 二、 针对痛点 2:多尺度悖论 (Solving Multi-Scale Paradox)(已改成Ground Truth问题) + +原论文缺陷分析:MultiRAG 使用互信息熵 (Mutual Information Entropy) 来计算置信度 3。公式为:$I(v_i, v_j) = \sum \sum p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$。它的假设是:如果两个源的数据“一致”,则置信度高;如果不一致,则置信度低 4。在火星场景下这是致命的: CTX (6m) 说 "Plain" (平原),HiRISE (0.3m) 说 "Rocky" (多石)。这两个描述在文本语义上是不一致的(互信息低),MultiRAG 会把它们当成冲突 (Hallucination) 从而杀掉其中一个。但实际上,这是多尺度互补 (Multi-scale Complementarity)。 + +我们的解决方案:Resolution-Aware Entailment (分辨率感知蕴含计算)我们提出一个新的置信度计算模块,不再计算“相似性(Similarity)”,而是计算“蕴含性(Entailment)”。引入分辨率因子 (Resolution Factor):为每个数据源 $D_k$ 分配一个分辨率权重 $\lambda_k$ (例如 HiRISE $\lambda=1.0$, CTX $\lambda=0.2$)。非对称置信度 (Asymmetric Confidence):传统的 MultiRAG 计算是对称的 $S(v_i, v_j) = S(v_j, v_i)$。我们要改为有向蕴含。定义 Scale-Consistent Scoring Function (尺度一致性评分):如果是同尺度对比,保持 MultiRAG 的做法(检查一致性)。如果是跨尺度对比(Source High vs Source Low),我们检查 "Semantic Encompassment" (语义包容)。 + +数学形式化 (Formalization for Paper):We redefine the confidence score $C(v_{high}, v_{low})$ not as similarity, but as a conditional probability based on resolution hierarchy: + +$$Score(v_{high}, v_{low}) = +\begin{cases} +\text{Sim}(v_{high}, v_{low}), & \text{if } |\lambda_{high} - \lambda_{low}| < \epsilon \quad (\text{同尺度,查冲突}) \\ +\text{Entail}(v_{low} \to v_{high}), & \text{if } \lambda_{high} \gg \lambda_{low} \quad (\text{跨尺度,查蕴含}) +\end{cases}$$ + +Where $\text{Entail}(\cdot)$ is a Natural Language Inference (NLI) probability:Does the coarse description (e.g., "Plain") logically permit the existence of the fine description (e.g., "Small Rocks")?"Plain" entails "Small Rocks" ? $\rightarrow$ Yes (High Confidence)."Lake" entails "Dune" ? $\rightarrow$ No (Low Confidence, likely Hallucination). + +Impact:通过这种改进,你的模型会说:"Source A says Plain, Source B says Rocks. Since Source B has higher resolution, and plains often contain small rocks, both are kept, and the final answer is enriched: 'A generally flat plain containing localized rocky fields'." diff --git a/method.md b/method.md new file mode 100644 index 0000000..8eec6c0 --- /dev/null +++ b/method.md @@ -0,0 +1,192 @@ +# III. METHODOLOGY + +## A. Framework of XXX-RAG + +This section elaborates on the implementation approach of XXX-RAG. As shown in Fig. 3, the framework comprises three modules. The first step involves constructing a Hyperbolic Spatial Hypergraph (HySH) from multi-source planetary observation data, achieving unified spatiotemporal representation via n-ary observation hyperedges embedded in hyperbolic space; the second step requires performing spatiotemporal retrieval on the constructed HySH, where hyperbolic spatial proximity encoding and cross-resolution aggregation are employed to extract query-relevant multi-source evidence; the third step involves physics-informed conflict triage (PICT), which detects inter-source conflicts via cross-source interaction entropy, classifies them into four scientific categories, and applies conflict-aware confidence recalibration to preserve scientifically valuable disagreements while filtering noise. Finally, integrating the aforementioned steps to form the XXX-RAG Prompting algorithm, ARP. + +## B. Hyperbolic Spatial Hypergraph Construction + +The XXX-RAG method begins by constructing a knowledge structure that can faithfully represent the continuous spatiotemporal topology of planetary multi-source data. Unlike MultiRAG's Multi-source Line Graph (MLG), which relies on discrete text entities and binary triples, we introduce a hypergraph structure embedded in hyperbolic space to jointly address edge explosion and spatial scale hierarchy. + +Specifically, we first design a spatial adapter for each observation data source to parse instrument metadata, spatial footprints, temporal windows, and spectral parameters. For orbital remote sensing data (e.g., HiRISE, CTX, CRISM), parsing involves extracting the image footprint geometry, ground sampling distance, and spectral band configuration from PDS labels. For in-situ data (e.g., rover spectrometers, ground-penetrating radar), parsing extracts the rover traverse coordinates, measurement timestamps in Sol, and instrument-specific parameters such as penetration depth. All temporal references are unified to Solar Longitude $L_s$ to enable cross-platform temporal comparison. The final integration of multi-source spatial data can be expressed as: + +$$D_{Fusion} = \bigcup_{i=1}^{n} A_i^{spa}(D_i) \tag{2}$$ + +where $A_i^{spa} \in \{Ada_{orbital}, Ada_{insitu}, Ada_{derived}\}$ represents the spatial adapter parsing functions for orbital, in-situ, and derived data products respectively. + +Through the parsed data, we further construct the hyperbolic spatial hypergraph. The construction process involves three key phases: spatial observation hyperedge formation, scale-aware hyperbolic embedding, and cross-reference-frame alignment. + +Definition 1. Spatial observation hyperedge. Given a multi-source spatial knowledge hypergraph $\mathcal{G}_{hyp} = (\mathcal{E}, \mathcal{R}, \mathcal{F}_{spa})$, a spatial observation hyperedge $f_{spa}^n \in \mathcal{F}_{spa}$ is defined as: + +$$f_{spa}^n = (\mathcal{I}, \; \mathcal{P}_{foot}, \; \mathcal{T}_{win}, \; \mathcal{S}_{band}, \; \mathcal{O}_{target}, \; \ell_{res}) \tag{3}$$ + +where $\mathcal{I}$ denotes the instrument entity, $\mathcal{P}_{foot} \subset \mathbb{S}^2_{Mars}$ denotes the spatial footprint on the Martian sphere, $\mathcal{T}_{win}$ denotes the temporal acquisition window parameterized in $L_s$, $\mathcal{S}_{band}$ denotes the spectral band set, $\mathcal{O}_{target}$ denotes the set of target geological features, and $\ell_{res} \in \mathbb{R}^+$ denotes the ground sampling distance. + +Based on the definition, it can be inferred that spatial observation hyperedges achieve high aggregation of co-located multi-source observations. In a pairwise spatial graph, $k$ co-existing spatial entities require $\binom{k}{2} = O(k^2)$ spatial proximity edges. With hyperedges, a single $n$-ary fact binds all $k$ entities, reducing edge complexity to $O(k)$. This directly resolves the edge explosion problem identified in our analysis of MultiRAG's MLG structure. + +Definition 2. Scale-aware Lorentz embedding. We represent the spatial observation hypergraph in $d$-dimensional hyperbolic space $\mathbb{H}_K^d$ with constant negative curvature $K < 0$ using the Lorentz model. The embedding mapping $\Phi: \mathcal{F}_{spa} \rightarrow \mathbb{H}_K^d$ couples the radial depth with spatial resolution: + +$$r\left(\Phi(f_{spa}^n)\right) = \frac{1}{\sqrt{-K}} \cosh\left(\sqrt{-K} \cdot g(\ell_{res})\right) \tag{4}$$ + +where $g(\ell_{res}) = -\log(\ell_{res} / \ell_{max})$ is a monotone decreasing function of resolution, and $r(\mathbf{x}) = x_0$ denotes the radial depth (intrinsic hyperbolic distance from the origin). + +This embedding design is motivated by the following observation on the intrinsic geometry of planetary spatial data: + +Proposition 1 (Spatial Scale-Curvature Correspondence). The planetary spatial observation hierarchy exhibits tree-like branching: each coarser-resolution observation spatially contains multiple finer-resolution observations. Let $N(\ell)$ denote the number of observations at resolution level $\ell$. For remote sensing data with total survey area $A_{coverage}$: + +$$N(\ell) \propto A_{coverage} / \ell^2 \tag{5}$$ + +As resolution $\ell$ decreases (finer scale), $N(\ell)$ grows quadratically, exhibiting the exponential branching characteristic of negative-curvature spaces. Therefore, the spatial scale hierarchy is intrinsically hyperbolic, and Euclidean embedding with polynomial volume growth cannot faithfully represent it. + +Through this embedding, global coarse-resolution data (e.g., MOLA topography at ~460m) is placed near the hyperbolic origin (small radial depth), while local high-resolution data (e.g., HiRISE at 0.3m) is placed far from the origin (large radial depth). The exponential volume growth of $\mathbb{H}_K^d$ naturally accommodates the exponentially increasing number of observations at finer scales. + +Finally, to address the heterogeneous reference frame problem (orbiter areocentric coordinates vs. rover-centric local coordinates), we align all observations to a global reference via parallel transport on the hyperbolic manifold: + +$$\Phi_{aligned}(e) = \exp_{o_{g}}\left(\Gamma_{o_k \to o_{g}}\left(\log_{o_k}(\Phi_k(e))\right)\right) \tag{6}$$ + +where $\log_{o_k}$ is the logarithmic map at the local reference origin $o_k$, $\Gamma_{o_k \to o_{g}}$ is the parallel transport operator along the geodesic from $o_k$ to the global origin $o_g$, and $\exp_{o_g}$ is the exponential map at the global origin. Unlike Euclidean affine transformations, hyperbolic parallel transport preserves geodesic distances and radial depth, ensuring that scale hierarchy information is maintained after cross-frame alignment. + +Here, we provide a simple example. As shown in Fig. 4, an observation region is covered by three sources at different resolutions: a CTX mosaic (6m), an HiRISE strip (0.3m), and a CRISM spectral cube (18m). In the hyperbolic spatial hypergraph, the HiRISE observation (finest resolution) is embedded at the largest radial depth, while the CRISM observation (coarsest resolution) is nearest to the origin. A spatial observation hyperedge binds all three observations and their co-located geological features into a single $n$-ary fact, without requiring $O(k^2)$ pairwise edges. + +## C. Spatiotemporal Retrieval with Cross-Resolution Aggregation + +After the construction of the hyperbolic spatial hypergraph, the next step is to retrieve query-relevant multi-source spatial evidence. Given a user query $q$, we first employ the LLM to extract spatial intent, including target entities, spatial constraints (footprint, region), temporal constraints ($L_s$ range, Sol range), and resolution preferences. These are denoted as query elements $\mathcal{K}_q$. + +Subsequently, we perform spatiotemporal retrieval on the hypergraph. For each topic entity $e_s \in \mathcal{E}_q$ extracted from the query, we retrieve its incident spatial observation hyperedges $\mathcal{F}_{e_s} = \{f_{spa}^n \in \mathcal{F}_{spa} : e_s \in f_{spa}^n\}$ and derive pseudo-binary triples $(e_h, f_{spa}^n, e_t)$ for pairwise reasoning. For each candidate triple, we compute a spatiotemporal encoding that fuses semantic, structural, and physical-spatial signals: + +$$\mathbf{x} = \left[\varphi(q) \| \varphi(e_h) \| \varphi(f_{spa}^n) \| \varphi(e_t) \| \delta(e_h, f_{spa}^n, e_t) \| \psi_{geo}(e_h, e_t)\right] \tag{7}$$ + +where $\varphi$ denotes a text embedding model, $\delta$ denotes a structural proximity encoding adapted from [6] to operate on hyperedges, and $\psi_{geo}$ is the hyperbolic spatial encoding defined as: + +$$\psi_{geo}(e_h, e_t) = \left[d_K\left(\Phi(e_h), \Phi(e_t)\right), \; \Delta r(e_h, e_t), \; \cos\theta_{bearing}\right] \tag{8}$$ + +Here $d_K$ is the geodesic distance in $\mathbb{H}_K^d$ capturing physical proximity, $\Delta r = |r(\Phi(e_h)) - r(\Phi(e_t))|$ encodes scale difference via radial depth gap, and $\cos\theta_{bearing}$ encodes directional relationship. A lightweight MLP classifier $f_\theta$ then scores the plausibility of each candidate triple: + +$$\text{score}(e_h, f_{spa}^n, e_t) = f_\theta(\mathbf{x}) \in [0, 1] \tag{9}$$ + +Top-scored triples are retained and their tail entities form the frontier for next-hop expansion, following an adaptive search strategy with density-aware thresholding. + +After retrieval, the selected multi-source evidence typically spans multiple resolutions. To aggregate these into a unified representation without losing fine-scale information, we introduce the Spatial Outward Einstein Midpoint (Spatial OEM). The motivation stems from a known failure mode: naively averaging hyperbolic embeddings collapses representations toward the origin, destroying the hierarchical structure encoded in radial depth [7]. + +Given spatial observation hyperedge embeddings $\{\Phi(f_i)\}_{i=1}^n \subset \mathbb{H}_K^d$ with query-relevance weights $w_i$ and resolution-aware radial weighting $\phi_{res}(f_i) = r(\Phi(f_i))^p$: + +$$\mathbf{m}_{K,p}^{Spa\text{-}OEM} = \Pi_K\left(\frac{\sum_{i=1}^{n} w_i \cdot \phi_{res}(f_i) \cdot \lambda_i \cdot \Phi(f_i)}{\sum_{i=1}^{n} w_i \cdot \phi_{res}(f_i) \cdot \lambda_i}\right) \tag{10}$$ + +where $\lambda_i = \Phi(f_i)_0$ is the Lorentz factor and $\Pi_K$ denotes reprojection onto $\mathbb{H}_K^d$. + +Theorem 1 (Spatial OEM Outward Bias). For $p \geq 1$, the Spatial OEM satisfies: + +$$r(\mathbf{m}_{K,p}^{Spa\text{-}OEM}) \geq r(\mathbf{m}_K^{Ein})$$ + +where $\mathbf{m}_K^{Ein}$ is the standard Einstein midpoint ($p = 0$). + +*Proof.* The OEM weights $\tilde{w}_i \propto w_i \cdot r(\Phi(f_i))^{p+1}$ concentrate more mass on high-radius points than the Einstein weights $w_i \cdot r(\Phi(f_i))$. By the Chebyshev sum inequality applied to the co-monotonic sequences $a_i = r(\Phi(f_i))^{p+1}$ and $b_i = r(\Phi(f_i))$, the pre-projection time component satisfies $\tilde{v}_0 \geq \bar{r}_w$ (weighted mean radius). Since reprojection $\Pi_K$ preserves the ordering of time components, the result follows. $\square$ + +Notably, the outward bias guarantees that high-resolution observations dominate the aggregated representation. This is essential for planetary science retrieval: when a user queries a specific geological feature, the aggregated evidence should preserve the fine-scale observational details rather than being smoothed into a coarse-resolution summary. + +## D. Physics-Informed Conflict Triage + +We define the multi-source spatial evidence retrieved in a single query as observation-grounded homologous data. Although targeting the same query object, these data often provide inconsistent factual statements due to differences in instrument principles, observation geometry, and acquisition epochs. Unlike MultiRAG's MCC module, which assumes that inconsistency indicates unreliability and employs mutual information entropy to filter conflicting nodes, we adopt a fundamentally different paradigm: physics-informed conflict triage (PICT), which classifies conflicts by their physical origin and applies differentiated processing strategies. + +1) Observation-Grounded Conflict Formalization: The first stage establishes a formal framework for reasoning about conflicts in the context of physical observations. Each knowledge source carries not only factual content but also a physical measurement model that constrains what it can observe. + +Definition 3. Observation-grounded knowledge source. A planetary observation knowledge source is defined as $\mathcal{K}_s = (\mathcal{I}_s, \Omega_s, F(\mathcal{K}_s), \mathcal{M}_s)$, where $\mathcal{I}_s$ denotes the instrument, $\Omega_s = (\ell_{res}, \lambda_{band}, \theta_{view}, d_{pen})$ denotes observation geometry parameters (spatial resolution, spectral band, viewing angle, penetration depth), $F(\mathcal{K}_s)$ denotes the set of atomic factual statements, and $\mathcal{M}_s$ denotes the physical measurement model mapping target properties through observation constraints to observable facts. + +For two sources $\mathcal{K}_i$ and $\mathcal{K}_j$ ($i \neq j$), the pairwise conflict set is: + +$$\mathcal{C}_{i,j} = \{(\psi_i, \psi_j) \mid \psi_i \in F(\mathcal{K}_i), \psi_j \in F(\mathcal{K}_j), \psi_i \bot \psi_j\} \tag{11}$$ + +where $\psi_i \bot \psi_j$ denotes semantic incompatibility. We further introduce the central distinction of PICT: + +Definition 4. Explainable conflict and opaque conflict. A pairwise conflict $(\psi_i, \psi_j) \in \mathcal{C}_{i,j}$ is *explainable* if there exists a physical bridging function $\mathcal{B}$ such that: + +$$\mathcal{B}(\Omega_i, \Omega_j, \mathcal{M}_i, \mathcal{M}_j) \models \neg(\psi_i \bot \psi_j) \tag{12}$$ + +i.e., the apparent inconsistency is resolvable by accounting for observation constraint differences. Otherwise, the conflict is *opaque*. Based on this distinction, we define four conflict categories: + +| Category | Condition | Strategy | +|----------|-----------|----------| +| Noise ($\mathcal{C}^{noise}$) | Opaque, with significant source authority disparity | Filter low-authority source | +| Instrument-Inherent ($\mathcal{C}^{inst}$) | Explainable via $\Omega_i \neq \Omega_j$ | Preserve with physical explanation | +| Scale-Dependent ($\mathcal{C}^{scale}$) | Explainable via $\ell_{res}^i \neq \ell_{res}^j$ | Preserve with cross-scale linkage | +| Temporal-Evolution ($\mathcal{C}^{temp}$) | Explainable via $\mathcal{T}_i \neq \mathcal{T}_j$ | Preserve with temporal ordering | + +2) Cross-Source Interaction Entropy: In the second stage, we detect conflicts by measuring the information-theoretic interaction effect when two sources are jointly presented to the LLM. Unlike TruthfulRAG [9], which compares retrieval-augmented entropy against parametric-only entropy ($\Delta H_p = H(P_{aug}) - H(P_{param})$), this formulation is inapplicable to our setting where all knowledge is external. We instead measure the cross-source interaction: + +$$\mathcal{H}_{inter}(p_i, p_j \mid q) = H\left(P(\text{ans} \mid q, p_i \oplus p_j)\right) - \frac{1}{2}\left[H\left(P(\text{ans} \mid q, p_i)\right) + H\left(P(\text{ans} \mid q, p_j)\right)\right] \tag{13}$$ + +where $H(\cdot)$ is the token-averaged entropy over top-$k$ candidate tokens: + +$$H\left(P(\text{ans} \mid \text{context})\right) = -\frac{1}{|l|}\sum_{t=1}^{|l|}\sum_{i=1}^{k} pr_i^{(t)} \log_2 pr_i^{(t)} \tag{14}$$ + +and $p_i \oplus p_j$ denotes the concatenation of both reasoning paths. Positive values of $\mathcal{H}_{inter}$ (super-additive uncertainty) indicate that the two sources contradict each other; near-zero values indicate independence or consistency; negative values (sub-additive) indicate mutual complementarity. Reasoning path pairs exhibiting interaction entropy exceeding a predefined threshold $\epsilon$ are classified as detected conflicts: + +$$\mathcal{C}^{detected} = \{(\psi_i, \psi_j) \mid \mathcal{H}_{inter}(p_i, p_j \mid q) > \epsilon\} \tag{15}$$ + +3) Conflict Classification and Confidence Recalibration: In the third stage, each detected conflict is classified and the node confidence is recalibrated accordingly. For each detected conflict, we construct a feature vector that fuses information-theoretic, physical, and neural signals: + +$$\mathbf{z}_{conf} = \left[\mathcal{H}_{inter}, \; \|\Omega_i - \Omega_j\|, \; |\log(\ell_{res}^i / \ell_{res}^j)|, \; \Delta\mathcal{T}, \; \rho_{auth}(i,j), \; \mathbf{h}^{(l^*)}_{conf}\right] \tag{16}$$ + +where $\mathbf{h}^{(l^*)}_{conf}$ is the LLM hidden state at the conflict encoding layer (mid-to-late layers where conflict signals concentrate, following the depth localization finding of [8]). A lightweight classifier maps the feature vector to conflict type: + +$$\hat{c} = \arg\max_{c \in \{noise, inst, scale, temp\}} P_\theta(c \mid \mathbf{z}_{conf}) \tag{17}$$ + +Proposition 2 (Conflict Type Separability). The four conflict types are distinguished by orthogonal physical dimensions: $\|\Omega_i - \Omega_j\|$ separates instrument conflicts; $|\log(\ell_{res}^i / \ell_{res}^j)|$ separates scale conflicts; $\Delta\mathcal{T}$ separates temporal conflicts; $\rho_{auth}$ separates noise conflicts. Since these physical features are independent of and complementary to the hidden state features $\mathbf{h}^{(l^*)}_{conf}$ (which encode semantic inconsistency and achieve > 93% AUC with a linear probe [8]), the four conflict types are linearly separable in the augmented feature space $\mathbf{z}_{conf}$. + +Based on the classification result, we recalibrate the node confidence. This is the key departure from MultiRAG's MCC, which uniformly penalizes inconsistency: + +$$C_{triage}(v) = \begin{cases} C_{MCC}(v) & \text{if } v \notin \mathcal{C}^{detected} \\ \alpha \cdot C_{MCC}(v) + (1-\alpha) \cdot \eta & \text{if } \hat{c} = noise \\ C_{MCC}(v) + \beta \cdot \mathcal{H}_{inter}^{-1} & \text{if } \hat{c} \in \{inst, scale\} \\ C_{MCC}(v) \cdot \gamma(|\Delta\mathcal{T}|) & \text{if } \hat{c} = temp \end{cases} \tag{18}$$ + +where $\eta < 0$ is a penalty term for noise conflicts, $\beta > 0$ is a boost coefficient for scientifically explainable conflicts, and $\gamma(|\Delta\mathcal{T}|)$ is a time-decay weighting function that prioritizes recent observations while preserving temporal evolution signals. + +Theorem 2 (Anti-Over-Smoothing Guarantee). Let $V_{sci} \subset V$ denote the set of nodes involved in explainable scientific conflicts ($\mathcal{C}^{inst} \cup \mathcal{C}^{scale} \cup \mathcal{C}^{temp}$). Under PICT with $\beta > 0$: + +$$C_{triage}(v) > C_{MCC}(v) \quad \forall v \in V_{sci} \tag{19}$$ + +*Proof.* For $v \in \mathcal{C}^{inst} \cup \mathcal{C}^{scale}$: $C_{triage}(v) = C_{MCC}(v) + \beta \cdot \mathcal{H}_{inter}^{-1}$. Since $\beta > 0$ and $\mathcal{H}_{inter} > \epsilon > 0$ (by the detection threshold in Eq. 15), $\beta \cdot \mathcal{H}_{inter}^{-1} > 0$, thus $C_{triage}(v) > C_{MCC}(v)$. For $v \in \mathcal{C}^{temp}$: $\gamma(|\Delta\mathcal{T}|) > 1$ for temporal contrasts with scientific significance, ensuring amplification. $\square$ + +This theorem guarantees that scientifically valuable conflict nodes can never be filtered out by the confidence mechanism, directly addressing the over-smoothing problem. + +Ultimately, we design the conflict triage algorithm, PICT, to replace the MCC of MultiRAG. For noise conflicts, the low-authority source is filtered (compatible with the original MCC logic). For instrument-inherent and scale-dependent conflicts, both sources are preserved with a physical bridging explanation $\mathcal{B}(\Omega_i, \Omega_j)$ appended to the context. For temporal-evolution conflicts, a temporal ordering is constructed. All preserved evidence carries provenance metadata (DataID, source institution, instrument identity, observation timestamp) to ensure scientific traceability. + +## E. XXX-RAG Prompting + +We propose the XXX-RAG Prompting (ARP) algorithm for multi-source planetary spatial data retrieval. Given a user query $q$, the LLM is first employed to extract entities, spatial constraints, and temporal constraints, generating corresponding logical and spatial relationships. The observation data then undergoes multi-source spatial adapter parsing to derive normalized datasets, followed by constructing a Hyperbolic Spatial Hypergraph (HySH) for spatiotemporal knowledge aggregation. Further, spatiotemporal retrieval with Spatial OEM aggregation is performed to obtain multi-source spatial evidence. Finally, by leveraging the PICT mechanism, conflict detection and triage are executed, and the triage-calibrated confidence is computed to enhance the reliability of the answer while preserving scientific conflicts. The results are embedded into the context of the LLM, together with provenance and conflict explanations, to generate a scientifically faithful retrieval answer. + +Algorithm 1. XXX-RAG Prompting (ARP) + +--- + +procedure ARP$(q)$ + +$\quad$ $\mathcal{E}_q, \mathcal{R}_q, \mathcal{P}_{foot}, \mathcal{T}_{win} \leftarrow$ Spatial Intent Extraction$(q)$ + +$\quad$ $D_q \leftarrow$ Multi-source Spatial Adapter Parsing$(D)$ + +$\quad$ $\mathcal{G}_{hyp} \leftarrow$ HySH Construction$(D_q)$ $\quad\triangleright$ Eq. 3-6 + +$\quad$ $\mathcal{T}_q \leftarrow$ Spatiotemporal Retrieval$(\mathcal{G}_{hyp}, \mathcal{E}_q)$ $\quad\triangleright$ Eq. 7-9 + +$\quad$ $\mathbf{m}_{agg} \leftarrow$ Spatial OEM Aggregation$(\mathcal{T}_q)$ $\quad\triangleright$ Eq. 10 + +$\quad$ $\mathcal{C}^{detected} \leftarrow$ Cross-Source Interaction Entropy$(\mathcal{T}_q, q)$ $\quad\triangleright$ Eq. 13-15 + +$\quad$ **for** $(\psi_i, \psi_j) \in \mathcal{C}^{detected}$ **do** + +$\quad\quad$ $\hat{c} \leftarrow$ Conflict Classification$(\mathbf{z}_{conf})$ $\quad\triangleright$ Eq. 16-17 + +$\quad\quad$ $C_{triage}(v) \leftarrow$ Confidence Recalibration$(v, \hat{c})$ $\quad\triangleright$ Eq. 18 + +$\quad$ **end for** + +$\quad$ Context $\leftarrow$ Differential Context Construction$(q, \mathcal{T}_q, \hat{c})$ + +$\quad$ Answer $\leftarrow$ LLM$(q \oplus$ Context $\oplus$ Provenance$)$ + +$\quad$ **return** Answer + +end procedure + +--- + +It should be noted that the ARP algorithm integrates the HySH and PICT modules in a sequential pipeline: HySH provides spatially aligned multi-source evidence, which PICT then evaluates for conflict semantics. The two modules interact through three coupling points: (1) spatial alignment (Eq. 6) is a prerequisite for meaningful interaction entropy computation (Eq. 13); (2) the radial depth difference $\Delta r$ from HySH (Eq. 8) directly feeds into the PICT feature vector (Eq. 16) as the resolution disparity signal; (3) triage results feed back to boost retrieval priority of scientifically interesting regions in subsequent queries. diff --git a/paper_introduction.md b/paper_introduction.md new file mode 100644 index 0000000..59cb86f --- /dev/null +++ b/paper_introduction.md @@ -0,0 +1,31 @@ +# AreoRAG: A Physics-Informed Framework for Multi-Source Retrieval Augmented Generation over Planetary Spatial Data + +## I. INTRODUCTION + +Large Language Models (LLMs) have achieved remarkable success in handling a variety of natural language processing tasks, attributable to their robust capabilities in understanding and generating language and symbols [1]. In knowledge-intensive retrieval tasks, Retrieval Augmented Generation (RAG) has become a standardized solution paradigm [2]–[4]. Previous works [5]–[11] have made significant strides in addressing the inherent knowledge limitations of LLMs by introducing external knowledge bases, markedly improving the accuracy and fidelity of LLM responses. Notably, the synergy between LLMs and Knowledge Graphs (KGs) has been proposed to achieve more efficient and structured information retrieval [12]–[26], propelling the deep reasoning capabilities of RAG in multi-hop question answering, knowledge-intensive retrieval, and multi-source data fusion. + +With the rapid advancement of deep space exploration programs, including NASA's Mars 2020 Perseverance mission, ESA's ExoMars, and CNSA's Tianwen-1 mission, the volume and heterogeneity of planetary observation data have grown at an unprecedented scale [27], [28]. These multi-source datasets — spanning orbital remote sensing imagery (e.g., HiRISE at 0.3m, CTX at 6m, CRISM spectral cubes), in-situ measurements (e.g., rover-mounted spectrometers, ground-penetrating radar), and derived products (e.g., digital terrain models, mineral abundance maps) — collectively constitute a rich yet highly complex knowledge ecosystem for planetary science [29]. The demand for intelligent retrieval over such multi-source planetary data has become increasingly urgent: researchers need to perform spatial semantic search (e.g., "find HiRISE images with dust devil tracks near the equator"), cross-source association (e.g., aggregating multi-resolution data for a target region), and temporally-aware retrieval (e.g., "images captured by Zhurong rover within the first 90 Sols after landing along its southward traverse"). These tasks require the RAG system to bridge the gap between natural language queries and the underlying spatiotemporal structure of planetary observations. + +Recent multi-source RAG frameworks, exemplified by MultiRAG [30], have demonstrated promising results in mitigating hallucinations arising from data sparsity and inter-source inconsistency through multi-source line graph construction and multi-level confidence computation. However, these frameworks are fundamentally designed for discrete textual entities (e.g., flight records, book metadata, stock transactions) with explicit semantic associations, and their direct application to planetary spatial data introduces critical structural failures. Building upon the categorization of retrieval challenges in multi-source settings [9], [30], we identify the following failure modes that are unique to multi-source planetary spatial data retrieval: + +1) **Spatial proximity collapse**: Existing graph-based RAG methods rely on discrete entity co-occurrence to establish edges. When applied to spatially continuous observation data, encoding spatial proximity (e.g., two overlapping image footprints) as binary edges leads to $O(k^2)$ edge explosion, fundamentally destroying the sparsity-oriented optimizations of line graph structures. + +2) **Scale hierarchy distortion**: Planetary observations inherently form a resolution hierarchy — a single CTX mosaic (6m) spatially contains dozens of HiRISE strips (0.3m), which in turn are nested within MOLA topographic grids (~460m). This containment relationship cannot be faithfully represented by flat, pairwise graph topologies. + +3) **Scientific conflict erasure**: Multi-level confidence mechanisms designed to filter "unreliable" nodes inadvertently eliminate scientifically valuable observational disagreements. When an orbital spectrometer detects hydrated minerals on the surface while in-situ drilling reveals no such signature at depth, this conflict is not data error but evidence of subsurface geological stratification — a potential major scientific discovery. + +Fig. 1 illustrates the fundamental differences between conventional text-based multi-source retrieval and planetary spatial data retrieval. The continuous spatial embedding, hierarchical resolution structure, and physics-grounded observational conflicts of planetary data are inherently incompatible with discrete graph topologies and de-falsification mechanisms designed for textual knowledge bases. Against this backdrop, we focus on addressing the retrieval challenges unique to multi-source planetary spatial data to empower knowledge-augmented generation for deep space exploration. This work primarily explores the following two fundamental challenges: + +**1) Failure of Discrete Representation for Continuous Spatiotemporal Topology.** Multi-source knowledge aggregation methods, such as multi-source line graphs (MLG) [30], [31], rely heavily on discrete text entities and explicit semantic associations to construct graph topology. However, planetary science data is intrinsically embedded in continuous Euclidean physical space. Attempting to encode continuous spatial proximity and directional relationships within traditional discrete graph structures inevitably triggers edge explosion, thereby undermining the efficiency gains that graph-based methods achieve for sparse data distributions. Specifically, for $k$ co-located spatial entities, pairwise spatial encoding requires $\binom{k}{2} = O(k^2)$ edges, while the observation hierarchy (from coarse-resolution global coverage to fine-resolution local strips) demands nested containment relationships that flat graph topologies cannot express. This structural bottleneck prevents existing discrete logical graph structures from bridging the gap between physical continuity and semantic discreteness, constituting a fundamental constraint on planetary spatial reasoning capabilities. + +**2) Contradiction Between Scientific Cognitive Conflict and Traditional De-Falsification Mechanisms.** The core assumption underlying existing multi-source RAG frameworks is that inter-source data inconsistency typically stems from erroneous information or model hallucination, and therefore relies on multi-level confidence computation to eliminate conflicting nodes [30], [33], [34]. However, in deep space exploration scenarios, where absolute ground truth is absent, different observation platforms (e.g., orbiters vs. rovers) often yield significantly conflicting observations of the same target region due to differences in observation scale, penetration depth, and instrument principles. For instance, an orbital spectrometer may detect surface hydrated minerals while in-situ drilling at the same location finds no mineralogical anomaly — such conflict is not data error but an inherent attribute of multi-dimensional scientific observation, potentially containing clues to major scientific discoveries such as geological evolution and subsurface water migration. If existing conflict-filtering mechanisms are applied indiscriminately, severe over-smoothing will result, uniformly erasing high-value scientific anomalies and fundamentally violating the knowledge discovery paradigm of "preserving disagreement, multi-source corroboration" that is central to deep space exploration. + +To address these challenges, we propose AreoRAG, a novel physics-informed framework designed for multi-source retrieval augmented generation over planetary spatial data. First, we introduce the Hyperbolic Spatial Hypergraph (HySH) for unified spatiotemporal knowledge representation. By employing $n$-ary spatial observation hyperedges, HySH binds co-located multi-source observations into single hyperedges, reducing edge complexity from $O(k^2)$ to $O(k)$. Through scale-aware Lorentz embedding, the resolution hierarchy is naturally encoded via radial depth in hyperbolic space, where the exponential volume growth of negative-curvature geometry faithfully accommodates the exponentially increasing number of observations at finer scales. Second, we propose Physics-Informed Conflict Triage (PICT), which replaces the conventional conflict-filtering paradigm with a classify-then-differentiate strategy. PICT detects inter-source conflicts via cross-source interaction entropy, classifies each conflict into four physically-grounded categories (noise, instrument-inherent, scale-dependent, and temporal-evolution), and applies differentiated confidence recalibration — filtering only noise conflicts while preserving and annotating scientifically valuable disagreements with physical bridging explanations. We provide a formal anti-over-smoothing guarantee ensuring that nodes involved in explainable scientific conflicts can never be filtered out by the confidence mechanism. + +The contributions of this paper are summarized as follows: + +1) **Hyperbolic Spatial Knowledge Aggregation**: In the knowledge construction module, we introduce the Hyperbolic Spatial Hypergraph as a data structure for unified spatiotemporal representation of multi-source planetary observations. By coupling $n$-ary spatial observation hyperedges with scale-aware Lorentz embedding, this structure simultaneously resolves the edge explosion problem inherent in encoding continuous spatial proximity and faithfully represents the resolution hierarchy through the intrinsic geometry of hyperbolic space. We further introduce the Spatial Outward Einstein Midpoint for cross-resolution aggregation that provably preserves fine-scale observational details. + +2) **Physics-Informed Conflict Triage**: In the retrieval module, we propose a conflict detection and classification mechanism grounded in observation physics. By formalizing conflicts through observation geometry parameters and measuring cross-source interaction entropy, we classify inter-source disagreements into four categories with orthogonal physical signatures. A conflict-aware confidence recalibration strategy is designed to filter noise while preserving scientifically explainable conflicts with provenance metadata and physical bridging explanations, accompanied by a formal anti-over-smoothing guarantee (Theorem 2). + +3) **Experimental Validation and Performance Comparison**: We construct a multi-source planetary spatial retrieval benchmark encompassing orbital imagery, in-situ measurements, and derived products from Mars exploration missions. Extensive experiments demonstrate that AreoRAG significantly outperforms existing state-of-the-art multi-source RAG methods in both retrieval accuracy and scientific conflict preservation, while maintaining competitive efficiency through the compact hyperbolic representation. diff --git a/参考论文/MultiRAG.md b/参考论文/MultiRAG.md new file mode 100644 index 0000000..41abe43 --- /dev/null +++ b/参考论文/MultiRAG.md @@ -0,0 +1,545 @@ +# MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation + +Wenlong Wu \( {}^{1} \) , Haofen Wang \( {}^{2} \) , Bohan Li \( {}^{1,3,4\text{ ✉ }} \) , Peixuan Huang \( {}^{1} \) , Xinzhe Zhao \( {}^{1} \) and Lei Liang \( {}^{5} \; {}^{1} \) College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education \( {}^{2} \) College of Design &Innovation, Tongji University + +\( {}^{3} \) Key Laboratory of Intelligent Decision and Digital Operation, Ministry of Industry and Information Technology + +\( {}^{4} \) Collaborative Innovation Center of Novel Software Technology and Industrialization + +\( {}^{5} \) Ant Group Knowledge Graph Team + +Email: \{wuwenlong, bhli, peixuanh, xinzhe_zhao\}@nuaa.edu.cn + +carter.whfcarter@gmail.com, leywar.liang@antgroup.com + +Abstract-Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (LLMs). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the sparse distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among different sources that lead to information conflicts. To address these challenges, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation through knowledge-guided approaches. Our framework introduces two key innovations: (1) a knowledge construction module that employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources, effectively addressing the sparse data distribution issue; and (2) a sophisticated retrieval module that implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes, thereby reducing hallucinations caused by inter-source inconsistencies. Extensive experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios. Our code is available in https://github.com/wuwenlong123/MultiRAG + +Index Terms-Retrieval Augmented Generation, Large Language Models, Multi-source Retrieval, Knowledge Graphs, Hallucination Mitigation + +## I. INTRODUCTION + +Large Language Models (LLMs) have achieved remarkable success in handling a variety of natural language processing tasks, attributable to their robust capabilities in understanding and generating language and symbols [1]. In knowledge-intensive retrieval tasks, Retrieval Augmented Generation (RAG) has become a standardized solution paradigm [2]- [4]. Previous works [5]-[11] have made significant strides in addressing the inherent knowledge limitations of LLMs. By introducing external knowledge bases, it has markedly improved the accuracy and fidelity of LLM responses. However, recent studies have highlighted a significant drawback: the retrieval results of RAG are imperfect, including irrelevant, misleading, and even malicious information, ultimately leading to inaccurate LLM responses. + +To address these limitations, the synergy between LLMs and Knowledge Graphs (KGs) has been proposed to achieve more efficient information retrieval [12]. On one hand, KG can efficiently store data with fixed characteristics (such as temporal KGs, event KGs, etc.), thereby enhancing the processing capabilities of LLMs on specific data [13]-[20]. On the other hand, the collaboration between LLMs and KGs has significantly improved performance in multi-hop and multi-document question answering, including the credibility and interpretability of retrieval [21]. Furthermore, LLM-KG collaborative methods have also provided the latest solutions for knowledge-intensive retrieval tasks [22]-[26], propelling the deep reasoning capabilities of RAG. + +Nevertheless, existing frameworks still fail to account for the complexity of real-world data. Although RAG can mitigate the generation of hallucinations, these hallucinations often stem from the internal knowledge of LLMs [27]-[29]. Inconsistent information sources and unreliable retrieval methods can still lead to retrieval biases and hallucinations in LLMs. This issue becomes particularly pronounced when dealing with information retrieval tasks that involve multi-source knowledge, where hallucinations are more prominent. Research [30] indicates that approximately 70% of retrieved paragraphs do not directly contain the correct query answers but instead include information indirectly related to the answers, causing misguidance and comprehension bias in LLMs. + +Building upon the categorization of hallucinations in retrieval [9], we outlines the three most common types of hallucinations encountered in multi-source data retrieval: + +--- + +Wenlong Wu and Haofen Wang contributed equally to this work. + +Bohan Li is the corresponding author. + +--- + +![bo_d6oh96c601uc73e30m7g_1_133_134_762_691_0.jpg](images/bo_d6oh96c601uc73e30m7g_1_133_134_762_691_0.jpg) + +Fig. 1: Single-source Retrieval & Multi-source Retrieval + +1) Inter-source data inconsistency: Discrepancies between different data sources can lead to conflicting information, causing hallucinations in LLMs. + +2) Redundancy of similar data: There often exists data that is highly similar and semantically equivalent across multiple data sources, which can impose significant computational overhead on retrieval. + +3) Incomplete inference paths: Forming a comprehensive inference path from different data sources is challenging. Existing retrievers often fail to capture the complete logical associations within multiple data sources. + +Fig. 1 vividly illustrates the differences between single-source and multi-source data retrieval through CA981 flight analysis. The sparse distribution and inconsistency of data are unique issues in multi-source data retrieval, leading to severe hallucination bias in LLMs. Against this backdrop, we focus on addressing the issue of retrieval hallucinations in multi-source data retrieval to empower knowledge-augmented generation. This work primarily explores the following two fundamental challenges: + +1) Sparse Distribution of Multi-source Data: Multi-domain queries require fusing structured (SQL tables), semi-structured (JSON logs), and unstructured data (text reports). Due to the variability in data storage formats and sparsity, the connectivity between knowledge elements is low, making it difficult for RAG systems to effectively capture logical associations across sources, thereby affecting the recall rate and quality of retrieval results. + +2) Inter-source Data Inconsistency: Conversely, the inherent diversity in knowledge representations across multi-source data often leads to inconsistencies in retrieved fragments. These discrepancies may induce information conflicts during retrieval processes, thereby compromising response accuracy. This challenge becomes particularly pronounced in domain-specific complex reasoning and multi-hop question answering tasks. + +To address these issues above, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval augmented generation through knowledge-guided approaches. Initially, we introduce multi-source line graphs for rapid aggregation of knowledge sources to tackle issues arising from sparse data distribution. Subsequently, based on these integrated multi-source line graphs, we propose a multi-level confidence calculation method to ensure the reliability of multi-source data queries. This approach not only enhances query efficiency but also strengthens the accuracy of results, providing an effective solution for the multi-source knowledge-guided RAG. + +The contributions of this paper are summarized as follows: + +1) Multi-source Knowledge Aggregation: In the knowledge construction module, we introduce multi-source line graphs as a data structure for rapid aggregation and reconstruction of knowledge structures from multiple query-relevant data sources. This effectively captures inter-source data dependencies within chunk texts, thereby providing a unified and centralized representation of multi-source knowledge. + +2) Multi-level Confidence Calculation: In the retrieval module, we perform graph-level and node-level confidence calculations on the extracted knowledge subgraphs. The aim is to filter out and eliminate low-quality subgraphs and inconsist retrieval nodes, ultimately enhancing the quality of text embedded in context to alleviate retrieval hallucinations. + +3) Experimental Validation and Performance Comparison: We conducted extensive experiments on existing multi-source retrieval datasets and two complex Q&A datasets, comparing our approach with existing state-of-the-art(SOTA) methods. This demonstrated the robustness and accuracy of our proposed method in retrieval performance. Particularly in multi-source data retrieval tasks, our method significantly outperforms other SOTA methods by more than 10%. + +## II. PRELIMINARY + +In the field of Knowledge-Guided RAG, the primary challenges include efficiently accessing relevant knowledge and achieving reliable retrieval performance. This section introduces the core elements of our approach and precisely defines the problems we address. + +Let \( Q = \left\{ {{q}_{1},{q}_{2},\ldots ,{q}_{n}}\right\} \) be the set of query instances, where each \( {q}_{i} \) corresponds to a distinct query. Let \( E = \; \left\{ {{e}_{1},{e}_{2},\ldots ,{e}_{m}}\right\} \) be the set of entities in the knowledge graph, where each \( {e}_{j} \) represents an entity. Let \( R = \left\{ {{r}_{1},{r}_{2},\ldots ,{r}_{p}}\right\} \) be the set of relationships in the knowledge graph, where each \( {r}_{k} \) represents a relationship. Let \( D = \left\{ {{d}_{1},{d}_{2},\ldots ,{d}_{t}}\right\} \) be the set of documents, where each \( {d}_{l} \) represents a document. We define the knowledge-guided retrieval enhancement generation problem as follows: + +\[ +\arg \mathop{\max }\limits_{{{d}_{i} \in D}}{LLM}\left( {{q}_{i},{d}_{i}}\right) ,\mathop{\sum }\limits_{{{e}_{j} \in E}}\mathop{\sum }\limits_{{{r}_{k} \in R}}{KG}\left( {{e}_{j},{r}_{k},{d}_{i}}\right) \tag{1} +\] + +where \( \operatorname{LLM}\left( {{q}_{i},{d}_{l}}\right) \) denotes the score of the relevance between query \( {q}_{i} \) and document \( {d}_{l} \) assessed by the LLM, and \( \mathrm{{KG}}\left( {{e}_{j},{r}_{k},{d}_{l}}\right) \) represents the degree of match between entity \( {e}_{j} \) , relationship \( {r}_{k} \) , and document \( {d}_{l} \) . + +Furthermore, we optimize the knowledge construction and retrieval modules by introducing multi-source line graphs to accelerate knowledge establishment and enhance retrieval robustness. Specifically, the proposed approach is formally defined as follows: + +Definition 1. Multi-source data fusion. Given a set of sources \( H \) , the data \( D = \{ d, \) name, \( c, \) meta \( \} \) exists, where \( d \) represents the domain of data, \( c \) represents the content of the data file, name represents the file/attribute name, and meta represents the file metadata. Through a multi-source data fusion algorithm, we can obtain normalized data \( \widehat{D} = \; \{ {id}, d, \) name, \( {jsc}, \) meta,(cols_index) \( \} \) . Here, \( {id} \) represents the unique identifier for normalization, \( d \) indicates the domain where the data file is located, name denotes the data file name, meta denotes the file metadata, and jsc denotes the file content stored using JSON-LD. If the stored data is structured data or other data formats that can use a columnar storage model, the column index cols_index of all attributes will also be stored for rapid retrieval and query. Fig. 2 provides an example of JSON-LD format. + +Definition 2. Multi-source line graph [31]. Given a multi-source knowledge graph \( \mathcal{G} \) and a transformed knowledge graph \( {\mathcal{G}}^{\prime } \) (multi-source line graph, MLG), the MLG satisfies the following characteristics: + +1) A node in \( {\mathcal{G}}^{\prime } \) represents a triplet. + +2) There is an associated edge between any two nodes in \( {\mathcal{G}}^{\prime } \) if and only if the triples represented by these two nodes share a common node. + +Based on the definition, it can be inferred that MLG achieves high aggregation of related nodes, which can greatly improve the efficiency of data retrieval and accelerate subsequent retrieval and query algorithms. + +Definition 3. Multi-source homologous data. For any two nodes \( {v}_{1} \) and \( {v}_{2} \) in \( \mathcal{G} \) , they are defined as multi-source homologous if and only if they belong to the same retrieval candidate set in a single search. + +Definition 4. Homologous node and homologous subgraph. Given a set of mult-domain homologous data \( {SV} = \; {\left\{ {v}_{i}\right\} }_{i = 1}^{n} \) in the knowledge graph \( \mathcal{G} \) , we define the homologous center node as snode \( = \{ \) name, meta, num, \( C\left( v\right) \} \) , the set of homologous nodes as \( {U}_{\text{ snode }} \) , and the set of homologous edges as \( {E}_{\text{ snode }} \) . Here, name represents the common attribute name, meta denotes the identical file metadata, num indicates the number of homologous data instances, and \( C\left( v\right) \) represents the data confidence. We define the association edge between snode and \( {v}_{i} \) as \( {e}_{i} = {\left\{ {w}_{i}\right\} }_{i = 1}^{n} \) , where \( {w}_{i} \) represents the weight of node \( {v}_{i} \) in the data confidence calculation. Thus, the homologous center node and \( S\mathcal{G} \) together form the homologous subgraph subSG. + +--- + +\{ + + "@context": "https://json-ld.org/contexts/person.jsonld", + + "@id": "http://dbpedia.org/resource/John_Lennon", + + "name": "John Lennon", + + "born": "1940-10-09", + + "spouse": "http://dbpedia.org/resource/Cynthia_Lennon" + +\} + +--- + +Fig. 2: Data format of JSON-LD + +Definition 5. Homologous triple line graph. For all homologous subgraphs within the knowledge graph \( \mathcal{G} \) , they collectively constitute the homologous knowledge graph \( S\mathcal{G} \) . By performing a linear graph transformation on the homologous knowledge graph, we obtain the homologous triple line graph \( S{\mathcal{G}}^{\prime } \) . + +By constructing a homologous triple line graph, multi-source homologous data are aggregated into a single subgraph, centered around homologous nodes, enabling rapid consistency checks and conflict feedback for homologous data. Additionally, the knowledge graph contains a significant number of isolated nodes (i.e., nodes without homologous data), which are also incorporated into the homologous triple line graph. + +Definition 6. Candidate graph confidence and candidate node confidence. For a query \( Q\left( {q,\mathcal{G}}\right) \) on the knowledge graph \( \mathcal{G} \) , the corresponding Homologous line graph \( S{\mathcal{G}}^{\prime } \) is obtained. The candidate graph confidence is an estimation of the confidence in the candidate Homologous subgraph, assessing the overall credibility of the candidate graph; the candidate node confidence is an assessment of the confidence in individual node to determine the credibility of single attribute node. + +## III. METHODOLOGY + +## A. Framework of MultiRAG + +This section elaborates on the implementation approach of MultiRAG. As shown in Fig. 3, the first step involves segmenting and extracting multi-source data to construct the corresponding MLG, achieving preliminary aggregation of multi-source data; the second step requires reconstructing the MLG and performing subgraph extraction to identify candidate homologous subgraphs, ensuring consistent storage of homologous data for subsequent hallucination assessment; the third step involves calculating the graph-level and node-level confidence of the candidate subgraphs, eliminating low-quality nodes to enhance the credibility of the response, and returning the extracted trustworthy subgraphs to the LLM to form the final answer. Finally, integrating the aforementioned steps to form the Multi-source Line Graph Prompting algorithm, MKLGP. + +## B. Multi-source Line Graph Construction + +The MultiRAG method initially employs an adapter structure to integrate multi-source data and standardize its storage format. For practical application scenarios, data is directly obtained from various non-homologous formats and transformed into a unified, normalized representation. Specifically, file names and metadata are parsed, and the domains to which the files belong are categorized. Subsequently, the data content is parsed and stored in JSON-LD format, thereby transforming it into linked data. Finally, unique identifiers are assigned to the data, resulting in normalized datasets. + +![bo_d6oh96c601uc73e30m7g_3_208_135_1385_677_0.jpg](images/bo_d6oh96c601uc73e30m7g_3_208_135_1385_677_0.jpg) + +Fig. 3: Framework of MultiRAG, including three modules. + +Specifically, a unique adapter is designed for each distinct data format to facilitate data parsing. Although the implementation frameworks of these adapters are largely similar, it is essential to differentiate between the parsing processes for structured, semi-structured, and unstructured data. + +For structured data, parsing involves storing tabular information in JSON format, where attribute variables within the file are managed using a Decomposition Storage Model (DSM). This approach enables the extraction of all attribute information for consistency checks through the use of column indices. In the case of semi-structured data, parsing corresponds to storing tree-shaped data in JSON format with multi-layer nested structures. This data format lacks column indices and does not support fast retrieval, necessitating the use of tree or graph retrieval algorithms, such as DFS, for efficient searching. Finally, for unstructured data, the focus is currently limited to textual information, which is stored directly. Subsequent steps involve leveraging LLMs for entity and relationship extraction tasks to obtain the relevant information. + +The final integration of multi-source data can be expressed by the following formula: + +\[ +{D}_{\text{ Fusion }} = \mathop{\bigcup }\limits_{{i = 1}}^{n}{A}_{i}\left( {D}_{i}\right) \tag{2} +\] + +where \( {A}_{i} \in \left\{ {{Ad}{a}_{\text{ stru }},{Ad}{a}_{\text{ semi-s }},{Ad}{a}_{\text{ unstru }}}\right\} \) , representing the adapter parsing functions for structured data, semi-structured data, and unstructured data, respectively. \( {D}_{i} \in \; \left\{ {{D}_{\text{ stru }},{D}_{\text{ semi-s }},{D}_{\text{ unstru }}}\right\} \) represents the original datasets of structured data, semi-structured data, and unstructured data, respectively. + +Through the parsed data \( {D}_{\text{ Fusion }} = \left\{ {{E}_{\mathrm{q}},{R}_{\mathrm{q}}}\right\} \) , we further extracts key information and links it to the knowledge graph. The knowledge construction process involves three key phases implemented through the OpenSPG framework [26], [32], in which we use the Custom Prompt module \( {}^{2} \) to integrate LLM-based knowledge extraction. + +For entity recognition, we utilize the ner.py prompts within the kag/builder/prompt/default directory. We first define relevant entity types in the schema. Then, by adjusting the example.input and example.output in the ner.py prompts, we guide the LLM-based SchemaFreeExtractor to identify entities accurately. + +In relationship extraction, the triple.py prompts play a crucial role. We define relationships in the schema and use the triple_prompt in the SchemaFreeExtractor. The instruction in triple.py ensures that the extracted Subject-Predicate-Object(SPO) triples are related to the entities in the entity_list, enabling effective relationship extraction. + +Regarding attribute extraction, we rely on the entity standardization prompts in std.py. After entity recognition, the std_prompt in the SchemaFreeExtractor standardizes the entities and helps in extracting their attributes. We modify the example.input, example.named_entities, and example.output in std.py according to our data characteristics to optimize the attribute extraction process. Through these steps of customizing and applying OpenSPG's prompts, we achieve efficient knowledge extraction. The following formula describes the data extraction process: + +\[ +{KB} = \mathop{\sum }\limits_{{D}_{i}}\left( {\left\{ {{e}_{1},{e}_{2},\ldots ,{e}_{m}}\right\} \sqcup \left\{ {{r}_{1},{r}_{2},\ldots ,{r}_{n}}\right\} }\right) \tag{3} +\] + +--- + +https://github.com/OpenSPG/openspg + +https://openspg.yuque.com/ + +--- + +![bo_d6oh96c601uc73e30m7g_4_178_142_662_431_0.jpg](images/bo_d6oh96c601uc73e30m7g_4_178_142_662_431_0.jpg) + +Fig. 4: Example of multi-source line graph transformation + +## C. Homologous Subgraph Matching + +After the preliminary extraction of information, the next step is to identify the multi-source homologous data group set \( \mathcal{{SV}}s \) and the isolated point set \( \mathcal{{LV}}s \) . This process begins by initializing the unvisited node set \( {\mathcal{U}}_{\text{ unvisited }} = \mathcal{V} \) , while setting the homologous data group \( \mathcal{{SV}}s = \varnothing \) and the isolated point set \( \mathcal{L}\mathcal{V}s = \varnothing \) . By traversing all nodes and retrieving node information from various domains, for matched homologous data, construct the homologous node \( s{g}_{i} \) and its corresponding associated edge \( {e}_{i} \) , and add them to the homologous node set \( {\mathcal{U}}_{sg} \) and edge set \( {\mathcal{E}}_{sg} \) , respectively. After the traversal, add \( \left( {{\mathcal{U}}_{sg},{\mathcal{E}}_{sg}}\right) \) to \( \mathcal{S}\mathcal{V}s \) . If no homologous data is obtained after one round of traversal, add the node to the isolated point set \( \mathcal{L}\mathcal{V}s \) . After the traversal is completed, the node will be removed from the \( {\mathcal{U}}_{\text{ unvisited }} \) set. The time complexity of homologous subgraph matching is \( O\left( {n\log n}\right) \) , where \( n \) is the number of nodes in the knowledge graph \( \mathcal{G} \) . + +For each homologous subgraph in \( \mathcal{{SV}}s \) , homologous linear knowledge subgraph sub \( {\mathcal{{SG}}}^{\prime }{}_{i} \) is constructed by utilizing the homologous node set \( {\mathcal{U}}_{sg} \) and the homologous edge set \( {\mathcal{E}}_{sg} \) . Subsequently, all sub \( {\mathcal{{SG}}}^{\prime }{}_{i} \) and the isolated point set \( \mathcal{{LV}}s \) are aggregated to obtain the homologous linear knowledge graph \( S{\mathcal{G}}^{\prime } \) . It should be noted that \( S{\mathcal{G}}^{\prime } \) is solely used for consistency checks and retrieval queries of homologous data; other types of queries still conducts operations on the original knowledge graph \( \mathcal{G} \) . + +Here, we provide a simple example of a homologous triple line graph. As shown in Fig. 4, a homologous node is associated with 4 homologous data points. After transformation into a triple line graph, it forms a complete graph of order 4, indicating that the four triples are pairwise homologous. + +## D. Multi-level Confidence Computing + +We define the candidate data from different domains obtained in a single retrieval as multi-source homologous data. These data have been extracted into a homologous line graph for temporary storage. Although targeting the same query object, they often provide inconsistent reference answers. Considering the varying retrieval errors, the multi-level confidence calculation method is adpoted in this framework. First, the confidence of individual homologous line graphs is calculated, followed by the confidence of each candidate node, to determine the final set of answer candidates. + +1) Graph-Level Confidence Computing: In the first stage, a confidence calculation method based on mutual information entropy is introduced to assess the confidence of homologous line graphs. The core idea of this method is that if two nodes with the same attributes in a homologous line graph are close in content, their similarity is high, and thus their confidence is also high; conversely, if they are not, their confidence is low. + +Let \( \mathcal{G} \) be a homologous line graph, and \( \mathcal{N}\left( \mathcal{G}\right) \) be the set of nodes in the graph. For any two nodes \( {v}_{i},{v}_{j} \in \mathcal{N}\left( \mathcal{G}\right) \) with the same attributes, the similarity \( S\left( {{v}_{i},{v}_{j}}\right) \) between them is defined based on the calculation method of mutual information entropy. The mutual information entropy \( I\left( {{v}_{i},{v}_{j}}\right) \) measures the interdependence of the attribute content of the two nodes, and its calculation formula is: + +\[ +I\left( {{v}_{i},{v}_{j}}\right) = \mathop{\sum }\limits_{{x \in {V}_{i}}}\mathop{\sum }\limits_{{y \in {V}_{j}}}p\left( {x, y}\right) \log \left( \frac{p\left( {x, y}\right) }{p\left( x\right) p\left( y\right) }\right) \tag{4} +\] + +where \( {V}_{i} \) and \( {V}_{j} \) are the sets of attribute values for nodes \( {v}_{i} \) and \( {v}_{j} \) , respectively, \( p\left( {x, y}\right) \) is the joint probability distribution of \( {v}_{i} \) taking attribute value \( x \) and \( {v}_{j} \) taking attribute value \( y \) , and \( p\left( x\right) \) and \( p\left( y\right) \) are the marginal probability distributions of \( x \) and \( y \) , respectively. + +The similarity \( S\left( {{v}_{i},{v}_{j}}\right) \) can be defined as the normalized form of mutual information entropy to ensure that its value lies within the interval \( \left\lbrack {0,1}\right\rbrack \) : + +\[ +S\left( {{v}_{i},{v}_{j}}\right) = \frac{I\left( {{v}_{i},{v}_{j}}\right) }{H\left( {V}_{i}\right) + H\left( {V}_{j}\right) } \tag{5} +\] + +where \( H\left( {V}_{i}\right) \) and \( H\left( {V}_{j}\right) \) are the entropies of the attribute value sets of nodes \( {v}_{i} \) and \( {v}_{j} \) , respectively, calculated as: + +\[ +H\left( V\right) = - \mathop{\sum }\limits_{{x \in V}}p\left( x\right) \log p\left( x\right) \tag{6} +\] + +Subsequently, the confidence \( C\left( \mathcal{G}\right) \) of the homologous line graph \( \mathcal{G} \) can be determined by calculating the average similarity \( S\left( {{v}_{i},{v}_{j}}\right) \) of all node pairs in the graph: + +\[ +C\left( \mathcal{G}\right) = \frac{1}{{\left| \mathcal{N}\left( \mathcal{G}\right) \right| }^{2} - \left| {\mathcal{N}\left( \mathcal{G}\right) }\right| }\mathop{\sum }\limits_{{{v}_{i} \in \mathcal{N}\left( \mathcal{G}\right) }}\mathop{\sum }\limits_{\substack{{{v}_{j} \in \mathcal{N}\left( \mathcal{G}\right) } \\ {j \neq i} }}S\left( {{v}_{i},{v}_{j}}\right) \tag{7} +\] + +where \( \left| {\mathcal{N}\left( \mathcal{G}\right) }\right| \) denotes the number of nodes in the graph. Notably, a homologous line graph exhibiting high confidence demonstrates that its constituent nodes maintain strong attribute-level consistency across their content representations. + +2) Node-Level Confidence Computing: In the second phase, the confidence of individual node \( C\left( v\right) \) is calculated, which takes into account the node's consistency, authority, and historical confidence. The following are the detailed calculation methods and formulas. + +Algorithm 1 Multi-level Confidence Computing Algorithm + +--- + +procedure CONFIDENCE_COMPUTING \( \left( {v, D}\right) \) + + \( {S}_{n}\left( v\right) \leftarrow \) Equation (8) + + AuthLLM \( \left( v\right) \leftarrow \) Equation (10) + + Authhist \( \left( v\right) \leftarrow \) Equation (11) + + \( A\left( v\right) \leftarrow \) Equation (9) + + \( C\left( v\right) \leftarrow {S}_{n}\left( v\right) + A\left( v\right) \) + + return \( C\left( v\right) \) + +end procedure + +procedure \( \operatorname{MCC}\left( {\mathcal{G}, Q, D}\right) \) + + \( \mathcal{{SV}}s \leftarrow \varnothing ,\mathcal{{LV}}s \leftarrow \varnothing \) + + \( {\mathcal{U}}_{\text{ unvisited }} \leftarrow V \) + + while \( {\mathcal{U}}_{\text{ unvisited }} \neq \varnothing \) do + + \( v \leftarrow \) pop a node from \( {\mathcal{U}}_{\text{ unvisited }} \) + + for all \( D \in D \) do + + if \( v \in \operatorname{Data}\left( {Q,{subS}{G}_{i}}\right) \) then + + \( C\left( v\right) \leftarrow \) Confidence_Computing \( \left( {v, D}\right) \) + + if \( C\left( v\right) > \theta \) then + + \( {\mathcal{U}}_{sg} \leftarrow {\mathcal{U}}_{sg} \cup \{ v\} \) + + \( {\mathcal{E}}_{sg} \leftarrow {\mathcal{E}}_{sg} \cup \left\{ {e}_{i}\right\} \) + + else + + \( \mathcal{{LV}}s \leftarrow \mathcal{{LV}}s \cup \{ v\} \) + + end if + + end if + + end for + + if \( {\mathcal{U}}_{sg} \neq \varnothing \) then + + \( \mathcal{{SV}}s \leftarrow \mathcal{{SV}}s \cup \left( {{\mathcal{U}}_{sg},{\mathcal{E}}_{sg}}\right) \) + + \( {\mathcal{U}}_{sg} \leftarrow \varnothing ,{\mathcal{E}}_{sg} \leftarrow \varnothing \) + + end if + + end while + + return \( \mathcal{{SV}}s,\mathcal{{LV}}s \) + +end procedure + +--- + +a) Node Consistency Score: The node consistency score \( S\left( v\right) \) reflects the consistency of the node across different data sources. We use mutual information entropy to calculate the similarity between node pairs, thereby assessing consistency. For a node \( v \) , its consistency score can be expressed as: + +\[ +{S}_{n}\left( v\right) = \frac{1}{\left| N\left( v\right) \right| }\mathop{\sum }\limits_{{u \in N\left( v\right) }}S\left( {v, u}\right) \tag{8} +\] + +where \( N\left( v\right) \) is the set of nodes with the same attributes as node \( v \) , and \( S\left( {v, u}\right) \) is the similarity between nodes \( v \) and \( u \) as defined in Equation 5 + +b) Node Authority Score: Authority score is divided into two parts: the node's authority assessed by the LLM and the node's historical authority. This score reflects the importance and authenticity of the node. Additionally, we use an expert LLM to comprehensively evaluate the authority of the node. The node’s authority score \( A\left( v\right) \) can be calculated using the following formula: + +\[ +A\left( v\right) = \alpha \cdot {\text{ Auth }}_{LLM}\left( v\right) + \left( {1 - \alpha }\right) \cdot {\text{ Auth }}_{\text{ hist }}\left( v\right) \tag{9} +\] + +where \( \alpha \) is a weight coefficient that balances the contributions of LLM-assessed authority and historical authority, satisfying \( 0 \leq \alpha \leq 1 \) . + +Algorithm 2 Multi-source Knowledge Line Graph Prompting + +--- + +procedure MKLGP \( \left( q\right) \) + + \( {E}_{q},{R}_{q} \leftarrow \) Logic Form Generation \( \left( q\right) \) + + \( {D}_{q} \leftarrow \) Multi Document Extraction \( \left( {V}_{q}\right) \) + + \( {\mathcal{{SG}}}^{\prime } \leftarrow \operatorname{Prompt}\left( {D}_{q}\right) \) + + \( \mathcal{{SV}}s,\mathcal{{LV}}s \leftarrow \operatorname{MCC}\left( {\mathcal{S}{\mathcal{G}}^{\prime }, q,{D}_{q}}\right) \) + + \( {C}_{\text{ nodes }},{\mathcal{G}}_{A} \leftarrow \operatorname{Prompt}\left( {\mathcal{S}\mathcal{V}s,\mathcal{L}\mathcal{V}s}\right) \) + + Answer \( \leftarrow \) Generating Trustworthy Answers \( \left( {{C}_{\text{ nodes }},{\mathcal{G}}_{A}}\right) \) + + return Answer + +end procedure + +--- + +Benefiting from the calculation idea of knowledge credibility in the PTCA [33], \( {\operatorname{Auth}}_{\mathrm{{LLM}}}\left( v\right) \) is assessed by the global influence and local connection strength of the node. The LLM can comprehensively calculate the credibility of knowledge by integrating the association strength between entities, entity type information, and multi-step path information. + +\[ +{\operatorname{Autm}}_{LLM}\left( v\right) = \frac{1}{1 + {e}^{-\beta \cdot {C}_{LLM}\left( v\right) }} \tag{10} +\] + +where \( {C}_{\mathrm{{LLM}}}\left( v\right) \) is the authority score provided by the LLM for node \( v \) is the average value of all nodes’ \( {C}_{\mathrm{{LLM}}}\left( v\right) \) , and \( \beta \) is a parameter that controls the steepness of the scoring curve. + +c) Historical Authority: \( {\operatorname{Auth}}_{\text{ hist }}\left( v\right) \) is an authority score based on the node's historical data. Inspired by Zhu's work [34], we expect to use the credibility of historical data sources and current query-related data for incremental estimation. + +\[ +{\text{ Auth }}_{\text{ hist }}\left( v\right) = \frac{\mathcal{H} \cdot P{r}^{h}\left( D\right) + \mathop{\sum }\limits_{{{v}_{p} \in {D}_{v}\left\lbrack q\right\rbrack }}\Pr \left( {v}_{p}\right) }{\mathcal{H} + \left| {\operatorname{Data}\left( {q,\text{ subS }{\mathcal{G}}^{\prime }{}_{i}}\right) }\right| } \tag{11} +\] + +where \( \mathcal{H} \) is the number of entities provided by data source \( D \) for all historical queries, \( \mathop{\Pr }\limits^{h}\left( D\right) \) is the historical credibility of data source \( D,{D}_{v}\left\lbrack q\right\rbrack \) is the set of correct answers, and Data \( \left( {q,{subS}{\mathcal{G}}^{\prime }{}_{i}}\right) \) is the query-related data obtained from the multi-source line subgraph. + +Ultimately, we designed the multi-level confidence computing algorithm, MCC, to calculate the credibility of the data sources in the homologous subgraph, ensuring the quality of the knowledge graph embedded in the LLM. The algorithm is shown in Algorithm 1 + +It should be noted that the MCC algorithm does not directly provide the final graph confidence and node confidence; these values must be obtained through prompt to achieve the ultimate results. + +## E. Multi-source knowledge line graph prompting + +We propose the Multi-source Knowledge Line Graph Prompting (MKLGP) algorithm for multi-source data retrieval. Given a user query \( q \) , LLM is firstly employed to extract the intent, entities, and relationships from \( q \) , and generates the corresponding logical relationships. The dataset then undergoes multi-document filtering to derive text chunks, followed by constructing a Multi-source Line Graph (MLG) for knowledge aggregation. Further, it matches homogeneous subgraphs and utilizes the MCC algorithm to obtain a set of credible query nodes and isolated points \( \mathcal{{SV}}s,\mathcal{{LV}}s \) . Finally, by leveraging the prompt, the graph confidence is obtained, and the node confidence is calculated to enhance the credibility of the answer. The results are then embedded into the context of the LLM to generate a credible retrieval answer. + +## IV. EXPERIMENTS + +This section will conduct experiments and performance analysis on the construction of homologous line graphs and the multi-level confidence calculation modules. Baseline methods will be compared with other SOTA multi-document retrieval QA methods, data fusion methods, and KBQA methods. Extensive experiments will be conducted to assess the robustness and efficiency of MultiRAG, which aims to answer the following questions. + +- Q1: How does the retrieval recall performance of Multi-RAG compare with other data fusion models and SOTA data retrieval models? + +- Q2: What are the respective impacts of data sparsity and data inconsistency on the quality of retrieval recall? + +- Q3: How effective are the two modules of MultiRAG individually? + +- Q4: How is the performance of MultiRAG in multi-hop Q&A datasets after incorporating multi-level confidence calculation? + +- Q5: What are the time costs of the various modules in MultiRAG? + +## A. Experimental Settings + +a) Datasets: To validate the efficiency of multi-source line graph construction and its enhancement of retrieval performance, the article conducts multi-source data fusion experiments on four real-world benchmark datasets [35]-[37], as is shown in Table 1 (1) The movie dataset comprises movie data collected from 13 sources. (2) The book dataset includes book data from 10 sources. (3) The flight dataset gathers information on over 1200 flights from 20 sources. (4) The stock dataset collects transaction data for 1000 stock symbols from 20 sources. In the experiments, we issue 100 queries for each of the four datasets to verify their retrieval efficiency. + +It is noteworthy that the Movies dataset and the Flights dataset are relatively dense, while the Books dataset and the Stocks dataset are relatively sparse, which can impact the model's performance. + +Additionally, to validate the robustness of the MultiRAG on complex Q&A datasets, we selected two multi-hop question answering datasets, HotpotQA [38] and 2WikiMultiHopQA [39]. Both datasets are constructed based on Wikipedia documents, allowing us to utilize a consistent document corpus and retriever to provide external references for LLMs. Considering the constraints of experimental costs, we conducted a subsample analysis on 300 questions from the validation sets of each experimental dataset. + +TABLE I: Statistics of the datasets preprocessed + +
DatasetsData sourceSourcesEntitiesRelationsQueries
MoviesJSON(J)41970145790100
KG(K)5100229264709
CSV(C)470276184657
BooksJSON(J)333922824100
CSV(C)325471812
XML(X)420541509
FlightsCSV(C)1048672100835100
JSON(J)104193989339
StocksCSV(C)10779911169100
JSON(J)10775910619
+ +b) Evaluation Metrics: To assess effectiveness, we adopt the F1 score as the evaluation metric for the data fusion results, following previous experimental metrics [37], [40]-[42]. The F1 score is the harmonic mean of precision (P) and recall (R), calculated as follows: + +\[ +{F1} = 2 \times \frac{P \times R}{P + R} \tag{12} +\] + +Furthermore, to evaluate the retrieval credibility of MKLGP Algorithm, we utilize the recall metric, specifically Recall@K, to assess performance at three distinct stages: before subgraph filtering, before node filtering, and after node filtering. In addition, we employ the query response time \( T \) (measured in seconds) as an evaluative metric to verify the efficiency of knowledge aggregation. + +c) Hyper-parameter Settings: For all baselines, we carefully adjusted the parameters according to the characteristics of MultiRAG. All methods were implemented in a Python 3.10 and CUDA 11.6 environment. Except for the experiments using GPT-3.5-Turbo for CoT, the rest of the work utilized Llama3-8B-Instruct as the base model. For each different data format, after slicing into Chunks, we stored the slice numbers, data source locations, and transformed triple nodes in the multi-source line graph using JSON-LD format, thereby enabling simple cross-indexing. + +For hyperparameter settings, the temperature parameter \( \beta \) was set to 0.5 . The number of entities in historical queries was initialized to 50, the initial node confidence threshold was defined as 0.7 , and the graph confidence threshold was set to 0.5. All experiments were conducted on a device equipped with an Intel(R) Core(TM) Ultra 9 185H 2.30GHz and 512GB of memory. + +d) Baseline Models: To demonstrate the superiority of the MultiRAG method, we compare it with basic data fusion methods and SOTA methods, including the multi-document question-answering methods and knowledge base question-answering methods. + +Thanks to Zhu's work \( {}^{3} \) [34], we compare with the following baseline methods: + +--- + +\( {}^{3} \) https://github.com/JunHao-Zhu/FusionQuery + +--- + +TABLE II: Comparison with baseline methods and SOTA methods for multi-source knowledge fusion + +
DatasetsData sourceData Fusion Methods (Baseline)SOTA MethodsOur Method
TFLTMIR-CoTMDQAChatKBQAFusionQueryMCC
F1/%Time/sF1/%Time/sF1/%Time/sF1/%Time/sF1/%Time/sF1/%Time/sF1/%Time/s
MoviesJ/K37.1971741.4199543.2156746.2158845.1380953.2122.452.698.3
J/C41.9721442.9188445.0139944.5136042.7324652.7183.154.375.1
K/C37.8219941.2157637.6101445.298740.4202742.5141.049.186.0
J/K/C36.61122540.8234641.5255149.8226444.7515153.6137.854.8157
BooksJ/C40.2101742.4195.3s35.2147.655.7124.256.1165.058.522.763.513.66
J/X35.5107035.6277.736.1178.755.1115.654.7200.157.920.663.113.78
C/X43.0103344.1232.642.6184.557.2115.655.6201.460.321.564.213.54
J/C/X37.3230441.0413.240.4342.656.4222.657.1394.159.147.066.827.4
FlightsC/J27.3604979.11478658.3214.076.536076.837674.220.274.980
StocksC/J68.42.3019.2133764.853.365.278.464.088.968.00.3378.612.1
+ +* The F1 score is for Q1 and time is for Q5. + +* Bold represents the optimal metrics, while underlined text indicates the sub-optimal metrics. The same applies to the following text. + +1) TruthFinder(TF) [37]: the classic iterative data fusion method. + +2) LTM [42]: the probabilistic data fusion method. + +3) CoT [43] is a foundational approach that involves step-by-step reasoning to reach a conclusion, we use GPT- 3.5-Turbo as the base model. + +4) Standard RAG [2] is a method that combines the strengths of retrieval and generation models to answer questions. + +Moreover, we also summerize these SOTA methods below: + +- IRCoT [44] is an advanced method that refines the reasoning process through iterative retrieval. + +- ChatKBQA [45] is a conversational interface-based method for knowledge base question answering. + +- MDQA [46] is a method designed to extract answers from multiple documents effectively. + +- FusionQuery [34] is a SOTA method based on the efficient on-demand fusion query framework. + +- RQ-RAG [47] is a method that integrates external documents and optimizes the query process to handle complex queries. + +- MetaRAG [9] is a method that employs metacognitive strategies to enhance the retrieval process. + +e) Dataset Preprocessing: To better align the datasets with real-world application scenarios and to demonstrate the applicability of the proposed method to multi-source data, we have split and reconstructed the four datasets into three categories of data formats: tabular data (structured data), nested JSON data (semi-structured data), and XML data (semi-structured data), stored respectively in csv, json, and xml file formats. We also retained some data directly stored in KG format. Table 1 displays the detailed statistics after the dataset division. + +## B. Evaluation of Multi-source Knowledge Aggregation (MKA) + +Q1: How does the retrieval recall performance of Mul-tiRAG compare with other data fusion models and SOTA + +## data retrieval models? + +To validate the effectiveness of the multi-source knowledge aggregation module (MKA) in MultiRAG, we assess it using F1 scores and query times across four multi-source query datasets. By substituting the fusion query algorithm with different baseline models and SOTA models, multiple sets of experimental results are botained to evaluate its performance in multi-domain querying. Table II summarizes the data querying performance of MKLGP and baselines on the four datasets; Q1 focuses solely on the F1 scores of the methods, which includes four data fusion methods and three SOTA methods that support data fusion. + +Table II demonstrates that the MCC module outperforms all comparative models across four datasets. Experimental results indicate that it achieves an F1 score that is more than 10% higher than the best baseline data fusion model and obtains superior performance compared to other baselines. The MV method performs poorly on all datasets because it can only return a single answer for a query, which fails to accommodate the common scenario where a query has multiple return values. For instance, a movie or a book typically has multiple directors or authors. However, the majority of methods show significantly better performance on the Movies and Flights datasets than on the Books and Stocks datasets. This is because the Movies and Flights datasets are inherently denser, and previous SOTA models can match or outperform our approach in situations where knowledge is abundant, which is acceptable. In contrast, on the more sparse Books and Stocks datasets, our method achieves an average improvement of more than 10% over SOTA methods. + +## Q2: What are the respective impacts of data sparsity and data inconsistency on the quality of retrieval recall? + +MultiRAG demonstrates good robustness in scenarios of varying data sparsity and inconsistency. To validate it, we conducted experiments from the following two perspectives. 1) Sparsity of multi-source data: We applied \( {30}\% ,{50}\% \) , and \( {70}\% \) random relationship masking to four pre-processed datasets, making the connections between data sparser while ensuring that the query answers are still retrievable. 2) Consistency of multi-source data: We added \( {30}\% ,{50}\% \) , and \( {70}\% \) of triple increments (the new triples are copies of the original triples) to the four pre-processed datasets, and completely shuffled the relationship edges of the added triples to disrupt the consistency of multi-source data. Subsequently, we employed MultiRAG to experiment with datasets under both perturbation schemes. + +TABLE III: Ablation experiments of multi-source knowledge aggregation(MKA) and multi-level confidence computing(MCC) + +
DatasetsSourceMultiRAGw/o MKAw/o Graph Levelw/o Node Levelw/o MCC
F1/%QT/sPT/sF1/%QT/sPT/sF1/%QT/sPT/sF1/%QT/sPT/sF1/%QT/sPT/s
MoviesJ/K52.625.762.6448.2278362.6445.350.158.238.721.30.3131.625.70.28
J/C54.312.761.3649.1188261.3646.828.957.440.210.50.2930.512.70.29
K/C49.131.664.4045.5423364.4042.765.361.835.928.4-0.2733.131.6-0.29
J/K/C54.839.260.847.5443760.848.175.656.241.535.80.3034.739.20.32
BooksJ/C63.51.192.4757.111.92.4755.24.72.1249.80.920.1843.41.190.22
J/X63.11.222.5659.311.72.6254.75.12.2448.30.890.1942.61.220.22
C/X64.21.162.3855.38.392.3853.93.92.0547.10.850.1641.01.160.17
J/C/X66.81.313.0757.215.83.0859.46.32.8952.71.120.2136.41.310.20
FlightsC/J74.929.8109.972.2NAN109.968.3142.798.561.425.30.8552.129.81.07
StocksC/J78.62.725.3669.6450.85.3672.18.94.1265.31.980.1545.42.720.17
+ +TABLE IV: Performance comparison on HotpotQA and 2WikiMultiHopQA datasets + +
MethodHotpotQA2WikiMultiHopQA
PrecisionRecall@5PrecisionRecall@5
Standard RAG34.133.525.626.2
GPT-3.5-Turbo+CoT33.947.235.045.1
IRCoT41.641.242.340.9
ChatKBQA47.842.146.543.7
MDQA48.652.544.145.8
RQ-RAG51.649.345.344.6
MetaRAG51.149.950.752.2
MultiRAG59.362.755.761.2
+ +Firstly, to address data sparsity, we conducted experiments on MultiRAG (Ours) and ChatKBQA (SOTA). The experimental results demonstrate that MultiRAG exhibits significant robustness when faced with the challenge of data sparsity. + +Specifically, after applying 30%, 50%, and 70% relationship masking, the F1 score of MultiRAG on the Books dataset only dropped from 66.8% to 60.0%. On the Stocks dataset, its F1 score decreased from 78.6% to 71.0%, which have been shown in Fig 5b and Fig 5d. These moderate decreases indicate that MultiRAG can effectively maintain its performance even when a substantial number of relationships are masked. + +In contrast, ChatKBQA's performance decline under the same conditions is more significant. On the Books dataset, ChatKBQA's F1 score dropped from 59.1% to 53.0%, and on the Stocks dataset, its F1 score decreased from 68.0% to 62.0%. This outcome reveals the challenges ChatKBQA faces when dealing with sparse data, especially when a large number of data connections are masked, significantly impacting its performance. + +Next, we conducted robustness experiments on multi-source data consistency. We perturbed the Books and Stocks datasets to varying degrees to test the performance changes of Mul-tiRAG and ChatKBQA when data consistency is disrupted. The experimental results show that MultiRAG demonstrates excellent robustness in the face of data consistency disruption, while ChatKBQA's performance declines rapidly under perturbation. + +Specifically, as is shown in Fig. 5a, on the Movies dataset, we added 30%, 50%, and 70% triple increments to the original dataset and randomized the relationship edges of the added triples. The results show that MultiRAG's F1 score slightly decreased from 54.8% to 52.1%, 51.5%, and 49.9%, while ChatKBQA's F1 score significantly dropped from 53.6% to 51.6%, 47.2%, and 40.8%. On the Flights dataset shown in Fig. 5c, we performed the same perturbation operations, and MultiRAG's F1 score slightly decreased from 74.9% to 73.4%, 72.9%, and 71.4%, while ChatKBQA's F1 score substantially dropped from 74.2% to 69.7%, 64.3%, and 55.8%. + +These results indicate that even when data consistency is severely compromised, MultiRAG can still maintain a high level of performance stability, whereas ChatKBQA's performance is more sensitive to disruptions in data consistency. + +## C. Evaluation of Multi-level Confidence Computing + +Calculating the confidence of subgraphs and nodes to filter trustworthy answers is of significant demand in critical domains such as finance and law. Considering the high temporal and spatial overhead of directly calculating the confidence of all nodes, we draw inspiration from the workflow of recommendation systems, mimicking the process of coarse and fine ranking, and adopt the multi-level confidence computing method to filter credible nodes and enhance retrieval performance. Calculating the credibility of homologous subgraphs allows us to preliminarily determine whether the subgraphs containing answers can generate highly credible answers. For subgraphs with low confidence, more nodes need to be extracted to ensure the robustness of the overall retrieval; for subgraphs with high confidence, only 1-2 nodes are required to generate the correct answer. + +![bo_d6oh96c601uc73e30m7g_9_135_136_1525_284_0.jpg](images/bo_d6oh96c601uc73e30m7g_9_135_136_1525_284_0.jpg) + +Fig. 5: Experimental results of Q2, where (a) and (b) display the multi-source data sparsity experiments, and (c) and (d) display the multi-source data consistency experiments. + +![bo_d6oh96c601uc73e30m7g_9_138_530_1525_421_0.jpg](images/bo_d6oh96c601uc73e30m7g_9_138_530_1525_421_0.jpg) + +Fig. 6: F1 score and Query Time of Movies and Books with corruption level \( 0\% ,{10}\% ,{30}\% ,{50}\% ,{70}\% \) in different sources + +## Q3: How effective are the two modules of MultiRAG individually? + +a) Ablation Study on Component Effectiveness: The MKA module achieves significant efficiency-accuracy synergy through its MLG architecture. As shown in Table III, MLG construction introduces modest preprocessing time (12.7s- 39.2s) while delivering 10-100× query acceleration. Specifically, the flight dataset shows QT reduction from computational infeasibility (marked NAN) to 29.8s through MLG's compact structure. Concurrently, MKA sustains consistent accuracy improvements. Removing MKA causes F1 drops of 7.3% on Movies and 9.6% on Books, demonstrating MLG's effectiveness in connecting fragmented knowledge across sources. + +The MCC module exhibits more significant effects on performance and hallucination control. Disabling MCC causes drastic F1 degradation of 20.1% on Movies and 33.2% on Stocks, with PT values indicating increased hallucination risks. This validates MCC's critical role in eliminating unreliable information through hierarchical confidence computation. + +b) Hierarchical Analysis of MCC: Stratified ablation reveals the complementary roles of graph-level and node-level computations. For Movies (J/K/C configuration), removing graph-level filtering reduces F1 to 48.1% (+13.4% vs MCC-disabled) with QT increasing to 75.6s (+93% vs full framework). Conversely, disabling node-level computation yields 41.5% F1 (+6.8% vs baseline), showing graph-level filtering alone cannot resolve local conflicts. The complete MCC framework achieves 54.8% F1 by synergistically combining both layers. + +Error analysis shows distinct failure patterns: 38.7% errors under graph-level removal (Movies J/K) stem from cross-source inconsistencies, while 52.7% failures with node-level removal (Books J/C/X) originate from local authority issues. This confirms the functional specialization-graph-level ensures global consistency, node-level verifies local credibility. + +Fig 7 demonstrates that an optimal balance between efficiency and accuracy is achieved at \( \alpha = {0.5} \) , where the hybrid weighting of LLM-assessed authority and historical authority peaks with an F1 score of 67.7% and balanced query time. Specifically, increasing \( \alpha \) towards 1.0, which emphasizes the LLM, reduces query time from 83.2 seconds \( \left( {\alpha = {0.0}}\right) \) to 51.8 seconds \( \left( {\alpha = {1.0}}\right) \) by minimizing historical data validation. Conversely, the F1 score follows a non-monotonic pattern, reaching its maximum at \( \alpha = {0.5} \) before declining as reliance on either the LLM or historical data becomes excessive. This equilibrium leverages the LLM's contextual adaptability (AuthLLM) while maintaining the stability of expert systems (Auth \( {}_{\text{ hist }} \) ), as evidenced by a 62.4% reduction in errors during ablation studies when both components are utilized. By avoiding complete dependence on the LLM \( \left( {\alpha \neq {1.0}}\right) \) and integrating probabilistic LLM inferences with deterministic historical patterns through multi-level confidence computing (Eq.9), the methodology enhances robustness against data sparsity and noise, particularly in the Books and Stocks datasets. + +Query: "What is the real-time status of Air China flight CA981 from Beijing Capital International Airport (PEK) to New York John F. Kennedy Airport (JFK)?" + +
Data Sources Structured Semi-structured UnstructuredCA981, PEK, JFK, Delayed, 2024-10-01 14:30
\{"flight": "CA981", "delay_reason": "Weather", "source": "AirChina"\}
"Typhoon Haikui impacts PEK departures after 14:00."
MKA ModuleStructured parsing: Flight attributes mapping
LLM extraction: (CA981, DelayReason, Typhoon) @0.87
+ +MLG Subgraph + +![bo_d6oh96c601uc73e30m7g_10_452_427_964_319_0.jpg](images/bo_d6oh96c601uc73e30m7g_10_452_427_964_319_0.jpg) + +
MCC ModuleWith GCC: Graph confidence=0.71 (Threshold=0.5), Filtered: ForumUser123 (0.47) Without GCC: Unfiltered conflict=2 subgraphs
LLM ContextTrusted: CA981.Status=Delayed (0.89), DelayReason=Typhoon (0.85)
Conflicts: ForumUser123:On-time (0.47), WeatherAPI:Clear (0.52)
Final AnswerCorrect: "CA981 delayed until after 14:30 due to typhoon"
Illucinated: "CA981 on-time with possible delay after 14:30 "
+ +![bo_d6oh96c601uc73e30m7g_10_172_1115_676_387_0.jpg](images/bo_d6oh96c601uc73e30m7g_10_172_1115_676_387_0.jpg) + +Fig. 7: Influence of hyperparameter \( \alpha \) on multi-source retrieval + +## Q4: How is the performance of MultiRAG in multi-hop Q&A datasets after incorporating multi-level confidence calculation? + +To assess the validity of the multi-level confidence computing method in reducing hallucinations generated by large models and enhancing the credibility of Q&A systems, we compare the Recall@5 scores of different methods on the HotpotQA and 2WikiMultiHopQA datasets. + +The outcome of Table IV indicates that the multi-level confidence computing method not only demonstrates a higher average Recall@5 score but also maintains a lower standard deviation compared to traditional methods. This suggests that the multi-level confidence computing method is more consistent in its performance across different queries, leading to fewer hallucinations and more reliable Q&A responses. The lower standard deviation is a testament to the robustness of the mechanism in handling the variability in data and the complexity of the queries. + +Furthermore, we performed a detailed error analysis to identify the types and frequency of hallucinations in the responses generated by the different methods. The results showed that the multi-level confidence computing method significantly reduced the frequency of hallucinations, particularly in the cases where the context was ambiguous or the information was not readily available in the knowledge base. + +## Q5: What are the time costs of the two modules in MultiRAG? + +Intuitively, MLG aggregates homologous data from several sources, ensuring the density of the retrieval subgraphs without the need to traverse and store an excessive number of invalid nodes, thereby significantly reducing the time cost associated with traversing and querying in traditional knowledge graphs. + +Furthermore, although the SOTA methods are not specifically tailored for low-resource, high-noise data scenarios, they still exhibit considerable robustness and retrieval performance in such environments. Both the MDQA and ChatKBQA models employ LLM-based data retrieval approaches, with the primary temporal and spatial overheads focusing on token consumption and LLM-based searching. + +In contrast, MultiRAG concentrates its overhead on the construction of the MLG. While in the original context of the MLG, construction times are often within seconds and highly efficient, the introduction of an LLM still incurs additional temporal costs due to text generation, which remains acceptable. Ultimately, these methods all demonstrate satisfactory retrieval performance; however, due to the inherent noise in the datasets, improvements in the accuracy of question-answering are somewhat limited. + +## D. Case Study + +MultiRAG's effectiveness in multi-source integration is demonstrated through a real-world flight status query for "CA981 from Beijing to New York". As detailed in Table V case study exemplifies MultiRAG's unique strength in transforming fragmented, conflicting inputs into trustworthy answers through systematic source weighting and consensus modeling. + +Firstly, MultiRAG integrated three data formats: structured departure schedules, semi-structured delay codes from airline systems, and unstructured weather alerts. The MKA module extracted key relationships (flight-delay-typhoon) with a confidence score of 0.87. Subsequently, the MCC module resolved conflicts through hierarchical verification by filtering out low-reliability sources, such as user forums (confidence score of 0.47), while prioritizing data from airlines (confidence score of 0.89 ) and weather reports. This dual-layer validation-combining automated threshold checks (graph confidence of 0.71) with LLM-simulated expert reasoning-enabled the precise reconciliation of contradictory departure time claims. Ultimately, the system generated the verified conclusion, "Delayed until after 14:30 due to typhoon," while suppressing the inconsistent "on-time" report. + +## E. Restrictive Analysis + +Lastly but not least, we acknowledge several limitations inherent in our current framework. + +1) Lack of optimization of text chunk segmentation. + +2) Reliance on LLM-based expert evaluation, which may introduce potential security vulnerabilities. + +3) Focuses on eliminating factual hallucinations but lacks handling of symbolic hallucinations. + +## V. RELATED WORK + +## A. Graph-Structured Approaches for Hallucination Mitigation + +Recent advancements have demonstrated unique advantages of graph structures in mitigating hallucinations within RAG systems. MetaRAG [9] establishes knowledge association verification through meta-cognitive graph reasoning paths, enhancing self-correction mechanisms in multi-hop QA. Graph-CoT [48] innovatively leverages Graph Neural Networks to establish bidirectional connections between KGs and the latent space of LLMs. In result, it reduces factual inconsistencies by 37% on KGQA benchmarks. Inspired by neurobiology, HippoRAG [23] constructs offline memory graphs with a neural indexing mechanism, decreasing retrieval latency to one-eighth of traditional methods. While ToG 2.0 [25] further advances this field by introducing a graph-context co-retrieval framework that dynamically balances structured and unstructured evidence, resulting in a 29% reduction in hallucination rates compared to unimodal approaches. + +Unlike prior approaches that primarily focus on unimodal confidence calculations, MultiRAG achieves superior hallucination mitigation through the adaptive filtering of conflicting subgraphs (GCC module) while maintaining multi-domain logical associations via its novel knowledge aggregation mechanism (MKA module). + +## B. Heterogeneous Graph Fusion for RAG + +The fusion of multi-source heterogeneous data relies on advanced graph representation techniques. FusionQuery [34] enhances cross-domain retrieval precision by integrating heterogeneous graphs and computing dynamic credibility evaluations. The Triple Line Graph [31] addresses the challenge of knowledge fragmentation by systematically aggregating cross-domain relationships, leading to Multi-source Line Graph proposed in this paper. Additionally, leveraging the structured representation advantages of KAG [26] in knowledge-guided retrieval, we achieve a unified representation approach for multi-source KGs, underscoring the importance of heterogeneous graph fusion in real-world applications. + +## C. Hallucination Benchmark and Confidence-Aware Comput- ing + +The evaluation of hallucinations in LLMs and associated confidence calculation methods are crucial for mitigating hallucinations. HaluEval [49] offers 5,000 annotated samples across five error categories, but lacks granularity for relational hallucinations. RefChecker [50] implements triple decomposition for fine-grained detection, improving precision by 26.1% over sentence-level methods. RAGTruth [51] contains nearly 18,000 RAG-generated responses with detailed manual annotations including word-level hallucination intensities. However, diverse and complex data sources continue to challenge existing evaluation frameworks. + +## VI. CONCLUSION + +In this work, we introduce MultiRAG, a framework designed to mitigate hallucination in multi-source knowledge-augmented generation. To address hallucinations arising from data sparsity and inconsistency, we propose two key innovations: multi-source knowledge aggregation and multi-level confidence calculation. The introduction of multi-source line graphs enables efficient cross-domain data aggregation, enhancing knowledge connectivity and retrieval performance. Meanwhile, our multi-level confidence computing module adaptively filter out low-quality subgraphs and unreliable nodes. Future work will explore more challenging aspects of hallucination mitigation, particularly in multimodal retrieval and ultra-long text reasoning, to better adapt generative retrieval systems to real-world, open multi-source environments. diff --git a/参考论文/geo-graph/HypRAG.md b/参考论文/geo-graph/HypRAG.md new file mode 100644 index 0000000..9f32c10 --- /dev/null +++ b/参考论文/geo-graph/HypRAG.md @@ -0,0 +1,263 @@ +# HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation + +Hiren Madhu \( {}^{1} \) Ngoc Bui \( {}^{1} \) Ali Maatouk \( {}^{1} \) Leandros Tassiulas \( {}^{1} \) Smita Krishnaswamy \( {}^{1} \) Menglin Yang \( {}^{2} \) Sukanta Ganguly \( {}^{3} \) Kiran Srinivasan \( {}^{3} \) Rex Ying \( {}^{1} \) + +## Abstract + +Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embed-dings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean em-beddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation-with over 20% radial increase from general to specific concepts—a property absent in Euclidean embed-dings, underscoring the critical role of geometric inductive bias in faithful RAG systems \( {}^{1} \) . + +## 1. Introduction + +Dense retrieval forms the backbone of retrieval-augmented generation (RAG) systems (Lewis et al., 2020; Fan et al., 2024), where embedding quality directly determines whether generated responses are grounded in evidence or hallucinated. By retrieving relevant documents and conditioning generation on this context, RAG systems produce responses that are more attributable and aligned with verifiable sources (Ni et al., 2025). Yet, despite advances in retrieval architectures, current systems continue to rely on Euclidean embeddings, a choice inherited from standard neural networks rather than from language structure itself. + +![bo_d6nbcqk601uc73e2hscg_0_917_545_680_293_0.jpg](images/bo_d6nbcqk601uc73e2hscg_0_917_545_680_293_0.jpg) + +Figure 1. Hierarchies in Text. (A) Documents naturally organize into branching hierarchies where general topics spawn increasingly specific subtopics. Euclidean spaces distort such hierarchies due to crowding effects, while hyperbolic geometry preserves hierarchical relationships through exponential volume growth. (B) Ricci curvature analysis of document embeddings from strong baselines reveals predominantly negative curvature, indicating tree-like semantic structure. + +Natural language inherently exhibits strong hierarchical organization (He et al., 2025b; Robinson et al., 2024), with semantic structure giving rise to locally tree-like neighborhoods. Euclidean spaces struggle to represent such branching hierarchies due to polynomial volume growth (He et al., 2025b), introducing shortcuts between hierarchically distinct regions that distort semantic relationships. In retrieval settings, these distortions can cause semantically distant documents to appear spuriously similar (Radovanovic et al., 2010; Bogolin et al., 2022), degrading retrieval precision (Reimers & Gurevych, 2021): a query about a specific subtopic may retrieve documents from sibling or parent categories that share similarity but lack the required specificity. + +To further see why geometry matters for retrieval, consider a query about transformer attention mechanisms (Figure 1A). Relevant documents form a natural hierarchy-from general concepts like NLP, to transformers, to specific components like multi-head attention-inducing tree-like semantic structure. Euclidean embeddings struggle to preserve this organization: representing both broad topics and specialized descendants forces a trade-off between semantic proximity and fine-grained separation, causing neighborhood crowding and distortion. Hyperbolic geometry resolves this tension through exponential volume growth, allowing general concepts to remain compact while specific documents spread outward. To test whether such structure appears empirically, we analyze Ollivier-Ricci curvature (Ni et al., 2019)—a measure of local geometry where negative values indicate tree-like branching—on graphs built from MS MARCO document embeddings (Bajaj et al., 2016). Across several strong models (Linq Embed Mistral, LLaMA Nemotron 8B, Qwen3 Embedding 4B), curvature distributions are predominantly negative (Figure 1B), providing empirical evidence that retrieval-relevant embeddings exhibit intrinsic hyperbolic structure and motivating hyperbolic geometry as a natural inductive bias for dense retrieval. + +--- + +\( {}^{1} \) Yale University, USA \( {}^{2} \) Hong Kong University of Science and Technology (Guangzhou), China \( {}^{3} \) NetApp, USA. Correspondence to: Rex Ying . + +Preprint. February 10, 2026. + +\( {}^{1} \) The code is available at: https://anonymous.4open.science/r/HypRAG-30C6 + +--- + +Recent work has begun exploring hyperbolic geometry for language modeling and RAG systems, though with different focus areas. HELM (He et al., 2025a) introduces a family of hyperbolic language models that operate entirely in hyperbolic space, but these models target text generation rather than retrieval. In the RAG setting, HyperbolicRAG (Cao et al., 2025) projects embeddings into the Poincaré ball to encode hierarchical depth within a static, pre-built knowledge graph, using dual-space retrieval that fuses Euclidean and hyperbolic rankings. However, HyperbolicRAG relies on Euclidean encoders to produce the initial embeddings, leaving the fundamental geometric mismatch. Moreover, aggregating token embeddings into document representations poses a challenge that existing work in hyperbolic learning does not address (Yang et al., 2024). As we show in Proposition 4.3, naively averaging tokens in Euclidean space before projecting to hyperbolic space causes representations to collapse toward the origin, destroying the hierarchical structure that is meant be to preserved. + +To this end, we introduce hyperbolic dense retrieval for RAG, framing embedding geometry as a design choice for improving evidence selection and grounding at the representation level. We study this through two complementary instantiations. First, HyTE-FH (Hyerbolic Text Encoder, Fully Hyperbolic) operates entirely in the Lorentz model of hyperbolic space, enabling end-to-end representation learning. Second, HyTE-H (Hybrid) maps embeddings from off-the-shelf Euclidean encoders into hyperbolic space, allowing us to build on existing pre-trained Euclidean models. The Lorentz model's intrinsic geometry enables parameter-efficient scaling: HyTE-H outperforms Euclidean baselines several times (2-3x) its size, reducing memory footprint in resource-constrained settings. To address the aggregation challenge in both instantiations, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that amplifies tokens farther from the origin, provably preserving hierarchical structure during pooling. + +Through extensive evaluation on RAGBench, we demonstrate that both hyperbolic variants consistently outperform Euclidean baselines in answer relevancy across multiple datasets, while achieving competitive performance on MTEB. Our experiments validate three key findings: (1) hyperbolic retrieval substantially improves RAG performance, with up to 29% gains over Euclidean baselines in context relevance and answer relevance; (2) hyperbolic models naturally encode concept-level hierarchies in their radial structure, with the fully hyperbolic model achieving a 20.2% radius increase from general to specific concepts, while Euclidean models fail to capture this organization; and (3) our theoretically grounded Outward Einstein Midpoint pooling preserves this hierarchical structure during aggregation. + +## 2. Related Works + +Text Embeddings and Dense Retrieval. Dense retrieval embeds queries and documents into a shared vector space and ranks candidates by similarity (e.g., dot product or cosine). Transformer bi-encoders (e.g., BERT (Devlin et al., 2019)) are widely used in this context due to their scalabil-ity with maximum inner product search (Karpukhin et al., 2020; Reimers & Gurevych, 2019). Most methods train with contrastive objectives using in-batch and hard negatives (Gao et al., 2021; Izacard et al., 2021; Xiong et al., 2021), often following large-scale pretraining plus task-specific fine-tuning (Wang et al., 2022; Li et al., 2023; Nussbaum et al., 2025). More recently, decoder-only embedding models initialize from LLMs to exploit their pretrained linguistic knowledge (Muennighoff et al., 2024; Lee et al., 2024; Zhang et al., 2025). However, most retrievers remain reliant on inner products or distances in Euclidean geometry-an inductive bias often misaligned with the hierarchical structure of language and document collections. We address this gap by introducing hyperbolic geometry for text embeddings to better capture such a hierarchy. + +Retrieval Augmented Generation. RAG grounds LLMs in retrieved evidence to improve factuality and access external knowledge (Oche et al.,2025). It typically retrieves top- \( k \) contexts (often via dense retrieval) and conditions generation on them (Lewis et al., 2020). Since the context window is limited, retrieval quality is a key bottleneck for relevance and faithfulness (Friel et al., 2024a). Several methods improve reliability after retrieval: Self-RAG (Asai et al., 2024) and CRAG (Yan et al., 2024) use learned critics to filter or re-rank evidence, while GraphRAG (Han et al., 2024) leverages knowledge graphs for structured subgraph retrieval. These approaches operate downstream of the embedding space and are complementary to ours geometric approach. Our goal is to improve RAG upstream by enhancing the retriever representations so that the initial top- \( k \) evidence is more reliable under realistic efficiency constraints. + +Hyperbolic Representation Learning. Hyperbolic geometry is primarily known for its ability to better capture hierarchical, tree-like structures (Yang et al., 2023; Peng et al., 2021), which enhances performance in various tasks, including molecular generation (Liu et al., 2019), recommendation (Yang et al., 2021; Li et al., 2021), image retrieval (Khrulkov et al., 2020; Wei et al., 2024; Bui et al., 2025), and knowledge graph embedding (Ganea et al., 2018a; Dhingra et al., 2018). More recently, hyperbolic geometry has also shown promise for multi-modal embedding models (Desai et al., 2023; Ibrahimi et al., 2024; Pal et al., 2024) and foundation models (Yang et al., 2025; He et al., 2025a). In contrast to these works, we study how hyperbolic representations can improve retrieval in RAG systems. Concurrently, Cao et al. (2025) use hyperbolic geometry to improve RAG rankings, but obtain hyperbolic embed-dings via a simple projection from Euclidean encoders; by contrast, we build on fully hyperbolic encoders trained end-to-end and address key challenges in this setting, including providing the theoretically grounded geometry-aware pooling for document-level representations. + +## 3. Hyperbolic Space Preliminaries + +In this section, we go over all the preliminaries of Lorentz model of hyperbolic space and introduce the basic building blocks that create HyTE-FH. + +### 3.1. Lorentz Model of Hyperbolic Space + +We represent all embeddings in \( d \) -dimensional hyperbolic space \( {\mathbb{H}}_{K}^{d} \) with constant negative curvature \( K < 0 \) using the Lorentz (hyperboloid) model. In the Lorentz model, hyperbolic space is realized as the upper sheet of a two-sheeted hyperboloid embedded in \( {\mathbb{R}}^{d + 1} \) , + +\[ +{\mathbb{H}}_{K}^{d} = \left\{ {\mathbf{x} \in {\mathbb{R}}^{d + 1}\mid \langle \mathbf{x},\mathbf{x}{\rangle }_{L} = \frac{1}{K},{x}_{0} > 0}\right\} , +\] + +where the Lorentzian inner product is defined as \( \langle \mathbf{x},\mathbf{y}{\rangle }_{L} = \; - {x}_{0}{y}_{0} + \mathop{\sum }\limits_{{i = 1}}^{d}{x}_{i}{y}_{i} \) . This formulation admits closed-form expressions for geodesic distances, barycentric operations, and parallel transport, and expresses similarity directly through Lorentzian inner products. The geodesic distance between two points \( \mathbf{x},\mathbf{y} \in {\mathbb{H}}_{K}^{d} \) is given by \( {d}_{K}\left( {\mathbf{x},\mathbf{y}}\right) = \; \frac{1}{\sqrt{-K}}{\cosh }^{-1}\left( {K\langle \mathbf{x},\mathbf{y}{\rangle }_{L}}\right) \) , which is a monotone function of the Lorentzian inner product. + +To support optimization, we make use of exponential and logarithmic maps between the manifold and its tangent spaces. For a point \( \mathbf{x} \in {\mathbb{H}}_{K}^{d} \) , the logarithmic map \( {\log }_{x}\left( \cdot \right) \) maps nearby points to the tangent space \( {T}_{x}{\mathbb{H}}_{K}^{d} \) , while the exponential map \( {\exp }_{x}\left( \cdot \right) \) maps tangent vectors back to the manifold. These operators are used only where necessary for gradient-based updates, ensuring that all representations remain on \( {\mathbb{H}}_{K}^{d} \) and preserving the hierarchical structure induced by negative curvature. + +### 3.2. Hyperbolic Transformer Components + +Standard operations cannot be applied directly in hyperbolic space, as they may violate the manifold constraint (Yang et al., 2024). To address this, we introduce hyperbolic components that serve as the building blocks for our embedding model. These operations are performed via a re-centering procedure that applies Euclidean operations in a latent space and maps the result back to the Lorentz model. By doing so, the resulting vector is constructed to satisfy the Lorentz constraint, thereby preserving the hyperbolic structure of representations. We present these operations as follows. + +Lorentz Linear Layer. Given curvatures \( {K}_{1},{K}_{2} \) , and parameters \( \mathbf{W} \in {\mathbb{R}}^{\left( {n + 1}\right) \times m} \) and \( \mathbf{b} \in {\mathbb{R}}^{m} \) with \( \mathbf{z} = \; \left| {{\mathbf{W}}^{\top }\mathbf{x} + \mathbf{b}}\right| \) , the Lorentzian linear transformation (Yang et al.,2024) is the map HLT : \( {\mathbb{L}}^{{K}_{1}, n} \rightarrow {\mathbb{L}}^{{K}_{2}, m} \) given by, + +\[ +\operatorname{HLT}\left( {\mathbf{x};\mathbf{W},\mathbf{b}}\right) = \sqrt{\frac{{K}_{2}}{{K}_{1}}} \cdot \left\lbrack {\sqrt{\parallel \mathbf{z}{\parallel }^{2} - 1/{K}_{2}},\mathbf{z}}\right\rbrack +\] + +Hyperbolic Layer Normalization. Given token embed-dings \( X = {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n} \subset {\mathbb{H}}_{K}^{d} \) , hyperbolic layer normalization is defined as + +\[ +\text{ HypLayerNorm }\left( X\right) = \left( {\sqrt{\frac{{K}_{1}}{{K}_{2}}\parallel \mathbf{z}{\parallel }_{2}^{2} - \frac{1}{{K}_{2}}},\sqrt{\frac{{K}_{1}}{{K}_{2}}}\mathbf{z}}\right) +\] + +where \( z = {f}_{\mathrm{{LN}}}\left( {\mathbf{x}}_{i,\left\lbrack {1 : d}\right\rbrack }\right) ,{f}_{\mathrm{{LN}}}\left( \cdot \right) \) denotes standard Euclidean LayerNorm applied to the spatial components of the embedding, and \( {K}_{1},{K}_{2} > 0 \) are input and output curvature respectively. + +Lorentz Residual Connection. Let \( \mathbf{x}, f\left( \mathbf{x}\right) \in {\mathbb{L}}^{K, n} \) where \( \mathbf{x} \) is an input vector and \( f\left( \mathbf{x}\right) \) is the output of a neural network \( f \) . Then, the Lorentzian residual connection (He et al.,2025d) is given by \( \mathbf{x}{ \oplus }_{\mathcal{L}}f\left( \mathbf{x}\right) = {\alpha }_{1}\mathbf{x} + {\alpha }_{2}\mathbf{y} \) , where + +\[ +{\alpha }_{i} = {w}_{i}/\left( {\sqrt{-K}{\begin{Vmatrix}{w}_{1}\mathbf{x} + {w}_{2}f\left( \mathbf{x}\right) \end{Vmatrix}}_{\mathcal{L}}}\right) ,\text{ for }i \in \{ 0,1\} , +\] + +where \( {\alpha }_{1},{\alpha }_{2} \) are weights parametrized by constants \( \left( {{w}_{1},{w}_{2}}\right) \in {\mathbb{R}}^{2} \smallsetminus \{ \left( {0,0}\right) \} . \) + +Hyperbolic Self-Attention. In hyperbolic attention, similarity is governed by hyperbolic geodesic distance (Ganea et al.,2018b). Given token embeddings \( X = {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n} \subset \; {\mathbb{H}}_{K}^{d} \) , queries, keys, and values are computed via Lorentz-linear transformations as \( \mathbf{Q} = \operatorname{HLT}\left( {X;{\mathbf{W}}^{Q},{\mathbf{b}}^{Q}}\right) ,\mathbf{K} = \; \operatorname{HLT}\left( {X;{\mathbf{W}}^{K},{\mathbf{b}}^{K}}\right) \) , and \( \mathbf{V} = \operatorname{HLT}\left( {X;{\mathbf{W}}^{V},{\mathbf{b}}^{V}}\right) \) , where HLT \( \left( \cdot \right) \) denotes a linear map in Lorentz space. Attention weights are computed using squared hyperbolic geodesic distances (He et al., 2025c; Chen et al., 2022) as + +\[ +{\nu }_{i, j} = \frac{\exp \left( {-{d}_{K}^{2}\left( {{\mathbf{q}}_{i},{\mathbf{k}}_{j}}\right) /\sqrt{m}}\right) }{\mathop{\sum }\limits_{{l = 1}}^{n}\exp \left( {-{d}_{K}^{2}\left( {{\mathbf{q}}_{i},{\mathbf{k}}_{l}}\right) /\sqrt{m}}\right) }, +\] + +![bo_d6nbcqk601uc73e2hscg_3_269_187_489_503_0.jpg](images/bo_d6nbcqk601uc73e2hscg_3_269_187_489_503_0.jpg) + +Figure 2. HyTE Architecture. A) HyTE-FH Encoder Block, B) HyTE-FH architecture, C) HyTE-H Architecture. + +with head dimension \( m \) . This prioritizes geodesic proximity rather than angular similarity. The attended representation is obtained via a Lorentzian weighted midpoint + +\[ +{\operatorname{Att}}_{\mathcal{L}}{\left( \mathbf{x}\right) }_{i} = \frac{\mathop{\sum }\limits_{{j = 1}}^{n}{\nu }_{i, j}{\lambda }_{j}{\mathbf{v}}_{j}}{\sqrt{-K}{\begin{Vmatrix}\mathop{\sum }\limits_{{j = 1}}^{n}{\nu }_{i, j}{\lambda }_{j}{\mathbf{v}}_{j}\end{Vmatrix}}_{\mathcal{L}}}, +\] + +where \( {\lambda }_{j} = {v}_{j,0} \) is the Lorentz factor. Unlike Euclidean averaging, this aggregation remains on \( {\mathbb{H}}_{K}^{d} \) and preserves radial structure during contextualization. + +## 4. Method + +We now outline our approach to hyperbolic dense retrieval. We begin by introducing the two proposed HyTE architectures, followed by an analysis of why naïve pooling strategies fail in hyperbolic space, and conclude by presenting our geometry-aware aggregation operator. + +### 4.1. Architecture + +The hyperbolic encoder components described in Section 3 form the building blocks (Figure 2A) of HyTE-FH, our fully hyperbolic transformer (Figure 2B). By operating entirely within hyperbolic geometry, HyTE-FH preserves hierarchical structure throughout token-level contextualization, aggregation, and similarity computation, with semantic abstraction and specificity encoded along radial dimensions. HyTE-H (Figure 2C) instead projects pretrained Euclidean representations into hyperbolic space, which allows hyperbolic geometry to be leveraged with a strong initialization and avoiding the need to train a fully hyperbolic encoder from scratch. + +While hyperbolic self-attention enables geometry-consistent contextualization at the token level, dense retrieval requires aggregating variable-length sequences into fixed-dimensional representations. Standard approaches map representations to tangent space, aggregate in Euclidean space, then map back to the manifold (Yang et al., 2024; Desai et al., 2023), but this distorts hierarchical structure encoded in radial depth in both the models. In the following subsections, we analyze this failure mode formally and introduce a pooling operator designed to preserve hierarchical information. + +### 4.2. Failure of Naïve Hyperbolic Pooling + +Naïve pooling strategies that aggregate in Euclidean space (Yang et al., 2024; Desai et al., 2023) systematically contract representations toward the origin. This follows from hyperbolic convexity: for any \( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 0}^{n} \subset {\mathbb{H}}_{K}^{d} \) , the barycenter lies strictly closer to the origin than the maximum-radius point unless all points coincide. Consequently, document-level embeddings lose the radial separation that encodes document specificity through hierarchical depth. To address this failure mode, we first establish notation for projecting ambient vectors onto the hyperboloid and measuring radial depth. + +Definition 4.1 (Lorentz Projection). For \( \mathbf{v} \in {\mathbb{R}}^{d + 1} \) with \( \langle \mathbf{v},\mathbf{v}{\rangle }_{L} < 0 \) and \( {v}_{0} > 0 \) , let \( {\Pi }_{K}\left( \mathbf{v}\right) = \; \frac{\mathbf{v}}{\sqrt{K\langle \mathbf{v},\mathbf{v}{\rangle }_{L}}} \) denote the unique positive rescaling satisfying \( {\left\langle {\Pi }_{K}\left( \mathbf{v}\right) ,{\Pi }_{K}\left( \mathbf{v}\right) \right\rangle }_{L} = 1/K \) + +Definition 4.2 (Radial Depth). The radial depth of \( \mathbf{x} \in {\mathbb{H}}_{K}^{d} \) is \( r\left( \mathbf{x}\right) = {x}_{0} \) . Since \( {x}_{0} = \frac{1}{\sqrt{-K}}\cosh \left( {\sqrt{-K}\rho }\right) \) where \( \rho = {d}_{K}\left( {o,\mathbf{x}}\right) \) , ordering by \( {x}_{0} \) is equivalent to ordering by intrinsic hyperbolic distance from the origin. + +Semantically, radial depth encodes concept specificity: general concepts should lie near the origin while fine-grained entities should have larger radii. This provides a measurable signature for evaluating whether models learn meaningful hierarchical structure. The simplest aggregation strategy is Euclidean averaging in the ambient space followed by reprojection. However, this approach provably contracts representations toward the origin (Ganea et al., 2018a; Chami et al., 2019), destroying hierarchical structure encoded in radial depth. We formalize this in the following proposition. + +Proposition 4.3 (Euclidean Mean Contracts). Let \( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n} \subset {\mathbb{H}}_{K}^{d} \) with \( n \geq 2 \) . Define the Euclidean mean \( \overline{\mathbf{x}} = \frac{1}{n}\mathop{\sum }\limits_{{i = 1}}^{n}{\mathbf{x}}_{i} \) and its projection onto the hyperboloid \( {\mathbf{m}}^{\text{ Euc }} = {\Pi }_{K}\left( \overline{\mathbf{x}}\right) \) . Then, we have + +\[ +r\left( {\mathbf{m}}^{\text{ Euc }}\right) \leq \frac{1}{n}\mathop{\sum }\limits_{{i = 1}}^{n}r\left( {\mathbf{x}}_{i}\right) , +\] + +with equality if and only if all \( {\mathbf{x}}_{i} \) are identical. + +![bo_d6nbcqk601uc73e2hscg_4_155_184_708_295_0.jpg](images/bo_d6nbcqk601uc73e2hscg_4_155_184_708_295_0.jpg) + +Figure 3. Outward Einstein Midpoint. Size of token shows its contribution towards aggregation. + +The proof of this Proposition is available in Appendix A.2. This failure motivates a precise characterization of desirable pooling behavior. We formalize the requirement that pooling should preserve, rather than collapse, radial structure. + +Definition 4.4 (Outward Bias). A pooling operator \( \mathcal{P} \) : \( {\left( {\mathbb{H}}_{K}^{d}\right) }^{n} \rightarrow {\mathbb{H}}_{K}^{d} \) is outward-biased if \( r\left( {\mathcal{P}\left( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}\right) }\right) \geq \bar{r} \) , where \( \bar{r} \) is the weighted mean radius. + +A natural alternative is a weighted aggregation scheme in which token contributions are modulated by their relative importance. For example, Zhu et al. (2020) adopt the Einstein midpoint, the canonical barycenter in hyperbolic space (Gul-cehre et al., 2019), to emphasize semantically specific tokens during pooling: since points near the boundary receive higher weight via the Lorentz factor \( {\lambda }_{i} = {x}_{i,0} \) , more informative content should dominate the aggregate. However, we show this intuition is misleading: the implicit radial weighting is fundamentally insufficient to counteract hyperbolic contraction at the document level. + +Proposition 4.5 (Implicit Radial Weighting is Insufficient). The Einstein midpoint weights points by the Lorentz factor \( {\lambda }_{i} = {x}_{i,0} \) , but this weighting grows as \( \exp \left( {\sqrt{-K}\rho }\right) \) while hyperbolic volume grows as \( \exp \left( {\left( {d - 1}\right) \sqrt{-K}\rho }\right) \) . Specifically, for a point \( \mathbf{x} \in {\mathbb{H}}_{K}^{d} \) at hyperbolic distance \( \rho \) from the origin \( o = \left( {1/\sqrt{-K},0,\ldots ,0}\right) \) , we have + +\[ +{x}_{0} = \frac{1}{\sqrt{-K}}\cosh \left( {\sqrt{-K}\rho }\right) \sim \frac{1}{2\sqrt{-K}}\exp \left( {\sqrt{-K}\rho }\right) +\] + +as \( \rho \rightarrow \infty \) . Thus, the Lorentz factor weighting undercom-pensates for the exponential growth of hyperbolic balls at large radii by a factor of \( \exp \left( {\left( {d - 2}\right) \sqrt{-K}\rho }\right) \) . + +These results establish that neither Euclidean averaging nor the standard Einstein midpoint satisfies the outward-bias property required for hierarchy-preserving aggregation. This motivates the design of a pooling operator with explicit radial amplification. The proof of this Proposition is available in Appendix A.3. + +### 4.3. Outward Einstein Midpoint Pooling + +To mitigate radial contraction during aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that explicitly amplifies the contribution of tokens with larger hyperbolic radius. Let \( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n} \subset {\mathbb{H}}_{K}^{d} \) denote a sequence of token embeddings, with optional attention weights \( {w}_{i} \geq 0 \) , and \( {\lambda }_{i} \) denoting the Lorentz factors. We define a radius-dependent weighting function + +\[ +\phi \left( {x}_{i}\right) = {x}_{i,0}^{p},\;p > 0, +\] + +which is monotone in the radial coordinate. The Outward Einstein Midpoint is then given by + +\[ +{\mathbf{m}}_{K, p}^{\mathrm{{OEM}}} = \frac{\mathop{\sum }\limits_{{i = 1}}^{n}\left( {{w}_{i}\phi \left( {\mathbf{x}}_{i}\right) }\right) {\lambda }_{i}{\mathbf{x}}_{i}}{\mathop{\sum }\limits_{{i = 1}}^{n}\left( {{w}_{i}\phi \left( {\mathbf{x}}_{i}\right) }\right) {\lambda }_{i}}, +\] + +followed by reprojection onto the hyperboloid \( {\mathbb{H}}_{K}^{d} \) . + +As shown in Figure 3, by construction, this operator assigns disproportionately higher weight to tokens located farther from the origin, counteracting the contraction inherent to naïve averaging. We now establish theoretical guarantees for the Outward Einstein Midpoint, showing that it systematically improves upon the standard Einstein midpoint in preserving radial structure. + +Theorem 4.6 (OEM Pre-Projection Bound). Let \( \widetilde{\mathbf{v}} = \; \mathop{\sum }\limits_{{i = 1}}^{n}{\widetilde{w}}_{i}{\mathbf{x}}_{i} \) where \( {\widetilde{w}}_{i} \propto {w}_{i}{x}_{i,0}^{p + 1} \) are the normalized OEM weights. Then, for \( p \geq 0 \) , we have + +\[ +{\widetilde{v}}_{0} = \frac{\mathop{\sum }\limits_{{i = 1}}^{n}{w}_{i}{x}_{i,0}^{p + 2}}{\mathop{\sum }\limits_{{i = 1}}^{n}{w}_{i}{x}_{i,0}^{p + 1}} \geq \frac{\mathop{\sum }\limits_{{i = 1}}^{n}{w}_{i}{x}_{i,0}}{\mathop{\sum }\limits_{{i = 1}}^{n}{w}_{i}} = {\bar{r}}_{w}. +\] + +We apply Chebyshev's sum inequality to the co-monotonic sequences \( {a}_{i} = {x}_{i,0}^{p + 1} \) and \( {b}_{i} = {x}_{i,0} \) to prove this. Full proof can be found in Appendix A.4. While projection onto \( {\mathbb{H}}_{K}^{d} \) contracts the radial coordinate, the OEM's concentration of weight on high-radius tokens inflates the pre-projection average, counteracting this effect. Theorem 4.6 establishes that OEM increases the pre-projection radial coordinate. The following theorem shows a stronger result: OEM provably dominates the standard Einstein midpoint in preserving radial structure. + +Theorem 4.7 (OEM Outward Bias). Let \( {\mathbf{m}}_{K}^{\text{ Ein }} \) denote the standard Einstein midpoint \( \left( {p = 0}\right) \) and \( {\mathbf{m}}_{K, p}^{\text{ OEM }} \) the Outward Einstein Midpoint. Then, for all \( p \geq 1 \) : + +\[ +r\left( {\mathbf{m}}_{K, p}^{\mathrm{{OEM}}}\right) \geq r\left( {\mathbf{m}}_{K}^{\mathrm{{Ein}}}\right) . +\] + +The OEM weights \( {\widetilde{w}}_{i} \propto {w}_{i}{x}_{i,0}^{p + 1} \) concentrate more mass on high-radius points than the Einstein weights \( {w}_{i}{x}_{i,0} \) , increasing the pre-projection time component while reducing pairwise dispersion. Full proof in Appendix A.5. Together, these results establish that the Outward Einstein Midpoint provably preserves hierarchical structure during aggregation, in contrast to both Euclidean averaging and the standard Einstein midpoint. We validate this empirically through concept-level hierarchy analysis (Section 5.2), showing that models using OEM pooling maintain monotonically increasing radii across semantic specificity levels-a property absent in Euclidean baselines. + +### 4.4. Training Methodology + +We train the hyperbolic encoder in three stages, with all objectives operating directly on the Lorentz manifold using geodesic-based similarity. + +Stage 1: Hyperbolic Masked Language Modeling. We initialize via masked language modeling (MLM), following the standard BERT objective in hyperbolic space. Contex-tualization is performed through hyperbolic self-attention, with all intermediate representations on the hyperboloid. Predictions are produced using a Lorentzian multinomial logistic regression (LorentzMLR) (Bdeir et al., 2024) head, which defines class logits via Lorentzian inner products. Only HyTE-FH is trained on MLM, while for HyTE-H we choose a pre-trained Euclidean model as the MLM base to leverage a sronger initialization in low-resource settings. + +Stage 2: Unsupervised Contrastive Pre-Training. We fine-tune the resulting MLM model on query-document pairs by minimizing unsupervised contrastive loss. Similarity is defined as negative geodesic distance \( s\left( {q, d}\right) = \; - {d}_{K}\left( {q, d}\right) \) . The contrastive loss over in-batch negatives is + +\[ +{\mathcal{L}}_{\text{ ctr }} = - \frac{1}{N}\mathop{\sum }\limits_{{i = 1}}^{N}\log \exp \left( {s\left( {{\mathbf{q}}_{i},{\mathbf{d}}_{i}}\right) /\tau }\right) , +\] + +where \( \tau > 0 \) is a temperature parameter. + +Stage 3: Supervised Contrastive Learning Fine-tuning. In the final stage of training, we further fine-tune the encoder using supervised contrastive learning on labeled query-document data. Given a query \( {q}_{i} \) , a set of relevant documents \( {\mathcal{D}}_{i}^{ + } \) , and a set of non-relevant documents \( {\mathcal{D}}_{i}^{ - } \) , the supervised contrastive objective encourages the query representation to be closer to all relevant documents than to non-relevant ones + +\[ +{\mathcal{L}}_{\text{ sup }} = - \frac{1}{N}\mathop{\sum }\limits_{{i = 1}}^{N}\log \frac{\mathop{\sum }\limits_{{{d}^{ + } \in {\mathcal{D}}_{i}^{ + }}}\exp \left( {s\left( {{\mathbf{q}}_{i},{\mathbf{d}}^{ + }}\right) /\tau }\right) }{\mathop{\sum }\limits_{{d \in {\mathcal{D}}_{i}^{ + } \cup {\mathcal{D}}_{i}^{ - }}}\exp \left( {s\left( {{\mathbf{q}}_{i},\mathbf{d}}\right) /\tau }\right) }, +\] + +where \( \tau > 0 \) is a temperature parameter. This stage explicitly aligns hyperbolic distances with supervised relevance signals, refining retrieval behavior beyond unsupervised co-occurrence structure. + +Retrieval-Augmented Generation. At inference time, the trained hyperbolic encoder is used to retrieve the top- \( k \) documents \( \mathcal{C} \) for a given queryt. These retrieved documents are then provided as context to a downstream generative language model. Prompt formatting and generation follow standard practice and are provided in Appendix B. We present runtime and computational complexity in Appendix D. + +Table 1. Performance on MTEB benchmark. We report mean scores across tasks and task types. HyTE-FH performs best among the three models. + +
ModelMean (Task)Mean (TaskType)
EucBERT54.1151.31
HyTE-H \( {}^{\text{ Euc }} \)54.5753.71
HyTE-FH56.4153.75
+ +## 5. Experiments and Results + +### 5.1. Experimental Setup + +Datasets. We pre-train our models using publicly available corpora following the data curation and filtering protocols introduced in nomic-embed (Nussbaum et al., 2025). For masked language modeling (MLM), we use the high-quality 2023 Wikipedia dump, which provides broad topical coverage and long-form text suitable for learning general-purpose semantic representations. For contrastive pre-training, we leverage approximately 235 million text pairs curated and filtered as described in (Nussbaum et al., 2025), designed to encourage semantic alignment across paraphrases and related content at scale. Finally, for task-specific fine-tuning, we use the training splits of the BEIR benchmark (Thakur et al., 2021), which comprises a diverse collection of retrieval tasks spanning multiple domains and query styles. + +Evaluation Benchmarks. We evaluate our approach on two complementary benchmarks: (1) the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2023) to assess embedding quality across diverse tasks, and (2) RAGBench (Friel et al., 2024b) for end-to-end RAG system evaluation. In MTEB, we particularly use the English part of the benchmark. RAGBench evaluates RAG systems on domain-specific question-answering datasets including CovidQA, Cuad, Emanual, DelucionQA, and ExpertQA. + +Baselines. We adopt different baseline strategies for our two models based on their training paradigms. For HyTE-FH, which is pre-trained from scratch, we train a fully Euclidean equivalent called EucBERT using the same architecture and training setup. This controlled comparison isolates the contribution of hyperbolic geometry. We also evaluate HyTE-H \( {}^{\mathrm{{Euc}}} \) , a hybrid hyperbolic model initialized with EucBERT. The three models are evaluated on MTEB and RAGBench. For HyTE-H \( {}^{\text{ bert }} \) , which is fine-tuned with modernbert-base (Warner et al., 2024) as base model, we compare against state-of-the-art embedding models smaller than 500M parameters, including gte-multilingual-base (Zhang et al., 2024), KaLM-embedding-multilingual-mini-v1 (Hu et al., 2025), and embeddinggemma-300m (Vera et al., 2025). + +Metrics. For MTEB, we report mean scores across tasks and task types. For RAG evaluation, we measure three key metrics using RAGAS (Es et al., 2024): (1) Faithfulness, which assesses whether generated answers are grounded in the retrieved context; (2) Context Relevance, which measures how relevant the retrieved documents are to the query; and (3) Answer Relevance, which evaluates how well the generated answer addresses the user's question. + +Table 2. RAG benchmark results comparing our model variants. + +
ModelAverageCovidQACuadEmanualDelucionQAExpertQA
FCRARFCRARFCRARFCRARFCRARFCRAR
EucBERT0.5960.7980.6470.6850.8630.5820.6540.6440.6410.6420.6460.6740.5250.9680.6790.4750.8720.662
HyTE-H \( {}^{\text{ Euc }} \)0.7060.8140.7390.7080.8680.6680.7870.6520.7100.6790.8350.8140.7370.8570.7730.6230.8590.728
HyTE-FH0.7320.8480.7650.7640.9160.6940.7470.6740.7520.6600.8070.7040.7890.9060.8610.7020.9360.814
+ +\( \mathrm{F} = \) Faithfulness, \( \mathrm{{CR}} = \) Context Relevance, \( \mathrm{{AR}} = \) Answer Relevance. Best results in bold. + +Table 3. RAG benchmark results comparing our hybrid model with state-of-the-art embedding models. HyTE-H demonstrates competitive performance particularly in context relevance and answer relevance. + +
ModelAverageCovidQACuadEmanualDelucionQAExpertQA
FCRARFCRARFCRARFCRARFCRARFCRAR
ModernBert*0.6170.7480.6320.6560.8950.53780.6320.7090.7460.5670.7150.6390.6550.66570.51830.5750.7580.718
GTE0.6590.7010.6500.6950.8400.5380.7330.5990.7790.5460.6080.6860.6480.7250.5490.6720.7310.698
Gemma0.6030.7350.6840.6850.7600.4970.7240.6000.7780.5550.8840.6870.6120.6430.7050.4420.7910.755
KaLM-mini-v10.6240.7190.5910.6560.7870.5280.7420.7890.7160.5650.7760.6160.5530.5810.5730.6070.6660.522
HyTE-H \( {}^{\text{ bert }} \)0.7630.9040.8320.7970.9740.7550.7600.6830.8040.6880.9430.8990.8290.9650.8710.7390.9580.834
+ +\( \mathrm{F} = \) Faithfulness, \( \mathrm{{CR}} = \) Context Relevance, \( \mathrm{{AR}} = \) Answer Relevance. Best results in bold. + +Implementation. We implement all hyperbolic models using HyperCore (He et al., 2025e) and train on NVIDIA H100 GPUs. All three models, HyTE-FH, HyTE-H, and Eu-cBERT, share the same architecture, each containing 149M parameters with 12 transformer layers and 768-dimensional embeddings. For generation and judging, we use Llama- 3.1-8B-Instruct (Weerawardhena et al., 2025). For RAG benchmarks, we fix the retrieval context window size to 5 for all models to ensure a controlled comparison; we additionally report ablations with larger context sizes in Appendix Table A3. + +### 5.2. Results + +MTEB Benchmark. Table 1 reports performance on the MTEB benchmark. HyTE-FH achieves the highest mean score across tasks (56.41), outperforming both EucBERT (54.11) and HyTE-H \( {}^{\mathrm{{Euc}}} \) (54.57). On the task-type mean, HyTE-FH and HyTE-H \( {}^{\mathrm{{Euc}}} \) perform comparably (53.75 and 53.71, respectively), with both surpassing EucBERT (51.31). These results demonstrate that hyperbolic representations not only improve RAG retrieval but also remain competitive on general-purpose embedding benchmarks. We present task-wise results in Table A1. + +RAG Benchmark Results. Table 2 presents RAG benchmark results across five datasets. HyTE-FH achieves the best average performance across all three metrics: faithfulness (0.732), context relevance (0.848), and answer relevance (0.765). HyTE-H \( {}^{\mathrm{{Euc}}} \) ranks second overall, with both hyperbolic variants substantially outperforming EucBERT. On individual datasets, HyTE-FH leads on CovidQA, Cuad, DelucionQA, and ExpertQA, while HyTE-H \( {}^{\text{ Euc }} \) achieves the best context and answer relevance on Emanual. These results demonstrate that hyperbolic geometry consistently improves retrieval quality for RAG across diverse domains. + +Table 3 reports RAG performance across five datasets. HyTE-H \( {}^{\text{ bert }} \) consistently outperforms strong Euclidean embedding baselines across all metrics, with particularly large gains in context relevance and answer relevance. These improvements indicate that hyperbolic representations are more effective at retrieving structurally relevant evidence, which is critical for downstream generation quality in RAG pipelines. In qualitative case studies shows in Appendix E.1, we observe that Euclidean models frequently fail to retrieve key supporting passages altogether, whereas hyperbolic model recover relevant evidence more reliably, leading to more faithful and contextually grounded answers. + +Concept-Level Hierarchy Analysis. A central motivation for hyperbolic embeddings is their capacity to preserve hierarchical relationships (Section 4.2). To understand how models capture document hierarchy, we analyze learned radii (distances from the origin in the Poincaré ball) across five hierarchical levels: from Level 1 (most general, e.g., document-level topics) to Level 5 (most specific, e.g., fine-grained entities). Figure 4 presents these results. The fully hyperbolic model demonstrates clear hierarchical organization with radii increasing monotonically from Level 1 (2.902) to Level 5 (3.488, +20.2%). This shows the model naturally places general concepts near the origin and specific details toward the boundary, consistent with hyperbolic geometry, where proximity to the origin represents generality. Euclidean models show flat or decreasing distributions. Baselines maintain constant norms across levels or decreases norm by \( {30}\% \) , reflecting inverted structure. Hybrid models exhibit substantially larger radii from the hyperbolic component. The fine-tuned hybrid increases from 116.9 to 146.7, showing that fine-tuning induces structured hierarchy. We have attached the dataset for this case study in the supplementary material. The concept level hierarchy data is available in Appendix C. + +![bo_d6nbcqk601uc73e2hscg_7_156_187_1438_841_0.jpg](images/bo_d6nbcqk601uc73e2hscg_7_156_187_1438_841_0.jpg) + +Figure 4. Empirical validation of hierarchical encoding. Left: Euclidean models show flat or decreasing norms. Middle: HyTE-H demonstrate increasing norms with fine-tuning enhancing this trend. Right: HyTE-FH achieves +20.2% total increase from L1 to L5. Bottom: Normalized comparison and percent change summary highlighting the contrasting behaviors of different geometric approaches. + +Ablation Studies. We compare two pooling strategies for aggregating token embeddings into document representations: CLS token pooling and OEM pooling. CLS pooling uses the representation of a special classification token, while OEM pooling performs geometry-aware aggregation directly in hyperbolic space. Table 4 shows that OEM pooling yields higher performance across both mean task and mean task-type metrics on MTEB retrieval tasks, indicating more effective document-level aggregation in the hyperbolic setting. We also show that using geodesic distance in the contrastive objective outperforms the Lorentz inner product (Appendix Table A2), suggesting better alignment of representations on the manifold. Additionally, hyperbolic models maintain strong performance with smaller retrieval budgets, whereas Euclidean baselines require larger context windows to achieve comparable results (Appendix Table A3). + +Table 4. Comparison of pooling strategies on MTEB tasks. OEM pooling leverages hyperbolic geometry for improved performance. + +
Pooling StrategyMean (Task)Mean (TaskType)
CLS Token49.3348.90
OEM56.4153.75
+ +## 6. Conclusion + +We introduced hyperbolic dense retrieval for RAG, showing that aligning embedding geometry with the hierarchical structure of language improves faithfulness and answer quality. Our approach preserves document-level structure during aggregation through a geometry-aware pooling operator, addressing a key failure mode of Euclidean retrieval pipelines. Across evaluations, we observe consistent gains using models substantially smaller than current state-of-the-art retrievers, highlighting the effectiveness of hyperbolic inductive bias over scale alone. Case studies further show that hyperbolic representations organize documents by specificity through norm-based separation, a property absent in Euclidean embeddings. These findings suggest that embedding geometry is a central design choice for reliable retrieval in RAG systems, with implications for future scalable and multimodal retrieval architectures. diff --git a/参考论文/geo-graph/HyperRAG.md b/参考论文/geo-graph/HyperRAG.md new file mode 100644 index 0000000..ba786d4 --- /dev/null +++ b/参考论文/geo-graph/HyperRAG.md @@ -0,0 +1,353 @@ +# HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation + +Wen-Sheng Lien + +National Yang Ming Chiao Tung + +University + +Hsinchu, Taiwan + +vincentlien.ii13@nycu.edu.tw + +Yu-Kai Chan + +National Yang Ming Chiao Tung + +University + +Hsinchu, Taiwan + +ctw33888.ee13@nycu.edu.tw + +Hao-Lung Hsiao + +National Yang Ming Chiao Tung + +University + +Hsinchu, Taiwan + +hlhsiao.cs13@nycu.edu.tw + +Bo-Kai Ruan + +National Yang Ming Chiao Tung + +University + +Hsinchu, Taiwan + +bkruan.ee11@nycu.edu.tw + +Meng-Fen Chiang + +National Yang Ming Chiao Tung + +University + +Hsinchu, Taiwan + +meng.chiang@nycu.edu.tw + +Chien-An Chen + +E.SUN Bank + +Taipei, Taiwan + +lukechen-15953@esunbank.com + +Yi-Ren Yeh + +National Kaohsiung Normal + +University + +Kaohsiung, Taiwan + +yryeh@nknu.edu.tw + +Hong-Han Shuai + +National Yang Ming Chiao Tung + +University + +Hsinchu, Taiwan + +hhshuai@nycu.edu.tw + +## Abstract + +Graph-based Retrieval-Augmented Generation (RAG) typically operates on binary Knowledge Graphs (KGs). However, decomposing complex facts into binary triples often leads to semantic fragmentation and longer reasoning paths, increasing the risk of retrieval drift and computational overhead. In contrast, \( n \) -ary hypergraphs preserve high-order relational integrity, enabling shallower and more semantically cohesive inference. To exploit this topology, we propose HyperRAG, a framework tailored for \( n \) -ary hypergraphs featuring two complementary retrieval paradigms: (i) HyperRetriever learns structural-semantic reasoning over \( n \) -ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM's parametric memory to guide beam search, dynamically scoring \( n \) -ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness. Hy-perRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that Hyper-Retriever bridges reasoning gaps through adaptive and interpretable \( n \) -ary chain construction, benefiting both open and closed-domain QA. Our codes are publicly available at https://github.com/Vincent-Lien/HyperRAG.git. + +## CCS Concepts + +- Information systems \( \rightarrow \) Retrieval models and ranking; Language models; Question answering. + +## Keywords + +Hypergraph-based Retrieval-Augmented Generation, N-ary Relational Knowledge Graphs, Multi-hop Question Answering, Memory-Guided Adaptive Retrieval + +## ACM Reference Format: + +Wen-Sheng Lien, Yu-Kai Chan, Hao-Lung Hsiao, Bo-Kai Ruan, Meng-Fen Chiang, Chien-An Chen, Yi-Ren Yeh, and Hong-Han Shuai. 2026. Hyper-RAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation. In Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, United Arab Emirates. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3774904.3792710 + +## 1 Introduction + +Retrieval-Augmented Generation (RAG) has established itself as a critical mechanism for augmenting Large Language Models (LLMs) with non-parametric external knowledge during inference [12, 17, 19, 20]. By dynamically retrieving verifiable information from external corpora without the need for extensive fine-tuning, RAG effectively mitigates intrinsic LLM limitations such as hallucinations and temporal obsolescence. This paradigm has proven particularly transformative for knowledge-intensive tasks, including open-domain question answering (QA), fact verification, and complex information extraction, driving significant innovation across both academia and industry. + +Current RAG methodologies broadly fall into three categories: document-based, graph-based, and hybrid approaches. Document-based methods utilize dense vector retrieval to match queries with textual segments, offering scalability but often failing to capture complex structural dependencies [5, 6]. Conversely, graph-based methods leverage Knowledge Graphs (KGs) to explicitly model relationships, enabling multi-hop reasoning over structured data [15, 31]. Hybrid approaches attempt to bridge these paradigms, balancing comprehensiveness with efficiency. However, despite the reasoning potential of graph-based methods, the prevailing reliance on binary KGs presents fundamental topological limitations. + +![bo_d6nbbuc601uc73e2hrig_1_187_247_648_687_0.jpg](images/bo_d6nbbuc601uc73e2hrig_1_187_247_648_687_0.jpg) + +Figure 1: Structural Comparison of (a) Knowledge Graphs and (b) Hypergraphs. For a given question \( q \) ,(a) requires 3-hop reasoning over binary facts, while (b) enables single-hop inference via an \( n \) -ary relational fact, yielding a more compact and expressive multi-entity representation. + +Traditional graph-based RAG methods predominantly rely on binary knowledge graphs, which suffer from notable limitations when applied to closed-domain question-answering scenarios. Specifically, binary KG approaches encounter two fundamental structural limitations. First, Semantic Fragmentation arises because binary relations limit the expressiveness required to capture complex multi-entity interactions, forcing the decomposition of holistic facts into disjoint triples that fail to represent intricate semantic nuances. Second, this fragmentation leads to Path Explosion, where conventional approaches incur significant computational costs due to the need for deep traversals over the vast binary relation space to reconnect these facts, enabling error propagation and undermining real-world practicality [18, 37]. To address these limitations, recent work advocates hypergraphs for structured retrieval in RAG. Hypergraphs natively encode higher-order ( \( n \) -ary) relations that bind multiple entities and roles, providing a richer semantic substrate than binary graphs [26]. As illustrated in Figure 1, the Path Explosion issue is evident when answering a question grounded on the topic entity "Bruce Seth Green," which requires a 3-hop binary traversal on a standard KG. In contrast, this reduces to a single hop through an \( n \) -ary relation in a hypergraph, yielding a more compact representation. Hypergraphs enable the direct modeling of higher-order relational chains, effectively mitigating Semantic Fragmentation and reducing the reasoning steps required to capture complex dependencies. + +Motivated by these insights, we introduce HyperRAG, an innovative retrieval-augmented generation framework designed explicitly for reasoning over \( n \) -ary hypergraphs. HyperRAG integrates two novel adaptive retrieval variants: (i) HyperRetriever, which uses a multilayer perceptron (MLP) to fuse structural and semantic em-beddings, constructing query-conditioned relational chains that enable accurate and interpretable evidence aggregation within context and token constraints; and (ii) HyperMemory, which leverages the parametric memory of an LLM to guide beam search, dynamically scoring \( n \) -ary facts and entities for query-adaptive path expansion. By combining higher-order reasoning with shallower yet more expressive chains that locate key evidence without multi-hop traversal. Replacement of the \( n \) -ary structure with a binary reduces the average MRR from 36.45% to 34.15% and the average Hits@10 from 40.59% to 36.82% (Table 3), indicating gains in response quality. + +Our key contributions are summarized as follows. + +- We propose HyperRAG, a pioneering framework that shifts the graph-RAG paradigm from binary triples to \( n \) -ary hypergraphs, tackling the issues of semantic fragmentation and path explosion. + +- We introduce HyperRetriever, a trainable MLP-based retrieval module that fuses structural and semantic signals to extract precise, interpretable evidence chains with low latency. + +- We develop HyperMemory, a synergistic retrieval approach that utilizes LLM parametric knowledge to guide symbolic beam search over hypergraphs for complex query adaptive reasoning. + +- Extensive evaluation across closed-domain and open-domain benchmarks demonstrates that HyperRAG consistently outperforms strong baselines, offering a superior trade-off between retrieval accuracy, reasoning interpretability, and system latency. + +## 2 Preliminaries + +### 2.1 Background + +Definition 2.1 ( \( n \) -ary Relational Knowledge Graph). An \( n \) -ary relational knowledge graph, or hypergraph, represents relational facts involving two or more entities and one or more relations. Formally, following the definition in [43], a hypergraph is defined as \( \mathcal{G} = \left( {\mathcal{E},\mathcal{R},\mathcal{F}}\right) \) , where \( \mathcal{E} \) denotes the set of entities, \( \mathcal{R} \) denotes the set of relations, and \( \mathcal{F} \) the set of \( n \) -ary relational facts (hyperedges). Each \( n \) -ary fact \( {f}^{n} \in \mathcal{F} \) , which consists of two or more entities, is represented as: \( {f}^{n} = {\left\{ {e}_{i}\right\} }_{i = 1}^{n} \) , where \( {\left\{ {e}_{i}\right\} }_{i = 1}^{n} \subseteq \mathcal{E} \) is a set of \( n \) entities with \( n \geq 2 \) . + +Unlike binary knowledge graphs, \( n \) -ary representation inherently captures higher-order relational dependencies among multiple entities. \( n \) -ary relations cannot be faithfully decomposed into combinations of binary relations without losing structural integrity or introducing ambiguity in semantic interpretation [1, 9, 35]. We formalize faithful reduction and show that any straightforward binary scheme violates at least one of: (i) recoverability of the original tuples, (ii) role preservation, or (iii) multiplicity of co-participations. Please refer to Appendix A for more details on the recoveryability of role-preserving hypergraph reduction, roles, and multiplicity. + +### 2.2 Problem Formulation + +Problem (Hypergraph-based RAG). Given a question \( q \) , a hyper-graph \( \mathcal{G} \) representing \( n \) -ary relational structures, and a collection of source documents \( \mathcal{D} \) , the goal of hypergraph-based retrieval-augmented generation (RAG) is to generate faithful and contextually grounded answers \( a \) by leveraging salient multi-hop relational chains from \( \mathcal{G} \) and extracting relevant textual evidence from \( \mathcal{D} \) . + +Complexity: Native \( n \) -ary Hypergraph Retrieval. Let \( {N}_{e} = \left| \mathcal{E}\right| \) , \( {N}_{f} = \left| \mathcal{F}\right| \) , and \( \bar{n} \) be the average arity. A query binds \( k \) role-typed arguments, \( q = {\left\{ \left( {r}_{i} : {a}_{i}\right) \right\} }_{i = 1}^{k} \) , and asks for the remaining \( n - k \) roles. We maintain sorted posting lists over role incidences, \( \mathcal{P}\left( {r : a}\right) = \; \{ f \in \mathcal{F} : \left( {r : a}\right) \in f\} \) , with length \( d\left( {r : a}\right) \) . To answer \( q \) , the \( n \) -ary based retriever intersects the \( k \) posting lists by hyperedge IDs and reads the missing roles from each surviving hyperedge. Let \( {n}^{ \star } \) be the (max/avg) arity among matches. The running time is given by: + +\[ +{T}_{\mathrm{{HYP}}}\left( q\right) = O\left( {\mathop{\sum }\limits_{{i = 1}}^{k}d\left( {{r}_{i} : {a}_{i}}\right) + \text{ out }}\right) , \tag{1} +\] + +where out is the number of matching facts. In typical schemas, the relation arity is often bounded by a small constant (e.g., triadic, \( n \leq 3 \) ). As a result, for each match the retriever touches exactly one hyperedge record to materialize the unbound roles, yielding per-output overhead \( O\left( 1\right) \) . + +Complexity: Standard Binary KG Retrieval. Suppose each \( n \) - ary fact \( f \) is reified as an event node \( {e}_{f} \) with \( n \) role-typed binary edges (e.g., \( {\operatorname{role}}_{j}\left( {{e}_{f},{a}_{j}}\right) \) ). For each binding \( \left( {{r}_{i} : {a}_{i}}\right) \) , use the list of event IDs posted \( {\mathcal{P}}_{\text{ event }}\left( {{r}_{i} : {a}_{i}}\right) \) and intersect the \( k \) lists to obtain candidate events to mirror the hypergraph intersection. For each surviving \( {e}_{f} \) , follow its remaining \( \left( {n - k}\right) \) role-edges to materialize unbound arguments. Let \( {d}_{\text{ event }}\left( {r : a}\right) = \left| {{\mathcal{P}}_{\text{ event }}\left( {r : a}\right) }\right| \) and let \( {n}^{ \star } \) be the (max/avg) arity over matches. The running time is given by: + +\[ +{T}_{\mathrm{{BIN}}}\left( q\right) = O\left( {\mathop{\sum }\limits_{{i = 1}}^{k}{d}_{\text{ event }}\left( {{r}_{i} : {a}_{i}}\right) + \text{ out } \cdot \left( {{n}^{ \star } - k}\right) }\right) . \tag{2} +\] + +Under a schema-bounded arity, the per-result overhead is up to \( \bar{n} \) role lookups to materialize the remaining arguments. In contrast, the hypergraph returns them from a single record. + +Complexity Gap. In a native hypergraph, all arguments of an \( n \) - ary fact co-reside in a single hyperedge record, thus materializing a hit, is one read, i.e., \( O\left( 1\right) \) per result under bounded arity. In contrast, in an event-reified binary KG, the fact is split across \( n \) role-typed edges, reachable only via the intermediate event node \( {e}_{f} \) . As a result, materializing requires up to \( \left( {n - k}\right) \) pointer chases, yielding out \( \cdot \bar{n} \) term, and usually incurs extra indirections/cache misses. + +## 3 Methodology + +We propose HyperRAG, a novel framework that enhances answer fidelity by integrating reasoning over condensed \( n \) -ary relational facts with textual evidence. As depicted in Figure 2, HyperRAG features two retrieval paradigms: (i) HyperRetriever, which performs adaptive structural-semantic traversal to build interpretable, query-conditioned relational chains; (ii) HyperMemory, which utilizes the parametric knowledge of the LLM to guide symbolic beam search. Both variants ground the generation process in hypergraph structures, ensuring faithful and accurate multi-hop reasoning. + +![bo_d6nbbuc601uc73e2hrig_2_950_260_682_816_0.jpg](images/bo_d6nbbuc601uc73e2hrig_2_950_260_682_816_0.jpg) + +Figure 2: The overall framework of HyperRAG. + +### 3.1 HyperRetriever: Relational Chains Learning + +The motivation behind learning to extract fine-grained \( n \) -ary relational chains over hypergraph structures stems from two key challenges: (i) the well-documented tendency of LLMs to hallucinate factual content and (ii) the vast combinatorial search space of hypergraphs under limited token and context budgets [25]. To mitigate these challenges, we introduce a lightweight yet expressive retriever that integrates structural and semantic cues to rank salient \( n \) -ary facts aligned with query intent. + +3.1.1 Topic Entity Extraction. The purpose of obtaining the topic entity is to ground the query semantics onto hypergraphs \( \mathcal{G} \) . Formally, given a query \( q \) , we request an LLM with prompt \( {p}_{\text{ topic }} \) to identify a set of topic entities that appear in \( q \) in an LLM as follows: + +\[ +{\mathcal{E}}_{q} = \operatorname{LLM}\left( {{p}_{\text{ topic }}, q}\right) +\] + +where \( {\mathcal{E}}_{q} \) denotes the set of extracted entities in the query \( q \) . + +3.1.2 Hyperedge Retrieval and Triple Formation. For each extracted topic entity \( {e}_{s} \in {\mathcal{E}}_{q} \) , we retrieve its incident hyperedges from \( \mathcal{F} \) , formally defined as follows: + +\[ +{\mathcal{F}}_{{e}_{s}} = \left\{ {{f}^{n} \in \mathcal{F} : {e}_{s} \in {f}^{n}}\right\} . +\] + +Each hyperedge \( {f}^{n} \in {\mathcal{F}}_{{e}_{s}} \) defines an \( n \) -ary relation over a subset of \( n \) entities. To enable pairwise reasoning, we derive a set of pseudobinary triples by enumerating ordered entity pairs within each hyperedge for query \( q \) as follows: + +\[ +{\mathcal{T}}_{q} = \left\{ {\left( {{e}_{h},{f}^{n},{e}_{t}}\right) \mid {f}^{n} \in {\mathcal{F}}_{{e}_{s}},{e}_{h} \in {f}^{n},{e}_{t} \in {f}^{n}}\right\} , \tag{3} +\] + +where each pseudo-binary triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) consists of a head entity, the originating hyperedge, and a tail entity. + +3.1.3 Structural Proximity Encoding. To capture the structural proximity between entities in the hypergraph, we adapt the directional distance encoding (DDE) mechanism from SubGraphRAG [21], extending it from binary relations to \( n \) -ary hyperedges. Formally, for each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \in {\mathcal{T}}_{q} \) , we compute its directional encoding in the following steps: + +- One-Hot Initialization: For each entity \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) , we initialize a one-hot indicator for the head entity: + +\[ +{s}_{e}^{\left( 0\right) } = \left\{ \begin{array}{ll} 1, & \text{ if }\exists \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \in {\mathcal{T}}_{q}\text{ such that }e = {e}_{h}, \\ 0, & \text{ otherwise. } \end{array}\right. \tag{4} +\] + +- Bi-directional Feature Propagation: For each layer \( l = 0,\ldots , L \) , we propagate features over the set of derived triples \( {\mathcal{T}}_{q} \) . Forward propagation simulates how the head entity \( {e}_{h} \) reaches out to the tail entity \( {e}_{t} \) as follows: + +\[ +{s}_{e}^{\left( l + 1\right) } = \frac{1}{\left| \left\{ {e}^{\prime } \mid \left( {e}^{\prime },\cdot , e\right) \in {\mathcal{T}}_{q}\right\} \right| }\mathop{\sum }\limits_{{\left( {{e}^{\prime },\cdot , e}\right) \in {\mathcal{T}}_{q}}}{s}_{{e}^{\prime }}^{\left( l\right) }. \tag{5} +\] + +In contrast, backward propagation updates head encodings based on tail-to-head influence: + +\[ +{s}_{e}^{\left( r, l + 1\right) } = \frac{1}{\left| \left\{ {e}^{\prime } \mid \left( e,\cdot ,{e}^{\prime }\right) \in {\mathcal{T}}_{q}\right\} \right| }\mathop{\sum }\limits_{{\left( {e,\cdot ,{e}^{\prime }}\right) \in {\mathcal{T}}_{q}}}{s}_{{e}^{\prime }}^{\left( r, l\right) }. \tag{6} +\] + +- Bi-directional Encoding: After \( L \) rounds of propagation, we concatenate the forward and backward encodings to obtain the final vector for each entity \( e \) as follows: + +\[ +{s}_{e} = \left\lbrack {{s}_{e}^{\left( 0\right) }\begin{Vmatrix}{s}_{e}^{\left( 1\right) }\end{Vmatrix}\cdots \begin{Vmatrix}{s}_{e}^{\left( L\right) }\end{Vmatrix}{s}_{e}^{\left( r,1\right) }\parallel \cdots \parallel {s}_{e}^{\left( r, L\right) }}\right\rbrack , \tag{7} +\] + +where \( \parallel \) denotes vector concatenation. Note that the backward propagation starts from \( l = 1 \) , as \( l = 0 \) is shared in both directions. + +- Triple Encoding: For each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) , we define its structural proximity encoding as follows: + +\[ +\delta \left( {{e}_{h},{f}^{n},{e}_{t}}\right) = \left\lbrack {{s}_{{e}_{h}}\parallel {s}_{{e}_{t}}}\right\rbrack \tag{8} +\] + +which is passed to a lightweight parametric neural function to compute the plausibility score for each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) given query \( q \) . + +3.1.4 Contrastive Plausibility Scoring. To reduce the search space in the hypergraph structure, we address the challenge that similarity-based retrieval often introduces noisy or irrelevant triples. To mitigate this, we train a lightweight MLP classifier \( {f}_{\theta } \) to estimate the plausibility of each triple candidate and prune uninformative ones. + +To this end, the training set is prepared with positive and negative samples. Let \( {P}_{q}^{ * } \) denote the shortest path of triples connecting the topic entity to a correct answer in the hypergraph \( \mathcal{G} \) . The positive samples \( {\mathcal{T}}_{i}^{ + } \) at hop \( i \) consist of triples in \( {P}_{q}^{ * } \) , denoted as \( {\mathcal{T}}_{i}^{ + } = \left\{ \left( {{e}_{h, i},{f}_{i}^{n},{e}_{t, i}}\right) \right\} \) . Negative samples \( {T}_{i}^{ - } \) consist of all other triples incident to the head entity \( {e}_{i} \) at hop \( i \) that are not in \( {P}_{q}^{ * } \) . At each exploration step, only positive triples are expanded at each hop, while negative ones are excluded. Each triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) is encoded in a feature vector by concatenating its contextual and structural encodings: + +\[ +\mathbf{x} = \left\lbrack {\varphi \left( q\right) \begin{Vmatrix}{\varphi \left( {e}_{h}\right) }\end{Vmatrix}\varphi \left( {f}^{n}\right) \begin{Vmatrix}{\varphi \left( {e}_{t}\right) }\end{Vmatrix}\delta \left( {{e}_{h},{f}^{n},{e}_{t}}\right) }\right\rbrack , \tag{9} +\] + +where \( \varphi \) denotes an embedding model that maps the textual content of the query \( \left( q\right) \) , head entity \( \left( {e}_{h}\right) \) , hyperedge \( \left( {f}^{n}\right) \) , and tail entity \( \left( {e}_{t}\right) \) , into vector representations, forming the candidate pseudobinary triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) . The classifier outputs a plausibility score \( {f}_{\theta }\left( \mathbf{x}\right) \in \left\lbrack {0,1}\right\rbrack \) , trained using binary cross-entropy as follows: + +\[ +\mathcal{L} = - \frac{1}{N}\mathop{\sum }\limits_{{i = 1}}^{N}\left\lbrack {{y}_{i}\log \left( {{f}_{\theta }\left( {\mathbf{x}}_{i}\right) }\right) + \left( {1 - {y}_{i}}\right) \log \left( {1 - {f}_{\theta }\left( {\mathbf{x}}_{i}\right) }\right) }\right\rbrack . \tag{10} +\] + +3.1.5 Adaptive Search. At inference time, we initiate the retrieval process with initial triples of topic entities and compute their plausibility scores using the trained MLP, \( {f}_{\theta }\left( \mathbf{x}\right) \) . Triples exceeding a plausibility threshold \( \tau \) are retained, and their tail entities are used as frontier entities in the next hop. This expansion-filtering cycle continues until no new triples satisfy the threshold. However, using a fixed threshold \( \tau \) can be problematic: it may be too strict in sparse hypergraphs, limiting retrieval, or too lenient in dense hypergraphs, leading to an overload of irrelevant triples. To mitigate this, we implement an adaptive thresholding strategy. We initialize with \( {\tau }_{0} = {0.5} \) , allow a maximum of \( {N}_{\max } = 5 \) threshold reductions, and define \( M = {50} \) as the minimum acceptable number of hyperedges per hop. At hop \( i \) , we retrieve the set of triples, \( {\mathcal{T}}_{q, \geq {\tau }_{j}} = \left\{ {\left( {{e}_{h},\mathbf{h},{e}_{t}}\right) \mid {f}_{\theta }\left( x\right) \geq {\tau }_{j}}\right\} \) under the current threshold \( {\tau }_{j} \) . If \( \left| {\mathcal{T}}_{q, \geq {\tau }_{j}}\right| < M \) , we iteratively reduce the threshold as follows: + +\[ +{\tau }_{j + 1} = {\tau }_{j} - c,\;j = 0,\ldots ,{N}_{\max } - 1, \tag{11} +\] + +where \( c = {0.1} \) is the decay factor. This process continues until \( \begin{Vmatrix}{\mathcal{T}}_{q, \geq {\tau }_{j}}\end{Vmatrix} \geq M \) or the reduction limit is reached. To further adapt to structural variations in the hypergraph, we incorporate a density-aware thresholding policy. Given the density of the hypergraph \( \Delta \left( \mathcal{G}\right) \) and the predefined lower and upper bounds \( {\Delta }_{\text{ lo }} \) and \( {\Delta }_{\text{ up }} \) , we classify the hypergraph and adjust \( {\tau }_{0} \) accordingly to balance coverage and precision as follows: + +\[ +{\mathcal{M}}_{\mathcal{G}} = \left\{ \begin{array}{ll} {\mathcal{M}}_{\text{ low }}, & \Delta \left( \mathcal{G}\right) \leq {\Delta }_{\mathrm{{lo}}}, \\ {\mathcal{M}}_{\text{ mid }}, & {\Delta }_{\mathrm{{lo}}} < \Delta \left( \mathcal{G}\right) \leq {\Delta }_{\mathrm{{up}}}, \\ {\mathcal{M}}_{\text{ high }}, & \Delta \left( \mathcal{G}\right) > {\Delta }_{\mathrm{{up}}} \end{array}\right. \tag{12} +\] + +After convergence or exhaustion of threshold reduction attempts, the retrieval strategy is adjusted based on the assigned graph density category. For low-density graphs \( \left( {\mathcal{M}}_{\text{ low }}\right) \) , the retriever selects from previously discarded triples those that satisfy the final plausibility threshold. For medium and high-density graphs \( \left( {\mathcal{M}}_{\text{ mid }}\right. \) and \( \left. {\mathcal{M}}_{\text{ high }}\right) \) , the strategy additionally expands from the tail entities of these newly accepted triples to increase the depth of reasoning. This density-aware adjustment prevents over-retrieval in sparse graphs while enabling more profound and broader exploration in dense graphs. To further control expansion in high-density settings, where the number of candidate hyperedges may become excessive, we impose an upper bound on the number of retrieved triples per hop. This constraint effectively limits entity expansion, accelerates retrieval, and reduces the inclusion of low-utility information. + +3.1.6 Budget-aware Contextualized Generator. After completion of the retrieval process, we organize the selected elements into a structured input for the generator. Following the context layout protocol of HyperGraphRAG [25], we include (i) entities and their associated descriptions, (ii) hyperedges along with their participating entities, and (iii) supporting source text chunks linked to each entity or hyperedge. Due to input length constraints, we prioritize components based on their utility. As shown in the ablation study of HyperGraphRAG, n-ary relational facts (i.e., hyperedges) contribute the most to reasoning performance, followed by entities and then source text. We therefore allocate the token budget accordingly: 50% for hyperedges, 30% for entities, and 20% for source chunks. To further maximize informativeness, we order hyperedges and entities according to their plausibility scores \( {f}_{\theta }\left( \cdot \right) \) , with graph connectivity as a secondary criterion. The selected components are then sequentially filled in the order: hyperedges, entities, and source chunks. Components are filled in priority order and any unused budget is passed to the next category. The contextualized evidence resulting context, together with the original query \( q \) , is then passed to the LLM to generate the final answer Answer as: + +Answer \( \mathrel{\text{ := }} \operatorname{LLM}\left( {\text{ Context }, q}\right) \) .(13) + +### 3.2 HyperMemory: Relational Chain Extraction + +To improve interpretability and context awareness in path retrieval, we avoid naive top- \( k \) heuristics with LLM-guided scoring that leverages the model's parametric memory to assess the salience of hyper-edges and entities. This enables retrieval to be guided by contextual priors and query intent, facilitating more targeted and meaningful relational exploration. + +3.2.1 Memory-Guided Beam Retriever. Specifically, we design beam search with width \( w = 3 \) and depth \( d = 3 \) , where \( w \) denotes the number of paths ranked in the top order retained at each iteration, and \( d \) specifies the maximum number of expansion steps. Following the process of the Learnable Relational Chain Retriever, we begin by identifying the set of topic entities \( {\mathcal{E}}_{q} \) from the input query \( q \) using an LLM-based entity extractor. For each topic entity \( {e}_{s} \in {\mathcal{E}}_{q} \) , we retrieve its incident hyperedge set \( {\mathcal{F}}_{{e}_{s}} \) . Each hyperedge \( {f}^{n} \in {\mathcal{F}}_{{e}_{s}} \) is scored for relevance to both \( {e}_{s} \) and \( q \) using a prompt \( {p}_{\text{ edge }} \) : + +\[ +{\mathcal{S}}_{\mathcal{F}}\left( {{f}^{n} \mid {e}_{s}, q}\right) \sim \operatorname{LLM}\left( {{p}_{\text{ edge }},{e}_{s},{f}^{n}, q}\right) . \tag{14} +\] + +We retain the top- \( w \) hyperedges, denoted \( {H}_{{e}_{s}}^{ + } \) , based on the score \( {\mathcal{S}}_{\mathcal{F}}\left( \cdot \right) \) . Next, for each \( {f}^{n} \in {\mathcal{F}}_{{e}_{s}}^{ + } \) , we identify unvisited tail entities \( {e}_{t} \) and score their relevance using a second prompt \( {p}_{\text{ entity }} \) : + +\[ +{\mathcal{S}}_{\mathcal{E}}\left( {{e}_{t} \mid {f}^{n}, q}\right) \sim \operatorname{LLM}\left( {{p}_{\text{ entity }},{f}^{n},{e}_{t}, q}\right) . \tag{15} +\] + +Next, each resulting candidate triple \( \left( {{e}_{s},{f}^{n},{e}_{t}}\right) \) receives a weighted composite score as follows: + +\[ +\mathcal{S}\left( {{e}_{s},{f}^{n},{e}_{t}}\right) = {\mathcal{S}}_{\mathcal{F}}\left( {{f}^{n} \mid {e}_{s}, q}\right) \cdot {\mathcal{S}}_{\mathcal{E}}\left( {{e}_{t} \mid {f}^{n}, q}\right) . \tag{16} +\] + +From the current set of candidate triples, we retain the top- \( w \) based on the final triple scorer \( \mathcal{S}\left( \cdot \right) \) . The tail entities of these selected paths define the next expansion frontier. At each depth \( i \) , we evaluate whether the accumulated evidence suffices to answer the query. All retrieved triples are assembled into a contextualized component \( {C}_{i} \) , which is passed to the LLM for an evidence sufficiency check: + +\[ +\operatorname{LLM}\left( {{p}_{\text{ ctx }},{C}_{i}, q}\right) \rightarrow \{ \text{ yes, no }\} \text{ , Reason. } \tag{17} +\] + +If the result is yes, terminate the search and proceed to generation. Otherwise, if \( i < d \) , the search continues until the next iteration. + +3.2.2 Contextualized Generator. The entities and hyperedges retrieved are organized in a fixed format context, as defined in Eq.(13). This contextualized evidence Context, combined with the original query \( q \) , is then passed to the LLM to generate the final Answer. + +## 4 Experiments + +We quantitatively evaluate the effectiveness and efficiency of Hyper-Retriever against RAG baselines both in-domain and cross-domain settings. Ablation studies highlight the benefits of adaptive expansion and \( n \) -ary relational chain learning, complemented by qualitative analyzes that illustrate the precision and efficiency of the adaptive retrieval process. + +### 4.1 Experimental Setup + +4.1.1 Datasets. We conduct experiments under both open-domain and closed-domain multi-hop question answering (QA) settings. For in-domain evaluation, we use three widely adopted benchmark datasets: HotpotQA [42], MuSiQue [38], and 2WikiMulti-HopQA [16]. To evaluate cross-domain generalization, we adopt the WikiTopics-CLQA dataset [11], which tests zero-shot inductive reasoning over unseen entities and relations at inference time. Comprehensive dataset statistics are summarized in Appendix B.2. + +4.1.2 Evaluation Metrics. We employ four standard metrics to assess performance, aligning with established protocols for each benchmark type. For open-domain QA datasets, where the objective is precise answer generation, we report Exact Match (EM) and F1 scores. For WikiTopics-CLQA, which involves ranking correct entities from a candidate list, we utilize Mean Reciprocal Rank (MRR) and Hits@k to evaluate retrieval fidelity. All metrics are reported as percentages (%), with higher values indicating better performance. + +4.1.3 Baselines. To evaluate the effectiveness of our approach, we compare HyperRAG with RAG baselines with varying retrieval granularities, enabling a systematic analysis of how evidence structure affects retrieval effectiveness and answer generation in both open- and closed-domain settings. Specifically, we include: RAPTOR [33], which retrieves tree-structured nodes; HippoRAG [14], which retrieves free-text chunks; ToG [37], which retrieves relational subgraphs; and HyperGraphRAG [25], which retrieves a heterogeneous mixture of entities, relations, and textual spans. + +4.1.4 Implementation Details. All baselines and our proposed methods utilize gpt-40-mini as the core model for both graph construction and question answering. For HyperRetriever, we additionally employ the pretrained text encoder gte-large-en-v1.5 to produce dense embeddings for entities, relations, and queries. With 434M parameters, this GTE-family model achieves strong performance on English retrieval benchmarks, such as MTEB, and offers an efficient balance between inference speed and embedding quality, making it well-suited for semantic subgraph retrieval. All experiments were implemented in Python 3.11.13 with CUDA 12.8 and conducted on a single NVIDIA RTX 3090 (24 GB). Peak GPU memory usage remained within 24 GB due to dynamic allocation. + +### 4.2 Open-domain Answering Performance + +4.2.1 Setup. For HyperRetriever, a lightweight MLP \( {f}_{\theta } \) scores the plausibility of candidate hyperedges, enabling aggressive pruning that reduces traversal complexity without compromising reasoning quality. For HyperMemory, we set beam width \( w = 3 \) and depth \( d = 3 \) to balance retrieval coverage against computational cost. Comprehensive prompt definitions for edge scoring \( \left( {p}_{\text{ edge }}\right) \) , entity ranking \( \left( {p}_{\text{ entity }}\right) \) , context evaluation \( \left( {p}_{\text{ ctx }}\right) \) , and generation are provided in the Appendix. + +
TopicRAPTORHippoRAGToGHyperGraphRAGHyperRetriever HyRel. Gain (%)
MRRHits@10MRRHits@10MRRHits@10MRRHits@10MRRHits@10MRRHits@10MRRHits@10
ART3.444.138.429.772.993.2017.1821.6819.3124.3115.6319.1712.4012.13
AWARD20.5725.1332.8038.658.709.3551.6463.4352.6665.2847.3456.981.982.93
EDU4.945.9023.8226.379.099.4943.4450.0544.7951.6341.6846.953.113.16
HEALTH18.8522.0425.7229.597.147.9531.4637.9432.6839.2627.4833.133.883.48
INFRA10.9512.7923.8827.119.8710.6737.1844.8238.9245.7735.7741.694.682.12
LOC16.5518.6819.8823.083.453.8329.9234.3831.8036.8530.7335.956.287.18
ORG12.0014.5436.2041.706.617.3364.6874.8962.8771.2152.2659.84-2.80-4.91
PEOPLE10.7413.1015.3918.283.904.4020.6728.1021.6228.4818.9625.294.601.35
SCI6.848.6615.6218.866.877.2825.9234.5425.1532.3021.5027.53-2.97-6.49
SPORT11.3113.2822.7826.017.518.5337.4044.9139.3745.5633.6439.725.271.45
TAX10.4811.0824.7726.656.226.5035.1540.9437.2040.9833.6538.195.830.10
AVG11.5213.5822.6626.016.587.1435.8843.2436.9443.7832.6038.592.951.23
+ +Table 1: Performance comparison of domain generalization across 11 diverse topics. The "Rel. Gain" column highlights the substantial relative improvement of our approach over the best baseline, averaged across all domains (metrics in %). + +
ModelHotpotQAMuSiQue2WikiMultiHopQA
EM(%)F1(%)EM(%)F1(%)EM(%)F1(%)
RAPTOR35.5041.5615.0016.3122.5022.95
HippoRAG49.5055.8714.5017.4330.0030.44
ToG10.0811.002.702.695.205.34
HyperGraphRAG51.0042.6922.0020.0242.5030.17
HyperRetriever42.5043.6513.5014.1534.0034.06
HyperMemory35.5041.518.0012.9631.5032.56
Rel. Gain (%)-16.67-21.87-38.64-29.32-20.0011.89
+ +Table 2: Performance comparison on HotpotQA, MuSiQue, and 2WikiMultiHopQA. Rel. Gain (%) indicates the relative performance gains achieved by our model compared with the best baselines. The best results are bolded, and the second best are underlined. + +4.2.2 Results. Table 2 details the Exact Match (EM) and F1 scores across three open-domain QA benchmarks. HyperRetriever consistently outperforms the HyperMemory variant on HotpotQA and MuSiQue, demonstrating superior capability in identifying evidential relational chains. This advantage is attributed to its learnable MLP-based plausibility scorer and density-aware expansion strategy, which affords precise control over retrieval depth. In contrast, HyperMemory relies on the fixed parametric memory of the LLM, rendering it less adaptable to domain-specific relational patterns. When compared to external KG-based RAG baselines, we observe a performance divergence based on graph topology. On HotpotQA and MuSiQue, HyperRetriever exhibits a performance gap (e.g., 38.64% lower EM on MuSiQue), likely because these datasets require the rigid structural guidance of explicit KG priors for cross-document navigation. However, on 2WikiMultiHopQA, HyperRe-triever reverses this trend, achieving an 11.89% relative F1 improvement. This suggests that while KG priors aid in sparse settings, HyperRetriever is uniquely effective at exploiting the denser, complex relational contexts found in 2WikiMultiHopQA. + +### 4.3 Closed-domain Generalization Performance + +To evaluate adaptability to closed-domain \( n \) -ary knowledge graphs, we evaluate the performance of HyperRAG on the WikiTopics-CLQA dataset (Table 1). The results demonstrate a strong generalization across diverse topic-specific hypergraphs. In particular, our learnable variant, HyperRetriever, achieved the highest overall answer precision, with average improvements of 2.95% (MRR) and 1.23% (Hits@10) compared to the second-best baseline, Hyper-GraphRAG. These gains are statistically significant \( \left( {p \ll {0.001}}\right) \) , with \( t \) -test values of \( {1.46} \times {10}^{-{17}} \) for MRR and \( {2.41} \times {10}^{-6} \) for Hits@10, suggesting the empirical reliability of our approach. HyperRetriever secures top performance in 9 out of the 11 categories-for instance, achieving relative gains of 12.40% (MRR) and 12.13% (Hits@10) in the ART domain-and consistently ranks second in the remaining two. This broad efficacy highlights the robustness of HyperRe-triever's adaptive retrieval mechanism. Unlike baselines that are sensitive to domain-specific graph density, HyperRetriever's learnable MLP scorer dynamically calibrates its expansion strategy to suit varying \( n \) -ary topologies, ensuring high precision even in complex reasoning tasks. In contrast, our memory-guided variant, Hyper-Memory, consistently underperforms against to HyperRetriever. This variant serves as a critical ablation to probe the limitations of an LLM's intrinsic parametric memory for \( n \) -ary retrieval. The results confirm that prompt-based scoring alone, without the explicit structural learning provided by HyperRetriever, is insufficient for multi-hop reasoning in closed domains. + +
TopicFullw/o Entitiesw/o HyperedgesChunks//o Adaptive Search w Binary KG
MRRHits@10MRRHits@10MRRHits@10MRRHits@10MRRHits@10MRRHits@10
ART26.0331.0027.2831.0024.0327.0024.1727.0026.3331.0014.0015.00
AWARD56.9170.0043.2261.0055.9569.0055.0166.0052.9866.0048.9253.00
EDU49.0056.0043.2452.0047.9352.0042.6747.0047.5353.0038.2042.00
HEALTH41.2547.0037.1743.0037.7040.0039.3347.0039.2046.0036.1739.00
INFRA34.8543.0035.1743.0030.8739.0038.7544.0035.5045.0030.5032.00
LOC38.7542.5044.5847.5037.5040.0033.1337.5041.6747.5039.5842.50
ORG46.7958.9758.7565.0045.9255.0053.0060.0038.0745.0047.5047.50
PEOPLE14.2022.0021.2328.0013.7319.0020.0326.0013.3720.0019.3322.00
SCI25.9136.0018.6722.0024.5332.0026.0938.0021.1432.0024.0027.00
SPORT31.0440.0035.8340.0035.0045.5029.5840.0033.3337.5042.0847.50
TAX36.2540.0029.1735.0033.5436.2533.1336.2536.8840.0035.4237.50
AVG36.4540.5935.8542.5035.1541.3435.9042.6135.6442.9134.1536.82
+ +Table 3: Ablation on the Contribution of Context Formation and Adaptive Search. The full model incorporates all components essential for context formation, including entities, hyperedges involved in learnable relational chains, and retrieved chunks. The best results in MRR are bolded, and the best in Hits@ 10 are underlined. + +
DimensionRAPTOR [33]HippoRAG [14]ToG [37]HyperGraphRAG [25]OG-RAG [34]HyperRetriever / Memory
Structure typeDoc tree (summ.)KG (binary)KG (binary)Hypergraph ( \( n \) -ary)Object graph (mostly bin.)Hypergraph (n-ary)
Unit of factPassage / summaryEntity-entity edgeStep / subgoalHyperedge ( \( n \) -ary fact)Object-object edgeHyperedge (n-ary fact)
Candidate growthAdditive (levels)Additive on edgeLLM-var.Additive on hyperedgesAdditive on objectsAdditive on hyperedges
Per-query overheadTokens only\( O\left( {n - k}\right) \)Var.\( O{\left( 1\right) }^{ \dagger } \)\( O\left( 1\right) \)\( O{\left( 1\right) }^{ \dagger } \)
Depth for reasoning chainDeepDeep (pairwise)LLM-var.Shallow \( \left( {n\text{ -ary edges }}\right) \)Deep (pairwise)Shallow \( \left( {n\text{ -ary edges }}\right) \)
Retrieval strategyDense tree searchGraph walk + denseLLM on graphStaticObject-centric walkAdaptive / LLM on graph
LLM at retrievalLow-MedLowMed-High (LLM)LowLowLow / Med (LLM)
Ontology
+ +Table 4: Method Comparison. HyperRetriever utilizes adaptive search on \( n \) ary hyperedges, enabling higher-order reasoning with shallow chains and near constant per-query retrieval overhead \( O\left( 1\right) \) . In contrast, static or object-centric walks on binary graphs entail deeper pairwise chains and materialization cost. \( \dagger \) denotes bounded arity; \( \checkmark \) indicates an ontology requirement. + +### 4.4 Ablation Study + +To evaluate the effectiveness of our approach, we conduct a series of ablation studies targeting two key aspects: (i) the contribution of individual components to context formation, and (ii) the impact of the adaptive search policy on retrieval performance. + +4.4.1 Higher-Order Reasoning Chains. Compared with binary KG RAG, HyperRAG supports higher-order reasoning on \( n \) -ary hyper-graphs. An \( n \) -ary hyperedge jointly binds multiple entities and roles, capturing fine-grained dependencies beyond pairwise links. Exploiting this structure yields shallower yet more expressive reasoning chains, enabling the model to surface key evidence without multihop traversal. Empirically (Table 3), replacing the \( n \) -ary structure with a binary one lowers average MRR from 36.45% to 34.15% (-2.3%) and the average Hits @ 10 from 40.59% to 36.82% (-3.77%), indicating gains in both accuracy and efficiency. Additional qualitative examples appear in Appendix C. + +4.4.2 Impact of Context Formation. Table 3 presents a componentwise ablation study conducted on a representative \( 1\% \) subset to isolate the contributions of (i) entities, (ii) structural relations (hy-peredges), and (iii) textual context. We observe that removing any component consistently degrades Mean Reciprocal Rank (MRR), though Hits@10 exhibits higher variance. This divergence highlights the distinction between ranking fidelity (MRR) and candidate inclusion (Hits@10). For instance, in the ORG and LOC domains, certain ablated variants maintain competitive Hits@10 scores but suffer sharp declines in MRR. This indicates that while the correct answer remains within the top candidates, the loss of structural or semantic signals causes it to drift down the ranking list, degrading precision. Crucially, hyperedges emerge as the dominant factor in effective context formation. Their exclusion precipitates the most significant performance drops across both metrics, underscoring the necessity of high-order topological structure for reasoning. In contrast, removing entities yields less severe degradation, as entities primarily provide node-level descriptions, whereas hyperedges capture the joint dependencies between them. Text chunks offer complementary unstructured semantics but lack the relational precision of the graph structure. Ultimately, the superior performance of the full model validates the synergistic integration of entity-aware signals, hypergraph topology, and adaptive textual evidence. + +4.4.3 Impact of Adaptive Search. Removing the adaptive search component results in a noticeable decline in MRR across most categories, whereas its impact on Hit@10 is minimal and in some cases (e.g., INFRA, LOC), even marginally positive. This pattern suggests that while correct answers remain retrievable among the top 10 candidates, they tend to be ranked lower in the absence of adaptive search, resulting in a reduced overall ranking precision. + +![bo_d6nbbuc601uc73e2hrig_7_219_241_581_353_0.jpg](images/bo_d6nbbuc601uc73e2hrig_7_219_241_581_353_0.jpg) + +Figure 3: The visualization shows the efficiency-effectiveness tradeoff in multi-hop QA: retrieval time ( \( x \) -axis), answer quality (Hits@10, y-axis), and context volume (bubble size, log-scaled by retrieved tokens). + +### 4.5 Efficiency Study + +4.5.1 Setup. To assess retrieval efficiency, we draw a stratified 1% from each WikiTopics-CLQA category, yielding approximately 1,000 questions evenly distributed across 11 topic domains, and evaluate all baselines on this set. Figure 3 depicts the three-way trade off among retrieval time ( \( x \) -axis), Hits@10 accuracy ( \( y \) -axis), and context volume (bubble size, logarithmically scaled by retrieved tokens). Models in the upper left quadrant achieve the best balance between efficiency and effectiveness, combining low latency with high Hits@10 while retrieving compact contexts. + +4.5.2 Empirical Evidence. HyperRetriever achieves the shortest retrieval time and the highest Hits@10.Although it retrieves more tokens than some baselines, top performers consistently rely on larger contexts, highlighting a common trade-off between answer quality and retrieval volume. Our empirical findings align with the theoretical analysis in §2.2. HyperRetriever employs adaptive search over \( n \) -ary hyperedges, enabling higher-order reasoning with shallow chains and nearly constant per query overhead \( O\left( 1\right) \) . In contrast, static or object-centric walks in binary graphs require deeper pairwise chains and incur an event materialization cost \( O\left( {n - k}\right) \) . We further benchmark our approach against five publicly available graph-based RAG systems, covering both \( n \) -ary and binary KG designs, and summarize in Table 4. + +## 5 Related Work + +Retrieval-Augmented Generation. RAG fundamentally augments the parametric memory of LLMs with external data, serving as a critical countermeasure against hallucination in knowledge-intensive tasks. The standard pipeline operates by retrieving top- \( k \) document chunks via dense similarity search before conditioning generation on this augmented context [2, 12, 17]. However, conventional dense retrieval methods [6, 20] treat data as flat text, often overlooking the complex structural and relational signals required for deep reasoning. To address this, iterative multi-step retrieval approaches have been proposed [18, 36, 39]. Yet, these methods often suffer from diminishing returns: they increase inference latency and retrieve redundant information that dilutes the context signal. This noise contributes to the "lost-in-the-middle" effect, where finite context windows prevent the LLM from effectively attending to dispersed evidence [24, 41]. + +Graph-based RAG. Graph-based RAG frameworks incorporate inter-document and inter-entity relationships into retrieval to enhance coverage and contextual relevance \( \left\lbrack {3,{15},{31},{32}}\right\rbrack \) . Early approaches queried curated KGs (e.g., WikiData, Freebase) for factual triples or reasoning chains \( \left\lbrack {4,{22},{27},{40}}\right\rbrack \) , while recent methods fuse KGs with unstructured text [8, 23] or build task-specific graphs from raw corpora [7]. To improve efficiency, LightRAG [13], HippoRAG [14], and MiniRAG [10] adopt graph indexing via entity links, personalized PageRank, or incremental updates [28, 29]. However, KG-based RAGs often face a trade-off between breadth and precision: broader retrieval increases noise, while narrower retrieval risks omitting key evidence. Methods using fixed substructures (e.g., paths, chunks) simplify reasoning [33, 44] but may miss global context, and challenges are amplified by LLM context window limits, vast KG search spaces [18, 30, 37], and the high latency of iterative queries [37]. Moreover, most graph-based RAG methods rely on binary relational facts, limiting the expressiveness and coverage of knowledge. Hypergraph-based representations capture richer \( n \) - ary relational structures [26]. HyperGraphRAG [25] advances this line by leveraging \( n \) -ary hypergraphs, outperforming conventional KG-based RAGs, yet suffers from noisy retrieval and reliance on dense retrievers. OG-RAG [34] addresses these issues by grounding hyperedge construction and retrieval in domain-specific ontologies, enabling more accurate and interpretable evidence aggregation. However, its dependence on high-quality ontologies constrains scalability in fast-changing or low-resource domains. Most graph-based and hypergraph-based RAG methods still face challenges, particularly due to the use of static or object-centric walks on binary graphs, which entail deeper pairwise chains and higher materialization costs. Table 4 compares existing methods with HyperRAG. + +## 6 Conclusion + +We introduced HyperRAG, a novel framework that advances multihop Question Answering by shifting the retrieval paradigm from binary triples to \( n \) -ary hypergraphs featuring two strategies: Hyper-Retriever, designed for precise, structure-aware evidential reasoning, and HyperMemory, which leverages dynamic, memory-guided path expansion. Empirical results demonstrate that HyperRAG effectively bridges reasoning gaps by enabling shallower, more semantically complete retrieval chains. Notably, HyperRetriever consistently outperforms strong baselines across diverse open- and closed-domain datasets, proving that modeling high-order dependencies is crucial for accurate and interpretable RAG systems. diff --git a/参考论文/groundtruth/Diagnosing_Knowledge_Conflict_in_Multimodal_Long-Chain_Reasoning.md b/参考论文/groundtruth/Diagnosing_Knowledge_Conflict_in_Multimodal_Long-Chain_Reasoning.md new file mode 100644 index 0000000..df26bf3 --- /dev/null +++ b/参考论文/groundtruth/Diagnosing_Knowledge_Conflict_in_Multimodal_Long-Chain_Reasoning.md @@ -0,0 +1,267 @@ +# Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning + +Jing Tang \( {}^{1 * } \) Kun Wang \( {}^{2 * } \) Haolang Lu \( {}^{3 * } \) Hongjin Chen \( {}^{3} \) KaiTao Chen \( {}^{3} \) Zhongxiang Sun \( {}^{4} \) Qiankun Li \( {}^{2} \) Lingjuan Lyu \( {}^{5} \) Guoshun Nan \( {}^{3} \) Zhigang Zeng \( {}^{1} \) + +jingtang@hust.edu.cn wang.kun@ntu.edu.sg luhaolang@bupt.edu.cn + +## Abstract + +Multimodal large language models in long chain-of-thought reasoning often fail when different knowledge sources provide conflicting signals. We formalize these failures under a unified notion of knowledge conflict, distinguishing input-level objective conflict from process-level effective conflict. Through probing internal representations, we reveal that: (I) Linear Separability: different conflict types are explicitly encoded as linearly separable features rather than entangled; (II) Depth Localization: conflict signals concentrate in mid-to-late layers, indicating a distinct processing stage for conflict encoding; (III) Hierarchical Consistency: aggregating noisy token-level signals along trajectories robustly recovers input-level conflict types; and (IV) Directional Asymmetry: reinforcing the model's implicit source preference under conflict is far easier than enforcing the opposite source. Our findings provide a mechanism-level view of multimodal reasoning under knowledge conflict and enable principled diagnosis and control of long-CoT failures. Code is available at anonymous link. + +## 1. Introduction + +Multimodal large language models (MLLMs) (Jin et al., 2025; Caffagni et al., 2024; Zhang et al., 2024a) have made substantial progress in visual understanding (Tong et al., 2024a; Ghatkesar et al., 2025; Ma et al., 2025), textual reasoning (Wang et al., 2024; Du et al., 2025; Mirzadeh et al., 2025), and cross-modal alignment (Yu et al., 2024; Yan et al., 2025; Yu et al., 2025), enabling complex perception-reasoning-decision workflows. A defining capability is long-form reasoning: beyond producing answers, these models can generate extended chains-of-thought (CoT) (Wang et al., 2025b; Yue et al., 2025) that support challenging multi-step tasks. However, recent work increasingly documents failures under mutually contradictory evidence or constraints: models may ignore explicit instructions (Wang et al., 2025a; Zhao et al., 2025), privilege the wrong evidential source (Guan et al., 2024; Liu et al., 2025b), or yield plausible yet goal-inconsistent conclusions (Fanous et al., 2025). These observations suggest that a key bottleneck in multimodal reasoning is not always missing information, but reliable decision-making under conflicting signals. + +Building on these observations, prior work (Zhang et al., 2024c; Lu et al., 2024) has characterized abnormal behavior under conflicting signals from several largely independent angles. In retrieval-augmented generation, a central question is whether models remain faithful to retrieved evidence or drift toward parametric priors (Wu et al., 2024). In vision settings with counterfactual or commonsense-violating inputs, MLLMs are often found to underweight visual evidence and default to "reasonable" answers that match world knowledge (Tong et al., 2024b; Liu et al., 2025c). In high-stakes domains, studies further report over-accommodation to user assertions, which can pull predictions away from the underlying evidence (Sharma et al., 2024). Although these lines of work differ in tasks, datasets, and evaluation criteria, their failure modes are strikingly similar: when information sources disagree, models do not reliably follow the appropriate basis for a decision, and instead exhibit unstable, hard-to-control trade-offs across sources. + +In this paper, we take a unified view that these phenomena arise from knowledge conflict in multimodal reasoning. When generating tokens, MLLMs jointly rely on multiple knowledge sources, including visual evidence, textual instructions and contextual constraints, and parametric priors stored in the model weights (Han et al., 2025; Liu et al., 2024a; Karamcheti et al., 2024). When these sources provide inconsistent signals for the same goal, the model must resolve which source to follow. Importantly, the resulting failures are not fabrications from missing knowledge, but incorrect source selection under conflict: the model may have access to competing plausible cues yet follow the wrong basis. Accordingly, our focus is not the act of answer generation itself, but whether conflict-induced failures can be localized, measured, and mechanistically tested. + +--- + +\( {}^{1} \) Huazhong University of Science and Technology \( {}^{2} \) Nanyang Technological University \( {}^{3} \) Beijing University of Posts and Telecommunications \( {}^{4} \) Renmin University of China \( {}^{5} \) Sony AI, Zurich, Switzerland. Correspondence to: Guoshun Nan , Zhigang Zeng . + +Preprint. February 17, 2026. + +--- + +Multimodal long-CoT reasoning (Ni et al., 2025) makes this problem sharper by unfolding decisions over many steps, with the internal reasoning state evolving over time. Under this setting, knowledge conflict can be triggered at any point and modality along the trajectory rather than only at the final answer. Once a step commits to the wrong basis, subsequent steps may continue from that premise in a locally coherent manner, eventually producing a globally incorrect conclusion (Zhang et al., 2024b). More challenging, such deviations are often masked by fluent rationales (Turpin et al., 2023), making it difficult to infer when the conflict emerged, what triggered it, and how it propagated from the final output alone. Understanding and correcting failures in long-CoT therefore requires step-level tools that can expose the underlying conflict dynamics. + +In this work, * We diagnose knowledge conflict dynamics on 7,500+ long-CoT trajectories from an objective conflict benchmark, where effective conflicts are activated in 78-90% of samples. * Through layer-wise analysis of three models, we identify a depth-dependent conflict encoding stage. Using streaming probes to detect token-level conflict states, we find they exhibit high linear separability (93.2~98.8% AUC, 76.9~97.8% Recall@0.1), revealing them as explicit, decodable features. * We employ three pluggable methods for intervention. These methods can either steer model outputs toward selected directions, reducing conflict frequency by up to 80%, or suppress high-confidence errors by up to 55%. + +## 2. Related Work + +Knowledge Conflict. Research on knowledge conflicts has identified three primary sources: conflicts between internal priors and visual information (Liu et al., 2025b; Du. et al., 2025) or textual inputs (Zhang et al., 2025a; Su et al., 2024), and conflicts between visual and textual modalities (Deng et al., 2025). Building on these findings, significant efforts have been made to mitigate such conflicts through advanced strategies (Xie et al., 2024; Guo et al., 2024), including knowledge editing (Tan et al., 2024; Zhang et al., 2025d; Cheng et al., 2024; Chen et al., 2025) and retrieval augmentation (Huo et al., 2025; Zhang et al., 2025b; Li et al., 2025). These approaches have demonstrated potential in enhancing model faithfulness and reliability (Huang et al., 2025b; An et al., 2025; Shi et al., 2024; Zhang et al., 2024d; Lu et al., 2025). Although the above evidence suggests that conflicts are coupled and multi-source, existing solutions remain fragmented across modalities and fail to model conflicts holistically, thereby limiting their applicability in complex settings. + +Probe Detection. Investigating internal states via probe detection is a developing field, yet the history of probing in LLMs (Kahana et al., 2025) provides clear precedents. Notably, the evolution of probe detection primarily centers on hallucination and faithfulness (Feng et al., 2025; Yi et al., 2025). Core techniques, such as linear probe generators (Ka-hana et al., 2025) and propositional probes (Feng et al., 2025), have inspired analogous approaches in watermark identification (Liu et al., 2025a), reward maximization (Li et al., 2024), and combinatorial optimization (Zhang et al., 2025e). However, these approaches predominantly focus on single-modal issues or specific downstream tasks, leaving the detection and localization of multimodal knowledge conflicts largely unexplored. Inspired by this, we introduce a specialized probe detection framework to identify the three sources of knowledge conflicts in MLLMs. + +![bo_d6nb7sc601uc73e2hngg_1_901_194_698_547_0.jpg](images/bo_d6nb7sc601uc73e2hngg_1_901_194_698_547_0.jpg) + +Figure 1. Overview of Knowledge Sources and Conflict Types. We categorize knowledge into Visual \( \left( {\mathcal{K}}_{\text{ vision }}\right) \) , Textual \( \left( {\mathcal{K}}_{\text{ text }}\right) \) , and Parametric Prior \( \left( {\mathcal{K}}_{\text{ prior }}\right) \) . Knowledge conflicts arise when factual statements from different sources act as incompatible signals. We define three primary conflict types: Vision-Text \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , Vision-Prior \( \left( {\mathcal{C}}_{\mathrm{{VP}}}\right) \) , and Prior-Text \( \left( {\mathcal{C}}_{\mathrm{{PT}}}\right) \) . + +## 3. Conflict in Multimodal Reasoning + +### 3.1. Knowledge Sources and Pairwise Conflicts + +We consider a multimodal long-CoT reasoning task with input \( x = \left( {{X}_{V},{X}_{T}}\right) \) , where \( {X}_{V} \) denotes the visual input and \( {X}_{T} \) the textual input. Given a multimodal generative model \( {M}_{\theta } \) , reasoning unfolds as a sequence of tokens \( \tau \left( x\right) = \left( {{y}_{1},{y}_{2},\ldots ,{y}_{T}}\right) \) , with each token sampled as + +\[ +{y}_{t} \sim {M}_{\theta }\left( {\cdot \mid x,{y}_{ < t}}\right) . \tag{1} +\] + +We denote the internal state at step \( t \) by + +\[ +{\mathbf{h}}_{t} = {f}_{\theta }\left( {x,{y}_{ < t}}\right) , \tag{2} +\] + +where \( {f}_{\theta } \) denotes the model’s hidden representation extraction, i.e., the forward pass up to a specified layer. + +To analyze how factual inconsistencies arise during reasoning, we abstract the knowledge available to the model into three sources, \( \mathcal{K} = \left\{ {{\mathcal{K}}_{\text{ vision }},{\mathcal{K}}_{\text{ text }},{\mathcal{K}}_{\text{ prior }}}\right\} \) . + +Table 1. Output-level conflict profile across models (objective conflict subsets). We present statistics of generated trajectories under three types of conflict (model details in Appendix B). Metrics reported include sample count, average CoT length, average conflict spans per sample (spans are contiguous conflict segments identified via an automated LLM annotation pipeline, may consist of one or multiple tokens.), conflict token density (proportion of conflicting tokens), and sample conflict rate (% of samples exhibiting effective conflict). + +
MetricLlama-3.2V-11B-cotR1-Onevision-7BOcean-R1-7B-Instruct
\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)All\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)All\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)All
Samples74910128032564724993769248664010268072473
Avg. CoT length (tokens)326.791768.85238.50868.32706.85790.63558.97694.57488.15711.26302.97520.28
Avg. conflict spans per sample2.696.204.044.503.666.737.025.938.689.005.437.75
Conflict token density (%)4.921.6511.255.613.202.167.684.178.703.2311.777.43
Conflict Sample Ratio (%)63.6882.2186.4378.1259.6785.9087.9178.8888.7590.2589.3489.57
+ +Here, \( {\mathcal{K}}_{\text{ vision }} \) consists of facts supported by the visual input \( {X}_{V},{\mathcal{K}}_{\text{ text }} \) consists of facts constrained by the textual input \( {X}_{T} \) , and \( {\mathcal{K}}_{\text{ prior }} \) denotes parametric prior knowledge implicitly encoded in the model parameters \( \theta \) . + +For each knowledge source \( {\mathcal{K}}_{ * } \in \mathcal{K} \) , we represent its supported factual content as a set of atomic factual statements \( F\left( {\mathcal{K}}_{ * }\right) \) , where each element \( \psi \in F\left( {\mathcal{K}}_{ * }\right) \) corresponds to an indivisible factual judgment. We use \( {\psi }_{a} \bot {\psi }_{b} \) to denote that two facts are semantically incompatible, i.e., they cannot simultaneously be true under the given context. + +Based on this notion, we define a pairwise knowledge conflict between two sources \( {\mathcal{K}}_{i} \) and \( {\mathcal{K}}_{j}\left( {i \neq j}\right) \) as the set of incompatible fact pairs: + +\[ +{\mathcal{C}}_{i, j} = \left\{ {\left( {{\psi }_{i},{\psi }_{j}}\right) \mid {\psi }_{i} \in F\left( {\mathcal{K}}_{i}\right) ,{\psi }_{j} \in F\left( {\mathcal{K}}_{j}\right) ,{\psi }_{i} \bot {\psi }_{j}}\right\} . +\] + +(3) + +In this work, we focus on three primary pairwise conflict types induced by the three knowledge sources: Vision-Prior \( \left( {\mathcal{C}}_{\mathrm{{VP}}}\right) \) , Vision-Text \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , and Prior-Text \( \left( {\mathcal{C}}_{\mathrm{{PT}}}\right) \) . + +### 3.2. Objective vs. Effective Conflict + +As illustrated in Figure 1, we distinguish between two related but fundamentally different notions: objective conflict, which is defined at the input level, and effective conflict, which manifests as a process-level state during reasoning. + +Objective Conflict describes factual inconsistency induced by the input and the model's parametric priors, independent of any particular reasoning trajectory. Given a conflict type \( {\mathcal{C}}_{i, j} \in \left\{ {{\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{VT}}},{\mathcal{C}}_{\mathrm{{PT}}}}\right\} \) , we define a binary variable \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \in \{ 0,1\} \) to indicate whether the input \( x \) exhibits an objective conflict of type \( {\mathcal{C}}_{i, j} \) . For example, \( {\mathcal{C}}_{\mathrm{{VP}}}^{o}\left( x\right) = 1 \) indicates that the visual evidence \( {X}_{V} \) contradicts the parametric prior knowledge encoded in \( \theta \) with respect to a specific fact. By definition, \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) depends only on the factual relations supported by the input \( x \) and the model priors, and does not reference the reasoning process itself. + +Importantly, the presence of an objective conflict does not by itself determine whether the model will engage with this conflict during inference. From the input-level specification alone, it is not directly inferable whether, when, or how a given conflict influences the model's internal reasoning dynamics. This gap motivates a process-level notion that captures conflict activation within the model. + +Effective Conflict characterizes whether an objective conflict is actually triggered during reasoning and reflected in the model’s internal state. Concretely, we use \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \in \; \{ 0,1\} \) to indicate whether, at reasoning step \( t \) , the model relies on mutually incompatible factual information of type \( {\mathcal{C}}_{i, j} \) . Here, \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \) means that the corresponding conflict is active and influences the current reasoning step, as encoded in the internal state at that step. + +The relationship between the two notions is asymmetric: + +\[ +\mathbb{P}\left( {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \mid {\mathcal{C}}_{i, j}^{o}\left( x\right) = 1}\right) < 1. \tag{4} +\] + +That is, objective conflict captures whether a conflict exists at the input level, whereas effective conflict captures whether and when that conflict is activated in the model's internal state during reasoning. The former is induced jointly by the input and priors, while the latter is both model-dependent and process-dependent. + +Objective conflict data construction. For mechanistic analysis, we construct an objective-conflict benchmark with isolated pairwise conflicts, where each example contains exactly one conflict type (VP, VT, or PT) and is intended to elicit effective conflict states. This setting is designed as a diagnostic stress-test of conflict arbitration under contradiction, rather than an estimate of in-the-wild conflict prevalence. For each input x, we generate a long-CoT trajectory and align the input-level labels \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) with step-level effective conflict signals \( {\left\{ {\mathcal{C}}_{i, j}^{e}\left( t \mid x\right) \right\} }_{t = 1}^{T} \) inferred from the model outputs. Table 1 reports conflict activation statistics for this benchmark. Full details are provided in Appendix A. + +## 4. Probing Conflict from Internal States + +In Section 3, we formalize knowledge conflict as an input-level \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) and a process-level \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) . Moving forward, this section addresses the core question: Is \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) reflected in the model's internal states, and can it be identified in a streaming manner during generation? + +### 4.1. Token-level Probing of Knowledge Conflict + +We construct a streaming detector: when generating the \( t \) -th token, it determines whether an effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) is triggered based solely on the hidden state \( {\mathbf{h}}_{t}^{\left( l\right) } \) . While prior work has employed probes for binary hallucination detection (Obeso et al., 2025), we extend this to a four-class classification task based on the definition in Section 3.2. + +Here, we use \( z = 0 \) as label, to indicate that no conflict is triggered (i.e., \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 0,\forall {\mathcal{C}}_{i, j} \) ); while \( z \in \{ 1,2,3\} \) corresponds to the active state of specific pairwise knowledge conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1 \) , namely \( {\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}} \) , and \( {\mathcal{C}}_{\mathrm{{VT}}} \) . + +Formally, we define a probe \( {f}_{\phi } \) that maps hidden states to a probability distribution over conflict labels: + +\[ +{P}_{\phi }\left( {z \mid {\mathbf{h}}_{t}^{\left( l\right) }}\right) = \operatorname{Softmax}\left( {{f}_{\phi }\left( {\mathbf{h}}_{t}^{\left( l\right) }\right) }\right) , z \in \{ 0,1,2,3\} . +\] + +(5) + +The supervision signal for training \( {f}_{\phi } \) comes from the span-level assertion annotations constructed in Table 1. We project the label of each annotated span to all its constituent tokens to obtain the dense label sequence \( \left\{ {z}_{t}\right\} \) . + +Since conflict tokens are extremely sparse in long-CoT, we train the probe using a weighted cross-entropy objective: + +\[ +{\mathcal{L}}_{\text{ probe }} = - \mathop{\sum }\limits_{t}{w}_{t}\log {P}_{\phi }\left( {{z}_{t} \mid {\mathbf{h}}_{t}^{\left( l\right) }}\right) , \tag{6} +\] + +where \( {w}_{t} \) is a sample weight that assigns higher weight to \( z \in \{ 1,2,3\} \) (i.e., tokens where knowledge conflict \( {\mathcal{C}}_{i, j} \) occurs), preventing the probe from degenerating into predicting only the no-conflict background class. This objective allows the probe to maintain overall stability while remaining sufficiently sensitive to critical conflict-triggering moments. Full training details are provided in Appendix C. + +### 4.2. Verifying the Separability of Knowledge Conflicts + +We evaluate whether the probe reliably diagnoses knowledge conflicts from internal states. Specifically, we examine the token-level separability of effective conflicts and whether their sample-level recovers the objective conflict types. + +![bo_d6nb7sc601uc73e2hngg_3_159_1535_668_387_0.jpg](images/bo_d6nb7sc601uc73e2hngg_3_159_1535_668_387_0.jpg) + +Figure 2. Token-level separability of effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) . The left panel shows the confusion matrix over token-level conflict predictions. The right panels decompose performance into binary detection of conflict versus no-conflict, and fine-grained attribution among conflict types. Values denote row-normalized recall. + +(I) Separability of Effective Conflicts: Local Signals in Sparse Regimes. We first examine whether the probe can distinguish different types of effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) from the model's internal states during reasoning. + +As shown in Figure 2, the probe demonstrates robust discrimination capabilities. In the binary detection stage (Stage I), the model achieves a high True Negative rate of 88.7%, effectively filtering out non-conflicting steps. Conversely, a False Negative rate of 46.6% is observed, primarily driven by semantic sparsity within conflict spans-where 67.1% of \( {\mathcal{C}}_{\mathrm{{VP}}} \) tokens are misclassified as non-conflicting due to weak local signals. However, once effective conflict is activated (Stage II), the separability between conflict types sharply increases: \( {\mathcal{C}}_{\mathrm{{PT}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) achieve near-perfect identification accuracies of 99.4% and 94.8%, respectively. Even \( {\mathcal{C}}_{\mathrm{{VP}}} \) , the most subtle type, sees its recognition accuracy jump from 26.6% in the global view to 80.7% in the conditioned view. The minimal off-diagonal confusion ( \( < 1\% \) between PT and others) confirms that effective conflict types possess distinct, highly separable internal representations. + +Conclusion (Local Effective Conflicts): Even under extreme sparsity and noise, different types of effective knowledge conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) give rise to distinct local structures in the model's internal states that can be reliably captured by the probe. This validates the feasibility of streaming diagnosis of effective conflicts while revealing differences in their intrinsic detectability. + +![bo_d6nb7sc601uc73e2hngg_3_896_1223_703_419_0.jpg](images/bo_d6nb7sc601uc73e2hngg_3_896_1223_703_419_0.jpg) + +Figure 3. Sample-level separability of conflict types. We visualize the t-SNE projection of hidden states at layer 20 (R1-Onevision) and layer 39 (Llama-3.2V). The three conflict categories are colored according to their Objective Conflict labels, pre-defined during dataset construction. The top-right confusion matrices illustrate the sample-level attribution performance. + +(II) Alignment to Objective Conflicts: Aggregating Effective Signals. We next examine whether aggregating local effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) along a reasoning trajectory recovers the corresponding objective conflict \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) defined at the input level. This analysis evaluates the robustness of effective conflict signals beyond individual steps. + +For each long-CoT trajectory, we aggregate hidden states of activated effective conflicts via mean pooling to obtain a sample-level representation. We visualize these representations using t-SNE (Figure 3), where samples sharing the same objective conflict type form compact clusters that are well separated, indicating consistent global structure. + +![bo_d6nb7sc601uc73e2hngg_4_155_190_1446_524_0.jpg](images/bo_d6nb7sc601uc73e2hngg_4_155_190_1446_524_0.jpg) + +Figure 4. Cross-layer distribution of conflict signals. Top row: attention-head activation ratio on conflict tokens vs. no-conflict tokens (lines), and their difference (bars), computed using effective conflict labels. Middle/bottom rows: layer-wise probe performance (one-vs-rest AUC and Recall@0.1) for \( {\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}},{\mathcal{C}}_{\mathrm{{VT}}} \) across three MLLM backbones. + +Quantitatively, we infer the objective conflict type by aggregating stepwise effective conflict activations: + +\[ +{\widehat{\mathcal{C}}}_{\text{ sample }} = \arg \mathop{\max }\limits_{{\mathcal{C}}_{i, j}}\mathop{\sum }\limits_{{t = 1}}^{T}\mathbb{I}\left\lbrack {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1}\right\rbrack . \tag{7} +\] + +Comparing \( {\widehat{\mathcal{C}}}_{\text{ sample }} \) with the ground-truth objective labels \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) directly tests whether the model’s internal conflict aligns with the conflict structure inherent in the input. + +As shown in the inset matrices of Figure 3, aggregation substantially enhances separability. Notably, \( {\mathcal{C}}_{\mathrm{{PT}}} \) achieves a perfect 100.0% on both R1-Onevision and Llama-3.2V, confirming that text-prior conflicts induce unique and stable shifts in internal states. The remaining confusion is largely confined to the visual-conflict types: for instance,25.1% of \( {\mathcal{C}}_{\mathrm{{VT}}} \) samples in R1-Onevision are misclassified as \( {\mathcal{C}}_{\mathrm{{VP}}} \) , and \( {14.7}\% \) of \( {\mathcal{C}}_{\mathrm{{VP}}} \) samples in Llama-3.2V are misidentified as \( {\mathcal{C}}_{\mathrm{{VT}}} \) . This overlap is expected, as both categories involve failures in processing visual evidence, leading to partially shared representations. + +### 4.3. Cross-Layer Distribution of Conflict Signals + +We scan model depth to localize where effective knowledge conflicts are most strongly encoded. Concretely, for each layer \( l \) , we train the same token-level probe on hidden states \( {\mathbf{h}}_{t}^{\left( l\right) } \) and evaluate its one-vs-rest AUC / Recall@0.1 for \( \left\{ {{\mathcal{C}}_{\mathrm{{VP}}},{\mathcal{C}}_{\mathrm{{PT}}},{\mathcal{C}}_{\mathrm{{VT}}}}\right\} \) . + +Beyond probe separability, we also quantify a lightweight mechanistic correlate (Huang et al., 2025a): how attention-head activations differ between conflict and no-conflict token positions. Let \( {\mathcal{A}}^{\left( l\right) } \) denote the set of attention heads at layer \( l \) , and let \( {\mathbf{o}}_{t}^{\left( l, a\right) } \) be the output of head \( a \in {\mathcal{A}}^{\left( l\right) } \) at token \( t \) . We define token sets using effective conflict signals: + +\[ +{\mathcal{S}}_{\text{ conf }} = \{ \left( {x, t}\right) \mid \exists \left( {i, j}\right) ,{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1\} , \tag{8} +\] + +\[ +{\mathcal{S}}_{\text{ nconf }} = \left\{ {\left( {x, t}\right) \mid \forall \left( {i, j}\right) ,{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 0}\right\} . \tag{9} +\] + +The layer-wise head activation ratio on a token set \( \mathcal{S} \) is + +\[ +{R}^{\left( l\right) }\left( \mathcal{S}\right) = {\mathbb{E}}_{\left( {x, t}\right) \in \mathcal{S}}\frac{1}{\left| {\mathcal{A}}^{\left( l\right) }\right| }\mathop{\sum }\limits_{{a \in {\mathcal{A}}^{\left( l\right) }}}\mathbb{I}\left\lbrack {{\begin{Vmatrix}{\mathbf{o}}_{t}^{\left( l, a\right) }\end{Vmatrix}}_{2} > \gamma }\right\rbrack , \tag{10} +\] + +where \( \gamma \) is a fixed activation threshold (details in Appendix C.3). We then report the activation drift + +\[ +\Delta {R}^{\left( l\right) } = {R}^{\left( l\right) }\left( {\mathcal{S}}_{\text{ conf }}\right) - {R}^{\left( l\right) }\left( {\mathcal{S}}_{\text{ nconf }}\right) , \tag{11} +\] + +which measures how strongly attention activations shift when effective conflicts are triggered. + +As shown in Figure 4, both measurements reveal distinct depth-dependent signatures. (I) Probe Separability: In 7B models (R1-Onevision, Ocean-R1), discrimination performance rises in early layers and maximizes in the mid-to-late block (Layers 15-22), where AUC scores for \( {\mathcal{C}}_{\mathrm{{PT}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) consistently exceed 93%, before declining in the final layers. Llama-3.2V pushes this saturation deeper, maintaining highly robust separability \( \left( { \geq {95}\% }\right) \) as deep as Layer 39. (II) Activation Drift: This aligns with attention shifts. R1-series models show negative drift (suppression) peaking at Layers 18-22, while Llama-3.2V displays positive drift (enhancement) in Layers 30-39. We term these co-located peaks (Layer 20 for 7B, 39 for 11B) the conflict encoding stage, anchoring our analysis. + +--- + +Conclusion (Global Effective Confilcts): By aggregating stepwise effective conflict signals \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) along the reasoning trajectory, different objective conflict types \( {\mathcal{C}}_{i, j}^{o}\left( x\right) \) become clearly and robustly separable at the sample level. This indicates that effective conflicts are not merely local artifacts, but form consistent global patterns that reliably reflect the underlying input-level objective conflict structure. + +--- + +Table 2. Assessment of conflict probe performance across three VLM backbones. We report AUC and Recall at FPR=0.1 (Rec@0.1) under the One-vs-Rest setting. Gray rows indicate the Span-Max aggregation, which consistently outperforms token-level baselines. Values are presented as percentages (%). + +
ModelsProbeGranularityAUC (%)Recall@0.1 (%)
w/o Conflict\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)w/o Conflict\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)
(7B) R1-OnevisionLinearAll Token81.7±0.1\( {86.3} \pm {0.2} \)92.0±0.194.8±0.250.0±0.356.8±0.275.1±0.1\( {87.3} \pm {0.3} \)
Span Only76.8±0.2\( {82.5} \pm {0.1} \)90.8±0.395.4±0.235.5±0.2\( {44.5} \pm {0.3} \)70.5±0.2\( {88.5} \pm {0.1} \)
Span-Max93.2±0.194.2±0.298.6±0.197.3±0.181.5±0.2\( {82.4} \pm {0.1} \)97.2±0.1\( {93.8} \pm {0.2} \)
MLPAll Token95.5±0.190.4±0.285.2±0.394.1±0.189.1±0.268.4+0.162.7±0.2\( {79.3} \pm {0.1} \)
Span Only95.7±0.286.1±0.380.3±0.193.3±0.289.8±0.153.0±0.243.7±0.2\( {76.8} \pm {0.3} \)
Span-Max97.3±0.194.5±0.193.2±0.299.1±0.193.4±0.282.4±0.182.1±0.198.7±0.2
(7B-Instruct) Ocean-R1LinearAll Token83.0±0.290.6+0.194.2±0.294.9+0.153.7±0.369.4±0.181.3±0.285.6+0.1
Span Only\( {78.5} \pm {0.1} \)86.7+0.390.0±0.297.6+0.141.4+0.252.5+0.266.6±0.194.6±0.3
Span-Max\( \mathbf{{95.0} \pm {0.2}} \)95.9±0.198.6±0.198.8±0.185.7±0.187.9+0.297.1±0.197.8±0.2
MLPAll Token95.5+0.192.8+0.185.0±0.295.5+0.187.1+0.275.6+0.361.6+0.185.2±0.2
Span Only97.8+0.287.3+0.279.7±0.191.7±0.295.7±0.353.9±0.143.3±0.271.0±0.1
Span-Max99.296.5±0.195.3±0.298.4±0.198.9±0.189.8±0.287.5±0.196.1±0.1
(11B-cot) Llama-3.2VLinearAll Token88.7+0.2\( {90.5} \pm {0.1} \)96.9+0.294.5±0.168.4±0.367.2±0.294.4+0.185.8±0.2
Span Only\( {79.6} \pm {0.2} \)85.8 + 0.290.295.2±0.343.2±0.1\( {51.1} \pm {0.2} \)66.0+0.288.4
Span-Max\( \mathbf{{93.9} \pm {0.1}} \)93.498.497.2±0.183.5±0.2\( \mathbf{{76.9}} \pm {0.1} \)96.1±0.293.1±0.1
MLPAll Token95.8+0.290.7+0.188.7+0.296.9+0.189.4±0.1\( {64.3} \pm {0.2} \)70.6+0.393.7±0.2
Span Only96.185.5+0.379.2+0.2\( {89.2}\overline{ + }{0.1} \)90.8±0.2\( {46.7} \pm {0.1} \)\( {40.5} \pm {0.2} \)65.2±0.1
Span-Max97.294.5±0.2\( \mathbf{{93.4} \pm {0.1}} \)97.893.2±0.1\( {82.3} \pm {0.2} \)82.3±0.1\( {94.4} \pm {0.2} \)
+ +Conclusion (Layer-level): Layer-scanning reveals that both probe separability and attention drift co-localize in a specific mid-to-late layer band across all three MLLM backbones. This indicates that conflict-related signals are depth-dependent and concentrated in a distinct "conflict encoding stage," bridging early perception and late decoding rather than being uniformly distributed across the network. + +### 4.4. Linearity of Conflict Representation + +To comprehensively assess the nature of effective conflict signals \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) encoded in the hidden states \( {\mathbf{h}}_{t}^{\left( l\right) } \) (specifically, whether they are explicitly linear or highly entangled) we conducted experiments on specific layers identified as the "Conflict Encoding Stage" in Section 4.3. We designed two probe architectures with distinct underlying assumptions: (I) Linear Probe \( \left( {f}_{lin}\right) \) , consisting of a single projection layer \( \mathbf{W} \in {\mathbb{R}}^{d \times 4} \) (where \( d \) denotes the hidden state dimension), aimed at evaluating the Linear Separability of conflict states. High classification accuracy with a linear mapping would indicate that the model has formed clear, decoupled conflict boundaries at the current layer. (II) MLP Probe \( \left( {f}_{mlp}\right) \) , designed to assess Non-linear Entanglement. Recognizing the potential manifold complexity in deep Transformer features, we construct a deep MLP with three dimension-reducing layers \( \left( {{1024} \rightarrow {512} \rightarrow {256}}\right) \) and ReLU activation to capture high-order interaction features. + +As shown in Table 2, we report AUC and Recall@0.1 for both probes using "Span-Max" aggregation, which takes the maximum predicted probability across tokens within each span (details in Appendix C.5). The Linear Probe achieves strong performance across all conflict types: AUC reaches 93.2-98.8% and Recall@0.1 reaches 76.9-97.8%. For \( {\mathcal{C}}_{\mathrm{{PT}}} \) , Linear Probe achieves 98.6% AUC and 96.1-97.2% Recall@0.1; for \( {\mathcal{C}}_{\mathrm{{VP}}} \) and \( {\mathcal{C}}_{\mathrm{{VT}}} \) , it reaches 93.4-95.9% AUC and 76.9-87.9% Recall@0.1, comparable to MLP. The fact that a single linear layer suffices to achieve such performance indicates that for knowledge conflicts, the "features" extracted by LLMs are already explicitly disentangled in the high-dimensional space, and introducing additional nonlinear complexity (MLP) does not yield significant gain. + +Conclusion (Linearity): It was observed that a simple linear probing method could achieve detection performance comparable to that of a non-linear MLP. This suggests that effective conflicts are not entangled within complex nonlinear manifolds, but rather are explicitly and approximately linearly separable. This makes real-time detection of conflict states during inference possible. + +## 5. Intervening in Knowledge Conflict + +Section 4 showed that effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) are streaming-decodable from internal states and are encoded as linearly separable features in specific mid-to-late layers. Building on this observation, we ask the following: given an input with \( {\mathcal{C}}_{i, j}^{o}\left( x\right) = 1 \) , can inference-time interventions bias the model toward a desired knowledge source, or suppress the activation of effective conflicts during generation? + +![bo_d6nb7sc601uc73e2hngg_6_152_184_1451_464_0.jpg](images/bo_d6nb7sc601uc73e2hngg_6_152_184_1451_464_0.jpg) + +Figure 5. Semantic performance of targeted source control. We evaluate three conflict subsets \( \left( {{\mathcal{C}}_{\mathrm{{VP}}}^{o},{\mathcal{C}}_{\mathrm{{VT}}}^{o},{\mathcal{C}}_{\mathrm{{PT}}}^{o}}\right) \) using judge-based metrics: ASR (Anchor Support Rate, ↑), ARR (Anchor Rejection Rate, ↓), and OER (Obvious Error Rate, ↓). Forward/Reverse denote intervening toward the truth-anchored (benchmark-reliable) vs. conflicting source. Arrows indicate relative changes against the baseline. Note that VCD is inapplicable to the non-visual \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} \) subset. + +#### 5.1.A unified framework for directional interventions + +Two control objectives. We study inference-time control under objectively conflicting inputs, two settings are considered. (I) Targeted source control. We choose a target source \( {\mathcal{K}}_{s} \in \left\{ {{\mathcal{K}}_{i},{\mathcal{K}}_{j}}\right\} \) and intervene so that the model follows \( {\mathcal{K}}_{s} \) under conflict. This yields two directions: Forward, which intervenes toward the truth-anchored (benchmark-reliable) source, and Reverse, which enforces the opposite source. (II) Conflict mitigation. We measure whether interventions reduce how often effective conflicts are activated during generation, quantified by the expected fraction of reasoning steps where a conflict is detected: + +\[ +{\mathbb{E}}_{x}{\mathbb{E}}_{t}\left\lbrack {\mathbb{I}\left\lbrack {{\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) = 1}\right\rbrack }\right\rbrack . \tag{12} +\] + +A unified view of directional interventions. Let \( {\ell }_{t} \in \; {\mathbb{R}}^{\left| \mathcal{V}\right| } \) denote the pre-softmax logits at step \( t \) . We view an inference-time intervention as modifying decoding through an additive logit perturbation, either directly or implicitly via hidden-state manipulation: + +\[ +{\widetilde{p}}_{t} = \operatorname{softmax}\left( {{\ell }_{t} + \Delta {\ell }_{t}}\right) ,\;\Delta {\ell }_{t} = \mathcal{I}\left( {x,{y}_{ < t}}\right) . \tag{13} +\] + +We consider three instantiations of \( \mathcal{I} \) : (I) Visual contrastive decoding (VCD). VCD applies a logit-level correction (Leng et al., 2023) and is restricted to conflicts involving visual sources (i.e., \( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \) or \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \) ). (II) Representation steering. Leveraging the linear separability found in Section 4, we adopt a representation steering (Zhang et al., 2025c) that shifts the hidden state at a selected conflict-sensitive layer by a learned direction, i.e., \( {\widetilde{\mathbf{h}}}_{t} = {\mathbf{h}}_{t} + \lambda \mathbf{v} \) (where \( \lambda \) is the steering strength and \( \mathbf{v} \) is the direction vector). (III) Probe-guided control. We use the streaming probe to score candidate continuations, reweighting decoding toward options less likely to trigger conflicts. For the top- \( k \) candidates \( {\mathcal{V}}_{k} \) with base probabilities \( {p}_{t}\left( w\right) \) , we apply + +\[ +{\widetilde{p}}_{t}\left( w\right) \propto {p}_{t}\left( w\right) \exp \left( {\alpha {P}_{t}^{\left( w\right) }}\right) ,\;w \in {\mathcal{V}}_{k}, \tag{14} +\] + +where \( {P}_{t}^{\left( w\right) } \) is the probe-predicted probability of the no-conflict state for the continuation committing to token \( w \) , and \( \alpha \) controls the strength of guidance. Full implementation details and hyperparameters are provided in Appendices D.4 and D.5. + +### 5.2. Targeted source control: semantic-level evaluation + +We evaluate whether targeted interventions successfully bias the model toward a specified knowledge source under objectively conflicting inputs. We adopt an automated assertion-level judge, implemented with a strong off-the-shelf large language model, to assess semantic alignment with the target source. The judge extracts factual claims from the model output and verifies each claim against the corresponding truth anchor (image, input text, or world knowledge), producing compact aggregate metrics: ASR (Anchor Support Rate), ARR (Anchor Refutation Rate), and OER (Obvious Error Rate). To validate judge reliability, we conducted human verification on a stratified 10% subset (~1,500 spans), yielding high inter-annotator agreement \( \left( {\kappa = {0.87}}\right) \) , confirming that automated verdicts align closely with human perception of conflict resolution (details in Appendix D.2). + +As shown in Figure 5, targeted source control is feasible but exhibits a pronounced directional asymmetry. Across objective-conflict subsets, Forward interventions (intervening toward the truth-anchored source; vision for VP/VT and prior knowledge for PT) reliably improve semantic alignment, whereas Reverse control (forcing reliance on the competing source) often degrades it. We hypothesize this asymmetry reflects an internal source-reliability prior: when sources disagree, the model is more resistant to reversing arbitration away from the source it treats as reliable, even under strong contextual pressure. This asymmetry cannot be explained by construction bias alone: if it were purely a data artifact, we would expect the probe to learn shortcuts to anchor proximity rather than capturing genuine conflict dynamics. However, the asymmetry persists across all three architecturally distinct backbones, suggesting it reflects shared instruction-tuning biases that favor user-provided context (Sharma et al., 2024; Zhang et al., 2025c). Under Forward control, Probe-guided control interventions improve ASR while lowering OER by \( \sim {30}\% \) ; VCD yields stronger but selective gains on \( {\mathcal{C}}_{\mathrm{{VP}}} \) (ASR +15%, ARR halved). Reverse control remains challenging-most methods regress or show negligible gains. Mechanistically, the probe primarily suppresses conflict states rather than enforcing weaker-source selection. This highlights a trade-off: VCD is high-gain but direction-sensitive, whereas Representation steering reliably reduces errors (ARR/OER) but rarely drives sustained ASR gains. + +Table 3. Token-level conflict mitigation under the forward direction. Results are reported on three objective-conflict subsets \( \left( {{\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1}\right. \) , \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \) , and \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \) ) across three backbones. We report four token-level mitigation metrics: \( \mathbf{{SS}} \uparrow ,\mathbf{{CR}} \downarrow ,\mathbf{{CAC}} \downarrow \) and \( \mathbf{{CCI}} \downarrow \) (metric definitions in Appendix D.3). VCD is not applicable when \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \) and is therefore reported only for the first two subsets. + +
SubsetR1-Onevision-7BOcean-R1-7B-InstructLlama-3.2V-11B-cot
SS↑CAC↓CCLCR↓SS↑CAC↓CCI↓CR \( \downarrow \)SS↑CAC↓CCI↓\( \mathbf{{CR} \downarrow } \)
\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)0.940.040.700.030.890.070.710.060.940.040.450.02
\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)0.880.080.800.100.870.090.790.100.900.060.720.03
baseline \( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)0.820.120.800.150.820.120.800.150.840.110.700.11
\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)\( {0.92}^{-{0.02}} \)\( {0.05}^{+{0.01}} \)0.69\( {0.04}^{+{0.01}} \)0.90\( {0.06}^{-{0.01}} \)\( {0.69}^{-{0.01}} \)\( {0.05}^{-{0.01}} \)0.850.08+0.040.63\( {0.06}^{+{0.05}} \)
\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)\( {0.90}^{+{0.01}} \)\( {0.07}^{-{0.01}} \)0.79\( {0.08}^{-{0.01}} \)0.92\( {0.05}^{-{0.03}} \)\( {0.75}^{-{0.04}} \)\( {0.05}^{-{0.05}} \)0.780.12+0.060.69\( {0.15}^{+{0.11}} \)
\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)\( {0.92}^{-{0.02}} \)\( {0.05}^{+{0.01}} \)0.69\( {0.05}^{+{0.02}} \)0.89\( {0.07}^{+{0.00}} \)\( {0.71}^{+{0.00}} \)\( {0.07}^{+{0.01}} \)0.920.05+0.010.55+0.10\( {0.03}^{+{0.02}} \)
steering \( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)\( {0.91}^{+{0.02}} \)\( {0.06}^{-{0.02}} \)0.76\( {0.07}^{-{0.03}} \)0.91\( {0.06}^{-{0.03}} \)\( {0.77}^{-{0.03}} \)\( {0.06}^{-{0.04}} \)\( {0.90}^{+{0.00}} \)\( {0.06}^{-{0.00}} \)\( {0.67}^{-{0.04}} \)\( {0.04}^{+{0.01}} \)
\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)\( {0.77}^{-{0.06}} \)\( {0.16}^{+{0.04}} \)0.76\( {0.20}^{+{0.05}} \)\( {0.82}^{+{0.00}} \)\( {0.12}^{-{0.00}} \)0.80\( {0.15}^{+{0.00}} \)\( {0.84}^{+{0.00}} \)\( {0.11}^{-{0.00}} \)\( {0.69}^{-{0.01}} \)\( {0.12}^{+{0.01}} \)
\( {\mathcal{C}}_{\mathrm{{VP}}}^{o} = 1 \)\( {0.95}^{+{0.01}} \)\( {0.03}^{-{0.01}} \)\( {0.67}^{-{0.03}} \)\( {0.02}^{-{0.01}} \)\( {0.92}^{+{0.03}} \)\( {0.05}^{-{0.02}} \)0.66\( {0.03}^{-{0.03}} \)\( {0.94}^{+{0.01}} \)\( {0.04}^{-{0.00}} \)\( {0.39}^{-{0.06}} \)\( {0.02}^{-{0.00}} \)
\( {\mathcal{C}}_{\mathrm{{VT}}}^{o} = 1 \)\( {0.94}^{+{0.06}} \)\( {0.04}^{-{0.04}} \)\( {0.64}^{-{0.16}} \)\( {0.02}^{-{0.07}} \)\( {0.93}^{+{0.06}} \)\( {0.04}^{-{0.04}} \)0.72\( {0.04}^{-{0.06}} \)\( {0.92}^{+{0.02}} \)\( {0.05}^{-{0.01}} \)\( {0.67}^{-{0.05}} \)\( {0.03}^{-{0.00}} \)
\( {\mathcal{C}}_{\mathrm{{PT}}}^{o} = 1 \)\( {0.78}^{-{0.04}} \)\( {0.10}^{-{0.02}} \)\( {0.60}^{-{0.20}} \)\( {0.15}^{+{0.01}} \)\( {0.87}^{+{0.04}} \)\( {0.08}^{-{0.04}} \)0.72\( {0.09}^{-{0.06}} \)\( {0.87}^{+{0.04}} \)\( {0.08}^{-{0.03}} \)0.63\( {0.10}^{-{0.01}} \)
+ +Conclusion (Targeted Source Control). When objective conflicts are present, inference-time interventions exhibit a clear directional asymmetry: biasing the model toward fact-consistent, truth-anchored sources is significantly easier and more reliable than forcing it to rely on fact-inconsistent sources. This suggests that conflict resolution in MLLMs is governed by a stable, source-dependent inductive tendency, which can be strengthened but is difficult to reverse. + +### 5.3. Conflict mitigation under the default direction + +Semantic evaluation in Section 5.2 demonstrated that, under objectively conflicting inputs, inference-time interventions can bias model outputs toward the truth-anchored source. Here, we pose a complementary process-level question: under the default (Forward) direction, can we reduce the activation of effective conflicts \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) during generation? We employ token-level mitigation metrics to summarize these internal dynamics (Support Score (SS), Conflict Rate (CR), Confidence-Adjusted Conflict (CAC), and Conflict Confidence Index (CCI)) as a further complement to the independent semantic correctness evaluation in Figure 5. Table 3 summarizes the token-level mitigation results. We observe that interventions targeting the identified conflict features (Probe-guided control) consistently suppress conflict dynamics across backbones. Specifically, on visually involved subsets \( \left( {\mathcal{C}}_{\mathrm{{VT}}}\right) \) , the frequency of conflict activation (CR) decreases significantly (e.g., \( {0.10} \rightarrow {0.02} \) on R1-Onevision). Crucially, even when conflict frequency remains stable (e.g., \( {\mathcal{C}}_{\mathrm{{PT}}} \) ), confidence-aware measures reveal deeper suppression (CCI drops by 25%), indicating that the intervention mitigates the intensity of conflicts even if not their occurrence. In contrast, rigid interventions like Representation steering or unguided perturbations like VCD struggle to generalize. For instance, VCD exacerbated conflict rates fivefold on Llama-3.2V for \( {\mathcal{C}}_{\mathrm{{VT}}} \; \left( {{0.03} \rightarrow {0.15}}\right) \) . This disparity highlights that effective mitigation requires precise targeting of the conflict-encoding subspaces rather than broad adjustments. + +Conclusion (Conflict Mitigation). Guiding the model toward the reliable source attenuates internal conflict dynamics during reasoning, reducing both the intensity and the frequency of effective conflict states. This implies that effective conflict \( {\mathcal{C}}_{i, j}^{e}\left( {t \mid x}\right) \) activation is not an inherent attribute of generation, but a plastic internal state that can be suppressed during reasoning. + +## 6. Conclusion + +In this work, we study failures in multimodal long-CoT reasoning from the perspective of knowledge conflict, rather than knowledge absence. By distinguishing objective conflicts from effective conflicts during reasoning, we show that many failures arise from how conflicting knowledge is resolved over time. We find that effective conflicts are encoded as explicit and linearly decodable signals, concentrated in mid-to-late layers of the model. Leveraging these signals, we uncover a pronounced directional asymmetry: guiding the model toward its reliability-aligned source is substantially easier than forcing conflict resolution in the opposite direction, indicating a biased and path-dependent mechanism. Looking forward, we hope this perspective motivates analysis and control methods for richer conflict structures and more complex multimodal reasoning settings. + +## Impact Statement + +This paper presents work whose goal is to advance the understanding and reliability of MLLMs in long-CoT reasoning scenarios. By diagnosing knowledge conflicts and their intervention mechanisms, our research contributes to making AI systems more transparent and trustworthy. The diagnostic framework and intervention methods proposed here could help identify and mitigate reasoning failures before deployment, potentially reducing the propagation of misinformation or hallucinated content in real-world applications. We do not foresee specific negative societal consequences that need to be highlighted beyond the general considerations applicable to advancing machine learning capabilities. diff --git a/参考论文/groundtruth/TruthfulRAG.md b/参考论文/groundtruth/TruthfulRAG.md new file mode 100644 index 0000000..e71e757 --- /dev/null +++ b/参考论文/groundtruth/TruthfulRAG.md @@ -0,0 +1,229 @@ +# TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs + +Shuyi Liu, Yuming Shang, Xi Zhang* + +Key Laboratory of Trustworthy Distributed Computing and Service (MoE) + +Beijing University of Posts and Telecommunications, China + +\{liushuyi111, shangym, zhangx\}@bupt.edu.cn + +## Abstract + +Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, Truth-fulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems. + +## Introduction + +Large Language Models (LLMs) have demonstrated impressive performance across diverse natural language understanding and generation tasks (Achiam et al. 2023; Tou-vron and et al. 2023; Yang et al. 2025). Despite their proficiency, LLMs remain ineffective in handling specialized, privacy-sensitive, or time-sensitive knowledge that is not encompassed within their training corpora (Zhang et al. 2024; Huang et al. 2025). For the solutions, Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm that enhances the relevance and factuality of the generated responses by integrating external knowledge retrieval with the remarkable generative capabilities of LLMs (Lewis et al. 2020; Gao et al. 2023; Fan et al. 2024). However, as RAG systems continuously update their knowledge repositories, the temporal disparity between dynamic external sources and static parametric knowledge within LLMs inevitably leads to knowledge conflicts (Xie et al. 2023; Xu et al. 2024; Shi et al. 2024), which can significantly undermine the accuracy and reliability of the generated content. + +![bo_d6nbbd4601uc73e2hqsg_0_930_625_726_730_0.jpg](images/bo_d6nbbd4601uc73e2hqsg_0_930_625_726_730_0.jpg) + +Figure 1: The illustration of knowledge conflicts and the differences between existing solutions and TruthfulRAG. + +Recent research has begun to investigate the impact of knowledge conflicts on the performance of RAG systems (Chen, Zhang, and Choi 2022; Xie et al. 2023; Tan et al. 2024) and explore methods to mitigate such conflicts (Wang et al. 2024; Jin et al. 2024; Zhang et al. 2025; Bi et al. 2025). Existing resolution approaches can be categorized into two methodological types: (i) token-level methods, which manage LLMs' preference between internal and external knowledge by adjusting the probability distribution over the output tokens (Jin et al. 2024; Bi et al. 2025); (ii) semantic-level methods, which resolve conflicts by semantically integrating and aligning knowledge segments from internal and external sources (Wang et al. 2024; Zhang et al. 2025). However, these token-level or semantic-level conflict resolution methods generally employ coarse-grained strategies that rely on fragmented data representations, resulting in insufficient contextual awareness. This may prevent LLMs from accurately capturing complex interdependencies and fine-grained factual inconsistencies, especially in knowledge-intensive conflict scenarios (Han et al. 2024). + +--- + +*Corresponding author. + +Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. + +--- + +To address the above limitations, we propose Truthful-RAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level conflicts in RAG systems. As illustrated in Figure 1, unlike previous studies, Truthful-RAG uses structured triple-based knowledge representations to construct reliable contexts, thereby enhancing the confidence of LLMs in external knowledge and facilitating trustworthy reasoning. The TruthfulRAG framework comprises three key modules: (a) Graph Construction, which derives structured triples from retrieved external knowledge by identifying entities, relations, and attributes to construct knowledge graphs; (b) Graph Retrieval, which conducts query-based retrieval algorithms to obtain relevant knowledge that exhibit strong factual associations with the input query; and (c) Conflict Resolution, which applies entropy-based filtering techniques to locate conflicting elements and mitigate factual inconsistencies, ultimately forming more reliable reasoning paths and promoting more accurate outputs. This framework integrates seamlessly with existing RAG architectures, enabling the extraction of highly relevant and factually consistent knowledge, effectively eliminating factual-level conflicts and improving generation reliability. + +The contributions of this paper are as follows: + +- We discover that constructing contexts through textual representations on structured triples can enhance the confidence of LLMs in external knowledge, thereby promoting trustworthy and reliable model reasoning. + +- We introduce TruthfulRAG, the first framework that leverages knowledge graphs to resolve factual-level conflicts in RAG systems through systematic triple extraction, query-based graph retrieval, and entropy-based filtering mechanisms. + +- We conduct extensive experiments demonstrating that TruthfulRAG outperforms existing methods in mitigating knowledge conflicts while improving the robustness and trustworthiness of RAG systems. + +## Methodology + +In this section, we provide a detailed introduction to the TruthfulRAG framework. As illustrated in Figure 2, Truth-fulRAG comprises three interconnected modules: (i) Graph Construction, which transforms unstructured retrieved content into structured knowledge graphs through systematic triple extraction; (ii) Graph Retrieval, which employs query-aware graph traversal algorithms to identify semantically relevant reasoning paths; and (iii) Conflict Resolution, which utilizes entropy-based filtering mechanisms to detect and mitigate factual inconsistencies between parametric and external knowledge. + +## Graph Construction + +The construction of a knowledge graph begins with the conversion of raw information retrieved from the RAG system into structured knowledge representations through systematic entity-relation-attribute extraction. + +Given the retrieved content \( C \) for the user’s query \( q \) , we first perform fine-grained semantic segmentation to partition the content into coherent textual segments \( \mathcal{S} = \; \left\{ {{s}_{1},{s}_{2},\ldots ,{s}_{m}}\right\} \) , where each segment \( {s}_{i} \) represents a semantically coherent unit containing factual information. For each textual segment \( {s}_{i} \in \mathcal{S} \) , we employ the generative model \( \mathcal{M} \) from the RAG system to extract a set of structured knowledge triples \( {\mathcal{T}}_{\text{ all }} = \left\{ {{\mathcal{T}}_{i,1},{\mathcal{T}}_{i,2},\ldots ,{\mathcal{T}}_{i, n}}\right\} \) , with each triple \( {\mathcal{T}}_{i, j} = \left( {h, r, t}\right) \) consisting of a head entity \( h \) , relation \( r \) , tail entity \( t \) . This extraction process aims to capture both explicit factual statements and implicit semantic relationships embedded within the original content, thereby ensuring the comprehensiveness and semantic integrity of the knowledge representation. + +The aggregated triple set from all retrieved content forms the foundation for constructing the knowledge graph \( \mathcal{G} \) : + +\[ +\mathcal{G} = \left( {\mathcal{E},\mathcal{R},{\mathcal{T}}_{\text{ all }}}\right) \tag{1} +\] + +where \( \mathcal{E} = \mathop{\bigcup }\limits_{{i, j, k}}{h}_{i, j, k},{t}_{i, j, k} \) represents the entity set, \( \mathcal{R} = \mathop{\bigcup }\limits_{{i, j, k}}{r}_{i, j, k} \) denotes the relation set, and \( {\mathcal{T}}_{\text{ all }} = \; \mathop{\bigcup }\limits_{{i, j}}{\mathcal{T}}_{i, j} \) constitutes the complete triple repository. This structured knowledge representation enables the filtering of low-information noise and captures detailed factual associations, thereby providing a clear and semantically enriched foundation for subsequent query-aware knowledge retrieval. + +## Graph Retrieval + +To acquire knowledge that is strongly aligned with user queries at the factual level, we design a query-aware graph traversal algorithm that can identify critical knowledge paths within the graph, ensuring both semantic relevance and factual consistency in the retrieval process. + +Initially, key elements are extracted from the user query \( q \) to serve as important references for matching components in the knowledge graph. These elements include the query's target entities, relations, and intent categories, denoted as \( {\mathcal{K}}_{q} \) . Subsequently, semantic similarity matching is employed to identify the top- \( k \) most relevant entities and relations within the knowledge graph: + +\[ +\mathcal{E}{imp} = \operatorname{TopK}\left( {\operatorname{sim}\left( {e,{\mathcal{K}}_{q}}\right) : e \in \mathcal{E}, k}\right) \tag{2} +\] + +\[ +\mathcal{R}{imp} = \operatorname{TopK}\left( {\operatorname{sim}\left( {r,{\mathcal{K}}_{q}}\right) : r \in \mathcal{R}, k}\right) \tag{3} +\] + +where \( \operatorname{sim}\left( {\cdot , \cdot }\right) \) represents the semantic similarity function computed using dense embeddings, Eimp denotes the set of key entities, and \( \mathcal{R}{imp} \) represents the set of key relations. From each key entity \( e \in \mathcal{E} \) imp, we perform a two-hop graph traversal to systematically collect the entire set of possible initial reasoning paths \( \mathcal{P} \) init. + +To further filter reasoning paths with stronger factual associations, we introduce a fact-aware scoring mechanism that evaluates the relevance of paths to the query based on the coverage of key entities and relations within each path p: + +\[ +\operatorname{Ref}\left( p\right) = \alpha \cdot \frac{\left| e \in p \cap \mathcal{E}imp\right| }{\left| \mathcal{E}imp\right| } + \beta \cdot \frac{\left| r \in p \cap \mathcal{R}imp\right| }{\left| \mathcal{R}imp\right| } \tag{4} +\] + +where \( \alpha \) and \( \beta \) are hyperparameters that control the relative importance of entity and relationship coverage, respectively. The top-scored reasoning paths from Pinit constitute the core knowledge paths \( \mathcal{P} \) super. + +\[ +\mathcal{P}\text{ super } = \operatorname{TopK}\left( {\operatorname{Ref}\left( p\right) : p \in \mathcal{P}\text{ init, }K}\right) \tag{5} +\] + +![bo_d6nbbd4601uc73e2hqsg_2_147_140_1502_806_0.jpg](images/bo_d6nbbd4601uc73e2hqsg_2_147_140_1502_806_0.jpg) + +Figure 2: The overall pipeline of the TruthfulRAG framework. TruthfulRAG first extracts structured knowledge triples to construct a comprehensive knowledge graph. Subsequently, it employs query-aware graph traversal to identify salient reasoning paths, where each path comprises entities and relationships enriched with associated attributes. Finally, the framework applies entropy-based conflict resolution to detect and filter out corrective paths that challenge parametric misconceptions, thereby alleviating knowledge conflicts between internal and external information, prompting consistent and credible responses. + +In order to construct detailed contextual information, each core reasoning path \( p \in \mathcal{P} \) super will be represented as a comprehensive contextual structure consisting of three essential components: + +\[ +p = {\mathcal{C}}_{\text{ path }} \oplus {\mathcal{C}}_{\text{ entities }} \oplus {\mathcal{C}}_{\text{ relations }} \tag{6} +\] + +where: + +- Cpath represents the complete sequential reasoning path: \( {e}_{1}\overset{{r}_{1}}{ \rightarrow }{e}_{2}\overset{{r}_{2}}{ \rightarrow }\cdots \overset{{r}_{n - 1}}{ \rightarrow }{e}_{n} \) , capturing the logical progression of entities connected through relational links. + +- Centities \( = \left( {e,\mathcal{A}e}\right) : e \in p \cap \mathcal{E} \) imp encompasses all important entities within the path along with their corresponding attribute descriptions \( \mathcal{A}e \) , providing thorough entity-specific information for the context. + +- Crelations \( = \left( {r,\mathcal{A}r}\right) : r \in p \cap \mathcal{R} \) imp includes all important relations on the path together with their corresponding attributes \( \mathcal{A}r \) , enriching the semantic and contextual understanding of the relations. + +This formalized representation of knowledge ensures that each extracted reasoning path preserves structural coherence through the entity-relation sequence and reinforces semantic richness via comprehensive attribute information, thereby facilitating more nuanced and context-aware knowledge integration for subsequent conflict resolution processes. + +## Conflict Resolution + +To address factual inconsistencies between parametric knowledge and external information, ensuring that LLMs consistently follow the retrieved knowledge paths to achieve accurate reasoning, we employ entropy-based model confidence analysis to investigate the influence of conflicting knowledge on model prediction uncertainty, thereby systematically identifying and resolving factual conflicts based on uncertainty quantification mechanisms. + +We implement conflict detection by comparing model performance under two distinct conditions: (1) pure parametric generation without access to external context, and (2) retrieval-augmented generation that incorporates structured reasoning paths constructed from knowledge graph. For parametric-based generation, we calculate the response probability from LLMs as baselines: + +\[ +{P}_{\text{ param }}\left( {\text{ ans } \mid q}\right) = \mathcal{M}\left( q\right) \tag{7} +\] + +where ans represents the generated answer and \( \mathcal{M}\left( q\right) \) denotes the response distribution of the LLM based solely on query \( q \) . For retrieval-augmented generation, we incorporate each reasoning path from \( \mathcal{P} \) super as contextual information to obtain the model's output probability: + +\[ +{P}_{\text{ aug }}\left( {\left. {\operatorname{ans} \mid q, p}\right| \; = \mathcal{M}\left( {q \oplus p}\right) ,\;\forall p \in \mathcal{P}\text{ super }}\right) \tag{8} +\] + +where \( \mathcal{M}\left( {q \oplus p}\right) \) represents the response distribution of the LLM conditioned on the query \( q \) and its corresponding reasoning paths extracted from the knowledge graph. + +Inspired by previous research on probability-based uncertainty estimation (Arora, Huang, and He 2021; Duan et al. 2024), we adopt entropy-based metrics to quantify the model's confidence in the retrieved knowledge: + +\[ +H\left( {P\left( {\text{ ans } \mid \text{ context }}\right) }\right) = - \frac{1}{\left| l\right| }\mathop{\sum }\limits_{{t = 1}}^{\left| l\right| }\mathop{\sum }\limits_{{i = 1}}^{k}p{r}_{i}^{\left( t\right) }{\log }_{2}p{r}_{i}^{\left( t\right) } \tag{9} +\] + +where \( p{r}_{i}^{\left( t\right) } \) represents the probability distribution over the top- \( k \) candidate tokens at position \( t \) , and \( \left| l\right| \) denotes the token length of the answer. Accordingly, we obtain \( H\left( {{P}_{\text{ param }}\left( {\text{ ans } \mid q}\right) }\right) \) for parametric generation and \( H\left( {{P}_{\text{ aug }}\left( {\text{ ans } \mid q, p}\right) }\right) \) for retrieval-augmented generation incorporating with individual reasoning path \( p \) . Consequently, we can utilize the entropy variation under different reasoning paths as a characteristic indicator of knowledge conflict: + +\[ +\Delta {H}_{p} = H\left( {{P}_{\text{ aug }}\left( {\text{ ans } \mid q, p}\right) }\right) - H\left( {{P}_{\text{ param }}\left( {\text{ ans } \mid q}\right) }\right) \tag{10} +\] + +where positive values of \( \Delta {H}_{p} \) indicate that the retrieved external knowledge intensifies uncertainty in the LLM's reasoning, potentially indicating factual inconsistencies with its parametric knowledge, whereas negative values suggest that the retrieved knowledge aligns with the LLM's internal understanding, thereby reducing uncertainty. Reasoning paths exhibiting entropy changes exceeding a predefined threshold \( \tau \) are classified as \( {\mathcal{P}}_{\text{ corrective }} \) : + +\[ +\mathcal{P}\text{ corrective } = p \in \mathcal{P}\text{ super: }\Delta {H}_{p} > \tau \tag{11} +\] + +These identified corrective knowledge paths, which effectively challenge and potentially rectify the LLM's internal misconceptions, are subsequently aggregated to construct the refined contextual input. The final response is then generated by the LLM based on the enriched context: + +\[ +\text{ Response } = \mathcal{M}\left( {q \oplus \mathcal{P}\text{ corrective }}\right) \tag{12} +\] + +This entropy-based conflict resolution mechanism ensures that LLMs consistently prioritize factually accurate external information when generating responses, improving reasoning accuracy and trustworthiness, thereby enhancing the overall robustness of the RAG system. + +## Experiments + +In this section, we present comprehensive experiments to evaluate the effectiveness of TruthfulRAG in resolving knowledge conflicts and enhancing the reliability of RAG systems. Specifically, we aim to address the following research questions: (1) How does TruthfulRAG perform compared to other methods in terms of factual accuracy? (2) What is the performance of TruthfulRAG in non-conflicting contexts? (3) To what extent do structured reasoning paths affect the confidence of LLMs compared to raw natural language context? (4) What are the individual contributions of each module within the TruthfulRAG framework? + +## Experimental Setup + +Datasets We conduct experiments on four datasets that encompass various knowledge-intensive tasks and conflict scenarios. FaithEval (Ming et al. 2025) is designed to assess whether LLMs remain faithful to unanswerable, inconsistent, or counterfactual contexts involving complex logical-level conflicts beyond the entity level. MuSiQue (Trivedi et al. 2022) and SQuAD (Rajpurkar et al. 2016) come from previous research KRE (Ying et al. 2024), which contain fact-level knowledge conflicts that necessitate compositional multi-hop reasoning, making it particularly suitable for evaluating knowledge integration and conflict resolution in complex reasoning scenarios. RealtimeQA (Kasai et al. 2023) focuses on temporal conflicts, where answers may quickly become outdated, leading to inconsistencies between static parametric knowledge and dynamic external sources. + +Evaluated Models We select three representative LLMs across different architectures and model scales to ensure comprehensive evaluations: GPT-40-mini (Achiam et al. 2023), Qwen2.5-7B-Instruct (Yang et al. 2025), and Mistral- 7B-Instruct (Jiang et al. 2024). This selection encompasses both open-source and closed-source models, ensuring that TruthfulRAG is broadly applicable to RAG systems built upon diverse LLM backbones. + +Baselines We compare TruthfulRAG against five baseline approaches spanning different methodological categories: (i) Direct Generation requires LLMs to generate responses solely based on their parametric knowledge without any external retrieval. (ii) Standard RAG represents the conventional retrieval-augmented generation paradigm, where LLMs generate responses using retrieved textual passages directly. (iii) KRE (Ying et al. 2024) serves as a representative prompt optimization method, which enhances reasoning faithfulness by adopting specialized prompting strategies to guide the model in resolving knowledge conflicts. (iv) COIECD (Yuan et al. 2024) represents the decoding manipulation category, which modifies the model's decoding strategy during the inference stage to guide LLMs toward greater reliance on retrieved context rather than parametric knowledge. (v) FaithfulRAG (Zhang et al. 2025) incorporates a self-reflection mechanism that identifies factual discrepancies between parametric knowledge and retrieved context, enabling LLMs to reason and integrate conflicting facts before generating content. + +Evaluation Metrics Following prior studies, we adopt accuracy (ACC) as the primary evaluation metric, measuring the proportion of questions for which the LLM generates correct answers, thereby providing a direct assessment of the factual correctness of the generated responses. To evaluate the method's capability to precisely extract information pertinent to the target answer from retrieved corpora, we introduce the Context Precision Ratio (CPR) metric, which measures the proportion of answer-related content within the processed context: + +\[ +\mathrm{{CPR}} = \frac{\left| {\mathcal{A}}_{\text{ gold }} \cap {\mathcal{C}}_{\text{ processed }}\right| }{\left| {\mathcal{C}}_{\text{ processed }}\right| } \tag{13} +\] + +where \( \left| {\text{ Context }}_{\text{ gold }}\right| \) denotes the length of segments directly related to the correct answer, and |Context \( {}_{\text{ processed }} \) | represents the total length of the processed context. + +
MethodLLMDatasetAvg.Imp.
FaithEvalMuSiQueRealtimeQASQuAD
w/o RAGGPT-40-mini4.615.143.411.218.6-
Qwen2.5-7B-Instruct4.219.640.711.118.9-
Mistral-7B-Instruct6.313.829.211.515.2-
w/ RAGGPT-40-mini61.372.667.373.168.650.0
Qwen2.5-7B-Instruct53.175.278.768.368.849.9
Mistral-7B-Instruct61.967.652.267.262.247.0
KREGPT-4o-mini50.734.647.565.349.530.9
Qwen2.5-7B-Instruct59.670.786.773.772.753.8
Mistral-7B-Instruct73.250.676.974.668.853.6
COIECDGPT-40-mini53.956.448.757.654.235.6
Qwen2.5-7B-Instruct62.369.778.870.870.451.5
Mistral-7B-Instruct62.866.858.465.463.348.1
FaithfulRAGGPT-40-mini67.279.378.880.876.558.0
Qwen2.5-7B-Instruct71.878.084.178.378.159.1
Mistral-7B-Instruct81.778.577.085.780.765.5
TruthfulRAG (Ours)GPT-40-mini69.579.485.081.178.860.2
Qwen2.5-7B-Instruct73.279.182.378.778.359.4
Mistral-7B-Instruct81.979.381.482.781.366.1
+ +Table 1: Comparison of ACC between TruthfulRAG and five baselines across four datasets within three representive LLMs. The best result for each backbone LLM within each dataset is highlighted in bold, and the second best is emphasized with an underline. Avg. denotes the arithmetic mean accuracy across the four datasets, while Imp. indicates the average improvement over the corresponding LLM's w/o RAG baseline. + +Implementation Details For dense retrieval, cosine similarity is computed using embeddings generated by the all-MiniLM-L6-v2. For entropy-based filtering, we set model-specific thresholds \( \tau \) for entropy variation \( \Delta {H}_{p} \) : GPT-40- mini and Mistral-7B-Instruct use \( \tau = 1 \) , while Qwen2.5- 7B-Instruct adopts a higher threshold of \( \tau = 3 \) . All experiments are conducted using NVIDIA V100 GPUs with 32GB memory. To ensure reproducibility, the temperature for text generation is set to 0, and all Top- \( K \) values are set to 10 . + +## Results and Analysis + +Overall Performance Table 1 presents a comprehensive comparison of TruthfulRAG against five baseline methods across four datasets, evaluating performance in terms of factual accuracy (ACC) using three representative LLMs. To facilitate overall assessment, we additionally report Avg., the arithmetic mean accuracy across the four datasets, and Imp., the average improvement over the corresponding LLM's w/o RAG baseline, serving as a proxy for the number of factual conflicts successfully corrected by the method from the LLM's parametric knowledge. + +The results clearly demonstrate that TruthfulRAG consistently achieves superior or competitive performance relative to all baseline approaches. Specifically, it achieves the highest accuracy on FaithEval (81.9%), MuSiQue (79.4%), and RealtimeQA (85.0%), and ranks first or second on SQuAD across all models. Notably, TruthfulRAG achieves the highest overall performance across all backbone LLMs, attaining both the best average accuracy (Avg.) and the greatest relative improvement (Imp.) compared to all baseline methods. This clearly illustrates its robustness in mitigating factual inconsistencies that standard RAG systems struggle with due to unresolved evidence conflicts. + +Compared to standard RAG systems, which exhibit significant variability in accuracy due to unresolved knowledge conflicts, TruthfulRAG achieves improvements ranging from 3.6% to 29.2%, highlighting its robustness in mitigating factual inconsistencies. Furthermore, while methods like FaithfulRAG and KRE offer partial gains through semantic alignment or prompt-based mechanisms, they fall short in consistently resolving fine-grained factual discrepancies. In contrast, TruthfulRAG integrates knowledge graph-based reasoning with entropy-guided conflict filtering mechanisms to identify and resolve contradictory information, thereby substantially enhancing factual reliability. These findings validate the effectiveness of TruthfulRAG in delivering accurate, faithful, and contextually grounded responses across diverse knowledge-intensive tasks. + +Performance on Non-Conflicting Contexts To evaluate the robustness of TruthfulRAG in scenarios where retrieved contexts free from factual conflicts, we conduct experiments on golden standard datasets in which the retrieved passages are guaranteed to be non-contradictory. + +As shown in Table 2, TruthfulRAG consistently outperforms all baseline methods across both the MuSiQue-golden and SQuAD-golden datasets. These findings substantiate that TruthfulRAG not only excels at resolving conflicting information but also maintains superior performance in nonconflicting contexts, thereby revealing its universal applicability and effectiveness. The consistent performance improvements can be attributed to the structured knowledge representation provided by the knowledge graph module, which enables the identification of fine-grained entities and relational links in non-conflicting contexts. This capability facilitates the extraction of query-relevant information and promotes a more comprehensive understanding and integration of factual knowledge by the LLMs. Notably, while methods such as KRE exhibit significant performance degradation in non-conflicting scenarios, TruthfulRAG maintains its robustness across diverse contextual settings. This consistency highlights its practical utility and reliability for real-world RAG applications. + +
DatasetMethod
w/o RAGw/ RAGKRECOIECDFaithfulRAGTruthfulRAG (Ours)
MuSiQue-golden45.689.944.1(-45.8)89.5(-0.4)91.8(+1.9)93.2 (+3.3)
SQuAD-golden68.797.983.2(-14.7)97.1(-0.8)98.1(+0.2)98.3 (+0.4)
+ +Table 2: Performance comparison on non-conflicting contexts with GPT-40-mini as the backbone LLM. The best result on each dataset is highlighted in bold. The numbers in parentheses indicates the change in accuracy compared to the standard RAG. + +![bo_d6nbbd4601uc73e2hqsg_5_169_471_1470_348_0.jpg](images/bo_d6nbbd4601uc73e2hqsg_5_169_471_1470_348_0.jpg) + +Figure 3: Comparison of LLM confidence, measured by negative log-probability (logprob) values using GPT-40-mini, when reasoning with natural language contexts versus structured reasoning paths across four datasets. Lower negative logprob values indicate higher actual log-probability scores and thus increased model confidence in generating correct answers. + +Impact of Structured Reasoning Paths To investigate the impact of structured reasoning paths on the confidence of LLMs relative to raw natural language context, we conduct a comprehensive analysis across four datasets. Specifically, we compare the model's confidence when reasoning with retrieved knowledge presented in natural language format or as structured reasoning paths derived through our knowledge graph construction mechanism. To quantify the model's confidence in its predicted answers, we measure the log-probability of the correct answer tokens generated by LLMs and compute the average across all test instances. + +As shown in Figure 3, our experimental results reveal a consistent pattern across all evaluated datasets. Structured reasoning paths consistently lead to higher logprob values for correct answers compared to natural language contexts, indicating greater model confidence when reasoning with structured knowledge representations. This empirical evidence demonstrates that transforming unstructured natural language into structured reasoning paths through knowledge graphs significantly strengthens the LLM's confidence in following external retrieved knowledge for inference. Furthermore, this finding provides crucial insights into the superior performance of TruthfulRAG in both conflicting and non-conflicting semantic scenarios, as the enhanced confidence facilitates more reliable adherence to external knowledge sources, thereby supporting factual consistency and promoting the generation of faithful model outputs. + +Ablation Study To comprehensively evaluate the contribution of each component in TruthfulRAG, we conduct systematic ablation experiments by removing key modules from the full framework. Since knowledge graph construction and retrieval are two closely coupled modules, we combine them as an integrated component for ablation evaluation. + +As shown in table 3, the complete TruthfulRAG framework achieves superior performance across all datasets, with accuracy improvements ranging from 6.8% to 17.7% compared to the standard RAG, demonstrating that the structured knowledge graph and the conflict resolution mechanism function synergistically to enhance both factual accuracy and contextual precision. The ablation results reveal several critical insights. First, when employing only the filtering mechanism without knowledge graph integration (w/o Knowledge Graph), although accuracy demonstrates modest improvements, CPR exhibits a notable decline across most datasets, particularly in MuSiQue (1.86 to 1.15) and SQuAD (2.71 to 1.97). This phenomenon indicates that LLMs encounter substantial difficulties in effectively extracting relevant information from naturally organized contexts, thereby constraining their ability to achieve higher accuracy. In contrast, when utilizing solely the knowledge graph component without conflict resolution (w/o Conflict Resolution), CPR achieves significant improvements, yet the introduction of extensive structured knowledge simultaneously introduces redundant information, resulting in limited improvements in accuracy across most datasets. These findings support our hypothesis that structured knowledge representations facilitate the precise localization of query-relevant information, enabling more targeted and effective information extraction compared to unstructured contexts. + +
MethodDataset
FaithEvalMuSiQueRealtimeQASQuAD
Standard RAG61.3 / 0.5172.6 / 1.8667.3 / 0.4773.1 / 2.71
w/o Knowledge Graph64.8 / 0.5278.9 / 1.1583.2 / 0.2378.8 / 1.97
w/o Conflict Resolution69.3 / 0.5977.8 / 2.7984.1 / 1.8078.2 / 2.85
Full Method69.5 / 0.5679.4 / 2.2585.0 / 1.5481.1 / 2.56
+ +Table 3: Ablation study results of different components in TruthfulRAG with GPT-40-mini as the backbone LLM. The results are presented in the format ACC / CPR, where ACC denotes accuracy and CPR represents Context Precision Ratio. + +## Related Work + +This section reviews existing research on knowledge conflicts in RAG systems, categorizing the literature into two main areas: impact analysis and resolution strategies. + +## Impact Analysis of Knowledge Conflicts + +Recent studies have extensively explored the influence of knowledge conflicts on the performance of RAG systems (Longpre et al. 2021; Chen, Zhang, and Choi 2022; Xie et al. 2023; Tan et al. 2024; Ming et al. 2025), which primarily highlight differential preferences between the parametric knowledge and retrieved external information. Long-pre et al. (Longpre et al. 2021) first expose entity-based knowledge conflicts in question answering, revealing that LLMs tend to rely on parametric memory when retrieved passages are perturbed or contain contradictory information. Chen et al. (Chen, Zhang, and Choi 2022) demonstrate that while retrieval-based LLMs predominantly depend on nonparametric evidence when recall is high, their confidence scores fail to reflect inconsistencies among retrieved documents. Xie et al. (Xie et al. 2023) find that LLMs are receptive to single external evidence, yet exhibit strong confirmation bias when presented with both supporting and conflicting information. Tan et al. (Tan et al. 2024) reveal a systematic bias toward self-generated contexts over retrieved ones, attributing this to the higher query-context similarity and semantic incompleteness of retrieved snippets. + +Our work aligns with the non-parametric knowledge preference paradigm, aiming to guide LLMs to follow updated and comprehensive external knowledge while correcting for temporal and factual errors within internal memory, thereby generating accurate and trustworthy outputs. + +## Solutions to Knowledge Conflicts + +Current approaches for knowledge conflict resolution can be categorized into token-level and semantic-level methods (Jin et al. 2024; Wang et al. 2024; Bi et al. 2025; Zhang et al. 2025; Wang et al. 2025). Token-level approaches focus on fine-grained intervention during generation. \( C{D}^{2} \) (Jin et al. 2024) employs attention weight manipulation to suppress parametric knowledge when conflicts are detected. ASTUTE RAG (Wang et al. 2024) utilizes gradient-based attribution to identify and mask conflicting tokens during inference. These methods achieve precise control, but often suffer from computational overhead and lack semantic awareness among generated contents. Semantic-level approaches operate at higher abstraction levels. CK-PLUG (Bi et al. 2025) develops parameter-efficient conflict resolution through adapter-based architectures that learn to weight parametric versus non-parametric knowledge dynamically. FaithfulRAG (Zhang et al. 2025) externalizes LLMs' parametric knowledge and aligns it with retrieved context, thereby achieving higher faithfulness without sacrificing accuracy. However, these methods primarily address surface-level conflicts without capturing the underlying factual relationships that drive knowledge inconsistencies. + +Different from these approaches, TruthfulRAG leverages structured triple-based knowledge representations to precisely identify and resolve factual-level knowledge conflicts arising from complex natural language expressions, thereby ensuring the reliability and consistency of reasoning. + +## Conclusion + +In this paper, we introduce TruthfulRAG, the first framework that leverages knowledge graphs to address factual-level conflicts in RAG systems. By integrating systematic triple extraction, query-aware graph retrieval, and entropy-based filtering mechanisms, TruthfulRAG transforms unstructured retrieved contexts into structured reasoning paths that enhance LLMs' confidence in external knowledge while effectively mitigating factual inconsistencies. Our comprehensive experiments demonstrate that TruthfulRAG consistently outperforms existing SOTA methods. These results establish TruthfulRAG as a robust and generalizable solution for improving the trustworthiness and accuracy of RAG systems, with significant implications for knowledge-intensive applications requiring high reliability and precision.