first commit
This commit is contained in:
263
参考论文/geo-graph/HypRAG.md
Normal file
263
参考论文/geo-graph/HypRAG.md
Normal file
@@ -0,0 +1,263 @@
|
||||
# HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation
|
||||
|
||||
Hiren Madhu \( {}^{1} \) Ngoc Bui \( {}^{1} \) Ali Maatouk \( {}^{1} \) Leandros Tassiulas \( {}^{1} \) Smita Krishnaswamy \( {}^{1} \) Menglin Yang \( {}^{2} \) Sukanta Ganguly \( {}^{3} \) Kiran Srinivasan \( {}^{3} \) Rex Ying \( {}^{1} \)
|
||||
|
||||
## Abstract
|
||||
|
||||
Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embed-dings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean em-beddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation-with over 20% radial increase from general to specific concepts—a property absent in Euclidean embed-dings, underscoring the critical role of geometric inductive bias in faithful RAG systems \( {}^{1} \) .
|
||||
|
||||
## 1. Introduction
|
||||
|
||||
Dense retrieval forms the backbone of retrieval-augmented generation (RAG) systems (Lewis et al., 2020; Fan et al., 2024), where embedding quality directly determines whether generated responses are grounded in evidence or hallucinated. By retrieving relevant documents and conditioning generation on this context, RAG systems produce responses that are more attributable and aligned with verifiable sources (Ni et al., 2025). Yet, despite advances in retrieval architectures, current systems continue to rely on Euclidean embeddings, a choice inherited from standard neural networks rather than from language structure itself.
|
||||
|
||||

|
||||
|
||||
Figure 1. Hierarchies in Text. (A) Documents naturally organize into branching hierarchies where general topics spawn increasingly specific subtopics. Euclidean spaces distort such hierarchies due to crowding effects, while hyperbolic geometry preserves hierarchical relationships through exponential volume growth. (B) Ricci curvature analysis of document embeddings from strong baselines reveals predominantly negative curvature, indicating tree-like semantic structure.
|
||||
|
||||
Natural language inherently exhibits strong hierarchical organization (He et al., 2025b; Robinson et al., 2024), with semantic structure giving rise to locally tree-like neighborhoods. Euclidean spaces struggle to represent such branching hierarchies due to polynomial volume growth (He et al., 2025b), introducing shortcuts between hierarchically distinct regions that distort semantic relationships. In retrieval settings, these distortions can cause semantically distant documents to appear spuriously similar (Radovanovic et al., 2010; Bogolin et al., 2022), degrading retrieval precision (Reimers & Gurevych, 2021): a query about a specific subtopic may retrieve documents from sibling or parent categories that share similarity but lack the required specificity.
|
||||
|
||||
To further see why geometry matters for retrieval, consider a query about transformer attention mechanisms (Figure 1A). Relevant documents form a natural hierarchy-from general concepts like NLP, to transformers, to specific components like multi-head attention-inducing tree-like semantic structure. Euclidean embeddings struggle to preserve this organization: representing both broad topics and specialized descendants forces a trade-off between semantic proximity and fine-grained separation, causing neighborhood crowding and distortion. Hyperbolic geometry resolves this tension through exponential volume growth, allowing general concepts to remain compact while specific documents spread outward. To test whether such structure appears empirically, we analyze Ollivier-Ricci curvature (Ni et al., 2019)—a measure of local geometry where negative values indicate tree-like branching—on graphs built from MS MARCO document embeddings (Bajaj et al., 2016). Across several strong models (Linq Embed Mistral, LLaMA Nemotron 8B, Qwen3 Embedding 4B), curvature distributions are predominantly negative (Figure 1B), providing empirical evidence that retrieval-relevant embeddings exhibit intrinsic hyperbolic structure and motivating hyperbolic geometry as a natural inductive bias for dense retrieval.
|
||||
|
||||
---
|
||||
|
||||
\( {}^{1} \) Yale University, USA \( {}^{2} \) Hong Kong University of Science and Technology (Guangzhou), China \( {}^{3} \) NetApp, USA. Correspondence to: Rex Ying <rex.ying@yale.edu>.
|
||||
|
||||
Preprint. February 10, 2026.
|
||||
|
||||
\( {}^{1} \) The code is available at: https://anonymous.4open.science/r/HypRAG-30C6
|
||||
|
||||
---
|
||||
|
||||
Recent work has begun exploring hyperbolic geometry for language modeling and RAG systems, though with different focus areas. HELM (He et al., 2025a) introduces a family of hyperbolic language models that operate entirely in hyperbolic space, but these models target text generation rather than retrieval. In the RAG setting, HyperbolicRAG (Cao et al., 2025) projects embeddings into the Poincaré ball to encode hierarchical depth within a static, pre-built knowledge graph, using dual-space retrieval that fuses Euclidean and hyperbolic rankings. However, HyperbolicRAG relies on Euclidean encoders to produce the initial embeddings, leaving the fundamental geometric mismatch. Moreover, aggregating token embeddings into document representations poses a challenge that existing work in hyperbolic learning does not address (Yang et al., 2024). As we show in Proposition 4.3, naively averaging tokens in Euclidean space before projecting to hyperbolic space causes representations to collapse toward the origin, destroying the hierarchical structure that is meant be to preserved.
|
||||
|
||||
To this end, we introduce hyperbolic dense retrieval for RAG, framing embedding geometry as a design choice for improving evidence selection and grounding at the representation level. We study this through two complementary instantiations. First, HyTE-FH (Hyerbolic Text Encoder, Fully Hyperbolic) operates entirely in the Lorentz model of hyperbolic space, enabling end-to-end representation learning. Second, HyTE-H (Hybrid) maps embeddings from off-the-shelf Euclidean encoders into hyperbolic space, allowing us to build on existing pre-trained Euclidean models. The Lorentz model's intrinsic geometry enables parameter-efficient scaling: HyTE-H outperforms Euclidean baselines several times (2-3x) its size, reducing memory footprint in resource-constrained settings. To address the aggregation challenge in both instantiations, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that amplifies tokens farther from the origin, provably preserving hierarchical structure during pooling.
|
||||
|
||||
Through extensive evaluation on RAGBench, we demonstrate that both hyperbolic variants consistently outperform Euclidean baselines in answer relevancy across multiple datasets, while achieving competitive performance on MTEB. Our experiments validate three key findings: (1) hyperbolic retrieval substantially improves RAG performance, with up to 29% gains over Euclidean baselines in context relevance and answer relevance; (2) hyperbolic models naturally encode concept-level hierarchies in their radial structure, with the fully hyperbolic model achieving a 20.2% radius increase from general to specific concepts, while Euclidean models fail to capture this organization; and (3) our theoretically grounded Outward Einstein Midpoint pooling preserves this hierarchical structure during aggregation.
|
||||
|
||||
## 2. Related Works
|
||||
|
||||
Text Embeddings and Dense Retrieval. Dense retrieval embeds queries and documents into a shared vector space and ranks candidates by similarity (e.g., dot product or cosine). Transformer bi-encoders (e.g., BERT (Devlin et al., 2019)) are widely used in this context due to their scalabil-ity with maximum inner product search (Karpukhin et al., 2020; Reimers & Gurevych, 2019). Most methods train with contrastive objectives using in-batch and hard negatives (Gao et al., 2021; Izacard et al., 2021; Xiong et al., 2021), often following large-scale pretraining plus task-specific fine-tuning (Wang et al., 2022; Li et al., 2023; Nussbaum et al., 2025). More recently, decoder-only embedding models initialize from LLMs to exploit their pretrained linguistic knowledge (Muennighoff et al., 2024; Lee et al., 2024; Zhang et al., 2025). However, most retrievers remain reliant on inner products or distances in Euclidean geometry-an inductive bias often misaligned with the hierarchical structure of language and document collections. We address this gap by introducing hyperbolic geometry for text embeddings to better capture such a hierarchy.
|
||||
|
||||
Retrieval Augmented Generation. RAG grounds LLMs in retrieved evidence to improve factuality and access external knowledge (Oche et al.,2025). It typically retrieves top- \( k \) contexts (often via dense retrieval) and conditions generation on them (Lewis et al., 2020). Since the context window is limited, retrieval quality is a key bottleneck for relevance and faithfulness (Friel et al., 2024a). Several methods improve reliability after retrieval: Self-RAG (Asai et al., 2024) and CRAG (Yan et al., 2024) use learned critics to filter or re-rank evidence, while GraphRAG (Han et al., 2024) leverages knowledge graphs for structured subgraph retrieval. These approaches operate downstream of the embedding space and are complementary to ours geometric approach. Our goal is to improve RAG upstream by enhancing the retriever representations so that the initial top- \( k \) evidence is more reliable under realistic efficiency constraints.
|
||||
|
||||
Hyperbolic Representation Learning. Hyperbolic geometry is primarily known for its ability to better capture hierarchical, tree-like structures (Yang et al., 2023; Peng et al., 2021), which enhances performance in various tasks, including molecular generation (Liu et al., 2019), recommendation (Yang et al., 2021; Li et al., 2021), image retrieval (Khrulkov et al., 2020; Wei et al., 2024; Bui et al., 2025), and knowledge graph embedding (Ganea et al., 2018a; Dhingra et al., 2018). More recently, hyperbolic geometry has also shown promise for multi-modal embedding models (Desai et al., 2023; Ibrahimi et al., 2024; Pal et al., 2024) and foundation models (Yang et al., 2025; He et al., 2025a). In contrast to these works, we study how hyperbolic representations can improve retrieval in RAG systems. Concurrently, Cao et al. (2025) use hyperbolic geometry to improve RAG rankings, but obtain hyperbolic embed-dings via a simple projection from Euclidean encoders; by contrast, we build on fully hyperbolic encoders trained end-to-end and address key challenges in this setting, including providing the theoretically grounded geometry-aware pooling for document-level representations.
|
||||
|
||||
## 3. Hyperbolic Space Preliminaries
|
||||
|
||||
In this section, we go over all the preliminaries of Lorentz model of hyperbolic space and introduce the basic building blocks that create HyTE-FH.
|
||||
|
||||
### 3.1. Lorentz Model of Hyperbolic Space
|
||||
|
||||
We represent all embeddings in \( d \) -dimensional hyperbolic space \( {\mathbb{H}}_{K}^{d} \) with constant negative curvature \( K < 0 \) using the Lorentz (hyperboloid) model. In the Lorentz model, hyperbolic space is realized as the upper sheet of a two-sheeted hyperboloid embedded in \( {\mathbb{R}}^{d + 1} \) ,
|
||||
|
||||
\[
|
||||
{\mathbb{H}}_{K}^{d} = \left\{ {\mathbf{x} \in {\mathbb{R}}^{d + 1}\mid \langle \mathbf{x},\mathbf{x}{\rangle }_{L} = \frac{1}{K},{x}_{0} > 0}\right\} ,
|
||||
\]
|
||||
|
||||
where the Lorentzian inner product is defined as \( \langle \mathbf{x},\mathbf{y}{\rangle }_{L} = \; - {x}_{0}{y}_{0} + \mathop{\sum }\limits_{{i = 1}}^{d}{x}_{i}{y}_{i} \) . This formulation admits closed-form expressions for geodesic distances, barycentric operations, and parallel transport, and expresses similarity directly through Lorentzian inner products. The geodesic distance between two points \( \mathbf{x},\mathbf{y} \in {\mathbb{H}}_{K}^{d} \) is given by \( {d}_{K}\left( {\mathbf{x},\mathbf{y}}\right) = \; \frac{1}{\sqrt{-K}}{\cosh }^{-1}\left( {K\langle \mathbf{x},\mathbf{y}{\rangle }_{L}}\right) \) , which is a monotone function of the Lorentzian inner product.
|
||||
|
||||
To support optimization, we make use of exponential and logarithmic maps between the manifold and its tangent spaces. For a point \( \mathbf{x} \in {\mathbb{H}}_{K}^{d} \) , the logarithmic map \( {\log }_{x}\left( \cdot \right) \) maps nearby points to the tangent space \( {T}_{x}{\mathbb{H}}_{K}^{d} \) , while the exponential map \( {\exp }_{x}\left( \cdot \right) \) maps tangent vectors back to the manifold. These operators are used only where necessary for gradient-based updates, ensuring that all representations remain on \( {\mathbb{H}}_{K}^{d} \) and preserving the hierarchical structure induced by negative curvature.
|
||||
|
||||
### 3.2. Hyperbolic Transformer Components
|
||||
|
||||
Standard operations cannot be applied directly in hyperbolic space, as they may violate the manifold constraint (Yang et al., 2024). To address this, we introduce hyperbolic components that serve as the building blocks for our embedding model. These operations are performed via a re-centering procedure that applies Euclidean operations in a latent space and maps the result back to the Lorentz model. By doing so, the resulting vector is constructed to satisfy the Lorentz constraint, thereby preserving the hyperbolic structure of representations. We present these operations as follows.
|
||||
|
||||
Lorentz Linear Layer. Given curvatures \( {K}_{1},{K}_{2} \) , and parameters \( \mathbf{W} \in {\mathbb{R}}^{\left( {n + 1}\right) \times m} \) and \( \mathbf{b} \in {\mathbb{R}}^{m} \) with \( \mathbf{z} = \; \left| {{\mathbf{W}}^{\top }\mathbf{x} + \mathbf{b}}\right| \) , the Lorentzian linear transformation (Yang et al.,2024) is the map HLT : \( {\mathbb{L}}^{{K}_{1}, n} \rightarrow {\mathbb{L}}^{{K}_{2}, m} \) given by,
|
||||
|
||||
\[
|
||||
\operatorname{HLT}\left( {\mathbf{x};\mathbf{W},\mathbf{b}}\right) = \sqrt{\frac{{K}_{2}}{{K}_{1}}} \cdot \left\lbrack {\sqrt{\parallel \mathbf{z}{\parallel }^{2} - 1/{K}_{2}},\mathbf{z}}\right\rbrack
|
||||
\]
|
||||
|
||||
Hyperbolic Layer Normalization. Given token embed-dings \( X = {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n} \subset {\mathbb{H}}_{K}^{d} \) , hyperbolic layer normalization is defined as
|
||||
|
||||
\[
|
||||
\text{ HypLayerNorm }\left( X\right) = \left( {\sqrt{\frac{{K}_{1}}{{K}_{2}}\parallel \mathbf{z}{\parallel }_{2}^{2} - \frac{1}{{K}_{2}}},\sqrt{\frac{{K}_{1}}{{K}_{2}}}\mathbf{z}}\right)
|
||||
\]
|
||||
|
||||
where \( z = {f}_{\mathrm{{LN}}}\left( {\mathbf{x}}_{i,\left\lbrack {1 : d}\right\rbrack }\right) ,{f}_{\mathrm{{LN}}}\left( \cdot \right) \) denotes standard Euclidean LayerNorm applied to the spatial components of the embedding, and \( {K}_{1},{K}_{2} > 0 \) are input and output curvature respectively.
|
||||
|
||||
Lorentz Residual Connection. Let \( \mathbf{x}, f\left( \mathbf{x}\right) \in {\mathbb{L}}^{K, n} \) where \( \mathbf{x} \) is an input vector and \( f\left( \mathbf{x}\right) \) is the output of a neural network \( f \) . Then, the Lorentzian residual connection (He et al.,2025d) is given by \( \mathbf{x}{ \oplus }_{\mathcal{L}}f\left( \mathbf{x}\right) = {\alpha }_{1}\mathbf{x} + {\alpha }_{2}\mathbf{y} \) , where
|
||||
|
||||
\[
|
||||
{\alpha }_{i} = {w}_{i}/\left( {\sqrt{-K}{\begin{Vmatrix}{w}_{1}\mathbf{x} + {w}_{2}f\left( \mathbf{x}\right) \end{Vmatrix}}_{\mathcal{L}}}\right) ,\text{ for }i \in \{ 0,1\} ,
|
||||
\]
|
||||
|
||||
where \( {\alpha }_{1},{\alpha }_{2} \) are weights parametrized by constants \( \left( {{w}_{1},{w}_{2}}\right) \in {\mathbb{R}}^{2} \smallsetminus \{ \left( {0,0}\right) \} . \)
|
||||
|
||||
Hyperbolic Self-Attention. In hyperbolic attention, similarity is governed by hyperbolic geodesic distance (Ganea et al.,2018b). Given token embeddings \( X = {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n} \subset \; {\mathbb{H}}_{K}^{d} \) , queries, keys, and values are computed via Lorentz-linear transformations as \( \mathbf{Q} = \operatorname{HLT}\left( {X;{\mathbf{W}}^{Q},{\mathbf{b}}^{Q}}\right) ,\mathbf{K} = \; \operatorname{HLT}\left( {X;{\mathbf{W}}^{K},{\mathbf{b}}^{K}}\right) \) , and \( \mathbf{V} = \operatorname{HLT}\left( {X;{\mathbf{W}}^{V},{\mathbf{b}}^{V}}\right) \) , where HLT \( \left( \cdot \right) \) denotes a linear map in Lorentz space. Attention weights are computed using squared hyperbolic geodesic distances (He et al., 2025c; Chen et al., 2022) as
|
||||
|
||||
\[
|
||||
{\nu }_{i, j} = \frac{\exp \left( {-{d}_{K}^{2}\left( {{\mathbf{q}}_{i},{\mathbf{k}}_{j}}\right) /\sqrt{m}}\right) }{\mathop{\sum }\limits_{{l = 1}}^{n}\exp \left( {-{d}_{K}^{2}\left( {{\mathbf{q}}_{i},{\mathbf{k}}_{l}}\right) /\sqrt{m}}\right) },
|
||||
\]
|
||||
|
||||

|
||||
|
||||
Figure 2. HyTE Architecture. A) HyTE-FH Encoder Block, B) HyTE-FH architecture, C) HyTE-H Architecture.
|
||||
|
||||
with head dimension \( m \) . This prioritizes geodesic proximity rather than angular similarity. The attended representation is obtained via a Lorentzian weighted midpoint
|
||||
|
||||
\[
|
||||
{\operatorname{Att}}_{\mathcal{L}}{\left( \mathbf{x}\right) }_{i} = \frac{\mathop{\sum }\limits_{{j = 1}}^{n}{\nu }_{i, j}{\lambda }_{j}{\mathbf{v}}_{j}}{\sqrt{-K}{\begin{Vmatrix}\mathop{\sum }\limits_{{j = 1}}^{n}{\nu }_{i, j}{\lambda }_{j}{\mathbf{v}}_{j}\end{Vmatrix}}_{\mathcal{L}}},
|
||||
\]
|
||||
|
||||
where \( {\lambda }_{j} = {v}_{j,0} \) is the Lorentz factor. Unlike Euclidean averaging, this aggregation remains on \( {\mathbb{H}}_{K}^{d} \) and preserves radial structure during contextualization.
|
||||
|
||||
## 4. Method
|
||||
|
||||
We now outline our approach to hyperbolic dense retrieval. We begin by introducing the two proposed HyTE architectures, followed by an analysis of why naïve pooling strategies fail in hyperbolic space, and conclude by presenting our geometry-aware aggregation operator.
|
||||
|
||||
### 4.1. Architecture
|
||||
|
||||
The hyperbolic encoder components described in Section 3 form the building blocks (Figure 2A) of HyTE-FH, our fully hyperbolic transformer (Figure 2B). By operating entirely within hyperbolic geometry, HyTE-FH preserves hierarchical structure throughout token-level contextualization, aggregation, and similarity computation, with semantic abstraction and specificity encoded along radial dimensions. HyTE-H (Figure 2C) instead projects pretrained Euclidean representations into hyperbolic space, which allows hyperbolic geometry to be leveraged with a strong initialization and avoiding the need to train a fully hyperbolic encoder from scratch.
|
||||
|
||||
While hyperbolic self-attention enables geometry-consistent contextualization at the token level, dense retrieval requires aggregating variable-length sequences into fixed-dimensional representations. Standard approaches map representations to tangent space, aggregate in Euclidean space, then map back to the manifold (Yang et al., 2024; Desai et al., 2023), but this distorts hierarchical structure encoded in radial depth in both the models. In the following subsections, we analyze this failure mode formally and introduce a pooling operator designed to preserve hierarchical information.
|
||||
|
||||
### 4.2. Failure of Naïve Hyperbolic Pooling
|
||||
|
||||
Naïve pooling strategies that aggregate in Euclidean space (Yang et al., 2024; Desai et al., 2023) systematically contract representations toward the origin. This follows from hyperbolic convexity: for any \( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 0}^{n} \subset {\mathbb{H}}_{K}^{d} \) , the barycenter lies strictly closer to the origin than the maximum-radius point unless all points coincide. Consequently, document-level embeddings lose the radial separation that encodes document specificity through hierarchical depth. To address this failure mode, we first establish notation for projecting ambient vectors onto the hyperboloid and measuring radial depth.
|
||||
|
||||
Definition 4.1 (Lorentz Projection). For \( \mathbf{v} \in {\mathbb{R}}^{d + 1} \) with \( \langle \mathbf{v},\mathbf{v}{\rangle }_{L} < 0 \) and \( {v}_{0} > 0 \) , let \( {\Pi }_{K}\left( \mathbf{v}\right) = \; \frac{\mathbf{v}}{\sqrt{K\langle \mathbf{v},\mathbf{v}{\rangle }_{L}}} \) denote the unique positive rescaling satisfying \( {\left\langle {\Pi }_{K}\left( \mathbf{v}\right) ,{\Pi }_{K}\left( \mathbf{v}\right) \right\rangle }_{L} = 1/K \)
|
||||
|
||||
Definition 4.2 (Radial Depth). The radial depth of \( \mathbf{x} \in {\mathbb{H}}_{K}^{d} \) is \( r\left( \mathbf{x}\right) = {x}_{0} \) . Since \( {x}_{0} = \frac{1}{\sqrt{-K}}\cosh \left( {\sqrt{-K}\rho }\right) \) where \( \rho = {d}_{K}\left( {o,\mathbf{x}}\right) \) , ordering by \( {x}_{0} \) is equivalent to ordering by intrinsic hyperbolic distance from the origin.
|
||||
|
||||
Semantically, radial depth encodes concept specificity: general concepts should lie near the origin while fine-grained entities should have larger radii. This provides a measurable signature for evaluating whether models learn meaningful hierarchical structure. The simplest aggregation strategy is Euclidean averaging in the ambient space followed by reprojection. However, this approach provably contracts representations toward the origin (Ganea et al., 2018a; Chami et al., 2019), destroying hierarchical structure encoded in radial depth. We formalize this in the following proposition.
|
||||
|
||||
Proposition 4.3 (Euclidean Mean Contracts). Let \( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n} \subset {\mathbb{H}}_{K}^{d} \) with \( n \geq 2 \) . Define the Euclidean mean \( \overline{\mathbf{x}} = \frac{1}{n}\mathop{\sum }\limits_{{i = 1}}^{n}{\mathbf{x}}_{i} \) and its projection onto the hyperboloid \( {\mathbf{m}}^{\text{ Euc }} = {\Pi }_{K}\left( \overline{\mathbf{x}}\right) \) . Then, we have
|
||||
|
||||
\[
|
||||
r\left( {\mathbf{m}}^{\text{ Euc }}\right) \leq \frac{1}{n}\mathop{\sum }\limits_{{i = 1}}^{n}r\left( {\mathbf{x}}_{i}\right) ,
|
||||
\]
|
||||
|
||||
with equality if and only if all \( {\mathbf{x}}_{i} \) are identical.
|
||||
|
||||

|
||||
|
||||
Figure 3. Outward Einstein Midpoint. Size of token shows its contribution towards aggregation.
|
||||
|
||||
The proof of this Proposition is available in Appendix A.2. This failure motivates a precise characterization of desirable pooling behavior. We formalize the requirement that pooling should preserve, rather than collapse, radial structure.
|
||||
|
||||
Definition 4.4 (Outward Bias). A pooling operator \( \mathcal{P} \) : \( {\left( {\mathbb{H}}_{K}^{d}\right) }^{n} \rightarrow {\mathbb{H}}_{K}^{d} \) is outward-biased if \( r\left( {\mathcal{P}\left( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}\right) }\right) \geq \bar{r} \) , where \( \bar{r} \) is the weighted mean radius.
|
||||
|
||||
A natural alternative is a weighted aggregation scheme in which token contributions are modulated by their relative importance. For example, Zhu et al. (2020) adopt the Einstein midpoint, the canonical barycenter in hyperbolic space (Gul-cehre et al., 2019), to emphasize semantically specific tokens during pooling: since points near the boundary receive higher weight via the Lorentz factor \( {\lambda }_{i} = {x}_{i,0} \) , more informative content should dominate the aggregate. However, we show this intuition is misleading: the implicit radial weighting is fundamentally insufficient to counteract hyperbolic contraction at the document level.
|
||||
|
||||
Proposition 4.5 (Implicit Radial Weighting is Insufficient). The Einstein midpoint weights points by the Lorentz factor \( {\lambda }_{i} = {x}_{i,0} \) , but this weighting grows as \( \exp \left( {\sqrt{-K}\rho }\right) \) while hyperbolic volume grows as \( \exp \left( {\left( {d - 1}\right) \sqrt{-K}\rho }\right) \) . Specifically, for a point \( \mathbf{x} \in {\mathbb{H}}_{K}^{d} \) at hyperbolic distance \( \rho \) from the origin \( o = \left( {1/\sqrt{-K},0,\ldots ,0}\right) \) , we have
|
||||
|
||||
\[
|
||||
{x}_{0} = \frac{1}{\sqrt{-K}}\cosh \left( {\sqrt{-K}\rho }\right) \sim \frac{1}{2\sqrt{-K}}\exp \left( {\sqrt{-K}\rho }\right)
|
||||
\]
|
||||
|
||||
as \( \rho \rightarrow \infty \) . Thus, the Lorentz factor weighting undercom-pensates for the exponential growth of hyperbolic balls at large radii by a factor of \( \exp \left( {\left( {d - 2}\right) \sqrt{-K}\rho }\right) \) .
|
||||
|
||||
These results establish that neither Euclidean averaging nor the standard Einstein midpoint satisfies the outward-bias property required for hierarchy-preserving aggregation. This motivates the design of a pooling operator with explicit radial amplification. The proof of this Proposition is available in Appendix A.3.
|
||||
|
||||
### 4.3. Outward Einstein Midpoint Pooling
|
||||
|
||||
To mitigate radial contraction during aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that explicitly amplifies the contribution of tokens with larger hyperbolic radius. Let \( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n} \subset {\mathbb{H}}_{K}^{d} \) denote a sequence of token embeddings, with optional attention weights \( {w}_{i} \geq 0 \) , and \( {\lambda }_{i} \) denoting the Lorentz factors. We define a radius-dependent weighting function
|
||||
|
||||
\[
|
||||
\phi \left( {x}_{i}\right) = {x}_{i,0}^{p},\;p > 0,
|
||||
\]
|
||||
|
||||
which is monotone in the radial coordinate. The Outward Einstein Midpoint is then given by
|
||||
|
||||
\[
|
||||
{\mathbf{m}}_{K, p}^{\mathrm{{OEM}}} = \frac{\mathop{\sum }\limits_{{i = 1}}^{n}\left( {{w}_{i}\phi \left( {\mathbf{x}}_{i}\right) }\right) {\lambda }_{i}{\mathbf{x}}_{i}}{\mathop{\sum }\limits_{{i = 1}}^{n}\left( {{w}_{i}\phi \left( {\mathbf{x}}_{i}\right) }\right) {\lambda }_{i}},
|
||||
\]
|
||||
|
||||
followed by reprojection onto the hyperboloid \( {\mathbb{H}}_{K}^{d} \) .
|
||||
|
||||
As shown in Figure 3, by construction, this operator assigns disproportionately higher weight to tokens located farther from the origin, counteracting the contraction inherent to naïve averaging. We now establish theoretical guarantees for the Outward Einstein Midpoint, showing that it systematically improves upon the standard Einstein midpoint in preserving radial structure.
|
||||
|
||||
Theorem 4.6 (OEM Pre-Projection Bound). Let \( \widetilde{\mathbf{v}} = \; \mathop{\sum }\limits_{{i = 1}}^{n}{\widetilde{w}}_{i}{\mathbf{x}}_{i} \) where \( {\widetilde{w}}_{i} \propto {w}_{i}{x}_{i,0}^{p + 1} \) are the normalized OEM weights. Then, for \( p \geq 0 \) , we have
|
||||
|
||||
\[
|
||||
{\widetilde{v}}_{0} = \frac{\mathop{\sum }\limits_{{i = 1}}^{n}{w}_{i}{x}_{i,0}^{p + 2}}{\mathop{\sum }\limits_{{i = 1}}^{n}{w}_{i}{x}_{i,0}^{p + 1}} \geq \frac{\mathop{\sum }\limits_{{i = 1}}^{n}{w}_{i}{x}_{i,0}}{\mathop{\sum }\limits_{{i = 1}}^{n}{w}_{i}} = {\bar{r}}_{w}.
|
||||
\]
|
||||
|
||||
We apply Chebyshev's sum inequality to the co-monotonic sequences \( {a}_{i} = {x}_{i,0}^{p + 1} \) and \( {b}_{i} = {x}_{i,0} \) to prove this. Full proof can be found in Appendix A.4. While projection onto \( {\mathbb{H}}_{K}^{d} \) contracts the radial coordinate, the OEM's concentration of weight on high-radius tokens inflates the pre-projection average, counteracting this effect. Theorem 4.6 establishes that OEM increases the pre-projection radial coordinate. The following theorem shows a stronger result: OEM provably dominates the standard Einstein midpoint in preserving radial structure.
|
||||
|
||||
Theorem 4.7 (OEM Outward Bias). Let \( {\mathbf{m}}_{K}^{\text{ Ein }} \) denote the standard Einstein midpoint \( \left( {p = 0}\right) \) and \( {\mathbf{m}}_{K, p}^{\text{ OEM }} \) the Outward Einstein Midpoint. Then, for all \( p \geq 1 \) :
|
||||
|
||||
\[
|
||||
r\left( {\mathbf{m}}_{K, p}^{\mathrm{{OEM}}}\right) \geq r\left( {\mathbf{m}}_{K}^{\mathrm{{Ein}}}\right) .
|
||||
\]
|
||||
|
||||
The OEM weights \( {\widetilde{w}}_{i} \propto {w}_{i}{x}_{i,0}^{p + 1} \) concentrate more mass on high-radius points than the Einstein weights \( {w}_{i}{x}_{i,0} \) , increasing the pre-projection time component while reducing pairwise dispersion. Full proof in Appendix A.5. Together, these results establish that the Outward Einstein Midpoint provably preserves hierarchical structure during aggregation, in contrast to both Euclidean averaging and the standard Einstein midpoint. We validate this empirically through concept-level hierarchy analysis (Section 5.2), showing that models using OEM pooling maintain monotonically increasing radii across semantic specificity levels-a property absent in Euclidean baselines.
|
||||
|
||||
### 4.4. Training Methodology
|
||||
|
||||
We train the hyperbolic encoder in three stages, with all objectives operating directly on the Lorentz manifold using geodesic-based similarity.
|
||||
|
||||
Stage 1: Hyperbolic Masked Language Modeling. We initialize via masked language modeling (MLM), following the standard BERT objective in hyperbolic space. Contex-tualization is performed through hyperbolic self-attention, with all intermediate representations on the hyperboloid. Predictions are produced using a Lorentzian multinomial logistic regression (LorentzMLR) (Bdeir et al., 2024) head, which defines class logits via Lorentzian inner products. Only HyTE-FH is trained on MLM, while for HyTE-H we choose a pre-trained Euclidean model as the MLM base to leverage a sronger initialization in low-resource settings.
|
||||
|
||||
Stage 2: Unsupervised Contrastive Pre-Training. We fine-tune the resulting MLM model on query-document pairs by minimizing unsupervised contrastive loss. Similarity is defined as negative geodesic distance \( s\left( {q, d}\right) = \; - {d}_{K}\left( {q, d}\right) \) . The contrastive loss over in-batch negatives is
|
||||
|
||||
\[
|
||||
{\mathcal{L}}_{\text{ ctr }} = - \frac{1}{N}\mathop{\sum }\limits_{{i = 1}}^{N}\log \exp \left( {s\left( {{\mathbf{q}}_{i},{\mathbf{d}}_{i}}\right) /\tau }\right) ,
|
||||
\]
|
||||
|
||||
where \( \tau > 0 \) is a temperature parameter.
|
||||
|
||||
Stage 3: Supervised Contrastive Learning Fine-tuning. In the final stage of training, we further fine-tune the encoder using supervised contrastive learning on labeled query-document data. Given a query \( {q}_{i} \) , a set of relevant documents \( {\mathcal{D}}_{i}^{ + } \) , and a set of non-relevant documents \( {\mathcal{D}}_{i}^{ - } \) , the supervised contrastive objective encourages the query representation to be closer to all relevant documents than to non-relevant ones
|
||||
|
||||
\[
|
||||
{\mathcal{L}}_{\text{ sup }} = - \frac{1}{N}\mathop{\sum }\limits_{{i = 1}}^{N}\log \frac{\mathop{\sum }\limits_{{{d}^{ + } \in {\mathcal{D}}_{i}^{ + }}}\exp \left( {s\left( {{\mathbf{q}}_{i},{\mathbf{d}}^{ + }}\right) /\tau }\right) }{\mathop{\sum }\limits_{{d \in {\mathcal{D}}_{i}^{ + } \cup {\mathcal{D}}_{i}^{ - }}}\exp \left( {s\left( {{\mathbf{q}}_{i},\mathbf{d}}\right) /\tau }\right) },
|
||||
\]
|
||||
|
||||
where \( \tau > 0 \) is a temperature parameter. This stage explicitly aligns hyperbolic distances with supervised relevance signals, refining retrieval behavior beyond unsupervised co-occurrence structure.
|
||||
|
||||
Retrieval-Augmented Generation. At inference time, the trained hyperbolic encoder is used to retrieve the top- \( k \) documents \( \mathcal{C} \) for a given queryt. These retrieved documents are then provided as context to a downstream generative language model. Prompt formatting and generation follow standard practice and are provided in Appendix B. We present runtime and computational complexity in Appendix D.
|
||||
|
||||
Table 1. Performance on MTEB benchmark. We report mean scores across tasks and task types. HyTE-FH performs best among the three models.
|
||||
|
||||
<table><tr><td>Model</td><td>Mean (Task)</td><td>Mean (TaskType)</td></tr><tr><td>EucBERT</td><td>54.11</td><td>51.31</td></tr><tr><td>HyTE-H \( {}^{\text{ Euc }} \)</td><td>54.57</td><td>53.71</td></tr><tr><td>HyTE-FH</td><td>56.41</td><td>53.75</td></tr></table>
|
||||
|
||||
## 5. Experiments and Results
|
||||
|
||||
### 5.1. Experimental Setup
|
||||
|
||||
Datasets. We pre-train our models using publicly available corpora following the data curation and filtering protocols introduced in nomic-embed (Nussbaum et al., 2025). For masked language modeling (MLM), we use the high-quality 2023 Wikipedia dump, which provides broad topical coverage and long-form text suitable for learning general-purpose semantic representations. For contrastive pre-training, we leverage approximately 235 million text pairs curated and filtered as described in (Nussbaum et al., 2025), designed to encourage semantic alignment across paraphrases and related content at scale. Finally, for task-specific fine-tuning, we use the training splits of the BEIR benchmark (Thakur et al., 2021), which comprises a diverse collection of retrieval tasks spanning multiple domains and query styles.
|
||||
|
||||
Evaluation Benchmarks. We evaluate our approach on two complementary benchmarks: (1) the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2023) to assess embedding quality across diverse tasks, and (2) RAGBench (Friel et al., 2024b) for end-to-end RAG system evaluation. In MTEB, we particularly use the English part of the benchmark. RAGBench evaluates RAG systems on domain-specific question-answering datasets including CovidQA, Cuad, Emanual, DelucionQA, and ExpertQA.
|
||||
|
||||
Baselines. We adopt different baseline strategies for our two models based on their training paradigms. For HyTE-FH, which is pre-trained from scratch, we train a fully Euclidean equivalent called EucBERT using the same architecture and training setup. This controlled comparison isolates the contribution of hyperbolic geometry. We also evaluate HyTE-H \( {}^{\mathrm{{Euc}}} \) , a hybrid hyperbolic model initialized with EucBERT. The three models are evaluated on MTEB and RAGBench. For HyTE-H \( {}^{\text{ bert }} \) , which is fine-tuned with modernbert-base (Warner et al., 2024) as base model, we compare against state-of-the-art embedding models smaller than 500M parameters, including gte-multilingual-base (Zhang et al., 2024), KaLM-embedding-multilingual-mini-v1 (Hu et al., 2025), and embeddinggemma-300m (Vera et al., 2025).
|
||||
|
||||
Metrics. For MTEB, we report mean scores across tasks and task types. For RAG evaluation, we measure three key metrics using RAGAS (Es et al., 2024): (1) Faithfulness, which assesses whether generated answers are grounded in the retrieved context; (2) Context Relevance, which measures how relevant the retrieved documents are to the query; and (3) Answer Relevance, which evaluates how well the generated answer addresses the user's question.
|
||||
|
||||
Table 2. RAG benchmark results comparing our model variants.
|
||||
|
||||
<table><tr><td rowspan="2">Model</td><td colspan="3">Average</td><td colspan="3">CovidQA</td><td colspan="3">Cuad</td><td colspan="3">Emanual</td><td colspan="3">DelucionQA</td><td colspan="3">ExpertQA</td></tr><tr><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td></tr><tr><td>EucBERT</td><td>0.596</td><td>0.798</td><td>0.647</td><td>0.685</td><td>0.863</td><td>0.582</td><td>0.654</td><td>0.644</td><td>0.641</td><td>0.642</td><td>0.646</td><td>0.674</td><td>0.525</td><td>0.968</td><td>0.679</td><td>0.475</td><td>0.872</td><td>0.662</td></tr><tr><td>HyTE-H \( {}^{\text{ Euc }} \)</td><td>0.706</td><td>0.814</td><td>0.739</td><td>0.708</td><td>0.868</td><td>0.668</td><td>0.787</td><td>0.652</td><td>0.710</td><td>0.679</td><td>0.835</td><td>0.814</td><td>0.737</td><td>0.857</td><td>0.773</td><td>0.623</td><td>0.859</td><td>0.728</td></tr><tr><td>HyTE-FH</td><td>0.732</td><td>0.848</td><td>0.765</td><td>0.764</td><td>0.916</td><td>0.694</td><td>0.747</td><td>0.674</td><td>0.752</td><td>0.660</td><td>0.807</td><td>0.704</td><td>0.789</td><td>0.906</td><td>0.861</td><td>0.702</td><td>0.936</td><td>0.814</td></tr></table>
|
||||
|
||||
\( \mathrm{F} = \) Faithfulness, \( \mathrm{{CR}} = \) Context Relevance, \( \mathrm{{AR}} = \) Answer Relevance. Best results in bold.
|
||||
|
||||
Table 3. RAG benchmark results comparing our hybrid model with state-of-the-art embedding models. HyTE-H demonstrates competitive performance particularly in context relevance and answer relevance.
|
||||
|
||||
<table><tr><td rowspan="2">Model</td><td colspan="3">Average</td><td colspan="3">CovidQA</td><td colspan="3">Cuad</td><td colspan="3">Emanual</td><td colspan="3">DelucionQA</td><td colspan="3">ExpertQA</td></tr><tr><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td><td>F</td><td>CR</td><td>AR</td></tr><tr><td>ModernBert*</td><td>0.617</td><td>0.748</td><td>0.632</td><td>0.656</td><td>0.895</td><td>0.5378</td><td>0.632</td><td>0.709</td><td>0.746</td><td>0.567</td><td>0.715</td><td>0.639</td><td>0.655</td><td>0.6657</td><td>0.5183</td><td>0.575</td><td>0.758</td><td>0.718</td></tr><tr><td>GTE</td><td>0.659</td><td>0.701</td><td>0.650</td><td>0.695</td><td>0.840</td><td>0.538</td><td>0.733</td><td>0.599</td><td>0.779</td><td>0.546</td><td>0.608</td><td>0.686</td><td>0.648</td><td>0.725</td><td>0.549</td><td>0.672</td><td>0.731</td><td>0.698</td></tr><tr><td>Gemma</td><td>0.603</td><td>0.735</td><td>0.684</td><td>0.685</td><td>0.760</td><td>0.497</td><td>0.724</td><td>0.600</td><td>0.778</td><td>0.555</td><td>0.884</td><td>0.687</td><td>0.612</td><td>0.643</td><td>0.705</td><td>0.442</td><td>0.791</td><td>0.755</td></tr><tr><td>KaLM-mini-v1</td><td>0.624</td><td>0.719</td><td>0.591</td><td>0.656</td><td>0.787</td><td>0.528</td><td>0.742</td><td>0.789</td><td>0.716</td><td>0.565</td><td>0.776</td><td>0.616</td><td>0.553</td><td>0.581</td><td>0.573</td><td>0.607</td><td>0.666</td><td>0.522</td></tr><tr><td>HyTE-H \( {}^{\text{ bert }} \)</td><td>0.763</td><td>0.904</td><td>0.832</td><td>0.797</td><td>0.974</td><td>0.755</td><td>0.760</td><td>0.683</td><td>0.804</td><td>0.688</td><td>0.943</td><td>0.899</td><td>0.829</td><td>0.965</td><td>0.871</td><td>0.739</td><td>0.958</td><td>0.834</td></tr></table>
|
||||
|
||||
\( \mathrm{F} = \) Faithfulness, \( \mathrm{{CR}} = \) Context Relevance, \( \mathrm{{AR}} = \) Answer Relevance. Best results in bold.
|
||||
|
||||
Implementation. We implement all hyperbolic models using HyperCore (He et al., 2025e) and train on NVIDIA H100 GPUs. All three models, HyTE-FH, HyTE-H, and Eu-cBERT, share the same architecture, each containing 149M parameters with 12 transformer layers and 768-dimensional embeddings. For generation and judging, we use Llama- 3.1-8B-Instruct (Weerawardhena et al., 2025). For RAG benchmarks, we fix the retrieval context window size to 5 for all models to ensure a controlled comparison; we additionally report ablations with larger context sizes in Appendix Table A3.
|
||||
|
||||
### 5.2. Results
|
||||
|
||||
MTEB Benchmark. Table 1 reports performance on the MTEB benchmark. HyTE-FH achieves the highest mean score across tasks (56.41), outperforming both EucBERT (54.11) and HyTE-H \( {}^{\mathrm{{Euc}}} \) (54.57). On the task-type mean, HyTE-FH and HyTE-H \( {}^{\mathrm{{Euc}}} \) perform comparably (53.75 and 53.71, respectively), with both surpassing EucBERT (51.31). These results demonstrate that hyperbolic representations not only improve RAG retrieval but also remain competitive on general-purpose embedding benchmarks. We present task-wise results in Table A1.
|
||||
|
||||
RAG Benchmark Results. Table 2 presents RAG benchmark results across five datasets. HyTE-FH achieves the best average performance across all three metrics: faithfulness (0.732), context relevance (0.848), and answer relevance (0.765). HyTE-H \( {}^{\mathrm{{Euc}}} \) ranks second overall, with both hyperbolic variants substantially outperforming EucBERT. On individual datasets, HyTE-FH leads on CovidQA, Cuad, DelucionQA, and ExpertQA, while HyTE-H \( {}^{\text{ Euc }} \) achieves the best context and answer relevance on Emanual. These results demonstrate that hyperbolic geometry consistently improves retrieval quality for RAG across diverse domains.
|
||||
|
||||
Table 3 reports RAG performance across five datasets. HyTE-H \( {}^{\text{ bert }} \) consistently outperforms strong Euclidean embedding baselines across all metrics, with particularly large gains in context relevance and answer relevance. These improvements indicate that hyperbolic representations are more effective at retrieving structurally relevant evidence, which is critical for downstream generation quality in RAG pipelines. In qualitative case studies shows in Appendix E.1, we observe that Euclidean models frequently fail to retrieve key supporting passages altogether, whereas hyperbolic model recover relevant evidence more reliably, leading to more faithful and contextually grounded answers.
|
||||
|
||||
Concept-Level Hierarchy Analysis. A central motivation for hyperbolic embeddings is their capacity to preserve hierarchical relationships (Section 4.2). To understand how models capture document hierarchy, we analyze learned radii (distances from the origin in the Poincaré ball) across five hierarchical levels: from Level 1 (most general, e.g., document-level topics) to Level 5 (most specific, e.g., fine-grained entities). Figure 4 presents these results. The fully hyperbolic model demonstrates clear hierarchical organization with radii increasing monotonically from Level 1 (2.902) to Level 5 (3.488, +20.2%). This shows the model naturally places general concepts near the origin and specific details toward the boundary, consistent with hyperbolic geometry, where proximity to the origin represents generality. Euclidean models show flat or decreasing distributions. Baselines maintain constant norms across levels or decreases norm by \( {30}\% \) , reflecting inverted structure. Hybrid models exhibit substantially larger radii from the hyperbolic component. The fine-tuned hybrid increases from 116.9 to 146.7, showing that fine-tuning induces structured hierarchy. We have attached the dataset for this case study in the supplementary material. The concept level hierarchy data is available in Appendix C.
|
||||
|
||||

|
||||
|
||||
Figure 4. Empirical validation of hierarchical encoding. Left: Euclidean models show flat or decreasing norms. Middle: HyTE-H demonstrate increasing norms with fine-tuning enhancing this trend. Right: HyTE-FH achieves +20.2% total increase from L1 to L5. Bottom: Normalized comparison and percent change summary highlighting the contrasting behaviors of different geometric approaches.
|
||||
|
||||
Ablation Studies. We compare two pooling strategies for aggregating token embeddings into document representations: CLS token pooling and OEM pooling. CLS pooling uses the representation of a special classification token, while OEM pooling performs geometry-aware aggregation directly in hyperbolic space. Table 4 shows that OEM pooling yields higher performance across both mean task and mean task-type metrics on MTEB retrieval tasks, indicating more effective document-level aggregation in the hyperbolic setting. We also show that using geodesic distance in the contrastive objective outperforms the Lorentz inner product (Appendix Table A2), suggesting better alignment of representations on the manifold. Additionally, hyperbolic models maintain strong performance with smaller retrieval budgets, whereas Euclidean baselines require larger context windows to achieve comparable results (Appendix Table A3).
|
||||
|
||||
Table 4. Comparison of pooling strategies on MTEB tasks. OEM pooling leverages hyperbolic geometry for improved performance.
|
||||
|
||||
<table><tr><td>Pooling Strategy</td><td>Mean (Task)</td><td>Mean (TaskType)</td></tr><tr><td>CLS Token</td><td>49.33</td><td>48.90</td></tr><tr><td>OEM</td><td>56.41</td><td>53.75</td></tr></table>
|
||||
|
||||
## 6. Conclusion
|
||||
|
||||
We introduced hyperbolic dense retrieval for RAG, showing that aligning embedding geometry with the hierarchical structure of language improves faithfulness and answer quality. Our approach preserves document-level structure during aggregation through a geometry-aware pooling operator, addressing a key failure mode of Euclidean retrieval pipelines. Across evaluations, we observe consistent gains using models substantially smaller than current state-of-the-art retrievers, highlighting the effectiveness of hyperbolic inductive bias over scale alone. Case studies further show that hyperbolic representations organize documents by specificity through norm-based separation, a property absent in Euclidean embeddings. These findings suggest that embedding geometry is a central design choice for reliable retrieval in RAG systems, with implications for future scalable and multimodal retrieval architectures.
|
||||
353
参考论文/geo-graph/HyperRAG.md
Normal file
353
参考论文/geo-graph/HyperRAG.md
Normal file
@@ -0,0 +1,353 @@
|
||||
# HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation
|
||||
|
||||
Wen-Sheng Lien
|
||||
|
||||
National Yang Ming Chiao Tung
|
||||
|
||||
University
|
||||
|
||||
Hsinchu, Taiwan
|
||||
|
||||
vincentlien.ii13@nycu.edu.tw
|
||||
|
||||
Yu-Kai Chan
|
||||
|
||||
National Yang Ming Chiao Tung
|
||||
|
||||
University
|
||||
|
||||
Hsinchu, Taiwan
|
||||
|
||||
ctw33888.ee13@nycu.edu.tw
|
||||
|
||||
Hao-Lung Hsiao
|
||||
|
||||
National Yang Ming Chiao Tung
|
||||
|
||||
University
|
||||
|
||||
Hsinchu, Taiwan
|
||||
|
||||
hlhsiao.cs13@nycu.edu.tw
|
||||
|
||||
Bo-Kai Ruan
|
||||
|
||||
National Yang Ming Chiao Tung
|
||||
|
||||
University
|
||||
|
||||
Hsinchu, Taiwan
|
||||
|
||||
bkruan.ee11@nycu.edu.tw
|
||||
|
||||
Meng-Fen Chiang
|
||||
|
||||
National Yang Ming Chiao Tung
|
||||
|
||||
University
|
||||
|
||||
Hsinchu, Taiwan
|
||||
|
||||
meng.chiang@nycu.edu.tw
|
||||
|
||||
Chien-An Chen
|
||||
|
||||
E.SUN Bank
|
||||
|
||||
Taipei, Taiwan
|
||||
|
||||
lukechen-15953@esunbank.com
|
||||
|
||||
Yi-Ren Yeh
|
||||
|
||||
National Kaohsiung Normal
|
||||
|
||||
University
|
||||
|
||||
Kaohsiung, Taiwan
|
||||
|
||||
yryeh@nknu.edu.tw
|
||||
|
||||
Hong-Han Shuai
|
||||
|
||||
National Yang Ming Chiao Tung
|
||||
|
||||
University
|
||||
|
||||
Hsinchu, Taiwan
|
||||
|
||||
hhshuai@nycu.edu.tw
|
||||
|
||||
## Abstract
|
||||
|
||||
Graph-based Retrieval-Augmented Generation (RAG) typically operates on binary Knowledge Graphs (KGs). However, decomposing complex facts into binary triples often leads to semantic fragmentation and longer reasoning paths, increasing the risk of retrieval drift and computational overhead. In contrast, \( n \) -ary hypergraphs preserve high-order relational integrity, enabling shallower and more semantically cohesive inference. To exploit this topology, we propose HyperRAG, a framework tailored for \( n \) -ary hypergraphs featuring two complementary retrieval paradigms: (i) HyperRetriever learns structural-semantic reasoning over \( n \) -ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM's parametric memory to guide beam search, dynamically scoring \( n \) -ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness. Hy-perRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that Hyper-Retriever bridges reasoning gaps through adaptive and interpretable \( n \) -ary chain construction, benefiting both open and closed-domain QA. Our codes are publicly available at https://github.com/Vincent-Lien/HyperRAG.git.
|
||||
|
||||
## CCS Concepts
|
||||
|
||||
- Information systems \( \rightarrow \) Retrieval models and ranking; Language models; Question answering.
|
||||
|
||||
## Keywords
|
||||
|
||||
Hypergraph-based Retrieval-Augmented Generation, N-ary Relational Knowledge Graphs, Multi-hop Question Answering, Memory-Guided Adaptive Retrieval
|
||||
|
||||
## ACM Reference Format:
|
||||
|
||||
Wen-Sheng Lien, Yu-Kai Chan, Hao-Lung Hsiao, Bo-Kai Ruan, Meng-Fen Chiang, Chien-An Chen, Yi-Ren Yeh, and Hong-Han Shuai. 2026. Hyper-RAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation. In Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, United Arab Emirates. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3774904.3792710
|
||||
|
||||
## 1 Introduction
|
||||
|
||||
Retrieval-Augmented Generation (RAG) has established itself as a critical mechanism for augmenting Large Language Models (LLMs) with non-parametric external knowledge during inference [12, 17, 19, 20]. By dynamically retrieving verifiable information from external corpora without the need for extensive fine-tuning, RAG effectively mitigates intrinsic LLM limitations such as hallucinations and temporal obsolescence. This paradigm has proven particularly transformative for knowledge-intensive tasks, including open-domain question answering (QA), fact verification, and complex information extraction, driving significant innovation across both academia and industry.
|
||||
|
||||
Current RAG methodologies broadly fall into three categories: document-based, graph-based, and hybrid approaches. Document-based methods utilize dense vector retrieval to match queries with textual segments, offering scalability but often failing to capture complex structural dependencies [5, 6]. Conversely, graph-based methods leverage Knowledge Graphs (KGs) to explicitly model relationships, enabling multi-hop reasoning over structured data [15, 31]. Hybrid approaches attempt to bridge these paradigms, balancing comprehensiveness with efficiency. However, despite the reasoning potential of graph-based methods, the prevailing reliance on binary KGs presents fundamental topological limitations.
|
||||
|
||||

|
||||
|
||||
Figure 1: Structural Comparison of (a) Knowledge Graphs and (b) Hypergraphs. For a given question \( q \) ,(a) requires 3-hop reasoning over binary facts, while (b) enables single-hop inference via an \( n \) -ary relational fact, yielding a more compact and expressive multi-entity representation.
|
||||
|
||||
Traditional graph-based RAG methods predominantly rely on binary knowledge graphs, which suffer from notable limitations when applied to closed-domain question-answering scenarios. Specifically, binary KG approaches encounter two fundamental structural limitations. First, Semantic Fragmentation arises because binary relations limit the expressiveness required to capture complex multi-entity interactions, forcing the decomposition of holistic facts into disjoint triples that fail to represent intricate semantic nuances. Second, this fragmentation leads to Path Explosion, where conventional approaches incur significant computational costs due to the need for deep traversals over the vast binary relation space to reconnect these facts, enabling error propagation and undermining real-world practicality [18, 37]. To address these limitations, recent work advocates hypergraphs for structured retrieval in RAG. Hypergraphs natively encode higher-order ( \( n \) -ary) relations that bind multiple entities and roles, providing a richer semantic substrate than binary graphs [26]. As illustrated in Figure 1, the Path Explosion issue is evident when answering a question grounded on the topic entity "Bruce Seth Green," which requires a 3-hop binary traversal on a standard KG. In contrast, this reduces to a single hop through an \( n \) -ary relation in a hypergraph, yielding a more compact representation. Hypergraphs enable the direct modeling of higher-order relational chains, effectively mitigating Semantic Fragmentation and reducing the reasoning steps required to capture complex dependencies.
|
||||
|
||||
Motivated by these insights, we introduce HyperRAG, an innovative retrieval-augmented generation framework designed explicitly for reasoning over \( n \) -ary hypergraphs. HyperRAG integrates two novel adaptive retrieval variants: (i) HyperRetriever, which uses a multilayer perceptron (MLP) to fuse structural and semantic em-beddings, constructing query-conditioned relational chains that enable accurate and interpretable evidence aggregation within context and token constraints; and (ii) HyperMemory, which leverages the parametric memory of an LLM to guide beam search, dynamically scoring \( n \) -ary facts and entities for query-adaptive path expansion. By combining higher-order reasoning with shallower yet more expressive chains that locate key evidence without multi-hop traversal. Replacement of the \( n \) -ary structure with a binary reduces the average MRR from 36.45% to 34.15% and the average Hits@10 from 40.59% to 36.82% (Table 3), indicating gains in response quality.
|
||||
|
||||
Our key contributions are summarized as follows.
|
||||
|
||||
- We propose HyperRAG, a pioneering framework that shifts the graph-RAG paradigm from binary triples to \( n \) -ary hypergraphs, tackling the issues of semantic fragmentation and path explosion.
|
||||
|
||||
- We introduce HyperRetriever, a trainable MLP-based retrieval module that fuses structural and semantic signals to extract precise, interpretable evidence chains with low latency.
|
||||
|
||||
- We develop HyperMemory, a synergistic retrieval approach that utilizes LLM parametric knowledge to guide symbolic beam search over hypergraphs for complex query adaptive reasoning.
|
||||
|
||||
- Extensive evaluation across closed-domain and open-domain benchmarks demonstrates that HyperRAG consistently outperforms strong baselines, offering a superior trade-off between retrieval accuracy, reasoning interpretability, and system latency.
|
||||
|
||||
## 2 Preliminaries
|
||||
|
||||
### 2.1 Background
|
||||
|
||||
Definition 2.1 ( \( n \) -ary Relational Knowledge Graph). An \( n \) -ary relational knowledge graph, or hypergraph, represents relational facts involving two or more entities and one or more relations. Formally, following the definition in [43], a hypergraph is defined as \( \mathcal{G} = \left( {\mathcal{E},\mathcal{R},\mathcal{F}}\right) \) , where \( \mathcal{E} \) denotes the set of entities, \( \mathcal{R} \) denotes the set of relations, and \( \mathcal{F} \) the set of \( n \) -ary relational facts (hyperedges). Each \( n \) -ary fact \( {f}^{n} \in \mathcal{F} \) , which consists of two or more entities, is represented as: \( {f}^{n} = {\left\{ {e}_{i}\right\} }_{i = 1}^{n} \) , where \( {\left\{ {e}_{i}\right\} }_{i = 1}^{n} \subseteq \mathcal{E} \) is a set of \( n \) entities with \( n \geq 2 \) .
|
||||
|
||||
Unlike binary knowledge graphs, \( n \) -ary representation inherently captures higher-order relational dependencies among multiple entities. \( n \) -ary relations cannot be faithfully decomposed into combinations of binary relations without losing structural integrity or introducing ambiguity in semantic interpretation [1, 9, 35]. We formalize faithful reduction and show that any straightforward binary scheme violates at least one of: (i) recoverability of the original tuples, (ii) role preservation, or (iii) multiplicity of co-participations. Please refer to Appendix A for more details on the recoveryability of role-preserving hypergraph reduction, roles, and multiplicity.
|
||||
|
||||
### 2.2 Problem Formulation
|
||||
|
||||
Problem (Hypergraph-based RAG). Given a question \( q \) , a hyper-graph \( \mathcal{G} \) representing \( n \) -ary relational structures, and a collection of source documents \( \mathcal{D} \) , the goal of hypergraph-based retrieval-augmented generation (RAG) is to generate faithful and contextually grounded answers \( a \) by leveraging salient multi-hop relational chains from \( \mathcal{G} \) and extracting relevant textual evidence from \( \mathcal{D} \) .
|
||||
|
||||
Complexity: Native \( n \) -ary Hypergraph Retrieval. Let \( {N}_{e} = \left| \mathcal{E}\right| \) , \( {N}_{f} = \left| \mathcal{F}\right| \) , and \( \bar{n} \) be the average arity. A query binds \( k \) role-typed arguments, \( q = {\left\{ \left( {r}_{i} : {a}_{i}\right) \right\} }_{i = 1}^{k} \) , and asks for the remaining \( n - k \) roles. We maintain sorted posting lists over role incidences, \( \mathcal{P}\left( {r : a}\right) = \; \{ f \in \mathcal{F} : \left( {r : a}\right) \in f\} \) , with length \( d\left( {r : a}\right) \) . To answer \( q \) , the \( n \) -ary based retriever intersects the \( k \) posting lists by hyperedge IDs and reads the missing roles from each surviving hyperedge. Let \( {n}^{ \star } \) be the (max/avg) arity among matches. The running time is given by:
|
||||
|
||||
\[
|
||||
{T}_{\mathrm{{HYP}}}\left( q\right) = O\left( {\mathop{\sum }\limits_{{i = 1}}^{k}d\left( {{r}_{i} : {a}_{i}}\right) + \text{ out }}\right) , \tag{1}
|
||||
\]
|
||||
|
||||
where out is the number of matching facts. In typical schemas, the relation arity is often bounded by a small constant (e.g., triadic, \( n \leq 3 \) ). As a result, for each match the retriever touches exactly one hyperedge record to materialize the unbound roles, yielding per-output overhead \( O\left( 1\right) \) .
|
||||
|
||||
Complexity: Standard Binary KG Retrieval. Suppose each \( n \) - ary fact \( f \) is reified as an event node \( {e}_{f} \) with \( n \) role-typed binary edges (e.g., \( {\operatorname{role}}_{j}\left( {{e}_{f},{a}_{j}}\right) \) ). For each binding \( \left( {{r}_{i} : {a}_{i}}\right) \) , use the list of event IDs posted \( {\mathcal{P}}_{\text{ event }}\left( {{r}_{i} : {a}_{i}}\right) \) and intersect the \( k \) lists to obtain candidate events to mirror the hypergraph intersection. For each surviving \( {e}_{f} \) , follow its remaining \( \left( {n - k}\right) \) role-edges to materialize unbound arguments. Let \( {d}_{\text{ event }}\left( {r : a}\right) = \left| {{\mathcal{P}}_{\text{ event }}\left( {r : a}\right) }\right| \) and let \( {n}^{ \star } \) be the (max/avg) arity over matches. The running time is given by:
|
||||
|
||||
\[
|
||||
{T}_{\mathrm{{BIN}}}\left( q\right) = O\left( {\mathop{\sum }\limits_{{i = 1}}^{k}{d}_{\text{ event }}\left( {{r}_{i} : {a}_{i}}\right) + \text{ out } \cdot \left( {{n}^{ \star } - k}\right) }\right) . \tag{2}
|
||||
\]
|
||||
|
||||
Under a schema-bounded arity, the per-result overhead is up to \( \bar{n} \) role lookups to materialize the remaining arguments. In contrast, the hypergraph returns them from a single record.
|
||||
|
||||
Complexity Gap. In a native hypergraph, all arguments of an \( n \) - ary fact co-reside in a single hyperedge record, thus materializing a hit, is one read, i.e., \( O\left( 1\right) \) per result under bounded arity. In contrast, in an event-reified binary KG, the fact is split across \( n \) role-typed edges, reachable only via the intermediate event node \( {e}_{f} \) . As a result, materializing requires up to \( \left( {n - k}\right) \) pointer chases, yielding out \( \cdot \bar{n} \) term, and usually incurs extra indirections/cache misses.
|
||||
|
||||
## 3 Methodology
|
||||
|
||||
We propose HyperRAG, a novel framework that enhances answer fidelity by integrating reasoning over condensed \( n \) -ary relational facts with textual evidence. As depicted in Figure 2, HyperRAG features two retrieval paradigms: (i) HyperRetriever, which performs adaptive structural-semantic traversal to build interpretable, query-conditioned relational chains; (ii) HyperMemory, which utilizes the parametric knowledge of the LLM to guide symbolic beam search. Both variants ground the generation process in hypergraph structures, ensuring faithful and accurate multi-hop reasoning.
|
||||
|
||||

|
||||
|
||||
Figure 2: The overall framework of HyperRAG.
|
||||
|
||||
### 3.1 HyperRetriever: Relational Chains Learning
|
||||
|
||||
The motivation behind learning to extract fine-grained \( n \) -ary relational chains over hypergraph structures stems from two key challenges: (i) the well-documented tendency of LLMs to hallucinate factual content and (ii) the vast combinatorial search space of hypergraphs under limited token and context budgets [25]. To mitigate these challenges, we introduce a lightweight yet expressive retriever that integrates structural and semantic cues to rank salient \( n \) -ary facts aligned with query intent.
|
||||
|
||||
3.1.1 Topic Entity Extraction. The purpose of obtaining the topic entity is to ground the query semantics onto hypergraphs \( \mathcal{G} \) . Formally, given a query \( q \) , we request an LLM with prompt \( {p}_{\text{ topic }} \) to identify a set of topic entities that appear in \( q \) in an LLM as follows:
|
||||
|
||||
\[
|
||||
{\mathcal{E}}_{q} = \operatorname{LLM}\left( {{p}_{\text{ topic }}, q}\right)
|
||||
\]
|
||||
|
||||
where \( {\mathcal{E}}_{q} \) denotes the set of extracted entities in the query \( q \) .
|
||||
|
||||
3.1.2 Hyperedge Retrieval and Triple Formation. For each extracted topic entity \( {e}_{s} \in {\mathcal{E}}_{q} \) , we retrieve its incident hyperedges from \( \mathcal{F} \) , formally defined as follows:
|
||||
|
||||
\[
|
||||
{\mathcal{F}}_{{e}_{s}} = \left\{ {{f}^{n} \in \mathcal{F} : {e}_{s} \in {f}^{n}}\right\} .
|
||||
\]
|
||||
|
||||
Each hyperedge \( {f}^{n} \in {\mathcal{F}}_{{e}_{s}} \) defines an \( n \) -ary relation over a subset of \( n \) entities. To enable pairwise reasoning, we derive a set of pseudobinary triples by enumerating ordered entity pairs within each hyperedge for query \( q \) as follows:
|
||||
|
||||
\[
|
||||
{\mathcal{T}}_{q} = \left\{ {\left( {{e}_{h},{f}^{n},{e}_{t}}\right) \mid {f}^{n} \in {\mathcal{F}}_{{e}_{s}},{e}_{h} \in {f}^{n},{e}_{t} \in {f}^{n}}\right\} , \tag{3}
|
||||
\]
|
||||
|
||||
where each pseudo-binary triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) consists of a head entity, the originating hyperedge, and a tail entity.
|
||||
|
||||
3.1.3 Structural Proximity Encoding. To capture the structural proximity between entities in the hypergraph, we adapt the directional distance encoding (DDE) mechanism from SubGraphRAG [21], extending it from binary relations to \( n \) -ary hyperedges. Formally, for each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \in {\mathcal{T}}_{q} \) , we compute its directional encoding in the following steps:
|
||||
|
||||
- One-Hot Initialization: For each entity \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) , we initialize a one-hot indicator for the head entity:
|
||||
|
||||
\[
|
||||
{s}_{e}^{\left( 0\right) } = \left\{ \begin{array}{ll} 1, & \text{ if }\exists \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \in {\mathcal{T}}_{q}\text{ such that }e = {e}_{h}, \\ 0, & \text{ otherwise. } \end{array}\right. \tag{4}
|
||||
\]
|
||||
|
||||
- Bi-directional Feature Propagation: For each layer \( l = 0,\ldots , L \) , we propagate features over the set of derived triples \( {\mathcal{T}}_{q} \) . Forward propagation simulates how the head entity \( {e}_{h} \) reaches out to the tail entity \( {e}_{t} \) as follows:
|
||||
|
||||
\[
|
||||
{s}_{e}^{\left( l + 1\right) } = \frac{1}{\left| \left\{ {e}^{\prime } \mid \left( {e}^{\prime },\cdot , e\right) \in {\mathcal{T}}_{q}\right\} \right| }\mathop{\sum }\limits_{{\left( {{e}^{\prime },\cdot , e}\right) \in {\mathcal{T}}_{q}}}{s}_{{e}^{\prime }}^{\left( l\right) }. \tag{5}
|
||||
\]
|
||||
|
||||
In contrast, backward propagation updates head encodings based on tail-to-head influence:
|
||||
|
||||
\[
|
||||
{s}_{e}^{\left( r, l + 1\right) } = \frac{1}{\left| \left\{ {e}^{\prime } \mid \left( e,\cdot ,{e}^{\prime }\right) \in {\mathcal{T}}_{q}\right\} \right| }\mathop{\sum }\limits_{{\left( {e,\cdot ,{e}^{\prime }}\right) \in {\mathcal{T}}_{q}}}{s}_{{e}^{\prime }}^{\left( r, l\right) }. \tag{6}
|
||||
\]
|
||||
|
||||
- Bi-directional Encoding: After \( L \) rounds of propagation, we concatenate the forward and backward encodings to obtain the final vector for each entity \( e \) as follows:
|
||||
|
||||
\[
|
||||
{s}_{e} = \left\lbrack {{s}_{e}^{\left( 0\right) }\begin{Vmatrix}{s}_{e}^{\left( 1\right) }\end{Vmatrix}\cdots \begin{Vmatrix}{s}_{e}^{\left( L\right) }\end{Vmatrix}{s}_{e}^{\left( r,1\right) }\parallel \cdots \parallel {s}_{e}^{\left( r, L\right) }}\right\rbrack , \tag{7}
|
||||
\]
|
||||
|
||||
where \( \parallel \) denotes vector concatenation. Note that the backward propagation starts from \( l = 1 \) , as \( l = 0 \) is shared in both directions.
|
||||
|
||||
- Triple Encoding: For each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) , we define its structural proximity encoding as follows:
|
||||
|
||||
\[
|
||||
\delta \left( {{e}_{h},{f}^{n},{e}_{t}}\right) = \left\lbrack {{s}_{{e}_{h}}\parallel {s}_{{e}_{t}}}\right\rbrack \tag{8}
|
||||
\]
|
||||
|
||||
which is passed to a lightweight parametric neural function to compute the plausibility score for each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) given query \( q \) .
|
||||
|
||||
3.1.4 Contrastive Plausibility Scoring. To reduce the search space in the hypergraph structure, we address the challenge that similarity-based retrieval often introduces noisy or irrelevant triples. To mitigate this, we train a lightweight MLP classifier \( {f}_{\theta } \) to estimate the plausibility of each triple candidate and prune uninformative ones.
|
||||
|
||||
To this end, the training set is prepared with positive and negative samples. Let \( {P}_{q}^{ * } \) denote the shortest path of triples connecting the topic entity to a correct answer in the hypergraph \( \mathcal{G} \) . The positive samples \( {\mathcal{T}}_{i}^{ + } \) at hop \( i \) consist of triples in \( {P}_{q}^{ * } \) , denoted as \( {\mathcal{T}}_{i}^{ + } = \left\{ \left( {{e}_{h, i},{f}_{i}^{n},{e}_{t, i}}\right) \right\} \) . Negative samples \( {T}_{i}^{ - } \) consist of all other triples incident to the head entity \( {e}_{i} \) at hop \( i \) that are not in \( {P}_{q}^{ * } \) . At each exploration step, only positive triples are expanded at each hop, while negative ones are excluded. Each triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) is encoded in a feature vector by concatenating its contextual and structural encodings:
|
||||
|
||||
\[
|
||||
\mathbf{x} = \left\lbrack {\varphi \left( q\right) \begin{Vmatrix}{\varphi \left( {e}_{h}\right) }\end{Vmatrix}\varphi \left( {f}^{n}\right) \begin{Vmatrix}{\varphi \left( {e}_{t}\right) }\end{Vmatrix}\delta \left( {{e}_{h},{f}^{n},{e}_{t}}\right) }\right\rbrack , \tag{9}
|
||||
\]
|
||||
|
||||
where \( \varphi \) denotes an embedding model that maps the textual content of the query \( \left( q\right) \) , head entity \( \left( {e}_{h}\right) \) , hyperedge \( \left( {f}^{n}\right) \) , and tail entity \( \left( {e}_{t}\right) \) , into vector representations, forming the candidate pseudobinary triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) . The classifier outputs a plausibility score \( {f}_{\theta }\left( \mathbf{x}\right) \in \left\lbrack {0,1}\right\rbrack \) , trained using binary cross-entropy as follows:
|
||||
|
||||
\[
|
||||
\mathcal{L} = - \frac{1}{N}\mathop{\sum }\limits_{{i = 1}}^{N}\left\lbrack {{y}_{i}\log \left( {{f}_{\theta }\left( {\mathbf{x}}_{i}\right) }\right) + \left( {1 - {y}_{i}}\right) \log \left( {1 - {f}_{\theta }\left( {\mathbf{x}}_{i}\right) }\right) }\right\rbrack . \tag{10}
|
||||
\]
|
||||
|
||||
3.1.5 Adaptive Search. At inference time, we initiate the retrieval process with initial triples of topic entities and compute their plausibility scores using the trained MLP, \( {f}_{\theta }\left( \mathbf{x}\right) \) . Triples exceeding a plausibility threshold \( \tau \) are retained, and their tail entities are used as frontier entities in the next hop. This expansion-filtering cycle continues until no new triples satisfy the threshold. However, using a fixed threshold \( \tau \) can be problematic: it may be too strict in sparse hypergraphs, limiting retrieval, or too lenient in dense hypergraphs, leading to an overload of irrelevant triples. To mitigate this, we implement an adaptive thresholding strategy. We initialize with \( {\tau }_{0} = {0.5} \) , allow a maximum of \( {N}_{\max } = 5 \) threshold reductions, and define \( M = {50} \) as the minimum acceptable number of hyperedges per hop. At hop \( i \) , we retrieve the set of triples, \( {\mathcal{T}}_{q, \geq {\tau }_{j}} = \left\{ {\left( {{e}_{h},\mathbf{h},{e}_{t}}\right) \mid {f}_{\theta }\left( x\right) \geq {\tau }_{j}}\right\} \) under the current threshold \( {\tau }_{j} \) . If \( \left| {\mathcal{T}}_{q, \geq {\tau }_{j}}\right| < M \) , we iteratively reduce the threshold as follows:
|
||||
|
||||
\[
|
||||
{\tau }_{j + 1} = {\tau }_{j} - c,\;j = 0,\ldots ,{N}_{\max } - 1, \tag{11}
|
||||
\]
|
||||
|
||||
where \( c = {0.1} \) is the decay factor. This process continues until \( \begin{Vmatrix}{\mathcal{T}}_{q, \geq {\tau }_{j}}\end{Vmatrix} \geq M \) or the reduction limit is reached. To further adapt to structural variations in the hypergraph, we incorporate a density-aware thresholding policy. Given the density of the hypergraph \( \Delta \left( \mathcal{G}\right) \) and the predefined lower and upper bounds \( {\Delta }_{\text{ lo }} \) and \( {\Delta }_{\text{ up }} \) , we classify the hypergraph and adjust \( {\tau }_{0} \) accordingly to balance coverage and precision as follows:
|
||||
|
||||
\[
|
||||
{\mathcal{M}}_{\mathcal{G}} = \left\{ \begin{array}{ll} {\mathcal{M}}_{\text{ low }}, & \Delta \left( \mathcal{G}\right) \leq {\Delta }_{\mathrm{{lo}}}, \\ {\mathcal{M}}_{\text{ mid }}, & {\Delta }_{\mathrm{{lo}}} < \Delta \left( \mathcal{G}\right) \leq {\Delta }_{\mathrm{{up}}}, \\ {\mathcal{M}}_{\text{ high }}, & \Delta \left( \mathcal{G}\right) > {\Delta }_{\mathrm{{up}}} \end{array}\right. \tag{12}
|
||||
\]
|
||||
|
||||
After convergence or exhaustion of threshold reduction attempts, the retrieval strategy is adjusted based on the assigned graph density category. For low-density graphs \( \left( {\mathcal{M}}_{\text{ low }}\right) \) , the retriever selects from previously discarded triples those that satisfy the final plausibility threshold. For medium and high-density graphs \( \left( {\mathcal{M}}_{\text{ mid }}\right. \) and \( \left. {\mathcal{M}}_{\text{ high }}\right) \) , the strategy additionally expands from the tail entities of these newly accepted triples to increase the depth of reasoning. This density-aware adjustment prevents over-retrieval in sparse graphs while enabling more profound and broader exploration in dense graphs. To further control expansion in high-density settings, where the number of candidate hyperedges may become excessive, we impose an upper bound on the number of retrieved triples per hop. This constraint effectively limits entity expansion, accelerates retrieval, and reduces the inclusion of low-utility information.
|
||||
|
||||
3.1.6 Budget-aware Contextualized Generator. After completion of the retrieval process, we organize the selected elements into a structured input for the generator. Following the context layout protocol of HyperGraphRAG [25], we include (i) entities and their associated descriptions, (ii) hyperedges along with their participating entities, and (iii) supporting source text chunks linked to each entity or hyperedge. Due to input length constraints, we prioritize components based on their utility. As shown in the ablation study of HyperGraphRAG, n-ary relational facts (i.e., hyperedges) contribute the most to reasoning performance, followed by entities and then source text. We therefore allocate the token budget accordingly: 50% for hyperedges, 30% for entities, and 20% for source chunks. To further maximize informativeness, we order hyperedges and entities according to their plausibility scores \( {f}_{\theta }\left( \cdot \right) \) , with graph connectivity as a secondary criterion. The selected components are then sequentially filled in the order: hyperedges, entities, and source chunks. Components are filled in priority order and any unused budget is passed to the next category. The contextualized evidence resulting context, together with the original query \( q \) , is then passed to the LLM to generate the final answer Answer as:
|
||||
|
||||
Answer \( \mathrel{\text{ := }} \operatorname{LLM}\left( {\text{ Context }, q}\right) \) .(13)
|
||||
|
||||
### 3.2 HyperMemory: Relational Chain Extraction
|
||||
|
||||
To improve interpretability and context awareness in path retrieval, we avoid naive top- \( k \) heuristics with LLM-guided scoring that leverages the model's parametric memory to assess the salience of hyper-edges and entities. This enables retrieval to be guided by contextual priors and query intent, facilitating more targeted and meaningful relational exploration.
|
||||
|
||||
3.2.1 Memory-Guided Beam Retriever. Specifically, we design beam search with width \( w = 3 \) and depth \( d = 3 \) , where \( w \) denotes the number of paths ranked in the top order retained at each iteration, and \( d \) specifies the maximum number of expansion steps. Following the process of the Learnable Relational Chain Retriever, we begin by identifying the set of topic entities \( {\mathcal{E}}_{q} \) from the input query \( q \) using an LLM-based entity extractor. For each topic entity \( {e}_{s} \in {\mathcal{E}}_{q} \) , we retrieve its incident hyperedge set \( {\mathcal{F}}_{{e}_{s}} \) . Each hyperedge \( {f}^{n} \in {\mathcal{F}}_{{e}_{s}} \) is scored for relevance to both \( {e}_{s} \) and \( q \) using a prompt \( {p}_{\text{ edge }} \) :
|
||||
|
||||
\[
|
||||
{\mathcal{S}}_{\mathcal{F}}\left( {{f}^{n} \mid {e}_{s}, q}\right) \sim \operatorname{LLM}\left( {{p}_{\text{ edge }},{e}_{s},{f}^{n}, q}\right) . \tag{14}
|
||||
\]
|
||||
|
||||
We retain the top- \( w \) hyperedges, denoted \( {H}_{{e}_{s}}^{ + } \) , based on the score \( {\mathcal{S}}_{\mathcal{F}}\left( \cdot \right) \) . Next, for each \( {f}^{n} \in {\mathcal{F}}_{{e}_{s}}^{ + } \) , we identify unvisited tail entities \( {e}_{t} \) and score their relevance using a second prompt \( {p}_{\text{ entity }} \) :
|
||||
|
||||
\[
|
||||
{\mathcal{S}}_{\mathcal{E}}\left( {{e}_{t} \mid {f}^{n}, q}\right) \sim \operatorname{LLM}\left( {{p}_{\text{ entity }},{f}^{n},{e}_{t}, q}\right) . \tag{15}
|
||||
\]
|
||||
|
||||
Next, each resulting candidate triple \( \left( {{e}_{s},{f}^{n},{e}_{t}}\right) \) receives a weighted composite score as follows:
|
||||
|
||||
\[
|
||||
\mathcal{S}\left( {{e}_{s},{f}^{n},{e}_{t}}\right) = {\mathcal{S}}_{\mathcal{F}}\left( {{f}^{n} \mid {e}_{s}, q}\right) \cdot {\mathcal{S}}_{\mathcal{E}}\left( {{e}_{t} \mid {f}^{n}, q}\right) . \tag{16}
|
||||
\]
|
||||
|
||||
From the current set of candidate triples, we retain the top- \( w \) based on the final triple scorer \( \mathcal{S}\left( \cdot \right) \) . The tail entities of these selected paths define the next expansion frontier. At each depth \( i \) , we evaluate whether the accumulated evidence suffices to answer the query. All retrieved triples are assembled into a contextualized component \( {C}_{i} \) , which is passed to the LLM for an evidence sufficiency check:
|
||||
|
||||
\[
|
||||
\operatorname{LLM}\left( {{p}_{\text{ ctx }},{C}_{i}, q}\right) \rightarrow \{ \text{ yes, no }\} \text{ , Reason. } \tag{17}
|
||||
\]
|
||||
|
||||
If the result is yes, terminate the search and proceed to generation. Otherwise, if \( i < d \) , the search continues until the next iteration.
|
||||
|
||||
3.2.2 Contextualized Generator. The entities and hyperedges retrieved are organized in a fixed format context, as defined in Eq.(13). This contextualized evidence Context, combined with the original query \( q \) , is then passed to the LLM to generate the final Answer.
|
||||
|
||||
## 4 Experiments
|
||||
|
||||
We quantitatively evaluate the effectiveness and efficiency of Hyper-Retriever against RAG baselines both in-domain and cross-domain settings. Ablation studies highlight the benefits of adaptive expansion and \( n \) -ary relational chain learning, complemented by qualitative analyzes that illustrate the precision and efficiency of the adaptive retrieval process.
|
||||
|
||||
### 4.1 Experimental Setup
|
||||
|
||||
4.1.1 Datasets. We conduct experiments under both open-domain and closed-domain multi-hop question answering (QA) settings. For in-domain evaluation, we use three widely adopted benchmark datasets: HotpotQA [42], MuSiQue [38], and 2WikiMulti-HopQA [16]. To evaluate cross-domain generalization, we adopt the WikiTopics-CLQA dataset [11], which tests zero-shot inductive reasoning over unseen entities and relations at inference time. Comprehensive dataset statistics are summarized in Appendix B.2.
|
||||
|
||||
4.1.2 Evaluation Metrics. We employ four standard metrics to assess performance, aligning with established protocols for each benchmark type. For open-domain QA datasets, where the objective is precise answer generation, we report Exact Match (EM) and F1 scores. For WikiTopics-CLQA, which involves ranking correct entities from a candidate list, we utilize Mean Reciprocal Rank (MRR) and Hits@k to evaluate retrieval fidelity. All metrics are reported as percentages (%), with higher values indicating better performance.
|
||||
|
||||
4.1.3 Baselines. To evaluate the effectiveness of our approach, we compare HyperRAG with RAG baselines with varying retrieval granularities, enabling a systematic analysis of how evidence structure affects retrieval effectiveness and answer generation in both open- and closed-domain settings. Specifically, we include: RAPTOR [33], which retrieves tree-structured nodes; HippoRAG [14], which retrieves free-text chunks; ToG [37], which retrieves relational subgraphs; and HyperGraphRAG [25], which retrieves a heterogeneous mixture of entities, relations, and textual spans.
|
||||
|
||||
4.1.4 Implementation Details. All baselines and our proposed methods utilize gpt-40-mini as the core model for both graph construction and question answering. For HyperRetriever, we additionally employ the pretrained text encoder gte-large-en-v1.5 to produce dense embeddings for entities, relations, and queries. With 434M parameters, this GTE-family model achieves strong performance on English retrieval benchmarks, such as MTEB, and offers an efficient balance between inference speed and embedding quality, making it well-suited for semantic subgraph retrieval. All experiments were implemented in Python 3.11.13 with CUDA 12.8 and conducted on a single NVIDIA RTX 3090 (24 GB). Peak GPU memory usage remained within 24 GB due to dynamic allocation.
|
||||
|
||||
### 4.2 Open-domain Answering Performance
|
||||
|
||||
4.2.1 Setup. For HyperRetriever, a lightweight MLP \( {f}_{\theta } \) scores the plausibility of candidate hyperedges, enabling aggressive pruning that reduces traversal complexity without compromising reasoning quality. For HyperMemory, we set beam width \( w = 3 \) and depth \( d = 3 \) to balance retrieval coverage against computational cost. Comprehensive prompt definitions for edge scoring \( \left( {p}_{\text{ edge }}\right) \) , entity ranking \( \left( {p}_{\text{ entity }}\right) \) , context evaluation \( \left( {p}_{\text{ ctx }}\right) \) , and generation are provided in the Appendix.
|
||||
|
||||
<table><tr><td rowspan="2">Topic</td><td colspan="2">RAPTOR</td><td colspan="2">HippoRAG</td><td colspan="2">ToG</td><td colspan="2">HyperGraphRAG</td><td colspan="2">HyperRetriever Hy</td><td colspan="2"></td><td colspan="2">Rel. Gain (%)</td></tr><tr><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td></tr><tr><td>ART</td><td>3.44</td><td>4.13</td><td>8.42</td><td>9.77</td><td>2.99</td><td>3.20</td><td>17.18</td><td>21.68</td><td>19.31</td><td>24.31</td><td>15.63</td><td>19.17</td><td>12.40</td><td>12.13</td></tr><tr><td>AWARD</td><td>20.57</td><td>25.13</td><td>32.80</td><td>38.65</td><td>8.70</td><td>9.35</td><td>51.64</td><td>63.43</td><td>52.66</td><td>65.28</td><td>47.34</td><td>56.98</td><td>1.98</td><td>2.93</td></tr><tr><td>EDU</td><td>4.94</td><td>5.90</td><td>23.82</td><td>26.37</td><td>9.09</td><td>9.49</td><td>43.44</td><td>50.05</td><td>44.79</td><td>51.63</td><td>41.68</td><td>46.95</td><td>3.11</td><td>3.16</td></tr><tr><td>HEALTH</td><td>18.85</td><td>22.04</td><td>25.72</td><td>29.59</td><td>7.14</td><td>7.95</td><td>31.46</td><td>37.94</td><td>32.68</td><td>39.26</td><td>27.48</td><td>33.13</td><td>3.88</td><td>3.48</td></tr><tr><td>INFRA</td><td>10.95</td><td>12.79</td><td>23.88</td><td>27.11</td><td>9.87</td><td>10.67</td><td>37.18</td><td>44.82</td><td>38.92</td><td>45.77</td><td>35.77</td><td>41.69</td><td>4.68</td><td>2.12</td></tr><tr><td>LOC</td><td>16.55</td><td>18.68</td><td>19.88</td><td>23.08</td><td>3.45</td><td>3.83</td><td>29.92</td><td>34.38</td><td>31.80</td><td>36.85</td><td>30.73</td><td>35.95</td><td>6.28</td><td>7.18</td></tr><tr><td>ORG</td><td>12.00</td><td>14.54</td><td>36.20</td><td>41.70</td><td>6.61</td><td>7.33</td><td>64.68</td><td>74.89</td><td>62.87</td><td>71.21</td><td>52.26</td><td>59.84</td><td>-2.80</td><td>-4.91</td></tr><tr><td>PEOPLE</td><td>10.74</td><td>13.10</td><td>15.39</td><td>18.28</td><td>3.90</td><td>4.40</td><td>20.67</td><td>28.10</td><td>21.62</td><td>28.48</td><td>18.96</td><td>25.29</td><td>4.60</td><td>1.35</td></tr><tr><td>SCI</td><td>6.84</td><td>8.66</td><td>15.62</td><td>18.86</td><td>6.87</td><td>7.28</td><td>25.92</td><td>34.54</td><td>25.15</td><td>32.30</td><td>21.50</td><td>27.53</td><td>-2.97</td><td>-6.49</td></tr><tr><td>SPORT</td><td>11.31</td><td>13.28</td><td>22.78</td><td>26.01</td><td>7.51</td><td>8.53</td><td>37.40</td><td>44.91</td><td>39.37</td><td>45.56</td><td>33.64</td><td>39.72</td><td>5.27</td><td>1.45</td></tr><tr><td>TAX</td><td>10.48</td><td>11.08</td><td>24.77</td><td>26.65</td><td>6.22</td><td>6.50</td><td>35.15</td><td>40.94</td><td>37.20</td><td>40.98</td><td>33.65</td><td>38.19</td><td>5.83</td><td>0.10</td></tr><tr><td>AVG</td><td>11.52</td><td>13.58</td><td>22.66</td><td>26.01</td><td>6.58</td><td>7.14</td><td>35.88</td><td>43.24</td><td>36.94</td><td>43.78</td><td>32.60</td><td>38.59</td><td>2.95</td><td>1.23</td></tr></table>
|
||||
|
||||
Table 1: Performance comparison of domain generalization across 11 diverse topics. The "Rel. Gain" column highlights the substantial relative improvement of our approach over the best baseline, averaged across all domains (metrics in %).
|
||||
|
||||
<table><tr><td rowspan="2">Model</td><td colspan="2">HotpotQA</td><td colspan="2">MuSiQue</td><td colspan="2">2WikiMultiHopQA</td></tr><tr><td>EM(%)</td><td>F1(%)</td><td>EM(%)</td><td>F1(%)</td><td>EM(%)</td><td>F1(%)</td></tr><tr><td>RAPTOR</td><td>35.50</td><td>41.56</td><td>15.00</td><td>16.31</td><td>22.50</td><td>22.95</td></tr><tr><td>HippoRAG</td><td>49.50</td><td>55.87</td><td>14.50</td><td>17.43</td><td>30.00</td><td>30.44</td></tr><tr><td>ToG</td><td>10.08</td><td>11.00</td><td>2.70</td><td>2.69</td><td>5.20</td><td>5.34</td></tr><tr><td>HyperGraphRAG</td><td>51.00</td><td>42.69</td><td>22.00</td><td>20.02</td><td>42.50</td><td>30.17</td></tr><tr><td>HyperRetriever</td><td>42.50</td><td>43.65</td><td>13.50</td><td>14.15</td><td>34.00</td><td>34.06</td></tr><tr><td>HyperMemory</td><td>35.50</td><td>41.51</td><td>8.00</td><td>12.96</td><td>31.50</td><td>32.56</td></tr><tr><td>Rel. Gain (%)</td><td>-16.67</td><td>-21.87</td><td>-38.64</td><td>-29.32</td><td>-20.00</td><td>11.89</td></tr></table>
|
||||
|
||||
Table 2: Performance comparison on HotpotQA, MuSiQue, and 2WikiMultiHopQA. Rel. Gain (%) indicates the relative performance gains achieved by our model compared with the best baselines. The best results are bolded, and the second best are underlined.
|
||||
|
||||
4.2.2 Results. Table 2 details the Exact Match (EM) and F1 scores across three open-domain QA benchmarks. HyperRetriever consistently outperforms the HyperMemory variant on HotpotQA and MuSiQue, demonstrating superior capability in identifying evidential relational chains. This advantage is attributed to its learnable MLP-based plausibility scorer and density-aware expansion strategy, which affords precise control over retrieval depth. In contrast, HyperMemory relies on the fixed parametric memory of the LLM, rendering it less adaptable to domain-specific relational patterns. When compared to external KG-based RAG baselines, we observe a performance divergence based on graph topology. On HotpotQA and MuSiQue, HyperRetriever exhibits a performance gap (e.g., 38.64% lower EM on MuSiQue), likely because these datasets require the rigid structural guidance of explicit KG priors for cross-document navigation. However, on 2WikiMultiHopQA, HyperRe-triever reverses this trend, achieving an 11.89% relative F1 improvement. This suggests that while KG priors aid in sparse settings, HyperRetriever is uniquely effective at exploiting the denser, complex relational contexts found in 2WikiMultiHopQA.
|
||||
|
||||
### 4.3 Closed-domain Generalization Performance
|
||||
|
||||
To evaluate adaptability to closed-domain \( n \) -ary knowledge graphs, we evaluate the performance of HyperRAG on the WikiTopics-CLQA dataset (Table 1). The results demonstrate a strong generalization across diverse topic-specific hypergraphs. In particular, our learnable variant, HyperRetriever, achieved the highest overall answer precision, with average improvements of 2.95% (MRR) and 1.23% (Hits@10) compared to the second-best baseline, Hyper-GraphRAG. These gains are statistically significant \( \left( {p \ll {0.001}}\right) \) , with \( t \) -test values of \( {1.46} \times {10}^{-{17}} \) for MRR and \( {2.41} \times {10}^{-6} \) for Hits@10, suggesting the empirical reliability of our approach. HyperRetriever secures top performance in 9 out of the 11 categories-for instance, achieving relative gains of 12.40% (MRR) and 12.13% (Hits@10) in the ART domain-and consistently ranks second in the remaining two. This broad efficacy highlights the robustness of HyperRe-triever's adaptive retrieval mechanism. Unlike baselines that are sensitive to domain-specific graph density, HyperRetriever's learnable MLP scorer dynamically calibrates its expansion strategy to suit varying \( n \) -ary topologies, ensuring high precision even in complex reasoning tasks. In contrast, our memory-guided variant, Hyper-Memory, consistently underperforms against to HyperRetriever. This variant serves as a critical ablation to probe the limitations of an LLM's intrinsic parametric memory for \( n \) -ary retrieval. The results confirm that prompt-based scoring alone, without the explicit structural learning provided by HyperRetriever, is insufficient for multi-hop reasoning in closed domains.
|
||||
|
||||
<table><tr><td rowspan="2">Topic</td><td colspan="2">Full</td><td colspan="2">w/o Entities</td><td colspan="2">w/o Hyperedges</td><td colspan="2">Chunks</td><td colspan="4">//o Adaptive Search w Binary KG</td></tr><tr><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td></tr><tr><td>ART</td><td>26.03</td><td>31.00</td><td>27.28</td><td>31.00</td><td>24.03</td><td>27.00</td><td>24.17</td><td>27.00</td><td>26.33</td><td>31.00</td><td>14.00</td><td>15.00</td></tr><tr><td>AWARD</td><td>56.91</td><td>70.00</td><td>43.22</td><td>61.00</td><td>55.95</td><td>69.00</td><td>55.01</td><td>66.00</td><td>52.98</td><td>66.00</td><td>48.92</td><td>53.00</td></tr><tr><td>EDU</td><td>49.00</td><td>56.00</td><td>43.24</td><td>52.00</td><td>47.93</td><td>52.00</td><td>42.67</td><td>47.00</td><td>47.53</td><td>53.00</td><td>38.20</td><td>42.00</td></tr><tr><td>HEALTH</td><td>41.25</td><td>47.00</td><td>37.17</td><td>43.00</td><td>37.70</td><td>40.00</td><td>39.33</td><td>47.00</td><td>39.20</td><td>46.00</td><td>36.17</td><td>39.00</td></tr><tr><td>INFRA</td><td>34.85</td><td>43.00</td><td>35.17</td><td>43.00</td><td>30.87</td><td>39.00</td><td>38.75</td><td>44.00</td><td>35.50</td><td>45.00</td><td>30.50</td><td>32.00</td></tr><tr><td>LOC</td><td>38.75</td><td>42.50</td><td>44.58</td><td>47.50</td><td>37.50</td><td>40.00</td><td>33.13</td><td>37.50</td><td>41.67</td><td>47.50</td><td>39.58</td><td>42.50</td></tr><tr><td>ORG</td><td>46.79</td><td>58.97</td><td>58.75</td><td>65.00</td><td>45.92</td><td>55.00</td><td>53.00</td><td>60.00</td><td>38.07</td><td>45.00</td><td>47.50</td><td>47.50</td></tr><tr><td>PEOPLE</td><td>14.20</td><td>22.00</td><td>21.23</td><td>28.00</td><td>13.73</td><td>19.00</td><td>20.03</td><td>26.00</td><td>13.37</td><td>20.00</td><td>19.33</td><td>22.00</td></tr><tr><td>SCI</td><td>25.91</td><td>36.00</td><td>18.67</td><td>22.00</td><td>24.53</td><td>32.00</td><td>26.09</td><td>38.00</td><td>21.14</td><td>32.00</td><td>24.00</td><td>27.00</td></tr><tr><td>SPORT</td><td>31.04</td><td>40.00</td><td>35.83</td><td>40.00</td><td>35.00</td><td>45.50</td><td>29.58</td><td>40.00</td><td>33.33</td><td>37.50</td><td>42.08</td><td>47.50</td></tr><tr><td>TAX</td><td>36.25</td><td>40.00</td><td>29.17</td><td>35.00</td><td>33.54</td><td>36.25</td><td>33.13</td><td>36.25</td><td>36.88</td><td>40.00</td><td>35.42</td><td>37.50</td></tr><tr><td>AVG</td><td>36.45</td><td>40.59</td><td>35.85</td><td>42.50</td><td>35.15</td><td>41.34</td><td>35.90</td><td>42.61</td><td>35.64</td><td>42.91</td><td>34.15</td><td>36.82</td></tr></table>
|
||||
|
||||
Table 3: Ablation on the Contribution of Context Formation and Adaptive Search. The full model incorporates all components essential for context formation, including entities, hyperedges involved in learnable relational chains, and retrieved chunks. The best results in MRR are bolded, and the best in Hits@ 10 are underlined.
|
||||
|
||||
<table><tr><td>Dimension</td><td>RAPTOR [33]</td><td>HippoRAG [14]</td><td>ToG [37]</td><td>HyperGraphRAG [25]</td><td>OG-RAG [34]</td><td>HyperRetriever / Memory</td></tr><tr><td>Structure type</td><td>Doc tree (summ.)</td><td>KG (binary)</td><td>KG (binary)</td><td>Hypergraph ( \( n \) -ary)</td><td>Object graph (mostly bin.)</td><td>Hypergraph (n-ary)</td></tr><tr><td>Unit of fact</td><td>Passage / summary</td><td>Entity-entity edge</td><td>Step / subgoal</td><td>Hyperedge ( \( n \) -ary fact)</td><td>Object-object edge</td><td>Hyperedge (n-ary fact)</td></tr><tr><td>Candidate growth</td><td>Additive (levels)</td><td>Additive on edge</td><td>LLM-var.</td><td>Additive on hyperedges</td><td>Additive on objects</td><td>Additive on hyperedges</td></tr><tr><td>Per-query overhead</td><td>Tokens only</td><td>\( O\left( {n - k}\right) \)</td><td>Var.</td><td>\( O{\left( 1\right) }^{ \dagger } \)</td><td>\( O\left( 1\right) \)</td><td>\( O{\left( 1\right) }^{ \dagger } \)</td></tr><tr><td>Depth for reasoning chain</td><td>Deep</td><td>Deep (pairwise)</td><td>LLM-var.</td><td>Shallow \( \left( {n\text{ -ary edges }}\right) \)</td><td>Deep (pairwise)</td><td>Shallow \( \left( {n\text{ -ary edges }}\right) \)</td></tr><tr><td>Retrieval strategy</td><td>Dense tree search</td><td>Graph walk + dense</td><td>LLM on graph</td><td>Static</td><td>Object-centric walk</td><td>Adaptive / LLM on graph</td></tr><tr><td>LLM at retrieval</td><td>Low-Med</td><td>Low</td><td>Med-High (LLM)</td><td>Low</td><td>Low</td><td>Low / Med (LLM)</td></tr><tr><td>Ontology</td><td>✘</td><td>✘</td><td>✘</td><td>✘</td><td>✓</td><td>✘</td></tr></table>
|
||||
|
||||
Table 4: Method Comparison. HyperRetriever utilizes adaptive search on \( n \) ary hyperedges, enabling higher-order reasoning with shallow chains and near constant per-query retrieval overhead \( O\left( 1\right) \) . In contrast, static or object-centric walks on binary graphs entail deeper pairwise chains and materialization cost. \( \dagger \) denotes bounded arity; \( \checkmark \) indicates an ontology requirement.
|
||||
|
||||
### 4.4 Ablation Study
|
||||
|
||||
To evaluate the effectiveness of our approach, we conduct a series of ablation studies targeting two key aspects: (i) the contribution of individual components to context formation, and (ii) the impact of the adaptive search policy on retrieval performance.
|
||||
|
||||
4.4.1 Higher-Order Reasoning Chains. Compared with binary KG RAG, HyperRAG supports higher-order reasoning on \( n \) -ary hyper-graphs. An \( n \) -ary hyperedge jointly binds multiple entities and roles, capturing fine-grained dependencies beyond pairwise links. Exploiting this structure yields shallower yet more expressive reasoning chains, enabling the model to surface key evidence without multihop traversal. Empirically (Table 3), replacing the \( n \) -ary structure with a binary one lowers average MRR from 36.45% to 34.15% (-2.3%) and the average Hits @ 10 from 40.59% to 36.82% (-3.77%), indicating gains in both accuracy and efficiency. Additional qualitative examples appear in Appendix C.
|
||||
|
||||
4.4.2 Impact of Context Formation. Table 3 presents a componentwise ablation study conducted on a representative \( 1\% \) subset to isolate the contributions of (i) entities, (ii) structural relations (hy-peredges), and (iii) textual context. We observe that removing any component consistently degrades Mean Reciprocal Rank (MRR), though Hits@10 exhibits higher variance. This divergence highlights the distinction between ranking fidelity (MRR) and candidate inclusion (Hits@10). For instance, in the ORG and LOC domains, certain ablated variants maintain competitive Hits@10 scores but suffer sharp declines in MRR. This indicates that while the correct answer remains within the top candidates, the loss of structural or semantic signals causes it to drift down the ranking list, degrading precision. Crucially, hyperedges emerge as the dominant factor in effective context formation. Their exclusion precipitates the most significant performance drops across both metrics, underscoring the necessity of high-order topological structure for reasoning. In contrast, removing entities yields less severe degradation, as entities primarily provide node-level descriptions, whereas hyperedges capture the joint dependencies between them. Text chunks offer complementary unstructured semantics but lack the relational precision of the graph structure. Ultimately, the superior performance of the full model validates the synergistic integration of entity-aware signals, hypergraph topology, and adaptive textual evidence.
|
||||
|
||||
4.4.3 Impact of Adaptive Search. Removing the adaptive search component results in a noticeable decline in MRR across most categories, whereas its impact on Hit@10 is minimal and in some cases (e.g., INFRA, LOC), even marginally positive. This pattern suggests that while correct answers remain retrievable among the top 10 candidates, they tend to be ranked lower in the absence of adaptive search, resulting in a reduced overall ranking precision.
|
||||
|
||||

|
||||
|
||||
Figure 3: The visualization shows the efficiency-effectiveness tradeoff in multi-hop QA: retrieval time ( \( x \) -axis), answer quality (Hits@10, y-axis), and context volume (bubble size, log-scaled by retrieved tokens).
|
||||
|
||||
### 4.5 Efficiency Study
|
||||
|
||||
4.5.1 Setup. To assess retrieval efficiency, we draw a stratified 1% from each WikiTopics-CLQA category, yielding approximately 1,000 questions evenly distributed across 11 topic domains, and evaluate all baselines on this set. Figure 3 depicts the three-way trade off among retrieval time ( \( x \) -axis), Hits@10 accuracy ( \( y \) -axis), and context volume (bubble size, logarithmically scaled by retrieved tokens). Models in the upper left quadrant achieve the best balance between efficiency and effectiveness, combining low latency with high Hits@10 while retrieving compact contexts.
|
||||
|
||||
4.5.2 Empirical Evidence. HyperRetriever achieves the shortest retrieval time and the highest Hits@10.Although it retrieves more tokens than some baselines, top performers consistently rely on larger contexts, highlighting a common trade-off between answer quality and retrieval volume. Our empirical findings align with the theoretical analysis in §2.2. HyperRetriever employs adaptive search over \( n \) -ary hyperedges, enabling higher-order reasoning with shallow chains and nearly constant per query overhead \( O\left( 1\right) \) . In contrast, static or object-centric walks in binary graphs require deeper pairwise chains and incur an event materialization cost \( O\left( {n - k}\right) \) . We further benchmark our approach against five publicly available graph-based RAG systems, covering both \( n \) -ary and binary KG designs, and summarize in Table 4.
|
||||
|
||||
## 5 Related Work
|
||||
|
||||
Retrieval-Augmented Generation. RAG fundamentally augments the parametric memory of LLMs with external data, serving as a critical countermeasure against hallucination in knowledge-intensive tasks. The standard pipeline operates by retrieving top- \( k \) document chunks via dense similarity search before conditioning generation on this augmented context [2, 12, 17]. However, conventional dense retrieval methods [6, 20] treat data as flat text, often overlooking the complex structural and relational signals required for deep reasoning. To address this, iterative multi-step retrieval approaches have been proposed [18, 36, 39]. Yet, these methods often suffer from diminishing returns: they increase inference latency and retrieve redundant information that dilutes the context signal. This noise contributes to the "lost-in-the-middle" effect, where finite context windows prevent the LLM from effectively attending to dispersed evidence [24, 41].
|
||||
|
||||
Graph-based RAG. Graph-based RAG frameworks incorporate inter-document and inter-entity relationships into retrieval to enhance coverage and contextual relevance \( \left\lbrack {3,{15},{31},{32}}\right\rbrack \) . Early approaches queried curated KGs (e.g., WikiData, Freebase) for factual triples or reasoning chains \( \left\lbrack {4,{22},{27},{40}}\right\rbrack \) , while recent methods fuse KGs with unstructured text [8, 23] or build task-specific graphs from raw corpora [7]. To improve efficiency, LightRAG [13], HippoRAG [14], and MiniRAG [10] adopt graph indexing via entity links, personalized PageRank, or incremental updates [28, 29]. However, KG-based RAGs often face a trade-off between breadth and precision: broader retrieval increases noise, while narrower retrieval risks omitting key evidence. Methods using fixed substructures (e.g., paths, chunks) simplify reasoning [33, 44] but may miss global context, and challenges are amplified by LLM context window limits, vast KG search spaces [18, 30, 37], and the high latency of iterative queries [37]. Moreover, most graph-based RAG methods rely on binary relational facts, limiting the expressiveness and coverage of knowledge. Hypergraph-based representations capture richer \( n \) - ary relational structures [26]. HyperGraphRAG [25] advances this line by leveraging \( n \) -ary hypergraphs, outperforming conventional KG-based RAGs, yet suffers from noisy retrieval and reliance on dense retrievers. OG-RAG [34] addresses these issues by grounding hyperedge construction and retrieval in domain-specific ontologies, enabling more accurate and interpretable evidence aggregation. However, its dependence on high-quality ontologies constrains scalability in fast-changing or low-resource domains. Most graph-based and hypergraph-based RAG methods still face challenges, particularly due to the use of static or object-centric walks on binary graphs, which entail deeper pairwise chains and higher materialization costs. Table 4 compares existing methods with HyperRAG.
|
||||
|
||||
## 6 Conclusion
|
||||
|
||||
We introduced HyperRAG, a novel framework that advances multihop Question Answering by shifting the retrieval paradigm from binary triples to \( n \) -ary hypergraphs featuring two strategies: Hyper-Retriever, designed for precise, structure-aware evidential reasoning, and HyperMemory, which leverages dynamic, memory-guided path expansion. Empirical results demonstrate that HyperRAG effectively bridges reasoning gaps by enabling shallower, more semantically complete retrieval chains. Notably, HyperRetriever consistently outperforms strong baselines across diverse open- and closed-domain datasets, proving that modeling high-order dependencies is crucial for accurate and interpretable RAG systems.
|
||||
Reference in New Issue
Block a user