Files
2026-04-02 09:48:38 +08:00

44 KiB

HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation

Hiren Madhu ( {}^{1} ) Ngoc Bui ( {}^{1} ) Ali Maatouk ( {}^{1} ) Leandros Tassiulas ( {}^{1} ) Smita Krishnaswamy ( {}^{1} ) Menglin Yang ( {}^{2} ) Sukanta Ganguly ( {}^{3} ) Kiran Srinivasan ( {}^{3} ) Rex Ying ( {}^{1} )

Abstract

Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embed-dings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean em-beddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation-with over 20% radial increase from general to specific concepts—a property absent in Euclidean embed-dings, underscoring the critical role of geometric inductive bias in faithful RAG systems ( {}^{1} ) .

1. Introduction

Dense retrieval forms the backbone of retrieval-augmented generation (RAG) systems (Lewis et al., 2020; Fan et al., 2024), where embedding quality directly determines whether generated responses are grounded in evidence or hallucinated. By retrieving relevant documents and conditioning generation on this context, RAG systems produce responses that are more attributable and aligned with verifiable sources (Ni et al., 2025). Yet, despite advances in retrieval architectures, current systems continue to rely on Euclidean embeddings, a choice inherited from standard neural networks rather than from language structure itself.

bo_d6nbcqk601uc73e2hscg_0_917_545_680_293_0.jpg

Figure 1. Hierarchies in Text. (A) Documents naturally organize into branching hierarchies where general topics spawn increasingly specific subtopics. Euclidean spaces distort such hierarchies due to crowding effects, while hyperbolic geometry preserves hierarchical relationships through exponential volume growth. (B) Ricci curvature analysis of document embeddings from strong baselines reveals predominantly negative curvature, indicating tree-like semantic structure.

Natural language inherently exhibits strong hierarchical organization (He et al., 2025b; Robinson et al., 2024), with semantic structure giving rise to locally tree-like neighborhoods. Euclidean spaces struggle to represent such branching hierarchies due to polynomial volume growth (He et al., 2025b), introducing shortcuts between hierarchically distinct regions that distort semantic relationships. In retrieval settings, these distortions can cause semantically distant documents to appear spuriously similar (Radovanovic et al., 2010; Bogolin et al., 2022), degrading retrieval precision (Reimers & Gurevych, 2021): a query about a specific subtopic may retrieve documents from sibling or parent categories that share similarity but lack the required specificity.

To further see why geometry matters for retrieval, consider a query about transformer attention mechanisms (Figure 1A). Relevant documents form a natural hierarchy-from general concepts like NLP, to transformers, to specific components like multi-head attention-inducing tree-like semantic structure. Euclidean embeddings struggle to preserve this organization: representing both broad topics and specialized descendants forces a trade-off between semantic proximity and fine-grained separation, causing neighborhood crowding and distortion. Hyperbolic geometry resolves this tension through exponential volume growth, allowing general concepts to remain compact while specific documents spread outward. To test whether such structure appears empirically, we analyze Ollivier-Ricci curvature (Ni et al., 2019)—a measure of local geometry where negative values indicate tree-like branching—on graphs built from MS MARCO document embeddings (Bajaj et al., 2016). Across several strong models (Linq Embed Mistral, LLaMA Nemotron 8B, Qwen3 Embedding 4B), curvature distributions are predominantly negative (Figure 1B), providing empirical evidence that retrieval-relevant embeddings exhibit intrinsic hyperbolic structure and motivating hyperbolic geometry as a natural inductive bias for dense retrieval.


( {}^{1} ) Yale University, USA ( {}^{2} ) Hong Kong University of Science and Technology (Guangzhou), China ( {}^{3} ) NetApp, USA. Correspondence to: Rex Ying rex.ying@yale.edu.

Preprint. February 10, 2026.

( {}^{1} ) The code is available at: https://anonymous.4open.science/r/HypRAG-30C6


Recent work has begun exploring hyperbolic geometry for language modeling and RAG systems, though with different focus areas. HELM (He et al., 2025a) introduces a family of hyperbolic language models that operate entirely in hyperbolic space, but these models target text generation rather than retrieval. In the RAG setting, HyperbolicRAG (Cao et al., 2025) projects embeddings into the Poincaré ball to encode hierarchical depth within a static, pre-built knowledge graph, using dual-space retrieval that fuses Euclidean and hyperbolic rankings. However, HyperbolicRAG relies on Euclidean encoders to produce the initial embeddings, leaving the fundamental geometric mismatch. Moreover, aggregating token embeddings into document representations poses a challenge that existing work in hyperbolic learning does not address (Yang et al., 2024). As we show in Proposition 4.3, naively averaging tokens in Euclidean space before projecting to hyperbolic space causes representations to collapse toward the origin, destroying the hierarchical structure that is meant be to preserved.

To this end, we introduce hyperbolic dense retrieval for RAG, framing embedding geometry as a design choice for improving evidence selection and grounding at the representation level. We study this through two complementary instantiations. First, HyTE-FH (Hyerbolic Text Encoder, Fully Hyperbolic) operates entirely in the Lorentz model of hyperbolic space, enabling end-to-end representation learning. Second, HyTE-H (Hybrid) maps embeddings from off-the-shelf Euclidean encoders into hyperbolic space, allowing us to build on existing pre-trained Euclidean models. The Lorentz model's intrinsic geometry enables parameter-efficient scaling: HyTE-H outperforms Euclidean baselines several times (2-3x) its size, reducing memory footprint in resource-constrained settings. To address the aggregation challenge in both instantiations, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that amplifies tokens farther from the origin, provably preserving hierarchical structure during pooling.

Through extensive evaluation on RAGBench, we demonstrate that both hyperbolic variants consistently outperform Euclidean baselines in answer relevancy across multiple datasets, while achieving competitive performance on MTEB. Our experiments validate three key findings: (1) hyperbolic retrieval substantially improves RAG performance, with up to 29% gains over Euclidean baselines in context relevance and answer relevance; (2) hyperbolic models naturally encode concept-level hierarchies in their radial structure, with the fully hyperbolic model achieving a 20.2% radius increase from general to specific concepts, while Euclidean models fail to capture this organization; and (3) our theoretically grounded Outward Einstein Midpoint pooling preserves this hierarchical structure during aggregation.

Text Embeddings and Dense Retrieval. Dense retrieval embeds queries and documents into a shared vector space and ranks candidates by similarity (e.g., dot product or cosine). Transformer bi-encoders (e.g., BERT (Devlin et al., 2019)) are widely used in this context due to their scalabil-ity with maximum inner product search (Karpukhin et al., 2020; Reimers & Gurevych, 2019). Most methods train with contrastive objectives using in-batch and hard negatives (Gao et al., 2021; Izacard et al., 2021; Xiong et al., 2021), often following large-scale pretraining plus task-specific fine-tuning (Wang et al., 2022; Li et al., 2023; Nussbaum et al., 2025). More recently, decoder-only embedding models initialize from LLMs to exploit their pretrained linguistic knowledge (Muennighoff et al., 2024; Lee et al., 2024; Zhang et al., 2025). However, most retrievers remain reliant on inner products or distances in Euclidean geometry-an inductive bias often misaligned with the hierarchical structure of language and document collections. We address this gap by introducing hyperbolic geometry for text embeddings to better capture such a hierarchy.

Retrieval Augmented Generation. RAG grounds LLMs in retrieved evidence to improve factuality and access external knowledge (Oche et al.,2025). It typically retrieves top- ( k ) contexts (often via dense retrieval) and conditions generation on them (Lewis et al., 2020). Since the context window is limited, retrieval quality is a key bottleneck for relevance and faithfulness (Friel et al., 2024a). Several methods improve reliability after retrieval: Self-RAG (Asai et al., 2024) and CRAG (Yan et al., 2024) use learned critics to filter or re-rank evidence, while GraphRAG (Han et al., 2024) leverages knowledge graphs for structured subgraph retrieval. These approaches operate downstream of the embedding space and are complementary to ours geometric approach. Our goal is to improve RAG upstream by enhancing the retriever representations so that the initial top- ( k ) evidence is more reliable under realistic efficiency constraints.

Hyperbolic Representation Learning. Hyperbolic geometry is primarily known for its ability to better capture hierarchical, tree-like structures (Yang et al., 2023; Peng et al., 2021), which enhances performance in various tasks, including molecular generation (Liu et al., 2019), recommendation (Yang et al., 2021; Li et al., 2021), image retrieval (Khrulkov et al., 2020; Wei et al., 2024; Bui et al., 2025), and knowledge graph embedding (Ganea et al., 2018a; Dhingra et al., 2018). More recently, hyperbolic geometry has also shown promise for multi-modal embedding models (Desai et al., 2023; Ibrahimi et al., 2024; Pal et al., 2024) and foundation models (Yang et al., 2025; He et al., 2025a). In contrast to these works, we study how hyperbolic representations can improve retrieval in RAG systems. Concurrently, Cao et al. (2025) use hyperbolic geometry to improve RAG rankings, but obtain hyperbolic embed-dings via a simple projection from Euclidean encoders; by contrast, we build on fully hyperbolic encoders trained end-to-end and address key challenges in this setting, including providing the theoretically grounded geometry-aware pooling for document-level representations.

3. Hyperbolic Space Preliminaries

In this section, we go over all the preliminaries of Lorentz model of hyperbolic space and introduce the basic building blocks that create HyTE-FH.

3.1. Lorentz Model of Hyperbolic Space

We represent all embeddings in ( d ) -dimensional hyperbolic space ( {\mathbb{H}}_{K}^{d} ) with constant negative curvature ( K < 0 ) using the Lorentz (hyperboloid) model. In the Lorentz model, hyperbolic space is realized as the upper sheet of a two-sheeted hyperboloid embedded in ( {\mathbb{R}}^{d + 1} ) ,

[ {\mathbb{H}}{K}^{d} = \left{ {\mathbf{x} \in {\mathbb{R}}^{d + 1}\mid \langle \mathbf{x},\mathbf{x}{\rangle }{L} = \frac{1}{K},{x}_{0} > 0}\right} , ]

where the Lorentzian inner product is defined as ( \langle \mathbf{x},\mathbf{y}{\rangle }{L} = ; - {x}{0}{y}{0} + \mathop{\sum }\limits{{i = 1}}^{d}{x}{i}{y}{i} ) . This formulation admits closed-form expressions for geodesic distances, barycentric operations, and parallel transport, and expresses similarity directly through Lorentzian inner products. The geodesic distance between two points ( \mathbf{x},\mathbf{y} \in {\mathbb{H}}{K}^{d} ) is given by ( {d}{K}\left( {\mathbf{x},\mathbf{y}}\right) = ; \frac{1}{\sqrt{-K}}{\cosh }^{-1}\left( {K\langle \mathbf{x},\mathbf{y}{\rangle }_{L}}\right) ) , which is a monotone function of the Lorentzian inner product.

To support optimization, we make use of exponential and logarithmic maps between the manifold and its tangent spaces. For a point ( \mathbf{x} \in {\mathbb{H}}{K}^{d} ) , the logarithmic map ( {\log }{x}\left( \cdot \right) ) maps nearby points to the tangent space ( {T}{x}{\mathbb{H}}{K}^{d} ) , while the exponential map ( {\exp }{x}\left( \cdot \right) ) maps tangent vectors back to the manifold. These operators are used only where necessary for gradient-based updates, ensuring that all representations remain on ( {\mathbb{H}}{K}^{d} ) and preserving the hierarchical structure induced by negative curvature.

3.2. Hyperbolic Transformer Components

Standard operations cannot be applied directly in hyperbolic space, as they may violate the manifold constraint (Yang et al., 2024). To address this, we introduce hyperbolic components that serve as the building blocks for our embedding model. These operations are performed via a re-centering procedure that applies Euclidean operations in a latent space and maps the result back to the Lorentz model. By doing so, the resulting vector is constructed to satisfy the Lorentz constraint, thereby preserving the hyperbolic structure of representations. We present these operations as follows.

Lorentz Linear Layer. Given curvatures ( {K}{1},{K}{2} ) , and parameters ( \mathbf{W} \in {\mathbb{R}}^{\left( {n + 1}\right) \times m} ) and ( \mathbf{b} \in {\mathbb{R}}^{m} ) with ( \mathbf{z} = ; \left| {{\mathbf{W}}^{\top }\mathbf{x} + \mathbf{b}}\right| ) , the Lorentzian linear transformation (Yang et al.,2024) is the map HLT : ( {\mathbb{L}}^{{K}{1}, n} \rightarrow {\mathbb{L}}^{{K}{2}, m} ) given by,

[ \operatorname{HLT}\left( {\mathbf{x};\mathbf{W},\mathbf{b}}\right) = \sqrt{\frac{{K}{2}}{{K}{1}}} \cdot \left\lbrack {\sqrt{\parallel \mathbf{z}{\parallel }^{2} - 1/{K}_{2}},\mathbf{z}}\right\rbrack ]

Hyperbolic Layer Normalization. Given token embed-dings ( X = {\left{ {\mathbf{x}}{i}\right} }{i = 1}^{n} \subset {\mathbb{H}}_{K}^{d} ) , hyperbolic layer normalization is defined as

[ \text{ HypLayerNorm }\left( X\right) = \left( {\sqrt{\frac{{K}{1}}{{K}{2}}\parallel \mathbf{z}{\parallel }{2}^{2} - \frac{1}{{K}{2}}},\sqrt{\frac{{K}{1}}{{K}{2}}}\mathbf{z}}\right) ]

where ( z = {f}{\mathrm{{LN}}}\left( {\mathbf{x}}{i,\left\lbrack {1 : d}\right\rbrack }\right) ,{f}{\mathrm{{LN}}}\left( \cdot \right) ) denotes standard Euclidean LayerNorm applied to the spatial components of the embedding, and ( {K}{1},{K}_{2} > 0 ) are input and output curvature respectively.

Lorentz Residual Connection. Let ( \mathbf{x}, f\left( \mathbf{x}\right) \in {\mathbb{L}}^{K, n} ) where ( \mathbf{x} ) is an input vector and ( f\left( \mathbf{x}\right) ) is the output of a neural network ( f ) . Then, the Lorentzian residual connection (He et al.,2025d) is given by ( \mathbf{x}{ \oplus }{\mathcal{L}}f\left( \mathbf{x}\right) = {\alpha }{1}\mathbf{x} + {\alpha }_{2}\mathbf{y} ) , where

[ {\alpha }{i} = {w}{i}/\left( {\sqrt{-K}{\begin{Vmatrix}{w}{1}\mathbf{x} + {w}{2}f\left( \mathbf{x}\right) \end{Vmatrix}}_{\mathcal{L}}}\right) ,\text{ for }i \in { 0,1} , ]

where ( {\alpha }{1},{\alpha }{2} ) are weights parametrized by constants ( \left( {{w}{1},{w}{2}}\right) \in {\mathbb{R}}^{2} \smallsetminus { \left( {0,0}\right) } . )

Hyperbolic Self-Attention. In hyperbolic attention, similarity is governed by hyperbolic geodesic distance (Ganea et al.,2018b). Given token embeddings ( X = {\left{ {\mathbf{x}}{i}\right} }{i = 1}^{n} \subset ; {\mathbb{H}}_{K}^{d} ) , queries, keys, and values are computed via Lorentz-linear transformations as ( \mathbf{Q} = \operatorname{HLT}\left( {X;{\mathbf{W}}^{Q},{\mathbf{b}}^{Q}}\right) ,\mathbf{K} = ; \operatorname{HLT}\left( {X;{\mathbf{W}}^{K},{\mathbf{b}}^{K}}\right) ) , and ( \mathbf{V} = \operatorname{HLT}\left( {X;{\mathbf{W}}^{V},{\mathbf{b}}^{V}}\right) ) , where HLT ( \left( \cdot \right) ) denotes a linear map in Lorentz space. Attention weights are computed using squared hyperbolic geodesic distances (He et al., 2025c; Chen et al., 2022) as

[ {\nu }{i, j} = \frac{\exp \left( {-{d}{K}^{2}\left( {{\mathbf{q}}{i},{\mathbf{k}}{j}}\right) /\sqrt{m}}\right) }{\mathop{\sum }\limits_{{l = 1}}^{n}\exp \left( {-{d}{K}^{2}\left( {{\mathbf{q}}{i},{\mathbf{k}}_{l}}\right) /\sqrt{m}}\right) }, ]

bo_d6nbcqk601uc73e2hscg_3_269_187_489_503_0.jpg

Figure 2. HyTE Architecture. A) HyTE-FH Encoder Block, B) HyTE-FH architecture, C) HyTE-H Architecture.

with head dimension ( m ) . This prioritizes geodesic proximity rather than angular similarity. The attended representation is obtained via a Lorentzian weighted midpoint

[ {\operatorname{Att}}{\mathcal{L}}{\left( \mathbf{x}\right) }{i} = \frac{\mathop{\sum }\limits_{{j = 1}}^{n}{\nu }{i, j}{\lambda }{j}{\mathbf{v}}{j}}{\sqrt{-K}{\begin{Vmatrix}\mathop{\sum }\limits{{j = 1}}^{n}{\nu }{i, j}{\lambda }{j}{\mathbf{v}}{j}\end{Vmatrix}}{\mathcal{L}}}, ]

where ( {\lambda }{j} = {v}{j,0} ) is the Lorentz factor. Unlike Euclidean averaging, this aggregation remains on ( {\mathbb{H}}_{K}^{d} ) and preserves radial structure during contextualization.

4. Method

We now outline our approach to hyperbolic dense retrieval. We begin by introducing the two proposed HyTE architectures, followed by an analysis of why naïve pooling strategies fail in hyperbolic space, and conclude by presenting our geometry-aware aggregation operator.

4.1. Architecture

The hyperbolic encoder components described in Section 3 form the building blocks (Figure 2A) of HyTE-FH, our fully hyperbolic transformer (Figure 2B). By operating entirely within hyperbolic geometry, HyTE-FH preserves hierarchical structure throughout token-level contextualization, aggregation, and similarity computation, with semantic abstraction and specificity encoded along radial dimensions. HyTE-H (Figure 2C) instead projects pretrained Euclidean representations into hyperbolic space, which allows hyperbolic geometry to be leveraged with a strong initialization and avoiding the need to train a fully hyperbolic encoder from scratch.

While hyperbolic self-attention enables geometry-consistent contextualization at the token level, dense retrieval requires aggregating variable-length sequences into fixed-dimensional representations. Standard approaches map representations to tangent space, aggregate in Euclidean space, then map back to the manifold (Yang et al., 2024; Desai et al., 2023), but this distorts hierarchical structure encoded in radial depth in both the models. In the following subsections, we analyze this failure mode formally and introduce a pooling operator designed to preserve hierarchical information.

4.2. Failure of Naïve Hyperbolic Pooling

Naïve pooling strategies that aggregate in Euclidean space (Yang et al., 2024; Desai et al., 2023) systematically contract representations toward the origin. This follows from hyperbolic convexity: for any ( {\left{ {\mathbf{x}}{i}\right} }{i = 0}^{n} \subset {\mathbb{H}}_{K}^{d} ) , the barycenter lies strictly closer to the origin than the maximum-radius point unless all points coincide. Consequently, document-level embeddings lose the radial separation that encodes document specificity through hierarchical depth. To address this failure mode, we first establish notation for projecting ambient vectors onto the hyperboloid and measuring radial depth.

Definition 4.1 (Lorentz Projection). For ( \mathbf{v} \in {\mathbb{R}}^{d + 1} ) with ( \langle \mathbf{v},\mathbf{v}{\rangle }{L} < 0 ) and ( {v}{0} > 0 ) , let ( {\Pi }{K}\left( \mathbf{v}\right) = ; \frac{\mathbf{v}}{\sqrt{K\langle \mathbf{v},\mathbf{v}{\rangle }{L}}} ) denote the unique positive rescaling satisfying ( {\left\langle {\Pi }{K}\left( \mathbf{v}\right) ,{\Pi }{K}\left( \mathbf{v}\right) \right\rangle }_{L} = 1/K )

Definition 4.2 (Radial Depth). The radial depth of ( \mathbf{x} \in {\mathbb{H}}{K}^{d} ) is ( r\left( \mathbf{x}\right) = {x}{0} ) . Since ( {x}{0} = \frac{1}{\sqrt{-K}}\cosh \left( {\sqrt{-K}\rho }\right) ) where ( \rho = {d}{K}\left( {o,\mathbf{x}}\right) ) , ordering by ( {x}_{0} ) is equivalent to ordering by intrinsic hyperbolic distance from the origin.

Semantically, radial depth encodes concept specificity: general concepts should lie near the origin while fine-grained entities should have larger radii. This provides a measurable signature for evaluating whether models learn meaningful hierarchical structure. The simplest aggregation strategy is Euclidean averaging in the ambient space followed by reprojection. However, this approach provably contracts representations toward the origin (Ganea et al., 2018a; Chami et al., 2019), destroying hierarchical structure encoded in radial depth. We formalize this in the following proposition.

Proposition 4.3 (Euclidean Mean Contracts). Let ( {\left{ {\mathbf{x}}{i}\right} }{i = 1}^{n} \subset {\mathbb{H}}{K}^{d} ) with ( n \geq 2 ) . Define the Euclidean mean ( \overline{\mathbf{x}} = \frac{1}{n}\mathop{\sum }\limits{{i = 1}}^{n}{\mathbf{x}}{i} ) and its projection onto the hyperboloid ( {\mathbf{m}}^{\text{ Euc }} = {\Pi }{K}\left( \overline{\mathbf{x}}\right) ) . Then, we have

[ r\left( {\mathbf{m}}^{\text{ Euc }}\right) \leq \frac{1}{n}\mathop{\sum }\limits_{{i = 1}}^{n}r\left( {\mathbf{x}}_{i}\right) , ]

with equality if and only if all ( {\mathbf{x}}_{i} ) are identical.

bo_d6nbcqk601uc73e2hscg_4_155_184_708_295_0.jpg

Figure 3. Outward Einstein Midpoint. Size of token shows its contribution towards aggregation.

The proof of this Proposition is available in Appendix A.2. This failure motivates a precise characterization of desirable pooling behavior. We formalize the requirement that pooling should preserve, rather than collapse, radial structure.

Definition 4.4 (Outward Bias). A pooling operator ( \mathcal{P} ) : ( {\left( {\mathbb{H}}{K}^{d}\right) }^{n} \rightarrow {\mathbb{H}}{K}^{d} ) is outward-biased if ( r\left( {\mathcal{P}\left( {\left{ {\mathbf{x}}{i}\right} }{i = 1}^{n}\right) }\right) \geq \bar{r} ) , where ( \bar{r} ) is the weighted mean radius.

A natural alternative is a weighted aggregation scheme in which token contributions are modulated by their relative importance. For example, Zhu et al. (2020) adopt the Einstein midpoint, the canonical barycenter in hyperbolic space (Gul-cehre et al., 2019), to emphasize semantically specific tokens during pooling: since points near the boundary receive higher weight via the Lorentz factor ( {\lambda }{i} = {x}{i,0} ) , more informative content should dominate the aggregate. However, we show this intuition is misleading: the implicit radial weighting is fundamentally insufficient to counteract hyperbolic contraction at the document level.

Proposition 4.5 (Implicit Radial Weighting is Insufficient). The Einstein midpoint weights points by the Lorentz factor ( {\lambda }{i} = {x}{i,0} ) , but this weighting grows as ( \exp \left( {\sqrt{-K}\rho }\right) ) while hyperbolic volume grows as ( \exp \left( {\left( {d - 1}\right) \sqrt{-K}\rho }\right) ) . Specifically, for a point ( \mathbf{x} \in {\mathbb{H}}_{K}^{d} ) at hyperbolic distance ( \rho ) from the origin ( o = \left( {1/\sqrt{-K},0,\ldots ,0}\right) ) , we have

[ {x}_{0} = \frac{1}{\sqrt{-K}}\cosh \left( {\sqrt{-K}\rho }\right) \sim \frac{1}{2\sqrt{-K}}\exp \left( {\sqrt{-K}\rho }\right) ]

as ( \rho \rightarrow \infty ) . Thus, the Lorentz factor weighting undercom-pensates for the exponential growth of hyperbolic balls at large radii by a factor of ( \exp \left( {\left( {d - 2}\right) \sqrt{-K}\rho }\right) ) .

These results establish that neither Euclidean averaging nor the standard Einstein midpoint satisfies the outward-bias property required for hierarchy-preserving aggregation. This motivates the design of a pooling operator with explicit radial amplification. The proof of this Proposition is available in Appendix A.3.

4.3. Outward Einstein Midpoint Pooling

To mitigate radial contraction during aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that explicitly amplifies the contribution of tokens with larger hyperbolic radius. Let ( {\left{ {\mathbf{x}}{i}\right} }{i = 1}^{n} \subset {\mathbb{H}}{K}^{d} ) denote a sequence of token embeddings, with optional attention weights ( {w}{i} \geq 0 ) , and ( {\lambda }_{i} ) denoting the Lorentz factors. We define a radius-dependent weighting function

[ \phi \left( {x}{i}\right) = {x}{i,0}^{p},;p > 0, ]

which is monotone in the radial coordinate. The Outward Einstein Midpoint is then given by

[ {\mathbf{m}}{K, p}^{\mathrm{{OEM}}} = \frac{\mathop{\sum }\limits{{i = 1}}^{n}\left( {{w}{i}\phi \left( {\mathbf{x}}{i}\right) }\right) {\lambda }{i}{\mathbf{x}}{i}}{\mathop{\sum }\limits_{{i = 1}}^{n}\left( {{w}{i}\phi \left( {\mathbf{x}}{i}\right) }\right) {\lambda }_{i}}, ]

followed by reprojection onto the hyperboloid ( {\mathbb{H}}_{K}^{d} ) .

As shown in Figure 3, by construction, this operator assigns disproportionately higher weight to tokens located farther from the origin, counteracting the contraction inherent to naïve averaging. We now establish theoretical guarantees for the Outward Einstein Midpoint, showing that it systematically improves upon the standard Einstein midpoint in preserving radial structure.

Theorem 4.6 (OEM Pre-Projection Bound). Let ( \widetilde{\mathbf{v}} = ; \mathop{\sum }\limits_{{i = 1}}^{n}{\widetilde{w}}{i}{\mathbf{x}}{i} ) where ( {\widetilde{w}}{i} \propto {w}{i}{x}_{i,0}^{p + 1} ) are the normalized OEM weights. Then, for ( p \geq 0 ) , we have

[ {\widetilde{v}}{0} = \frac{\mathop{\sum }\limits{{i = 1}}^{n}{w}{i}{x}{i,0}^{p + 2}}{\mathop{\sum }\limits_{{i = 1}}^{n}{w}{i}{x}{i,0}^{p + 1}} \geq \frac{\mathop{\sum }\limits_{{i = 1}}^{n}{w}{i}{x}{i,0}}{\mathop{\sum }\limits_{{i = 1}}^{n}{w}{i}} = {\bar{r}}{w}. ]

We apply Chebyshev's sum inequality to the co-monotonic sequences ( {a}{i} = {x}{i,0}^{p + 1} ) and ( {b}{i} = {x}{i,0} ) to prove this. Full proof can be found in Appendix A.4. While projection onto ( {\mathbb{H}}_{K}^{d} ) contracts the radial coordinate, the OEM's concentration of weight on high-radius tokens inflates the pre-projection average, counteracting this effect. Theorem 4.6 establishes that OEM increases the pre-projection radial coordinate. The following theorem shows a stronger result: OEM provably dominates the standard Einstein midpoint in preserving radial structure.

Theorem 4.7 (OEM Outward Bias). Let ( {\mathbf{m}}{K}^{\text{ Ein }} ) denote the standard Einstein midpoint ( \left( {p = 0}\right) ) and ( {\mathbf{m}}{K, p}^{\text{ OEM }} ) the Outward Einstein Midpoint. Then, for all ( p \geq 1 ) :

[ r\left( {\mathbf{m}}{K, p}^{\mathrm{{OEM}}}\right) \geq r\left( {\mathbf{m}}{K}^{\mathrm{{Ein}}}\right) . ]

The OEM weights ( {\widetilde{w}}{i} \propto {w}{i}{x}{i,0}^{p + 1} ) concentrate more mass on high-radius points than the Einstein weights ( {w}{i}{x}_{i,0} ) , increasing the pre-projection time component while reducing pairwise dispersion. Full proof in Appendix A.5. Together, these results establish that the Outward Einstein Midpoint provably preserves hierarchical structure during aggregation, in contrast to both Euclidean averaging and the standard Einstein midpoint. We validate this empirically through concept-level hierarchy analysis (Section 5.2), showing that models using OEM pooling maintain monotonically increasing radii across semantic specificity levels-a property absent in Euclidean baselines.

4.4. Training Methodology

We train the hyperbolic encoder in three stages, with all objectives operating directly on the Lorentz manifold using geodesic-based similarity.

Stage 1: Hyperbolic Masked Language Modeling. We initialize via masked language modeling (MLM), following the standard BERT objective in hyperbolic space. Contex-tualization is performed through hyperbolic self-attention, with all intermediate representations on the hyperboloid. Predictions are produced using a Lorentzian multinomial logistic regression (LorentzMLR) (Bdeir et al., 2024) head, which defines class logits via Lorentzian inner products. Only HyTE-FH is trained on MLM, while for HyTE-H we choose a pre-trained Euclidean model as the MLM base to leverage a sronger initialization in low-resource settings.

Stage 2: Unsupervised Contrastive Pre-Training. We fine-tune the resulting MLM model on query-document pairs by minimizing unsupervised contrastive loss. Similarity is defined as negative geodesic distance ( s\left( {q, d}\right) = ; - {d}_{K}\left( {q, d}\right) ) . The contrastive loss over in-batch negatives is

[ {\mathcal{L}}{\text{ ctr }} = - \frac{1}{N}\mathop{\sum }\limits{{i = 1}}^{N}\log \exp \left( {s\left( {{\mathbf{q}}{i},{\mathbf{d}}{i}}\right) /\tau }\right) , ]

where ( \tau > 0 ) is a temperature parameter.

Stage 3: Supervised Contrastive Learning Fine-tuning. In the final stage of training, we further fine-tune the encoder using supervised contrastive learning on labeled query-document data. Given a query ( {q}{i} ) , a set of relevant documents ( {\mathcal{D}}{i}^{ + } ) , and a set of non-relevant documents ( {\mathcal{D}}_{i}^{ - } ) , the supervised contrastive objective encourages the query representation to be closer to all relevant documents than to non-relevant ones

[ {\mathcal{L}}{\text{ sup }} = - \frac{1}{N}\mathop{\sum }\limits{{i = 1}}^{N}\log \frac{\mathop{\sum }\limits_{{{d}^{ + } \in {\mathcal{D}}{i}^{ + }}}\exp \left( {s\left( {{\mathbf{q}}{i},{\mathbf{d}}^{ + }}\right) /\tau }\right) }{\mathop{\sum }\limits_{{d \in {\mathcal{D}}{i}^{ + } \cup {\mathcal{D}}{i}^{ - }}}\exp \left( {s\left( {{\mathbf{q}}_{i},\mathbf{d}}\right) /\tau }\right) }, ]

where ( \tau > 0 ) is a temperature parameter. This stage explicitly aligns hyperbolic distances with supervised relevance signals, refining retrieval behavior beyond unsupervised co-occurrence structure.

Retrieval-Augmented Generation. At inference time, the trained hyperbolic encoder is used to retrieve the top- ( k ) documents ( \mathcal{C} ) for a given queryt. These retrieved documents are then provided as context to a downstream generative language model. Prompt formatting and generation follow standard practice and are provided in Appendix B. We present runtime and computational complexity in Appendix D.

Table 1. Performance on MTEB benchmark. We report mean scores across tasks and task types. HyTE-FH performs best among the three models.

ModelMean (Task)Mean (TaskType)
EucBERT54.1151.31
HyTE-H \( {}^{\text{ Euc }} \)54.5753.71
HyTE-FH56.4153.75

5. Experiments and Results

5.1. Experimental Setup

Datasets. We pre-train our models using publicly available corpora following the data curation and filtering protocols introduced in nomic-embed (Nussbaum et al., 2025). For masked language modeling (MLM), we use the high-quality 2023 Wikipedia dump, which provides broad topical coverage and long-form text suitable for learning general-purpose semantic representations. For contrastive pre-training, we leverage approximately 235 million text pairs curated and filtered as described in (Nussbaum et al., 2025), designed to encourage semantic alignment across paraphrases and related content at scale. Finally, for task-specific fine-tuning, we use the training splits of the BEIR benchmark (Thakur et al., 2021), which comprises a diverse collection of retrieval tasks spanning multiple domains and query styles.

Evaluation Benchmarks. We evaluate our approach on two complementary benchmarks: (1) the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2023) to assess embedding quality across diverse tasks, and (2) RAGBench (Friel et al., 2024b) for end-to-end RAG system evaluation. In MTEB, we particularly use the English part of the benchmark. RAGBench evaluates RAG systems on domain-specific question-answering datasets including CovidQA, Cuad, Emanual, DelucionQA, and ExpertQA.

Baselines. We adopt different baseline strategies for our two models based on their training paradigms. For HyTE-FH, which is pre-trained from scratch, we train a fully Euclidean equivalent called EucBERT using the same architecture and training setup. This controlled comparison isolates the contribution of hyperbolic geometry. We also evaluate HyTE-H ( {}^{\mathrm{{Euc}}} ) , a hybrid hyperbolic model initialized with EucBERT. The three models are evaluated on MTEB and RAGBench. For HyTE-H ( {}^{\text{ bert }} ) , which is fine-tuned with modernbert-base (Warner et al., 2024) as base model, we compare against state-of-the-art embedding models smaller than 500M parameters, including gte-multilingual-base (Zhang et al., 2024), KaLM-embedding-multilingual-mini-v1 (Hu et al., 2025), and embeddinggemma-300m (Vera et al., 2025).

Metrics. For MTEB, we report mean scores across tasks and task types. For RAG evaluation, we measure three key metrics using RAGAS (Es et al., 2024): (1) Faithfulness, which assesses whether generated answers are grounded in the retrieved context; (2) Context Relevance, which measures how relevant the retrieved documents are to the query; and (3) Answer Relevance, which evaluates how well the generated answer addresses the user's question.

Table 2. RAG benchmark results comparing our model variants.

ModelAverageCovidQACuadEmanualDelucionQAExpertQA
FCRARFCRARFCRARFCRARFCRARFCRAR
EucBERT0.5960.7980.6470.6850.8630.5820.6540.6440.6410.6420.6460.6740.5250.9680.6790.4750.8720.662
HyTE-H \( {}^{\text{ Euc }} \)0.7060.8140.7390.7080.8680.6680.7870.6520.7100.6790.8350.8140.7370.8570.7730.6230.8590.728
HyTE-FH0.7320.8480.7650.7640.9160.6940.7470.6740.7520.6600.8070.7040.7890.9060.8610.7020.9360.814

( \mathrm{F} = ) Faithfulness, ( \mathrm{{CR}} = ) Context Relevance, ( \mathrm{{AR}} = ) Answer Relevance. Best results in bold.

Table 3. RAG benchmark results comparing our hybrid model with state-of-the-art embedding models. HyTE-H demonstrates competitive performance particularly in context relevance and answer relevance.

ModelAverageCovidQACuadEmanualDelucionQAExpertQA
FCRARFCRARFCRARFCRARFCRARFCRAR
ModernBert*0.6170.7480.6320.6560.8950.53780.6320.7090.7460.5670.7150.6390.6550.66570.51830.5750.7580.718
GTE0.6590.7010.6500.6950.8400.5380.7330.5990.7790.5460.6080.6860.6480.7250.5490.6720.7310.698
Gemma0.6030.7350.6840.6850.7600.4970.7240.6000.7780.5550.8840.6870.6120.6430.7050.4420.7910.755
KaLM-mini-v10.6240.7190.5910.6560.7870.5280.7420.7890.7160.5650.7760.6160.5530.5810.5730.6070.6660.522
HyTE-H \( {}^{\text{ bert }} \)0.7630.9040.8320.7970.9740.7550.7600.6830.8040.6880.9430.8990.8290.9650.8710.7390.9580.834

( \mathrm{F} = ) Faithfulness, ( \mathrm{{CR}} = ) Context Relevance, ( \mathrm{{AR}} = ) Answer Relevance. Best results in bold.

Implementation. We implement all hyperbolic models using HyperCore (He et al., 2025e) and train on NVIDIA H100 GPUs. All three models, HyTE-FH, HyTE-H, and Eu-cBERT, share the same architecture, each containing 149M parameters with 12 transformer layers and 768-dimensional embeddings. For generation and judging, we use Llama- 3.1-8B-Instruct (Weerawardhena et al., 2025). For RAG benchmarks, we fix the retrieval context window size to 5 for all models to ensure a controlled comparison; we additionally report ablations with larger context sizes in Appendix Table A3.

5.2. Results

MTEB Benchmark. Table 1 reports performance on the MTEB benchmark. HyTE-FH achieves the highest mean score across tasks (56.41), outperforming both EucBERT (54.11) and HyTE-H ( {}^{\mathrm{{Euc}}} ) (54.57). On the task-type mean, HyTE-FH and HyTE-H ( {}^{\mathrm{{Euc}}} ) perform comparably (53.75 and 53.71, respectively), with both surpassing EucBERT (51.31). These results demonstrate that hyperbolic representations not only improve RAG retrieval but also remain competitive on general-purpose embedding benchmarks. We present task-wise results in Table A1.

RAG Benchmark Results. Table 2 presents RAG benchmark results across five datasets. HyTE-FH achieves the best average performance across all three metrics: faithfulness (0.732), context relevance (0.848), and answer relevance (0.765). HyTE-H ( {}^{\mathrm{{Euc}}} ) ranks second overall, with both hyperbolic variants substantially outperforming EucBERT. On individual datasets, HyTE-FH leads on CovidQA, Cuad, DelucionQA, and ExpertQA, while HyTE-H ( {}^{\text{ Euc }} ) achieves the best context and answer relevance on Emanual. These results demonstrate that hyperbolic geometry consistently improves retrieval quality for RAG across diverse domains.

Table 3 reports RAG performance across five datasets. HyTE-H ( {}^{\text{ bert }} ) consistently outperforms strong Euclidean embedding baselines across all metrics, with particularly large gains in context relevance and answer relevance. These improvements indicate that hyperbolic representations are more effective at retrieving structurally relevant evidence, which is critical for downstream generation quality in RAG pipelines. In qualitative case studies shows in Appendix E.1, we observe that Euclidean models frequently fail to retrieve key supporting passages altogether, whereas hyperbolic model recover relevant evidence more reliably, leading to more faithful and contextually grounded answers.

Concept-Level Hierarchy Analysis. A central motivation for hyperbolic embeddings is their capacity to preserve hierarchical relationships (Section 4.2). To understand how models capture document hierarchy, we analyze learned radii (distances from the origin in the Poincaré ball) across five hierarchical levels: from Level 1 (most general, e.g., document-level topics) to Level 5 (most specific, e.g., fine-grained entities). Figure 4 presents these results. The fully hyperbolic model demonstrates clear hierarchical organization with radii increasing monotonically from Level 1 (2.902) to Level 5 (3.488, +20.2%). This shows the model naturally places general concepts near the origin and specific details toward the boundary, consistent with hyperbolic geometry, where proximity to the origin represents generality. Euclidean models show flat or decreasing distributions. Baselines maintain constant norms across levels or decreases norm by ( {30}% ) , reflecting inverted structure. Hybrid models exhibit substantially larger radii from the hyperbolic component. The fine-tuned hybrid increases from 116.9 to 146.7, showing that fine-tuning induces structured hierarchy. We have attached the dataset for this case study in the supplementary material. The concept level hierarchy data is available in Appendix C.

bo_d6nbcqk601uc73e2hscg_7_156_187_1438_841_0.jpg

Figure 4. Empirical validation of hierarchical encoding. Left: Euclidean models show flat or decreasing norms. Middle: HyTE-H demonstrate increasing norms with fine-tuning enhancing this trend. Right: HyTE-FH achieves +20.2% total increase from L1 to L5. Bottom: Normalized comparison and percent change summary highlighting the contrasting behaviors of different geometric approaches.

Ablation Studies. We compare two pooling strategies for aggregating token embeddings into document representations: CLS token pooling and OEM pooling. CLS pooling uses the representation of a special classification token, while OEM pooling performs geometry-aware aggregation directly in hyperbolic space. Table 4 shows that OEM pooling yields higher performance across both mean task and mean task-type metrics on MTEB retrieval tasks, indicating more effective document-level aggregation in the hyperbolic setting. We also show that using geodesic distance in the contrastive objective outperforms the Lorentz inner product (Appendix Table A2), suggesting better alignment of representations on the manifold. Additionally, hyperbolic models maintain strong performance with smaller retrieval budgets, whereas Euclidean baselines require larger context windows to achieve comparable results (Appendix Table A3).

Table 4. Comparison of pooling strategies on MTEB tasks. OEM pooling leverages hyperbolic geometry for improved performance.

Pooling StrategyMean (Task)Mean (TaskType)
CLS Token49.3348.90
OEM56.4153.75

6. Conclusion

We introduced hyperbolic dense retrieval for RAG, showing that aligning embedding geometry with the hierarchical structure of language improves faithfulness and answer quality. Our approach preserves document-level structure during aggregation through a geometry-aware pooling operator, addressing a key failure mode of Euclidean retrieval pipelines. Across evaluations, we observe consistent gains using models substantially smaller than current state-of-the-art retrievers, highlighting the effectiveness of hyperbolic inductive bias over scale alone. Case studies further show that hyperbolic representations organize documents by specificity through norm-based separation, a property absent in Euclidean embeddings. These findings suggest that embedding geometry is a central design choice for reliable retrieval in RAG systems, with implications for future scalable and multimodal retrieval architectures.