修改创新点

This commit is contained in:
龙澳
2026-04-03 18:56:07 +08:00
parent da913b6ccc
commit 19255963d6
10 changed files with 237 additions and 189 deletions

View File

@@ -54,34 +54,25 @@ Retrieval Augmented Generation, Planetary Remote Sensing, Hypergraph, Hyperbolic
Large Language Models (LLMs) have emerged as powerful tools for natural language understanding and generation \cite{Cai25LLM}, and Retrieval Augmented Generation (RAG) has been established as a standard paradigm for grounding LLM responses in external knowledge bases \cite{Lewis20RAG}. By dynamically retrieving relevant documents and conditioning generation on retrieved context, RAG effectively mitigates the hallucination problem inherent in LLMs and enables knowledge-intensive question answering \cite{Zhou24hallucination}. The synergy between LLMs and Knowledge Graphs (KGs) has further advanced retrieval performance through structured knowledge representation, achieving notable improvements in multi-hop reasoning, credibility assessment, and interpretability \cite{Pan24KGandLLM}.
Nevertheless, deploying RAG systems for planetary science knowledge retrieval introduces domain-specific complexities that fundamentally challenge existing frameworks. Unlike conventional multi-source retrieval scenarios (e.g., integrating flight records, financial reports, or web documents), planetary observation data possesses two distinctive characteristics. First, all data sources are spatially grounded: each observation is anchored to a specific spatial footprint on the Martian surface, a temporal acquisition window parameterized by Solar Longitude ($L_s$), and instrument-specific parameters such as spectral bands and spatial resolution. The relevance between two observations is therefore governed not merely by textual semantic similarity, but primarily by physical spatial proximity, temporal co-occurrence, and cross-resolution complementarity. Second, inter-source inconsistencies in planetary science are not exclusively indicative of data errors or model hallucinations; rather, they frequently arise as inherent consequences of multi-platform, multi-scale observation and may encode critical scientific discoveries — such as subsurface geological evolution revealed by discrepancies between orbital spectroscopy and in-situ drilling results.
Recent advances in multi-source RAG, exemplified by MultiRAG \cite{Wu25MultiRAG}, have made significant progress in addressing data sparsity and inter-source inconsistency through multi-source line graphs and multi-level confidence computation. However, when confronted with planetary spatial data, these methods encounter two structural bottlenecks that cannot be resolved through parameter tuning alone.
Building upon the analysis of existing multi-source RAG limitations [14]-[16] in the context of planetary science, we identify the following failure modes that are unique to spatially grounded, physically observed multi-source data:
\begin{enumerate}
\item Spatial topology distortion: When multi-source observations share no common textual entities but are spatially co-located, discrete line graphs fail to establish connectivity, resulting in fragmented retrieval.
\item Scale hierarchy collapse: Observations at different spatial resolutions (e.g., 0.3 m vs. 460 m) exhibit a natural hierarchical containment structure that flat graph topologies cannot represent, leading to loss of cross-resolution context during aggregation.
\item Scientifically valuable conflict suppression: Confidence-based conflict filtering indiscriminately eliminates disagreeing nodes, destroying observational evidence that may indicate genuine geological phenomena such as subsurface mineral heterogeneity.
\end{enumerate}
These failure modes trace back to two fundamental scientific problems:
% TODO还要将多源数据过度一下
Nevertheless, deploying RAG systems for planetary science knowledge retrieval introduces domain-specific complexities that fundamentally challenge existing frameworks. Recent advances in multi-source RAG, exemplified by MultiRAG \cite{Wu25MultiRAG}, have made significant progress through multi-source line graphs and multi-level confidence computation. However, when confronted with planetary spatial data, these methods encounter two fundamental problems that cannot be resolved through parameter tuning alone:
\begin{enumerate}
\item Problem 1: Discrete Representation Failure for Continuous Spatiotemporal Topology.** Existing multi-source knowledge aggregation methods, such as multi-source line graphs [14], rely on discrete text entities and explicit semantic associations to construct graph topology. However, planetary science data is intrinsically embedded in continuous Euclidean physical space. Attempting to encode continuous spatial proximity and directional relationships within traditional discrete graph structures inevitably triggers an edge explosion problem — $k$ co-located spatial entities require $\binom{k}{2} = O(k^2)$ pairwise spatial proximity edges — thereby destroying the optimizations that existing graph models achieve for data sparsity. The discrete logical graph structure thus constitutes a structural bottleneck constraining planetary spatial reasoning capabilities, unable to bridge the chasm between physical continuity and semantic discreteness.
\item Problem 2: Fundamental Conflict Between Scientific Cognitive Divergence and Traditional De-Falsification Mechanisms.** The core assumption underlying existing multi-source RAG frameworks is that inter-source data inconsistency typically originates from misinformation or model hallucinations, and therefore relies on multi-level confidence computation to eliminate conflicting nodes [14], [17]. However, in deep-space exploration scenarios, the absence of absolute ground truth means that different observation platforms (e.g., orbiters versus rovers), due to differences in observation scale, penetration depth, and instrumental principles, often produce significantly conflicting results for the same target region. For instance, orbital spectrometers may detect surface hydrated minerals while in-situ drilling reveals no anomaly — a conflict arising not from data error, but from the inherent multi-dimensional nature of scientific observation, potentially harboring clues to major discoveries such as geological evolution. Applying existing conflict-filtering mechanisms indiscriminately would cause severe over-smoothing, uniformly suppressing high-value scientific anomalies and fundamentally violating the epistemological principle of deep-space exploration: preserving controversy and enabling multi-source corroboration for knowledge discovery.
\item \textbf{The Spatial Topology Loss Problem.} Conventional multi-source retrieval systems judge relevance by textual semantic similarity. Planetary observations are different. Each observation is tied to a spatial footprint on the surface, a time window, and a set of instrument parameters. Two observations are relevant to each other mainly because they are spatially close, temporally overlapping, or captured at complementary resolutions. Existing methods such as multi-source line graphs \cite{Wu25MultiRAG} build graph topology from discrete text entities. This design creates a mismatch with continuous spatial data: $k$ co-located entities need $\binom{k}{2} = O(k^2)$ pairwise edges to represent their spatial relationships. The resulting edge explosion removes the sparsity that these graph models rely on. In short, the discrete graph structure cannot bridge the gap between physical continuity and semantic discreteness.
\item \textbf{The Conflict Over-Smoothing Problem.} Existing multi-source RAG frameworks treat inter-source inconsistency as misinformation or hallucination. They use confidence scores to remove conflicting nodes \cite{Wu25MultiRAG}, \cite{Wang25Astute}. In planetary science, however, different platforms naturally produce different measurements for the same target. An orbiter and a rover observe at different scales, depths, and wavelengths. For example, an orbital spectrometer may detect hydrated minerals on the surface, while an in-situ drill finds olivine-carbonate assemblages below. This conflict does not come from data error. It reflects geological evolution across depth. If we apply uniform conflict filtering, the system suppresses these scientifically valuable signals together with genuine noise. This over-smoothing violates a core principle of deep-space exploration: observational disagreements should be preserved, because they may lead to new discoveries through multi-source comparison.
\end{enumerate}
To address these two fundamental challenges, we propose AreoRAG, a novel framework specifically designed for multi-source planetary spatial data retrieval augmented generation. AreoRAG introduces two synergistic innovations. First, to resolve Problem 1, we construct a Hyperbolic Spatial Hypergraph (HySH) that employs $n$-ary spatial observation hyperedges to bind co-located multi-source observations into single high-order facts, reducing edge complexity from $O(k^2)$ to $O(k)$. These hyperedges are embedded in hyperbolic space via the Lorentz model, where the exponential volume growth of negative-curvature geometry naturally accommodates the hierarchical scale structure of planetary observations — coarse-resolution global data resides near the origin while fine-resolution local data extends toward the boundary. Second, to resolve Problem 2, we develop a Physics-Informed Conflict Triage (PICT) mechanism that replaces the uniform conflict-filtering paradigm with a differentiated triage approach. PICT detects inter-source conflicts through cross-source interaction entropy, classifies each conflict into one of four physically grounded categories (noise, instrument-inherent, scale-dependent, temporal-evolution), and applies category-specific confidence recalibration filtering genuine noise while provably preserving and even boosting the confidence of scientifically valuable observational disagreements. Together, HySH provides spatially faithful multi-source evidence to PICT, while PICT feeds back triage results to prioritize scientifically interesting regions in subsequent retrieval, forming a tightly coupled framework.
To address these two challenges, we propose AreoRAG, a framework designed for multi-source planetary spatial data retrieval augmented generation. AreoRAG introduces two innovations. We first construct a \textbf{Hyperbolic Spatial Hypergraph (HySH)} to resolve the spatial topology loss problem. HySH uses $n$-ary spatial observation hyperedges to group co-located multi-source observations into single high-order facts. This design reduces edge complexity from $O(k^2)$ to $O(k)$. We embed these hyperedges in hyperbolic space via the Lorentz model. The exponential volume growth of negative-curvature geometry naturally fits the hierarchical scale structure of planetary observations. Coarse-resolution global data resides near the origin, while fine-resolution local data extends toward the boundary. To resolve the conflict over-smoothing problem, we develop a \textbf{Physics-Informed Conflict Triage (PICT)} mechanism. PICT replaces uniform conflict filtering with a differentiated triage strategy. It first detects inter-source conflicts through cross-source interaction entropy. Then it classifies each conflict into one of four physically grounded categories: noise, instrument-inherent, scale-dependent, and temporal-evolution. Finally, it applies category-specific confidence recalibration, filtering genuine noise while provably preserving scientifically valuable observational disagreements. The two modules form a tightly coupled loop. HySH provides spatially faithful multi-source evidence to PICT, while PICT feeds back triage results to prioritize scientifically interesting regions in subsequent retrieval.
The contributions of this paper are summarized as follows:
\begin{enumerate}
\item Hyperbolic Spatial Hypergraph Construction: We introduce HySH, a knowledge construction module that employs $n$-ary spatial observation hyperedges embedded in hyperbolic space to achieve unified spatiotemporal representation of multi-source planetary data. By coupling spatial resolution with hyperbolic radial depth via the Lorentz model, HySH faithfully preserves the hierarchical scale structure of planetary observations while eliminating edge explosion through high-order relational encoding. A resolution-aware Spatial Outward Einstein Midpoint (Spatial OEM) aggregation operator is further proposed to prevent hierarchical collapse during cross-resolution evidence fusion, with a formal guarantee of outward bias.
\item Physics-Informed Conflict Triage: We propose PICT, a retrieval module that fundamentally redefines the role of inter-source conflict in RAG systems. Through cross-source interaction entropy for conflict detection, a physically grounded four-category conflict classification informed by observation geometry, and differentiated confidence recalibration, PICT provably prevents the over-smoothing of scientifically valuable disagreements (Anti-Over-Smoothing Guarantee) while maintaining noise-filtering capability. To the best of our knowledge, this is the first conflict-handling mechanism in RAG that explicitly distinguishes between erroneous inconsistency and scientifically meaningful observational divergence.
\item Integrated Framework and Experimental Validation: We design the AreoRAG Prompting (ARP) algorithm that integrates HySH and PICT through three explicit coupling points: spatial alignment as a prerequisite for interaction entropy computation, radial depth difference as a resolution disparity signal for conflict classification, and triage-driven retrieval priority feedback. Extensive experiments on multi-source planetary observation datasets demonstrate that AreoRAG significantly outperforms existing multi-source RAG methods in both retrieval fidelity and scientific faithfulness, with particular advantages in scenarios involving cross-resolution reasoning and observation-grounded conflict preservation.
\item{We propose a Hyperbolic Spatial Hypergraph (HySH) construction module for multi-source planetary data, by combining the $n$-ary hyperedge representation from hypergraph-based RAG \cite{placeholder_HyperRAG} with the Lorentz-model hyperbolic embedding from hyperbolic knowledge graph methods \cite{placeholder_HypRAG}. HySH couples spatial resolution with hyperbolic radial depth so that the hierarchical scale structure of planetary observations is preserved, while edge complexity is reduced from $O(k^2)$ to $O(k)$. We further propose a resolution-aware Spatial Outward Einstein Midpoint (Spatial OEM) aggregation operator with a formal guarantee of outward bias.}
\item{We propose a Physics-Informed Conflict Triage (PICT) mechanism for multi-source retrieval, by adapting the entropy-based conflict detection from \cite{placeholder_TruthfulRAG} and the linear-separability finding of knowledge conflicts from \cite{placeholder_Diagnosing}. PICT classifies each inter-source conflict into four physically grounded categories (noise, instrument-inherent, scale-dependent, temporal-evolution) and applies category-specific confidence recalibration. We provide a formal Anti-Over-Smoothing Guarantee showing that scientifically valuable disagreements are provably preserved. To the best of our knowledge, this is the first conflict-handling mechanism in RAG that explicitly distinguishes erroneous inconsistency from scientifically meaningful observational divergence.}
\item{We design the AreoRAG Prompting (ARP) algorithm that integrates HySH and PICT through three coupling points: spatial alignment as a prerequisite for interaction entropy computation, radial depth difference as a resolution disparity signal for conflict classification, and triage-driven retrieval priority feedback. Experiments on three Mars observation datasets show that AreoRAG outperforms existing multi-source RAG methods in both retrieval accuracy and conflict preservation.}
\end{enumerate}
\section{Preliminary}