|
|
|
|
@@ -259,7 +259,7 @@ i.e., the apparent inconsistency is resolvable by accounting for observation con
|
|
|
|
|
\begin{table}
|
|
|
|
|
\renewcommand{\arraystretch}{1.3}
|
|
|
|
|
\caption{Physics-Informed Conflict Triage Categories}
|
|
|
|
|
\label{table_conflict_triage}
|
|
|
|
|
\label{table:conflict_triage}
|
|
|
|
|
\vspace{-0.13in}
|
|
|
|
|
\centering
|
|
|
|
|
\begin{tabular}{|m{2.1cm}|m{2.8cm}|m{2.8cm}|}
|
|
|
|
|
@@ -278,7 +278,7 @@ i.e., the apparent inconsistency is resolvable by accounting for observation con
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
Based on this distinction, we define four conflict categories, each with a differentiated processing strategy, as shown in Table~\ref{table_conflict_triage}. For each detected conflict, we construct a feature vector that fuses information-theoretic, physical, and neural signals:
|
|
|
|
|
Based on this distinction, we define four conflict categories, each with a differentiated processing strategy, as shown in Table~\ref{table:conflict_triage}. For each detected conflict, we construct a feature vector that fuses information-theoretic, physical, and neural signals:
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\label{equ:conflict classification feature vector}
|
|
|
|
|
\mathbf{z}_{conf} = \left[\mathcal{H}_{inter}, \; \|\Omega_i - \Omega_j\|, \; |\log(\ell_{res}^i / \ell_{res}^j)|, \; \Delta\mathcal{T}, \; \rho_{auth}(i,j), \; \mathbf{h}^{(l^*)}_{conf}\right],
|
|
|
|
|
@@ -295,7 +295,7 @@ Lemma~1 (Conflict Type Separability). The four conflict types are distinguished
|
|
|
|
|
|
|
|
|
|
3) Conflict-Aware Confidence Recalibration: Based on the classification result, we recalibrate the node confidence. This is the key departure from MultiRAG's MCC, which uniformly penalizes inconsistency:
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\label{equ:conflict classification}
|
|
|
|
|
\label{equ:conflict recalibration}
|
|
|
|
|
C_{triage}\left( v \right) =\begin{cases}
|
|
|
|
|
C_{base}\left( v \right)& \text{if\,\,}v\ni \mathcal{C}^{detected}\\
|
|
|
|
|
\alpha \cdot C_{base}\left( v \right) +\left( 1-\alpha \right) \cdot \eta& \text{if\,\,}\hat{c}=noise\\
|
|
|
|
|
@@ -352,30 +352,29 @@ It should be noted that the ARP algorithm constructs the HySH offline as a prepr
|
|
|
|
|
\section{Experiments}
|
|
|
|
|
This section conducts experiments and performance analysis on the Hyperbolic Spatial Hypergraph (HySH) construction and the Physics-Informed Conflict Triage (PICT) modules. Baseline methods are compared with SOTA multi-source retrieval, graph-based RAG, and conflict-resolution methods. Extensive experiments are conducted to assess the robustness and efficiency of AreoRAG, which aims to answer the following questions.
|
|
|
|
|
|
|
|
|
|
- **Q1**: How does the overall retrieval and QA performance of AreoRAG compare with existing multi-source RAG and graph-based RAG methods on planetary spatial data?
|
|
|
|
|
|
|
|
|
|
- **Q2**: What are the respective impacts of spatial sparsity and inter-source conflict intensity on retrieval quality?
|
|
|
|
|
|
|
|
|
|
- **Q3**: How effective are the two core modules (HySH and PICT) of AreoRAG individually?
|
|
|
|
|
|
|
|
|
|
- **Q4**: Can PICT correctly preserve scientifically valuable conflicts while filtering noise, and how does this compare with conventional conflict-elimination approaches?
|
|
|
|
|
|
|
|
|
|
- **Q5**: What are the time costs of the various modules in AreoRAG?
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item \textbf{Q1}: How does the overall retrieval and QA performance of AreoRAG compare with existing multi-source RAG and graph-based RAG methods on planetary spatial data?
|
|
|
|
|
\item \textbf{Q2}: What are the respective impacts of spatial sparsity and inter-source conflict intensity on retrieval quality?
|
|
|
|
|
\item \textbf{Q3}: How effective are the two core modules (HySH and PICT) of AreoRAG individually?
|
|
|
|
|
\item \textbf{Q4}: Can PICT correctly preserve scientifically valuable conflicts while filtering noise, and how does this compare with conventional conflict-elimination approaches?
|
|
|
|
|
\item \textbf{Q5}: What are the time costs of the various modules in AreoRAG?
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\subsection{Experimental Settings}
|
|
|
|
|
|
|
|
|
|
**a) Datasets:** To validate the effectiveness of AreoRAG in planetary multi-source spatial data retrieval, we construct three datasets from real Mars exploration archives and further evaluate on two general multi-hop QA benchmarks. The planetary datasets are summarized in Table I.
|
|
|
|
|
a) \textbf{Datasets}: To validate the effectiveness of AreoRAG in planetary multi-source spatial data retrieval, we construct three datasets from real Mars exploration archives and further evaluate on two general multi-hop QA benchmarks. The planetary datasets are summarized in Table~\ref{table:planetary_datasets}.
|
|
|
|
|
|
|
|
|
|
(1) **MarsRegion-QA**: A multi-source spatial QA dataset constructed from the Mars Orbital Data Explorer (ODE) archives. We select five scientifically significant regions on Mars — Jezero Crater, Gale Crater, Utopia Planitia, Valles Marineris, and Olympus Mons — and aggregate orbital observations from HiRISE (0.3 m), CTX (6 m), CRISM (18 m), and MOLA (460 m). Each query targets cross-source spatial reasoning (e.g., "What mineral signatures have been detected in the clay-bearing unit at the western delta of Jezero Crater, and do different orbital sensors agree?"). We construct 200 queries with expert-annotated ground truth answers and conflict labels.
|
|
|
|
|
% TODO 这个例子是否恰当?
|
|
|
|
|
(1) MarsRegion-QA: A multi-source spatial QA dataset constructed from the Mars Orbital Data Explorer archives. We select five scientifically significant regions on Mars: Jezero Crater, Gale Crater, Utopia Planitia, Valles Marineris, and Olympus Mons. For these areas, we aggregate orbital observations from HiRISE (0.5 m), CTX (5 m), CRISM (18 m), MoRIC (76 m), and MOLA (460 m). Each query targets cross-source spatial reasoning (e.g., "What mineral signatures have been detected in the clay-bearing unit at the western delta of Jezero Crater, and do different orbital sensors agree?"). We construct 200 queries with expert-annotated ground truth answers and conflict labels.
|
|
|
|
|
|
|
|
|
|
(2) **MarsConflict-50**: A curated subset of 50 observation pairs exhibiting known scientific conflicts documented in the planetary science literature (e.g., CRISM detection of hydrated minerals vs. contradictory results from other spectral sensors at the same location). Each pair is annotated with conflict type (instrument-inherent, scale-dependent, temporal-evolution, or noise) by domain experts. This dataset serves as the primary benchmark for evaluating PICT's conflict classification accuracy.
|
|
|
|
|
(2) MarsConflict-50 is a curated dataset comprising 50 observation pairs that exhibit established scientific conflicts documented in planetary science literature. A representative example involves discrepancies between hydrated mineral detections by CRISM and contradictory measurements from other spectral sensors at identical locations. Each pair is rigorously annotated by domain experts into four categories: instrument-inherent, scale-dependent, temporal-evolution, and noise-induced. This dataset serves as the primary benchmark for evaluating the conflict classification accuracy of the PICT framework.
|
|
|
|
|
|
|
|
|
|
(3) **MarsTemporal-QA**: A temporal reasoning dataset comprising 150 queries about surface changes observed across different Mars Years (MY), such as recurring slope lineae (RSL) activity, dust storm impacts, and seasonal frost patterns. Each query requires integrating observations spanning $L_s$ ranges to assess temporal evolution.
|
|
|
|
|
(3) MarsTemporal-QA: A temporal reasoning dataset comprising 150 queries about surface changes observed across different Mars Years (MY), such as recurring slope lineae activity, dust storm impacts, and seasonal frost patterns. Each query requires integrating observations spanning $L_s$ ranges to assess temporal evolution.
|
|
|
|
|
|
|
|
|
|
\begin{table}
|
|
|
|
|
\renewcommand{\arraystretch}{1.3}
|
|
|
|
|
\caption{Statistics of the Planetary Datasets}
|
|
|
|
|
\label{table_planetary_datasets}
|
|
|
|
|
\label{table:planetary_datasets}
|
|
|
|
|
\vspace{-0.13in}
|
|
|
|
|
\centering
|
|
|
|
|
\begin{tabular}{|m{1cm}|m{1cm}|m{1cm}|m{1cm}|m{1cm}|m{1cm}|}
|
|
|
|
|
@@ -398,66 +397,58 @@ This section conducts experiments and performance analysis on the Hyperbolic Spa
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
Additionally, to validate generalization on established benchmarks, we evaluate on HotpotQA [38] and 2WikiMultiHopQA [39], using the same 300-question subsamples as MultiRAG [14] for fair comparison.
|
|
|
|
|
Additionally, to validate generalization on established benchmarks, we evaluate on HotpotQA \cite{yang18hotpotqa} and 2WikiMultiHopQA \cite{ho202WikiMultiHopQA}, using the same 300-question subsamples as MultiRAG \cite{Wu25MultiRAG} for fair comparison.
|
|
|
|
|
|
|
|
|
|
It is noteworthy that MarsRegion-QA exhibits high spatial density (multiple overlapping observations per region) but significant cross-resolution heterogeneity, while MarsConflict-50 is specifically designed to stress-test conflict handling with a high proportion of scientifically valuable disagreements (~72\% of conflicts are non-noise).
|
|
|
|
|
It is noteworthy that MarsRegion-QA exhibits multiple overlapping observations per region but significant cross-resolution heterogeneity, while MarsConflict-50 is specifically designed to stress-test conflict handling with a high proportion of scientifically valuable disagreements (~72\% of conflicts are non-noise).
|
|
|
|
|
|
|
|
|
|
**b) Evaluation Metrics:** We adopt multiple metrics to comprehensively evaluate retrieval quality, answer accuracy, and conflict handling:
|
|
|
|
|
b) \textbf{Evaluation Metrics}: We adopt multiple metrics to comprehensively evaluate retrieval quality, answer accuracy, and conflict handling:
|
|
|
|
|
|
|
|
|
|
- **F1 score**: The harmonic mean of precision and recall, assessing overall retrieval and answer quality:
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item F1 score: The harmonic mean of precision and recall, assessing overall retrieval and answer quality: $F1 = 2 \times \frac{P \times R}{P + R}$.
|
|
|
|
|
\item Recall@K: Recall at rank $K$, measuring the proportion of relevant documents retrieved within the top-$K$ results.
|
|
|
|
|
\item Conflict Preservation Rate (CPR): The proportion of scientifically valuable conflicts (annotated as instrument-inherent, scale-dependent, or temporal-evolution) that are correctly preserved rather than filtered:$CPR = \frac{|\mathcal{C}^{sci}_{preserved}|}{|\mathcal{C}^{sci}_{total}|}$.
|
|
|
|
|
\item Noise Rejection Rate (NRR): The proportion of noise conflicts that are correctly filtered:$NRR = \frac{|\mathcal{C}^{noise}_{filtered}|}{|\mathcal{C}^{noise}_{total}|}$.
|
|
|
|
|
\item Conflict Classification Accuracy (CCA): Four-class classification accuracy over the conflict types on MarsConflict-50.
|
|
|
|
|
\item Query Time (QT) and Preprocessing Time (PT): Measured in seconds, assessing online and offline efficiency.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
$$F1 = 2 \times \frac{P \times R}{P + R}$$
|
|
|
|
|
c) \textbf{Hyper-parameter Settings}: All methods were implemented in Python 3.12 and CUDA 12.1 environment. The base LLM is Llama3-8B-Instruct for all methods except where noted. For HySH construction, the hyperbolic curvature is set to $K = -1.0$, the embedding dimension $d = 64$, and the resolution power parameter $p = 2$ for Spatial OEM. For PICT, the interaction entropy threshold is $\epsilon = 0.3$, the noise penalty $\eta = -0.5$, the scientific boost coefficient $\beta = 0.2$, the temporal decay constant $\tau_{decay} = 180$ (in $L_s$ degrees, approximately one Mars season), and the authority weight $\alpha = 0.5$. The MLP conflict classifier uses a two-layer architecture ($256 \rightarrow 128 \rightarrow 4$) with ReLU activation, trained on MarsConflict-50 with 5-fold cross-validation. The plausibility scoring MLP $f_\theta$ for retrieval follows the architecture in [18] with adaptive threshold $\tau_0 = 0.5$ and decay factor $c = 0.1$. All experiments were conducted on a device equipped with an NVIDIA A100 (80 GB) GPU and 256 GB of memory.
|
|
|
|
|
|
|
|
|
|
- **Recall@K**: Recall at rank $K$, measuring the proportion of relevant documents retrieved within the top-$K$ results.
|
|
|
|
|
d) \textbf{Baseline Models}: To demonstrate the superiority of AreoRAG, we compare with the following categories of methods.
|
|
|
|
|
|
|
|
|
|
- **Conflict Preservation Rate (CPR)**: The proportion of scientifically valuable conflicts (annotated as instrument-inherent, scale-dependent, or temporal-evolution) that are correctly preserved rather than filtered:
|
|
|
|
|
General RAG Methods:
|
|
|
|
|
|
|
|
|
|
$$CPR = \frac{|\mathcal{C}^{sci}_{preserved}|}{|\mathcal{C}^{sci}_{total}|}$$
|
|
|
|
|
1) Standard RAG \cite{Lewis20RAG}: Conventional retrieval-augmented generation with dense vector retrieval.
|
|
|
|
|
|
|
|
|
|
- **Noise Rejection Rate (NRR)**: The proportion of noise conflicts that are correctly filtered:
|
|
|
|
|
2) IRCoT \cite{Harsh23IRCoT}: Iterative retrieval with chain-of-thought reasoning refinement.
|
|
|
|
|
|
|
|
|
|
$$NRR = \frac{|\mathcal{C}^{noise}_{filtered}|}{|\mathcal{C}^{noise}_{total}|}$$
|
|
|
|
|
3) RQ-RAG \cite{Chan24RQRAG}: Retrieval with optimized query decomposition for complex queries.
|
|
|
|
|
|
|
|
|
|
- **Conflict Classification Accuracy (CCA)**: Four-class classification accuracy over the conflict types on MarsConflict-50.
|
|
|
|
|
Graph-based RAG Methods:
|
|
|
|
|
|
|
|
|
|
- **Query Time (QT)** and **Preprocessing Time (PT)**: Measured in seconds, assessing online and offline efficiency.
|
|
|
|
|
4) MultiRAG \cite{Wu25MultiRAG}: Multi-source line graph with multi-level confidence computing (the primary comparison target).
|
|
|
|
|
|
|
|
|
|
**c) Hyper-parameter Settings:** All methods were implemented in Python 3.10 and CUDA 12.1 environment. The base LLM is Llama3-8B-Instruct for all methods except where noted. For HySH construction, the hyperbolic curvature is set to $K = -1.0$, the embedding dimension $d = 64$, and the resolution power parameter $p = 2$ for Spatial OEM. For PICT, the interaction entropy threshold is $\epsilon = 0.3$, the noise penalty $\eta = -0.5$, the scientific boost coefficient $\beta = 0.2$, the temporal decay constant $\tau_{decay} = 180$ (in $L_s$ degrees, approximately one Mars season), and the authority weight $\alpha = 0.5$. The MLP conflict classifier uses a two-layer architecture ($256 \rightarrow 128 \rightarrow 4$) with ReLU activation, trained on MarsConflict-50 with 5-fold cross-validation. The plausibility scoring MLP $f_\theta$ for retrieval follows the architecture in [18] with adaptive threshold $\tau_0 = 0.5$ and decay factor $c = 0.1$. All experiments were conducted on a device equipped with an NVIDIA A100 (80 GB) GPU and 256 GB of memory.
|
|
|
|
|
5) HyperGraphRAG \cite{luo25hyperrag}: Hypergraph-based RAG with $n$-ary relational facts retrieval.
|
|
|
|
|
|
|
|
|
|
**d) Baseline Models:** To demonstrate the superiority of AreoRAG, we compare with the following categories of methods:
|
|
|
|
|
6) HyperRAG \cite{lien26hyperrag}: MLP-based retrieval over $n$-ary hypergraphs with adaptive search.
|
|
|
|
|
|
|
|
|
|
*General RAG Methods:*
|
|
|
|
|
% TODO Conflict-Resolution需要换一下
|
|
|
|
|
Conflict-Resolution Methods:
|
|
|
|
|
|
|
|
|
|
1) **Standard RAG** [6]: Conventional retrieval-augmented generation with dense vector retrieval.
|
|
|
|
|
7) TruthfulRAG \cite{liu26truthfulrag}: Knowledge graph-based conflict resolution via entropy-based filtering.
|
|
|
|
|
|
|
|
|
|
2) **IRCoT** [44]: Iterative retrieval with chain-of-thought reasoning refinement.
|
|
|
|
|
8) MetaRAG \cite{Zhou24MetaRAG}: Metacognitive strategies for hallucination mitigation in retrieval.
|
|
|
|
|
|
|
|
|
|
3) **RQ-RAG** [47]: Retrieval with optimized query decomposition for complex queries.
|
|
|
|
|
|
|
|
|
|
*Graph-based RAG Methods:*
|
|
|
|
|
|
|
|
|
|
4) **MultiRAG** [14]: Multi-source line graph with multi-level confidence computing (the primary comparison target).
|
|
|
|
|
|
|
|
|
|
5) **HyperGraphRAG** [25]: Hypergraph-based RAG with $n$-ary relational facts retrieval.
|
|
|
|
|
|
|
|
|
|
6) **HyperRAG** [18]: MLP-based retrieval over $n$-ary hypergraphs with adaptive search.
|
|
|
|
|
|
|
|
|
|
*Conflict-Resolution Methods:*
|
|
|
|
|
|
|
|
|
|
7) **TruthfulRAG** [17]: Knowledge graph-based conflict resolution via entropy-based filtering.
|
|
|
|
|
|
|
|
|
|
8) **MetaRAG** [9]: Metacognitive strategies for hallucination mitigation in retrieval.
|
|
|
|
|
|
|
|
|
|
**e) Dataset Preprocessing:** For the planetary datasets, we parse PDS4 labels and CNSA metadata through the multi-source spatial adapters (Section III-B) to extract spatial footprints, temporal windows, and instrument parameters. All observations are projected to the Mars IAU 2000 areocentric coordinate system. Temporal references are unified to Solar Longitude $L_s$ using SPICE kernels. For the general QA benchmarks, we follow the same preprocessing pipeline as MultiRAG [14] to ensure fair comparison.
|
|
|
|
|
e) \textbf{Dataset Preprocessing}: For the planetary datasets, we parse PDS4 labels and CNSA metadata through the multi-source spatial adapters (Section~\ref{sec:HySH}) to extract spatial footprints, temporal windows, and instrument parameters. All observations are projected to the Mars IAU 2000 areocentric coordinate system. Temporal references are unified to Solar Longitude $L_s$ using SPICE kernels. For the general QA benchmarks, we follow the same preprocessing pipeline as MultiRAG \cite{Wu25MultiRAG} to ensure fair comparison.
|
|
|
|
|
|
|
|
|
|
\subsection{Overall Retrieval and QA Performance (Q1)}
|
|
|
|
|
|
|
|
|
|
To validate the effectiveness of AreoRAG, we assess it using F1 scores and query times across the planetary datasets and the two general multi-hop QA benchmarks. Table II summarizes the performance comparison.
|
|
|
|
|
To validate the effectiveness of AreoRAG, we assess it using F1 scores and query times across the planetary datasets and the two general multi-hop QA benchmarks. Table~\ref{table:comparison_QA} summarizes the performance comparison.
|
|
|
|
|
|
|
|
|
|
\begin{table*}
|
|
|
|
|
\renewcommand{\arraystretch}{1.3}
|
|
|
|
|
\caption{Comparison with Baseline Methods on Planetary and General QA Datasets}
|
|
|
|
|
\label{table_comparison}
|
|
|
|
|
\label{table:comparison_QA}
|
|
|
|
|
\vspace{-0.13in}
|
|
|
|
|
\centering
|
|
|
|
|
\begin{tabular}{|m{2.5cm}|m{1.1cm}|m{1.3cm}|m{1.1cm}|m{1.3cm}|m{1.1cm}|m{1.3cm}|m{1.1cm}|m{1.3cm}|}
|
|
|
|
|
@@ -488,9 +479,9 @@ To validate the effectiveness of AreoRAG, we assess it using F1 scores and query
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\end{table*}
|
|
|
|
|
|
|
|
|
|
Table II demonstrates that AreoRAG outperforms all comparative methods across both planetary and general QA datasets. On MarsRegion-QA, AreoRAG achieves an F1 score of 55.8\%, representing a 13.5\% absolute improvement over MultiRAG (42.3%) and a 9.3% improvement over the best graph-based baseline HyperRAG (46.5%). This significant gap validates the effectiveness of HySH in capturing spatial relationships that discrete line graphs and standard hypergraphs miss.
|
|
|
|
|
Table~\ref{table:comparison_QA} demonstrates that AreoRAG outperforms all comparative methods across both planetary and general QA datasets. On MarsRegion-QA, AreoRAG achieves an F1 score of 55.8\%, representing a 13.5\% absolute improvement over MultiRAG (42.3\%) and a 9.3\% improvement over the best graph-based baseline HyperRAG (46.5\%). This significant gap validates the effectiveness of HySH in capturing spatial relationships that discrete line graphs and standard hypergraphs miss.
|
|
|
|
|
|
|
|
|
|
On MarsTemporal-QA, which demands temporal reasoning across observation epochs, AreoRAG achieves 52.4\% F1, outperforming all baselines by at least 10.6\%. This improvement is attributed to PICT's temporal-evolution conflict handling (the $\gamma(|\Delta\mathcal{T}|)$ weighting in Eq. 20), which preserves temporal change signals rather than filtering them as inconsistencies.
|
|
|
|
|
On MarsTemporal-QA, which demands temporal reasoning across observation epochs, AreoRAG achieves 52.4\% F1, outperforming all baselines by at least 10.6\%. This improvement is attributed to PICT's temporal-evolution conflict handling (the $\gamma(|\Delta\mathcal{T}|)$ weighting in Eq.~\ref{equ:conflict recalibration}), which preserves temporal change signals rather than filtering them as inconsistencies.
|
|
|
|
|
|
|
|
|
|
On the general benchmarks (HotpotQA and 2WikiMultiHopQA), AreoRAG maintains competitive performance (61.7\% and 57.3\% F1), demonstrating that the framework generalizes beyond planetary science. The modest improvements over MultiRAG on these benchmarks (2.4\% and 1.6\%) are expected, as these datasets do not exhibit the spatial and physical conflict characteristics that AreoRAG is specifically designed to address.
|
|
|
|
|
|
|
|
|
|
@@ -500,22 +491,22 @@ Notably, HyperRAG and HyperGraphRAG perform well on planetary datasets (46.5\% a
|
|
|
|
|
|
|
|
|
|
AreoRAG demonstrates strong robustness under varying spatial sparsity and conflict intensity. We conduct experiments from two perspectives.
|
|
|
|
|
|
|
|
|
|
**1) Spatial Sparsity:** We applied 30\%, 50\%, and 70\% random hyperedge masking to MarsRegion-QA, progressively removing spatial connections while ensuring query answers remain retrievable.
|
|
|
|
|
a) Spatial Sparsity: We applied 30\%, 50\%, and 70\% random hyperedge masking to MarsRegion-QA, progressively removing spatial connections while ensuring query answers remain retrievable.
|
|
|
|
|
|
|
|
|
|
As shown in Fig. 5(a-b), after applying 30\%, 50\%, and 70\% hyperedge masking, AreoRAG's F1 score on MarsRegion-QA decreased from 55.8\% to 52.1\%, 49.3\%, and 45.6\% respectively. In contrast, MultiRAG's F1 dropped more sharply from 42.3\% to 37.8\%, 32.5\%, and 26.1\%. HyperRAG shows moderate degradation (46.5\% to 42.7\%, 38.9\%, 33.4\%). The superior robustness of AreoRAG under sparsity is attributed to two factors: (i) hyperbolic embedding preserves proximity information even when explicit graph edges are removed, as geodesic distance in $\mathbb{H}_K^d$ encodes spatial proximity independently of graph connectivity; and (ii) the Spatial OEM aggregation maintains representational quality by amplifying high-resolution signals that survive masking.
|
|
|
|
|
|
|
|
|
|
**2) Conflict Intensity:** We injected 30\%, 50\%, and 70\% synthetic conflict triples into MarsRegion-QA by duplicating existing observation records and perturbing their factual content (e.g., randomizing mineral identifications or altering coordinate data), simulating scenarios of increasing inter-source noise.
|
|
|
|
|
b) Conflict Intensity: We injected 30\%, 50\%, and 70\% synthetic conflict triples into MarsRegion-QA by duplicating existing observation records and perturbing their factual content (e.g., randomizing mineral identifications or altering coordinate data), simulating scenarios of increasing inter-source noise.
|
|
|
|
|
|
|
|
|
|
As shown in Fig. 5(c-d), AreoRAG's F1 score decreased only moderately from 55.8\% to 54.2\%, 52.8\%, and 50.1\% under 30\%, 50\%, and 70\% conflict injection respectively. MultiRAG exhibited steeper degradation (42.3\% to 40.1\%, 36.4\%, 30.7\%), and TruthfulRAG showed similar sensitivity (40.8\% to 38.2\%, 34.6\%, 29.3\%). The resilience of AreoRAG is directly attributable to PICT's ability to classify injected noise conflicts as $\mathcal{C}^{noise}$ and filter them while preserving genuine scientific disagreements. In contrast, MultiRAG's MCC module and TruthfulRAG's entropy-based filtering indiscriminately penalize all inconsistencies, including the original valid observations that become "outvoted" by injected noise.
|
|
|
|
|
As shown in Fig. 5(c-d), AreoRAG's F1 score decreased only moderately from 55.8\% to 54.2\%, 52.8\%, and 50.1\% under 30\%, 50\%, and 70\% conflict injection respectively. MultiRAG exhibited steeper degradation (42.3\% to 40.1\%, 36.4\%, 30.7\%), and TruthfulRAG showed similar sensitivity (40.8\% to 38.2\%, 34.6\%, 29.3\%). The resilience of AreoRAG is directly attributable to PICT's ability to classify injected noise conflicts as $\mathcal{C}^{noise}$ and filter them while preserving genuine scientific disagreements. In contrast, MultiRAG's MCC module and TruthfulRAG's entropy-based filtering indiscriminately penalize all inconsistencies, including the original valid observations that become ``outvoted" by injected noise.
|
|
|
|
|
|
|
|
|
|
\subsection{Ablation Study (Q3)}
|
|
|
|
|
|
|
|
|
|
To evaluate the individual contributions of HySH and PICT, we conduct systematic ablation experiments. Table III reports results on MarsRegion-QA and MarsTemporal-QA.
|
|
|
|
|
To evaluate the individual contributions of HySH and PICT, we conduct systematic ablation experiments. Table~\ref{table:ablation} reports results on MarsRegion-QA and MarsTemporal-QA.
|
|
|
|
|
|
|
|
|
|
\begin{table*}
|
|
|
|
|
\renewcommand{\arraystretch}{1.3}
|
|
|
|
|
\caption{Ablation Experiments of HySH and PICT Modules}
|
|
|
|
|
\label{table_ablation}
|
|
|
|
|
\label{table:ablation}
|
|
|
|
|
\vspace{-0.13in}
|
|
|
|
|
\centering
|
|
|
|
|
\begin{tabular}{|m{4cm}|m{1.1cm}|m{1.1cm}|m{1.1cm}|m{1.1cm}|m{1.1cm}|m{1.1cm}|}
|
|
|
|
|
@@ -544,15 +535,15 @@ To evaluate the individual contributions of HySH and PICT, we conduct systematic
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\end{table*}
|
|
|
|
|
|
|
|
|
|
**a) HySH Module Analysis:** The HySH module achieves significant improvements in both accuracy and efficiency. Replacing HySH with MultiRAG's MLG (w/o HySH) causes F1 drops of 11.2\% on MarsRegion-QA and 12.3\% on MarsTemporal-QA, while query time increases by 8.4$\times$ (3.42s to 28.7s) due to the edge explosion problem in pairwise spatial encoding. This validates the $O(k)$ vs. $O(k^2)$ complexity advantage of hyperedges.
|
|
|
|
|
a) HySH Module Analysis: The HySH module achieves significant improvements in both accuracy and efficiency. Replacing HySH with MultiRAG's MLG (w/o HySH) causes F1 drops of 11.2\% on MarsRegion-QA and 12.3\% on MarsTemporal-QA, while query time increases by 8.4$\times$ (3.42s to 28.7s) due to the edge explosion problem in pairwise spatial encoding. This validates the $O(k)$ vs. $O(k^2)$ complexity advantage of hyperedges.
|
|
|
|
|
|
|
|
|
|
Within HySH, the hyperbolic embedding contributes 6.6\% F1 improvement over Euclidean hypergraph (49.2\% vs. 55.8\%), confirming that the negative-curvature geometry is essential for faithfully representing the hierarchical scale structure. The Spatial OEM contributes an additional 4.5\% F1 over standard Einstein midpoint aggregation (51.3\% vs. 55.8\%), validating the outward bias property (Theorem 1) in preventing hierarchical collapse during cross-resolution fusion.
|
|
|
|
|
Within HySH, the hyperbolic embedding contributes 6.6\% F1 improvement over Euclidean hypergraph (49.2\% vs. 55.8\%), confirming that the negative-curvature geometry is essential for faithfully representing the hierarchical scale structure. The Spatial OEM contributes an additional 4.5\% F1 over standard Einstein midpoint aggregation (51.3\% vs. 55.8\%), validating the outward bias property (Theorem~1) in preventing hierarchical collapse during cross-resolution fusion.
|
|
|
|
|
|
|
|
|
|
**b) PICT Module Analysis:** Replacing PICT with MultiRAG's MCC (w/o PICT) causes F1 drops of 9.9\% on MarsRegion-QA and 12.7\% on MarsTemporal-QA. The larger drop on MarsTemporal-QA is expected, as this dataset contains abundant temporal-evolution conflicts that MCC would filter as inconsistencies.
|
|
|
|
|
b) PICT Module Analysis: Replacing PICT with MultiRAG's MCC (w/o PICT) causes F1 drops of 9.9\% on MarsRegion-QA and 12.7\% on MarsTemporal-QA. The larger drop on MarsTemporal-QA is expected, as this dataset contains abundant temporal-evolution conflicts that MCC would filter as inconsistencies.
|
|
|
|
|
|
|
|
|
|
The ablation further reveals the contribution of each PICT component. Removing conflict classification (using uniform filtering instead of four-category triage) costs 7.7\% F1 on MarsRegion-QA. Replacing cross-source interaction entropy with TruthfulRAG's $\Delta H_p$ metric costs 5.4\% F1, confirming that the cross-source formulation (Eq. 14) is more appropriate for the all-external-knowledge setting of planetary observations.
|
|
|
|
|
The ablation further reveals the contribution of each PICT component. Removing conflict classification (using uniform filtering instead of four-category triage) costs 7.7\% F1 on MarsRegion-QA. Replacing cross-source interaction entropy with TruthfulRAG's $\Delta H_p$ metric costs 5.4\% F1, confirming that the cross-source formulation (Eq.\ref{equ:interaction entropy}) is more appropriate for the all-external-knowledge setting of planetary observations.
|
|
|
|
|
|
|
|
|
|
**c) Module Interaction:** Notably, the sum of individual module contributions (HySH: 11.2\% + PICT: 9.9\% = 21.1\%) exceeds the gap between the full model and Standard RAG (55.8\% - 28.4\% = 27.4\%), but the actual synergy is evident in the coupling points. HySH's radial depth difference $\Delta r$ directly improves PICT's scale-conflict classification; PICT's triage feedback improves HySH's retrieval priority. Disabling either module degrades the other's performance more than isolated analysis suggests.
|
|
|
|
|
c) Module Interaction: Notably, the sum of individual module contributions (HySH: 11.2\% + PICT: 9.9\% = 21.1\%) exceeds the gap between the full model and Standard RAG (55.8\% - 28.4\% = 27.4\%), but the actual synergy is evident in the coupling points. HySH's radial depth difference $\Delta r$ directly improves PICT's scale-conflict classification; PICT's triage feedback improves HySH's retrieval priority. Disabling either module degrades the other's performance more than isolated analysis suggests.
|
|
|
|
|
|
|
|
|
|
\subsection{Conflict Preservation Evaluation (Q4)}
|
|
|
|
|
|
|
|
|
|
@@ -561,7 +552,7 @@ A defining capability of AreoRAG is the ability to preserve scientifically valua
|
|
|
|
|
\begin{table}
|
|
|
|
|
\renewcommand{\arraystretch}{1.3}
|
|
|
|
|
\caption{Conflict Handling Performance on MarsConflict-50}
|
|
|
|
|
\label{table_conflict}
|
|
|
|
|
\label{table:conflict}
|
|
|
|
|
\vspace{-0.13in}
|
|
|
|
|
\centering
|
|
|
|
|
\begin{tabular}{|m{1.5cm}|m{1cm}|m{1cm}|m{1cm}|m{1cm}|}
|
|
|
|
|
@@ -582,9 +573,9 @@ A defining capability of AreoRAG is the ability to preserve scientifically valua
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
*Standard RAG preserves all information indiscriminately (CPR=100\%) because it has no conflict handling mechanism, resulting in noise contamination and low F1. "—" indicates the method does not perform explicit conflict classification.*
|
|
|
|
|
Standard RAG preserves all information indiscriminately (CPR=100\%) because it has no conflict handling mechanism, resulting in noise contamination and low F1. Symbol ``—" indicates the method does not perform explicit conflict classification.
|
|
|
|
|
|
|
|
|
|
Table IV reveals the fundamental difference between AreoRAG and existing methods. MultiRAG achieves a high Noise Rejection Rate (85.7\%) but at the cost of a catastrophically low Conflict Preservation Rate (8.3\%) — it filters 91.7\% of scientifically valuable conflicts as "unreliable data." TruthfulRAG and MetaRAG show similar behavior (CPR of 13.9\% and 11.1\%), confirming that existing conflict-resolution methods systematically destroy scientific anomaly signals.
|
|
|
|
|
Table~\ref{table:conflict} reveals the fundamental difference between AreoRAG and existing methods. MultiRAG achieves a high Noise Rejection Rate (85.7\%) but at the cost of a catastrophically low Conflict Preservation Rate (8.3\%) — it filters 91.7\% of scientifically valuable conflicts as "unreliable data." TruthfulRAG and MetaRAG show similar behavior (CPR of 13.9\% and 11.1\%), confirming that existing conflict-resolution methods systematically destroy scientific anomaly signals.
|
|
|
|
|
|
|
|
|
|
In contrast, AreoRAG achieves a CPR of 91.7\% while maintaining the same NRR (85.7\%) as MultiRAG, demonstrating that PICT successfully decouples noise filtering from scientific conflict preservation. The Conflict Classification Accuracy of 84.0\% on the four-category task validates the separability claim in Lemma~1. Error analysis reveals that the primary source of misclassification is between instrument-inherent and scale-dependent conflicts (12.3\% confusion rate), which is expected as both involve observation geometry differences. Noise vs. scientific conflict misclassification is rare (3.7\%), confirming the robustness of the explainable/opaque distinction (Definition 7).
|
|
|
|
|
|
|
|
|
|
@@ -595,7 +586,7 @@ Furthermore, the F1 score improvement (53.1\% vs. 35.2\% for MultiRAG) demonstra
|
|
|
|
|
\begin{table}
|
|
|
|
|
\renewcommand{\arraystretch}{1.3}
|
|
|
|
|
\caption{Time Cost Analysis Across Modules}
|
|
|
|
|
\label{table_time_cost}
|
|
|
|
|
\label{table:time_cost}
|
|
|
|
|
\vspace{-0.13in}
|
|
|
|
|
\centering
|
|
|
|
|
\begin{tabular}{|m{2cm}|m{1cm}|m{1cm}|m{1cm}|m{1cm}|}
|
|
|
|
|
@@ -620,7 +611,7 @@ Furthermore, the F1 score improvement (53.1\% vs. 35.2\% for MultiRAG) demonstra
|
|
|
|
|
|
|
|
|
|
AreoRAG's query time (3.42s on MarsRegion-QA) is competitive with HyperRAG (2.95s) and substantially faster than MultiRAG (4.87s) and TruthfulRAG (5.62s). The faster online query is attributable to the $O(k)$ hyperedge traversal complexity and the lightweight MLP-based plausibility scoring, which avoids the expensive mutual information entropy computation required by MultiRAG's MCC at query time.
|
|
|
|
|
|
|
|
|
|
The preprocessing time (86.5s) is higher than MultiRAG (15.2s) due to the hyperbolic embedding computation (Eq. 6-8), but lower than HyperRAG (142.7s) because we do not require the full contrastive training pipeline. Importantly, HySH construction is a one-time offline cost amortized across all queries. The PICT module adds minimal online overhead: the conflict classifier (Eq. 19) requires $<$0.1s per detected conflict pair, and the interaction entropy computation (Eq. 14) adds approximately 0.8s per query through parallel LLM forward passes.
|
|
|
|
|
The preprocessing time (86.5s) is higher than MultiRAG (15.2s) due to the hyperbolic embedding computation (Eq.~\ref{equ:embedding mapping},~\ref{equ:Spatial Scale-Curvature Correspondence}), but lower than HyperRAG (142.7s) because we do not require the full contrastive training pipeline. Importantly, HySH construction is a one-time offline cost amortized across all queries. The PICT module adds minimal online overhead: the conflict classifier (Eq.~\ref{equ:conflict classification}) requires $<$0.1s per detected conflict pair, and the interaction entropy computation (Eq.~\ref{equ:interaction entropy}) adds approximately 0.8s per query through parallel LLM forward passes.
|
|
|
|
|
|
|
|
|
|
\subsection{Case Study}
|
|
|
|
|
|
|
|
|
|
|