What are statistical embeddings?

They map numerical datasets to a shared vector space via statistical descriptors and pre-trained transformers, enabling comparison without shared column names.

Why does this matter?

It enables retrieval and alignment of heterogeneous numerical tables for LLMs, supporting RAG pipelines and transfer learning in finance and healthcare.

What should we watch next?

Future LLM applications in scientific computing will rely on methods that capture intrinsic statistical structures to bridge raw data and intelligent models effectively.

統計嵌入：實現數值型表格數據集的相似性檢索與可解釋對齊

針對大型語言模型缺乏處理異構數值型表格數據的原生機制這一問題，本研究提出一種基於統計嵌入的新方法。該方法通過結構化探索性數據分析描述符對數據集進行表征，並利用預訓練句子變換器將其映射到共享向量空間。核心創新是引入典型相關分析（CCA）及其懲罰變體，以量化跨數據集相似性並恢復稀疏、可解釋的變量級對應關係，無需共享變量名。在15個涵蓋通用基準、材料資訊學和核級石墨表征的數據集上驗證，檢索P@1達0.9，且在嵌入消融和差分隱私預算下保持魯棒性。

Sources

arXiv