TY - GEN
T1 - Supervised Ensemble-based Causal DAG Selection
AU - Mio, Corrado
AU - Lin, Jianyi
AU - Damiani, Ernesto
AU - Gianini, Gabriele
N1 - Publisher Copyright:
Copyright © 2025 held by the owner/author(s).
PY - 2025/5/14
Y1 - 2025/5/14
N2 - Causal Discovery (CD) identifies cause-and-effect relationships from data using statistical learning. Several CD algorithms have been proposed relying on different assumptions, e.g. about the statistical relations among variables. However, which assumptions actually hold for a specific case study is not known a priori. Given a dataset obtained by sampling the joint distribution of all variables of a generative causal model, in general each algorithm could reconstruct a different Direct Acyclic Graph (DAG): some will be closer to the ground truth (GT) DAG than others, depending also on the applicability of the respective assumptions to the case study. As a consequence, given a collection of heterogeneous case studies, a hypothetical GT-aware oracle, able to select the best DAG out of the set of reconstructed DAGs, will outclass the average performance of the individual algorithms of the ensemble. In this work, we propose a supervised approach, relying on multilabel classification, to select the DAGs closest to GT by only comparing the topologies of the reconstructed DAGs. We carried out the study on a wide synthetic data set of causal models, sampling DAG topologies up to ten vertices, and using a representative set of linear and non-linear statistical dependencies. Whereas the best individual CD algorithm yields, on average, a distance from GT three times larger than the oracle, our algorithm features an average distance from GT only about 10% larger than the oracle.
AB - Causal Discovery (CD) identifies cause-and-effect relationships from data using statistical learning. Several CD algorithms have been proposed relying on different assumptions, e.g. about the statistical relations among variables. However, which assumptions actually hold for a specific case study is not known a priori. Given a dataset obtained by sampling the joint distribution of all variables of a generative causal model, in general each algorithm could reconstruct a different Direct Acyclic Graph (DAG): some will be closer to the ground truth (GT) DAG than others, depending also on the applicability of the respective assumptions to the case study. As a consequence, given a collection of heterogeneous case studies, a hypothetical GT-aware oracle, able to select the best DAG out of the set of reconstructed DAGs, will outclass the average performance of the individual algorithms of the ensemble. In this work, we propose a supervised approach, relying on multilabel classification, to select the DAGs closest to GT by only comparing the topologies of the reconstructed DAGs. We carried out the study on a wide synthetic data set of causal models, sampling DAG topologies up to ten vertices, and using a representative set of linear and non-linear statistical dependencies. Whereas the best individual CD algorithm yields, on average, a distance from GT three times larger than the oracle, our algorithm features an average distance from GT only about 10% larger than the oracle.
KW - causal discovery
KW - D-separation based distance
KW - ensemble approach
KW - model selection
KW - multi-label classification
KW - structural hamming distance
KW - structural intervention distance
UR - https://www.scopus.com/pages/publications/105006443858
U2 - 10.1145/3672608.3707709
DO - 10.1145/3672608.3707709
M3 - Conference contribution
AN - SCOPUS:105006443858
T3 - Proceedings of the ACM Symposium on Applied Computing
SP - 622
EP - 629
BT - 40th Annual ACM Symposium on Applied Computing, SAC 2025
T2 - 40th Annual ACM Symposium on Applied Computing, SAC 2025
Y2 - 31 March 2025 through 4 April 2025
ER -