Scientific Machine Learning

What is Scientific Machine Learning?

Scientific machine learning (SciML) is an emerging discipline within the data science community. SciML seeks to address domain-specic data challenges and extract insights from scientic data sets through innovative methodological solutions. SciML draws on tools from both machine learning and scientic computing to develop new methods for scalable, domain-aware, robust, reliable, and interpretable learning and data analysis, and will be critical in driving the next wave of data-driven scientific discovery in the physical and engineering sciences.

Like scientific computing, SciML is multidisciplinary and leverages expertise from applied and computational mathematics, computer science, and physical science.


Why Scientific Machine Learning?

New innovations in machine learning (ML) and "big data" are beginning to drive advances in scientic disciplines such as the Earth sciences [1], but the full potential of these techniques for data-driven discovery has yet to be fully realized. One barrier to data-driven discovery is that existing methods often do not meet the needs of scientic users. Application-agnostic algorithms, or those designed for more traditional ML applications such as image or natural language processing, can not typically be directly applied to scientic data sets and require non-trivial, task-specic modications. In other cases, the models or outputs do not provide the insights or guarantees required for scientic applications.

Consider the following:

  • In many applications only limited or low-quality labels are available, while massive unlabeled (often class imbalanced) data sets are common.
  • In discovery-oriented tasks, ground truth is unknown and benchmark data sets are unavailable.
  • Scientific data are often high-dimensional, noisy, heterogeneous, low-signal-to-noise, and multiscale.
  • Models should respect or incorporate physical laws, constraints, and other scientic domain knowledge.
  • Robust methods and an ability to quantify uncertainty are required for scientic rigor.
  • Extracting new scientic insights from data requires human-interpretable models or outputs.


Research to advance data-driven discovery in the Earth and physical sciences

A non-exhaustive list of research topics in scientific machine learning:

  • Big data & small labels.
    • Methods for unsupervised learning, semi-supervised learning, positive-unlabeled learning, active learning, or weakly-supervised learning, that account for biases in labeling and make realistic assumptions about label-generating process.
  • Leveraging non-traditional / low-cost data sources, data fusion.
    • Extracting insights from multiple sensors/ sources that produce larger quantities of lower quality (noisy, heterogeneous, unstructured, high uncertainty) data.
  • Robust and reliable learning.
    • Uncertainty quantication, stability analysis, validation, performance metrics, and reproducibility, especially in high stakes or safety-critical applications.
  • Domain-aware and physics-informed learning.
    • Hybrid models that include both data-driven and domain-aware components
  • Enchancing modeling and simulation capabilities with machine learning.
  • Novelty detection in large data sets.
  • Interpretable learning.
  • Algorithms for streaming data.

References

  1. Bergen et al. (2019) Machine learning for data-driven discovery in solid Earth geoscience. Science. DOI: 10.1126/science.aau0323 [ link on Publications page ]