The meteorologists of today no longer ask themselves, “Will it rain tomorrow?”, but rather, “What is the probability it will rain tomorrow?”. In other words, weather forecasting has evolved beyond giving simple point projections, and instead has largely shifted to probabilistic predictions, where forecast uncertainty is quantified through quantiles or entire probability distributions. Probabilistic forecasting was also the subject of my previous blog post, where the article of discussion explored the intricacies of proper scoring rules, metrics that allow us to compare and rank these more complex distributional forecasts. In this blog post, we explore facets of an even more basic consideration: how can one be sure their probabilistic forecasts make sense and actually align with the data that ended up being observed? This ‘alignment’ between forecasted probabilities and observations is referred to as probabilistic calibration. Put more concretely, when a precipitation forecasting model gives an 80% chance of rain, one would expect to see rain in approximately 80% of those cases (if the model is calibrated).
Nobel laureate Niels Bohr is famously quoted as saying, “Prediction is very difficult, especially if it’s about the future.” The science (or perhaps the art) of forecasting is no easy task and lends itself to a large amount of uncertainty. For this reason, practitioners interested in prediction have increasingly migrated to probabilistic forecasting, where an entire distribution is given as the forecast instead of a single number, thus fully quantifying the inherent uncertainty. In such a setting, traditional metrics of assessing and comparing predictive performance, such as mean squared error (MSE), are no longer appropriate. Instead, proper scoring rules are utilized to evaluate and rank forecast methods. A scoring rule is a function that takes a predictive distribution along with an observed value and outputs a real number called the score. Such a rule is said to be proper if the expected score is maximized when the predictive distribution is the same as the distribution from which the observation was drawn.
Pinpointing Causality across Time and Geography: Uncovering the Relationship between Airstrikes and Insurgent Violence in Iraq
“Correlation is not causation”, as the saying goes, yet sometimes it can be, if certain assumptions are met. Describing those assumptions and developing methods to estimate causal effects, not just correlations, is the central concern of the causal inference field. Broadly speaking, causal inference seeks to measure the effect of a treatment on an outcome. This treatment can be an actual medicine or something more abstract like a policy. Much of the literature in this space focuses on relatively simple treatments/outcomes and uses data which doesn’t exhibit much dependency. As an example, clinicians often want to measure the effect of a binary treatment (received the drug or not) on a binary outcome (developed the disease or not). The data used to answer such questions is typically patient-level data where the patients are assumed to be independent from each other. To be clear, these simple setups are enormously useful and describe commonplace causal questions.
Companies often want to test the impact of one design decision over another, for example Google might want to compare the current ranking of search results (version A) with an alternative ranking (version B) and evaluate how the modification would affect users’ decisions and click behavior. An experiment to determine this impact on users is known as an A/B test, and many methods have been designed to measure the ‘treatment’ effect of the proposed change. However, these classical methods typically assume that changing one person’s treatment will not affect others (known as the Stable Unit Treatment Value Assumption or SUTVA). In the Google example, this is typically a valid assumption—showing one user different search results shouldn’t impact another user’s click behavior. But in some situations, SUTVA is violated, and new methods must be introduced to properly measure the effect of design changes.