For centuries, the test of hypotheses has been one of the fundamental inferential concepts in statistics to guide the scientific community and to confirm one’s belief. The p-value has been a famous and universal metric to reject (or not to reject) a null hypothesis H0, which essentially denotes a common belief even without the experimental data.
If you’re old enough to remember ‘flip phones,’ then you might remember the first time phones had cameras. Fast forward 10-20 or so odd years – now, phone cameras have front and back lenses with incredible resolution and the latest image processing technology. Now, imagine taking a picture of a dog with a flip phone from the early 2000s and with another phone released in 2023. The dog remains the same, but the image itself vastly differs. This is what is known as domain shift in medical imaging technology; the equipment and user used to capture the same object differs. Specifically in medicine, hospitals use different brands and specifications of equipment acquired from various vendors, which can depend on their resources and budget.
Complex polynomials are one of the oldest and most fundamental objects of study in mathematics, and are ubiquitous in applications.
As the fields of statistics and data science have grown, the importance of reproducibility in research and easing the “replication crisis” has become increasingly apparent. The inability to reproduce scientific results when using the same data and code may lead to a lack of confidence in the validity of research and can make it difficult to build on and advance scientific knowledge.
Pinpointing Causality across Time and Geography: Uncovering the Relationship between Airstrikes and Insurgent Violence in Iraq
“Correlation is not causation”, as the saying goes, yet sometimes it can be, if certain assumptions are met. Describing those assumptions and developing methods to estimate causal effects, not just correlations, is the central concern of the causal inference field. Broadly speaking, causal inference seeks to measure the effect of a treatment on an outcome. This treatment can be an actual medicine or something more abstract like a policy. Much of the literature in this space focuses on relatively simple treatments/outcomes and uses data which doesn’t exhibit much dependency. As an example, clinicians often want to measure the effect of a binary treatment (received the drug or not) on a binary outcome (developed the disease or not). The data used to answer such questions is typically patient-level data where the patients are assumed to be independent from each other. To be clear, these simple setups are enormously useful and describe commonplace causal questions.
Explainable groupings in the face of noisy, high-dimensional madness: Wild ambitions tamed through features’ salience
Whatever your exact interests in data, frequently, inseparable from model-building, stand other related responsibilities. Sample two crucial ones:
a. the checking of how well your model did: the less frequently you make big, bad decisions – like predicting someone’s salary to be $95,000, an estimate far adrift from the real, say, $70,000 in case it’s a regression problem, or saying a customer will buy a product when, in fact, she won’t, under a classification environment – the happier you are. These accuracies are unsurprisingly, often used to guide the model-building process.
b. the explaining of how you arrived at a prediction: this involves unpacking or interpreting the $95,000. The person, due to his experience, makes $10,000 more than the average, due to his education, makes $20,000 more, but due to his state of residence, makes $5000 less than the average, etc. These ups and downs contribute to a net final value.
Large volumes of data are pouring in every day from scientific experiments like CERN and the Sloan Digital Sky Survey. Data is coming in so fast, that researchers struggle to keep pace with the analysis and are increasingly developing automated analysis methods to aid in this herculean task. As a first step, it is now commonplace to perform dimension reduction in order to reduce a large number of measurements to a set of key values that are easier to visualize and interpret.
For statistical modeling and analyses, construction of a confidence interval for a parameter of interest is an important inferential task to quantify the uncertainty around the parameter estimate. For instance, the true average lifetime of a cell phone can be a parameter of interest, which is unknown to both manufacturers and consumers. Its confidence interval can guide the manufacturers to determine an appropriate warranty period as well as to communicate the device reliability and quality to consumers. Unfortunately, exact methods to build confidence intervals are often unavailable in practice and approximate procedures are employed instead.