Tag Archives: data

Two flow charts a and b with a have Full Dataset (x) having arrows to both Selection Dataset XI (text on this arrow reads use some porta a of the data for selection) and to inference dataset XII (text on this arrow reads use the remaining portion 1-a of the data for inference) and XI has an arrow to XII (text on arrow reads use any procedure to choose a model on the selection dataset and then use the inference dataset to test it b has full dataset x and two arrows connected to f(X) (text on this arrow read split data using randomization scheme) and use p(g(x)|f(X)) for inference (text on this arrow reads reveal full dataset at inference stage) and an arrow from f(x) to the other (text on this arrow reads use any procedure to choose a model from f(X))

Title: Data Fission: Splitting a Single Data PointAuthors & Year:   J. Leiner, B. Duan, L. Wasserman, and A. Ramdas (2023)Journal:   Journal of the American Statistical Association[DOI:10.1080/01621459.2023.2270748]Review Prepared by David Han Why Split the Data? In statistical analysis, a common practice is to split a dataset into two (or more) parts, typically one for model development and the other for model evaluation/validation. However, a new method called data fission offers a more efficient approach. Imagine you have a single data point, and you want to divide it into two pieces that cannot be understood separately but can fully reveal the original data when combined. By adding and subtracting some random noise to create these two parts, each part contains unique information, and together they provide a complete picture. This technique is useful for making inferences after selecting a statistical model, allowing for better flexibility and accuracy compared to traditional data splitting…

Read more

One of the key goals of science is to create theoretical models that are useful at describing the world we see around us. However, no model is perfect. The inability of models to replicate observations is often called the “synthetic gap.” For example, it may be too computationally expensive to include a known effect or to vary a large number of known parameters. Or, there may be unknown instrumental effects associated with variability in conditions during the data acquisition.

Read more

In responding to a pandemic, time is of the essence. As the COVID-19 pandemic has raged on, it has become evident that complex decisions must be made as quickly as possible, and quality data and statistics are necessary to drive the solutions that can prevent mass illness and death. Therefore, it is essential to outline a robust and generalizable statistical process that can not only help to diminish the current COVID-19 pandemic but also assist in the prevention of potential future pandemics. 

Read more

3/3