How Nyholt’s Method Makes Scientific Testing More Reliable

by AB TestApril 1st, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This appendix explores Nyholt’s method for optimizing multiple hypothesis testing. It enhances efficiency by adjusting for covariance structures, improving type I error rates and power while maintaining confidence intervals and sample size calculations.

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - How Nyholt’s Method Makes Scientific Testing More Reliable
AB Test HackerNoon profile picture
0-item

Abstract and 1 Introduction

1.1 Related literature

  1. Types of Metrics and Their Hypothesis and 2.1 Types of metrics

    2.2 Hypotheses for different types of metrics

  2. Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

    3.1 The composite hypotheses of superiority and non-inferiority tests

    3.2 Bounding the type I and type II error rates for UI and IU testing

    3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

    3.4 Power corrections for non-inferiority testing

  3. Extending the Decision Rule with Deterioration and Quality Metrics

  4. Monte Carlo Simulation Study

    5.1 Results

  5. Discussion and Conclusions


APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS


Acknowledgments and References


APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS

As discussed above, the lack of assumption regarding the data generating process implies that under many processes the decision rule with have too low type I error rate and to high power. Although there are many alternatives to Bonferronitype corrections we have two conditions that we would like to have for any design. First, the individual metric results should have CI’s that maps to the decision rule outcome. Second, the design should allow for analytical required sample size calculations. These conditions means that most alternative multiple correction methods, such as [7] and [8] are discarded. Very few methods have confidence intervals, and even fewer are possible to integrate in the sample size calculation, due to their reliance on the p-values.


One method that fulfills our conditions is Nyholt’s method for calculating the so-called effective number of independent tests in a set of tests with arbitrary covariance structure [14]. This method is frequently applied in high-dimensional genome testing, and has been refined by several, see e.g. [11, 5]. Nyholt’s method is simple, the effective number of independent tests is given by





D.0.1 Monte Carlo simulation results with the Nyholt method Below the tables from Section 5 are repeated together with the result using Nyholts method as described above. As expected, the method improves the efficiency. This is shown by type I error rates and power closer to the intended rates under certain covariance structures.

ACKNOWLEDGMENTS

The authors would like to thank Bob Wilson for feedback on earlier drafts of this paper.


TABLE 10Simulation results of the type I error rates under the null hypotheses of the non-inferiority and the superiority tests, respectively. The rejection of the deterioration, non-inferiority, and superiority tests are presented along with the rejection rate of Decision Rule 2.


TABLE 11Simulation results of the rejection rates under the alternative hypothesis of the non-inferiority and the null hypothesis of superiority tests. The rejection rates of the deterioration, non-inferiority, and superiority tests are presented along with the rejection rate of Decision Rule 2.

REFERENCES

[1] BERGER, J. (2013). Statistical decision theory: foundations, concepts, and methods. Springer Science & Business Media.


[2] DMITRIENKO, A., OFFEN, W. W. and WESTFALL, P. H. (2003). Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Statistics in Medicine 22 2387-2400.


[3] DMITRIENKO, A., TAMHANE, A. C. and BRETZ, F. (2009). Multiple testing problems in pharmaceutical statistics. CRC press.


[4] FABIJAN, A., GUPCHUP, J., GUPTA, S., OMHOVER, J., QIN, W., VERMEER, L. and DMITRIEV, P. (2019). Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’19 2156–2164. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3292500.3330722


[5] GALWEY, N. W. (2009). A new measure of the effective number of tests, a practical tool for comparing families of non-independent significance tests. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society 33 559–568.


[6] GUILBAUD, O. (2014). Sharper Confidence Intervals for Hochberg-and Hommel-Related Multiple Tests Based On an Extended Simes Inequality. Statistics in Biopharmaceutical Research 6 123–136.


[7] HOLM, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics 65–70.


[8] HOMMEL, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75 383–386.


TABLE 12Simulation results of the type I error rate under the alternative hypotheses of the non-inferiority and the superiority tests, respectively. The rejection rates of the deterioration, non-inferiority, and superiority tests are presented along with the rejection rate of Decision Rule 2.


[9] KONG, L., KOHBERGER, R. C. and KOCH, G. G. (2004). Type I Error and Power in Noninferiority/Equivalence Trials with Correlated Multiple Endpoints: An Example from Vaccine Development Trials. Journal of Biopharmaceutical Statistics 14 893-907. PMID: 15587971. https://doi.org/10.1081/BIP-200035454


[10] LEHMACHER, W., WASSMER, G. and REITMEIR, P. (1991). Procedures for Two-Sample Comparisons with Multiple Endpoints Controlling the Experimentwise Error Rate. Biometrics 47 511–521.


[11] LI, J. and JI, L. (2005). Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity 95 221–227.


[12] LIU, S. and LIU, A. (2022). The Lifecycle of Developing Overall Evaluation Criterion in AB Testing for Netflix Messaging. In Proceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering 266–267.


[13] NEUHÄUSER, M. (2006). How to deal with multiple endpoints in clinical trials. Fundamental & Clinical Pharmacology 20 515-523.


[14] NYHOLT, D. R. (2004). A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. The American Journal of Human Genetics 74 765–769.


[15] O’BRIEN, P. C. (1984). Procedures for Comparing Samples with Multiple Endpoints. Biometrics 40 1079–1087.


[16] OFFEN, W., CHUANG-STEIN, C., DMITRIENKO, A., LITTMAN, G., MACA, J., MEYERSON, L., MUIRHEAD, R., STRYSZAK, P., BADDY, A., CHEN, K., COPLEY-MERRIMAN, K., DERE, W., GIVENS, S., HALL, D., HENRY, D., JACKSON, J. D., KRISHEN, A., LIU, T., RYDER, S., SANKOH, A. J., WANG, J. and YEH, C.-H. (2007). Multiple Co-primary Endpoints: Medical and Statistical Solutions: A Report from the Multiple Endpoints Expert Team of the Pharmaceutical Research and Manufacturers of America. Drug Information Journal 41 31–46. https://doi.org/10.1177/009286150704100105


Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks