FastPval: A fast and memory efficient program to calculate very low P-values from empirical distribution

Mulin Jun Li; Pak Chung Sham; Junwen Wang

doi:10.1093/bioinformatics/btq540

FastPval: A fast and memory efficient program to calculate very low P-values from empirical distribution

Mulin Jun Li, Pak Chung Sham, Junwen Wang

Research

Research output: Contribution to journal › Article › peer-review

17 Scopus citations

Abstract

Motivation: Resampling methods, such as permutation and bootstrap, have been widely used to generate an empirical distribution for assessing the statistical significance of a measurement. However, to obtain a very low P-value, a large size of resampling is required, where computing speed, memory and storage consumption become bottlenecks, and sometimes become impossible, even on a computer cluster. Results: We have developed a multiple stage P-value calculating program called FastPval that can efficiently calculate very low (up to 10^-9) P-values from a large number of resampled measurements. With only two input files and a few parameter settings from the users, the program can compute P-values from empirical distribution very efficiently, even on a personal computer. When tested on the order of 10⁹ resampled data, our method only uses 52.94% the time used by the conventional method, implemented by standard quicksort and binary search algorithms, and consumes only 0.11% of the memory and storage. Furthermore, our method can be applied to extra large datasets that the conventional method fails to calculate. The accuracy of the method was tested on data generated from Normal, Poison and Gumbel distributions and was found to be no different from the exact ranking approach.

Original language	English (US)
Pages (from-to)	2897-2899
Number of pages	3
Journal	Bioinformatics
Volume	26
Issue number	22
DOIs	https://doi.org/10.1093/bioinformatics/btq540
State	Published - Nov 2010

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btq540

Cite this

@article{13c5784fe96c48478a9e97784d1c603d,

title = "FastPval: A fast and memory efficient program to calculate very low P-values from empirical distribution",

abstract = "Motivation: Resampling methods, such as permutation and bootstrap, have been widely used to generate an empirical distribution for assessing the statistical significance of a measurement. However, to obtain a very low P-value, a large size of resampling is required, where computing speed, memory and storage consumption become bottlenecks, and sometimes become impossible, even on a computer cluster. Results: We have developed a multiple stage P-value calculating program called FastPval that can efficiently calculate very low (up to 10-9) P-values from a large number of resampled measurements. With only two input files and a few parameter settings from the users, the program can compute P-values from empirical distribution very efficiently, even on a personal computer. When tested on the order of 109 resampled data, our method only uses 52.94% the time used by the conventional method, implemented by standard quicksort and binary search algorithms, and consumes only 0.11% of the memory and storage. Furthermore, our method can be applied to extra large datasets that the conventional method fails to calculate. The accuracy of the method was tested on data generated from Normal, Poison and Gumbel distributions and was found to be no different from the exact ranking approach.",

author = "Li, {Mulin Jun} and Sham, {Pak Chung} and Junwen Wang",

note = "Funding Information: Funding: Internal funds from the CRCG and the Genomic SRT of the University of Hong Kong; GRF 778609M and AoE M-04/04 from the Research Grants Council of Hong Kong.",

year = "2010",

month = nov,

doi = "10.1093/bioinformatics/btq540",

language = "English (US)",

volume = "26",

pages = "2897--2899",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "22",

}

TY - JOUR

T1 - FastPval

T2 - A fast and memory efficient program to calculate very low P-values from empirical distribution

AU - Li, Mulin Jun

AU - Sham, Pak Chung

AU - Wang, Junwen

N1 - Funding Information: Funding: Internal funds from the CRCG and the Genomic SRT of the University of Hong Kong; GRF 778609M and AoE M-04/04 from the Research Grants Council of Hong Kong.

PY - 2010/11

Y1 - 2010/11

N2 - Motivation: Resampling methods, such as permutation and bootstrap, have been widely used to generate an empirical distribution for assessing the statistical significance of a measurement. However, to obtain a very low P-value, a large size of resampling is required, where computing speed, memory and storage consumption become bottlenecks, and sometimes become impossible, even on a computer cluster. Results: We have developed a multiple stage P-value calculating program called FastPval that can efficiently calculate very low (up to 10-9) P-values from a large number of resampled measurements. With only two input files and a few parameter settings from the users, the program can compute P-values from empirical distribution very efficiently, even on a personal computer. When tested on the order of 109 resampled data, our method only uses 52.94% the time used by the conventional method, implemented by standard quicksort and binary search algorithms, and consumes only 0.11% of the memory and storage. Furthermore, our method can be applied to extra large datasets that the conventional method fails to calculate. The accuracy of the method was tested on data generated from Normal, Poison and Gumbel distributions and was found to be no different from the exact ranking approach.

AB - Motivation: Resampling methods, such as permutation and bootstrap, have been widely used to generate an empirical distribution for assessing the statistical significance of a measurement. However, to obtain a very low P-value, a large size of resampling is required, where computing speed, memory and storage consumption become bottlenecks, and sometimes become impossible, even on a computer cluster. Results: We have developed a multiple stage P-value calculating program called FastPval that can efficiently calculate very low (up to 10-9) P-values from a large number of resampled measurements. With only two input files and a few parameter settings from the users, the program can compute P-values from empirical distribution very efficiently, even on a personal computer. When tested on the order of 109 resampled data, our method only uses 52.94% the time used by the conventional method, implemented by standard quicksort and binary search algorithms, and consumes only 0.11% of the memory and storage. Furthermore, our method can be applied to extra large datasets that the conventional method fails to calculate. The accuracy of the method was tested on data generated from Normal, Poison and Gumbel distributions and was found to be no different from the exact ranking approach.

UR - http://www.scopus.com/inward/record.url?scp=78149251209&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78149251209&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btq540

DO - 10.1093/bioinformatics/btq540

M3 - Article

C2 - 20861029

AN - SCOPUS:78149251209

SN - 1367-4803

VL - 26

SP - 2897

EP - 2899

JO - Bioinformatics

JF - Bioinformatics

IS - 22

ER -

FastPval: A fast and memory efficient program to calculate very low P-values from empirical distribution

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this