đŸ”Ŧ
GWASTic Documentation
  • 👋Welcome to GWAStic documentation
  • Overview
    • 💡Video Tutorials
    • ✨Features
  • Fundamentals
    • đŸ–Ĩī¸Installing GWAStic
    • đŸ–Ĩī¸Starting GWAStic
    • â†Ēī¸Converting VCF to BED files
    • 🔄GWAS Analysis
    • 🔄Genomic Prediction
    • Running from command line
    • Algorithms
    • Effect size and P-value
    • Train, Test, and Validation Sets
    • Settings
    • â„šī¸References
    • Version history
Powered by GitBook
On this page
  • Effect size
  • P-value
  1. Fundamentals

Effect size and P-value

Effect size

To calculate the effect size using machine learning methods like XGBoost and Random Forest, feature importance is used as a key indicator of how much each feature contributes to predicting the target outcome. Both models are tree-based and assign importance to features based on how much they help improve predictions. In Random Forest, feature importance is typically calculated in two ways: Gini importance and permutation importance. Gini importance measures how much each feature reduces impurity in the decision trees, while permutation importance evaluates how much shuffling a feature's values decreases model accuracy. The more significant the drop in accuracy, the more important the feature is considered. In XGBoost, feature importance is calculated in three ways: gain, cover, and frequency. Gain measures how much a feature improves prediction accuracy when it is used for splits in the tree, cover reflects how many observations are impacted by splits involving the feature, and frequency counts how often the feature appears in the trees.

When we combine the feature importance from XGBoost and Random Forest, the sum (or median/mean) of their feature importance values gives an overall effect size for each feature in the model. This reflects the contribution of that feature to the model's prediction performance.

P-value

In contrast, p-values in GWAS, particularly when using methods like Fast-LMM, work differently. GWAS evaluates the statistical association between each genetic variant and a phenotype. The p-value represents the probability of observing an association as strong as the one found, assuming there is no real association. A low p-value indicates strong evidence that the variant is associated with the trait, while a high p-value suggests no significant association. Fast-LMM computes these p-values based on a linear mixed model, which accounts for population structure and relatedness.

PreviousAlgorithmsNextTrain, Test, and Validation Sets

Last updated 8 months ago