๐Ÿ”ฌ
GWASTic Documentation
  • ๐Ÿ‘‹Welcome to GWAStic documentation
  • Overview
    • ๐Ÿ’กVideo Tutorials
    • โœจFeatures
  • Fundamentals
    • ๐Ÿ–ฅ๏ธInstalling GWAStic
    • ๐Ÿ–ฅ๏ธStarting GWAStic
    • โ†ช๏ธConverting VCF to BED files
    • ๐Ÿ”„GWAS Analysis
    • ๐Ÿ”„Genomic Prediction
    • Running from command line
    • Algorithms
    • Effect size and P-value
    • Train, Test, and Validation Sets
    • Settings
    • โ„น๏ธReferences
    • Version history
Powered by GitBook
On this page
  1. Fundamentals

Train, Test, and Validation Sets

Understanding Dataset Splits: Train, Test, and Validation Sets

PreviousEffect size and P-valueNextSettings

Last updated 8 months ago

The training data set is the portion of data used to train a machine learning model, allowing it to learn patterns and relationships.

The test data set is a separate, unseen portion used to evaluate the model's performance on new data.

A validation data set is a portion of the original data that is set aside during model training and not used in the training process. It serves as "unseen" data to evaluate the model's performance and accuracy, helping to fine-tune hyperparameters and prevent overfitting before final testing on the actual test data.

In machine learning, data is typically split into three subsets: training, validation, and test sets, each serving a unique purpose.

The training data set is the portion of the data used to train a machine learning model, allowing it to learn patterns and relationships. The test data set is a separate, unseen portion used to evaluate the model's performance on new data. In practice, cross-validation is often employed to assess model accuracy more reliably. This involves multiple random splits of the training and test data, training the model on different subsets, and testing it on corresponding test sets. This ensures consistent performance and helps reduce overfitting. The number of cross-validation splits can be adjusted in the settings using the "Nr. of models" parameter.

The validation data set is a portion of the original data that is set aside during model training and is not used in the training process. It serves as "unseen" data to evaluate the modelโ€™s performance, particularly for tuning hyperparameters and preventing overfitting. The validation accuracy reflects how well the model generalizes to unseen data. Users must carefully manage the validation set by excluding certain phenotypes from the original dataset, ensuring that it remains independent from the training set to properly validate the model.

In the context of GWAStic, the training and test datasets will be generated automatically. In the settings menu, users can adjust the percentage of the test/train split size based on their needs, allowing for flexibility in the data preparation process. Additionally, users must manually manage the validation set by excluding phenotypes from the training set to ensure accurate model validation.

Traning and validation strategy for genomic prediction models.