# Train, Test, and Validation Sets

{% hint style="info" %}
The <mark style="color:red;">**training data**</mark> <mark style="color:red;">**set**</mark> is the portion of data used to train a machine learning model, allowing it to learn patterns and relationships.
{% endhint %}

{% hint style="info" %}
The <mark style="color:red;">**test data set**</mark> is a separate, unseen portion used to evaluate the model's performance on new data.
{% endhint %}

{% hint style="info" %}
A <mark style="color:red;">**validation data set**</mark> is a portion of the original data that is set aside during model training and not used in the training process. It serves as "unseen" data to evaluate the model's performance and accuracy, helping to fine-tune hyperparameters and prevent overfitting before final testing on the actual test data.
{% endhint %}

In machine learning, data is typically split into three subsets: training, validation, and test sets, each serving a unique purpose.

The **training data set** is the portion of the data used to train a machine learning model, allowing it to learn patterns and relationships. The **test data set** is a separate, unseen portion used to evaluate the model's performance on new data. In practice, **cross-validation** is often employed to assess model accuracy more reliably. This involves multiple random splits of the training and test data, training the model on different subsets, and testing it on corresponding test sets. This ensures consistent performance and helps reduce overfitting. The number of cross-validation splits can be adjusted in the settings using the "Nr. of models" parameter.

The **validation data set** is a portion of the original data that is set aside during model training and is not used in the training process. It serves as "unseen" data to evaluate the model’s performance, particularly for tuning hyperparameters and preventing overfitting. The **validation accuracy** reflects how well the model generalizes to unseen data. Users must carefully manage the validation set by excluding certain phenotypes from the original dataset, ensuring that it remains independent from the training set to properly validate the model.

In the context of **GWAStic**, the training and test datasets will be generated automatically. In the settings menu, users can adjust the percentage of the test/train split size based on their needs, allowing for flexibility in the data preparation process. Additionally, users must manually manage the validation set by excluding phenotypes from the training set to ensure accurate model validation.

<figure><img src="/files/4k3rbHJsAqkW3RhvbpNW" alt=""><figcaption><p>Traning and validation strategy for genomic prediction models.</p></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://snowformatics.gitbook.io/product-docs/fundamentals/train-test-and-validation-sets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
