πŸ”„Genomic Prediction

Start Genomic Prediction

Genomic Prediction interface

Any accessions included in the genotypic file without corresponding phenotypic data will automatically receive predicted phenotypes.

Please note, the prediction accuracy could potentially lead to overfitting and an overestimation of the model's accuracy, which is part of the model tuning and comparison phase of the software. This phase allows users to explore different model configurations and hyperparameters. The purpose is not to validate the models but to compare their relative performance during training. The software provides users with the flexibility to set hyperparameters, which can be fine-tuned based on the performance metrics during this stage.

For a robust assessment of model performance, an independent validation set should be used. We recommend that users provide a separate validation dataset to test the final trained models and assess their performance.

  1. Choose BED: Click to select a BED file containing genotype data for genomic prediction.

  2. Choose phenotype: Click to select a file with phenotype data that will be used in the genomic prediction analysis.

  3. Algorithm dropdown: Select the algorithm for genomic prediction.

  4. Run button: Initiate the genomic prediction process with the selected BED file, phenotype data, and algorithm.

Correlation plot

A regression plot with correlation analysis is used in genomic prediction to assess the accuracy of prediction models. The plot compares predicted values against observed values, where a strong linear relationship indicates high predictive ability. The correlation coefficient quantifies this relationship, with values close to 1.0 signaling strong predictive accuracy.

Correlation plot

Bland-Altman plot (predicted vs real values)

Bland-Altman plot

Data points (blue dots): Each blue dot represents a pair of predicted and real values for an individual instance. The x-coordinate of a dot is the mean of the predicted and real values for that instance, and the y-coordinate is their difference (Predicted_value - Real_value). This arrangement shows how much each prediction deviates from the real value and whether this deviation is consistent across different levels of measurement.

Mean difference (red dashed line): This line represents the average difference between the predicted and actual values. Ideally, in a perfect prediction scenario, this line would be at zero, indicating no difference between predicted and real values. The position of this line (above or below zero) can indicate a systematic bias in the predictions β€” for instance, if it's above zero, it means the predictions are generally higher than the actual values.

Limits of agreement (green dashed lines): These lines represent the bounds where most differences between predicted and actual values lie. They are calculated as the mean difference plus and minus 1.96 times the standard deviation of the differences. This metric is based on the assumption that the differences are normally distributed, and thus, roughly 95% of the data points should lie between these lines. If the data points are widely spread between these lines, it indicates a larger variability in the differences.

Supported file formats

Genotypic files

VCF file format (including vcf.gz) and PLINK BED/BIM/FAM format are supported for all GWAS methods. The VCF files must converted to BED/BIM/FAM file format.

VCF example file

Phenotypic files

Phenotypic data must be three columns text file delimited by space:

5837 5837 1
6009 6009 1
6898 6898 1
6900 6900 0
6901 6901 0
6903 6903 1
Phenotype Example File

Last updated