Many GWAS provide their data in supplemental tables, these data are available on GRASP, the GWAS catalog, and GWASdb. GRASP is the most comprehensive of these with over 2 million SNPs, compared to closer to 20k for the other 2. Unfortunately, it is also the only database that does not include risk allele data.
I asked the GRASP mailing list at [email protected] if there is any way to get risk allele data from GRASP, and they confirmed that there is not:
Getting Risk Alleles from GRASP
To get risk alleles from GRASP, the fastest way is to pull the pubmed ID and paper location for the study for the rsID/trait of interest. If you need to query many SNPs, the fastest way is going to be to use grasp.readthedocs.io to dump a pandas dataframe of the SNPs including only rsID, trait, pubmed ID, and paper location. You can then get the paper directly from pubmed by prepending
https://www.ncbi.nlm.nih.gov/pubmed/ to the ID and following links to the tables.
Important If the
paper_loc field is ‘FullData‘, you can get the data file from the GRASP updates site.
Tables can be extracted from PDFs using tabula and can usually be copied directly into vim->libreoffice from there.
As the data is going to be intersected with the GRASP database, the only data to keep is the minimal set of data to get the risk allele, as well as sufficient data (rsID, trait, and PMID) to do the intersection with GRASP.
Getting raw data from GRASP updates
GRASP hasn’t been updated since 2013, but the raw results from several papers is available on the GRASP updates webpage. The easiest way to get the data is to check the readmes first to make sure the study of interest has sufficient data to get the risk allele.
Getting Risk Alleles From GWAS Tables
In many cases it just isn’t possible to use data from published GWAS to find the risk allele, many studies just do not publish enough information by either skipping allele information altogether and just publishing p-values, or by publishing an odds-ratio or beta without stating how it was calculated.
However, there are a few tricks/heuristics that can be used to get the data:
- Sometimes the way the OR was calculated (e.g. with respect to the minor allele) is given in the methods or table description. The allele they identify as the coded allele is then allele1 for the OR calculation and thus the risk allele is the coded allele if the OR is greater than 1.
- A large number of studies provide an allele1, allele2, and OR. In these cases it is almost always the case that the OR was calculated with respect to allele1, and thus the risk allele is allele1 if the OR is greater than 1. However, as the authors give no explicit information in this case, you need to treat this data with a large grain of salt, you could easily have it backwards.
- Many studies give the MAF for both cases and controls along with the minor allele, sometimes with an OR as well. In this case the risk allele is the minor allele if the MAF is higher in the the case vs control. When studies have both the MAFs and the OR, in my experience of a few dozen studies, the MAF calculation and OR direction always match, which gives me more confidence in the last method I mentioned (bullet 2).
- Sometimes studies do not provide the minor allele, but just provide the MAFs. In these cases you can still get the risk allele in cases where you can query the SNP on dbSNP. The key is that the population studied should be the same as the population data in dbSNP, and dbSNP should have only 1 alternate allele for that SNP. Sometimes there are multiple alternate alleles, in this case you can never have confidence that you have the right minor allele. In the cases where there is obviously one major and one minor allele for your population though, you can use that to pick the risk allele using the technique mentioned above.
- If the effect size is given and explicitly stated to be for one allele (usually A1), then a negative effect means A2 is the risk allele; A1 would thus be the risk allele given a positive effect.
- Sometimes a direction column is given, usually in meta-analyses. This will have a format something like ‘++’, ‘+-‘, ‘–‘. This tells you the direction of effect in each of the meta-analyzed sub-components. Thus, a straight run of pluses means A1 is the risk allele, a straight run of minuses means A2 is the risk allele, and anything else requires more complex interpretation.