News and Articles

Balding Skips a Generation on Your Mother’s Side

Grandpa Foss had a solid head of hair into his 90s. Grandpa Roberts also had a solid shock of black hair. Despite this, my father and most of my uncles had typical male-pattern baldness. This didn’t worry me as a teenager though, after all, we all know that hair-loss skips a generation. I heard this from my Mom when I was younger. Despite the infallibility of this scientific truth, I started losing my hair when I was 23.

For Christmas 2016 I got a 23andMe kit for myself. All my friends were doing it and having a fun time. When I got my results back, nothing was that surprising. What I liked the most about 23andMe was the ability to download my results and play with them on my own. I read this blog post a while back about analyzing your genetic results using the fantastic Bioconductor package in R.  I decided to give it a modest try.

It was fun. By fun I mean innocent and meaningless; I didn’t read too much into it because of my sophomoric knowledge of bioinformatics. Despite that, I decided to write about it and explain why the ‘baldness-skips-a-generation’ truism is in fact wrong. I’m about to rewrite the science of baldness here!

First, getting your 23andMe data is super easy. Follow the instructions on their website and then unzip the file you download to a location on your hard drive. They describe their data as:

The 23andMe genotyping platform detects single nucleotide polymorphisms (SNPs). A SNP is a DNA location, or “marker,” in the genome that has been shown to vary among people in terms of the DNA base or bases. There are four DNA bases: adenine (A), thymine (T), guanine (G), and cytosine (C). So, for example, at the same genomic location, you might have a C and someone else might have a T. These DNA base differences are known as “variants.”

The data is in a simple format. It is a tab delimited file with four columns. RSID for the unique identifier given to a DNA or protein sequence record. Chrom for chromosome which has a sequence 1 to 22, X, Y and MT (for mitochondrial). Position is for the position of the RSID. Finally, GenoType is for the, well, the genotype or DNA base. There are four bases: A for Adenine, T for Thymine, G for Guanine and C for Cytosine. In your 23andMe data you will usually have two pairs at every location because you inherit one from your mother and one from your father.

Here is what the file format looks like:

Once you’ve downloaded the data, the process to load it into R is simple.

me  <- read.table(file = "D:/Data/23andme/genome_Greg_Foss_v4_Full_20180317144539.txt",
sep="\t", header=T,
colClasses=c("character", "character", "numeric", "character"),
col.names=c("rsid", "chromosome", "position", "genotype"))

If we look at the data the first thing we notice is that there are some genotypes that are ‘’- -“. What is that?

According to the 23andMe site:

Occasionally, for some SNPs on the 23andMe platform, your genotype may be reported as an insertion or deletion (–) of DNA bases instead of just a simple variant pair. Depending on the genomic location, either an insertion or deletion could represent the typical version of the SNP. In other words, there are some markers in which having an extra base (insertion) is the typical variant and having a deletion is the less common variant. Conversely, there are some places in the genome where an insertion is rare, making a deletion the typical variant at that location.

Now we should load data from Bioconductor and do some matching and graphing and blah blah…I’m going to do that, but I want to take a little tangential journey first. A word of caution here, do not click on this SNPedia link. Don’t do it…this rabbit hole goes a long long way.

This is a site where you can look up SNPs and research that has been done in relation to them.

Just enter the trait to research in the search bar. 

It will give you a list of relevant SNPs and other useful information including research papers. It is simple to use R Studio to just lookup the traits. No programming necessary (too bad, no worries, we’ll get to more programming in a bit).

The first one is rs6152 and is located (according to the documentation) on the X chromosome. The risk allele is G. Let’s see if that’s what I got. Enter ‘rs6152’ into the search field in your R Studio table viewer.

Well, that’s odd, I’m not at risk? According to further documentation on the site, men of allele A also develop baldness as they age. There are three other SNPs to look up.

RS2223841 says that the risk allele is A for men to more likely go bald before 40 and G for less likely.



I’m not very good at this. I have a ‘T’.

RS1160312 says that a combination of ‘AG’ increases the risk of baldness in Caucasian men by 1.6 times.



I got that one alright.

Finally RS2180439 says that a combination of TT increases the risk of baldness by 2 times.



Got that one too. Not an exact science but an entertaining process.

Now to the Bioconductor package for R. This is a robust and well documented package. To install it follow the instructions on their site. Installing this package is different than most other R packages so make sure to read the instructions. What is fantastic is that there is a package within Bioconductor called gwascat. It is an interface to the National Human Genome Reseach Insititute’s database of genome studies.

After installing Bioconductor you will want to take some time to read the vignettes using the command ‘browseVignettes(“gwascat”)’. Warning…another rabbit hole!


Once you’ve entertained yourself enough with SNPedia and the Vignette for gwascat, we can move on. There is one more piece of documentation to peruse and that is the manual for the gwascat package. This will give us the data for ‘gwdf_2014_09_08’. In this data are 17832 observations with such interesting things as SNP, Pubmed links, Disease Traits, Risk Alleles and more. Because we want to see want to peruse our risks and make ourselves paranoid, we will merge our 23andMe data with the SNPs, the PubMed Link, Disease/Trait and Risk Allele data from gwascat.
First, we load our libraries.

library(gwascat) #

Next, we load our 23andMe data.

me <- read.table(file = "D:/Data/23andme/genome_Greg_Foss_v4_Full_20180317144539.txt",
sep="\t", header=T,
colClasses=c("character", "character", "numeric", "character"),
col.names=c("rsid", "chromosome", "position", "genotype"))

To get a visual representation of our chromosome distribution we can call on ggplot.

ggplot(me) + geom_bar(aes(ordered(me$chromosome, levels=c(seq(1, 22), "X", "Y", "MT"))))

We can perform a few standard glancing routines to look at our overall data and structure.


We can see that I have 601,885 observations.

I’m curious about my genotype, I can do a simple table function and see how many of each combination I have.


It seems that CC and GG are the most common combinations.
To pull the data in from the gwascat package is easy.


The first thing we’ll do is roughly filter out only the studies that pertain to European, because that’s what I am.

filtered_gwdf <- gwdf_2014_09_08 %>%  filter(grepl('Europe', `Initial Sample Size`))

The next step is to just take the columns we are interested in. These are Link, Disease/Trait, SNPs and Strongest SNP-Risk Allele. First, let’s get rid of ‘Strongest SNP-Risk Allele’ so we don’t have issues later.

names(filtered_gwdf)[names(filtered_gwdf)=="Strongest SNP-Risk Allele"] <- "RiskAllele"

We also just want a few of the columns at this time, not all of them.

gwdf <- filtered_gwdf[ , which(names(filtered_gwdf) %in% c("Link", "Disease/Trait", "SNPs", "RiskAllele"))]

Let’s clean up the RiskAllele so it only has the allele and not the snp.

gwdf$RiskAllele <- str_sub(gwdf$RiskAllele, -1)

Now that our gwascat looks pretty, just merge it with the 23andMe data.

me_gwdf <- merge(gwdf, me, by.x="SNPs", by.y="rsid")

What we want to see if is we have the risk allele from both our parents. So, compare the first and second position of the genotype column with the risk allele column and store it in dataframe titled ‘risks’.

Note: I understand how ‘wrong’ this is. For instance, in SNPedia there is a reference to baldness that states “ The risk allele is assumed to be rs1160312-A.  Note that a haplotype consisting of rs201571-T – rs6036025-G has a perfect correlation (r2=1) with rs1160312-A, at least in the Caucasian population studied.” What I ‘really’ want to do is mine these correlating snps and run that against my personal data. However, that is a task for another time, for now we’ll keep it simple.

risks <- me_gwdf %>%
filter(substr(genotype,1,1) == RiskAllele &
substr(genotype,2,2) == RiskAllele)

For our final output, we want to be able to click on the link and go read the Pubmed study abstract therefore we must structure the link as an html link.

risks$Link <- paste0("<a href='", risks$Link, "'>PMID</a>")

Finally, pull the whole thing into a DataTable so we can peruse it easily and make ourselves completely worried about our future health.

datatable(risks, escape = FALSE, options = list(
initComplete = JS("
function(settings, json) {
'background-color': '#000',
'color': '#fff'

Our result is a nice clean table that we can search and filter easily. What I’m curious about is SNP rs10772939. It has something to do with how my economic and political preferences might have some relationship to my genes! Yep, I’ve provided multiple rabbit holes on this one.

And of course, why we started this whole thing…to find out why I started balding at 23. By the way…thanks a lot Mom.

Leave a Reply

Your email address will not be published. Required fields are marked *