New tool maximises the power of deep learning in genomics



Scientists have developed a deep learning tool that could help to accelerate the process of predicting and detecting disease-driving mutations in genes.

The universal programming tool, known as Janggu, streamlines the time-consuming process required for analysing genomics data and allows scientists to utilise deep learning to speed up their research.

Deep learning models involve algorithms sorting through massive amounts data and finding relevant features or patterns.

While deep learning is a very powerful tool, its use in genomics has been limited to date.

The first scientific papers on deep learning in genomics were published in 2015, which mostly worked with fixed types of data and were only able to answer a single specific question.

Before researchers could begin their analysis, they spent a lot of time formatting and preparing huge data sets to feed into deep learning models.

Swapping or adding new data often required starting from scratch and extensive programming efforts.

Devloped by researchers from the Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Janggu simplifies this process by automatically converting a wide range of genomics data into the appropriate format for analysis by deep learning models.

Lead author of the research paper, Wolfgang Kopp, says: “For two of three years after the first publications were published, it was difficult for someone wanting to do their own research to get started, even if they were a very good programmer.

“This is because of what are called deep learning libraries, which do all the mathematics and reveal the information. Before you can use these libraries, you have to convert the genomic data set into a different format before they can work with it.

“Essentially, there was a gap between the raw file formats and the deep learning libraries. This meant that anyone who wanted to use deep learning had to do a lot of reinventing the wheel.”

The group of scientists aimed to create a standardised method for reading the information from genomic data and plugging it into genomic deep learning libraries.

Janggu allows researchers to input data directly and start modelling without the need for excessive pre-processing.

This is predicted to save computational biologists valuable time and increase the turnover of hypothesis testing within the field.

Kopp says: “Before now, genomic research required software engineers to spend a lot of time on the technical side before researchers could actually address the biological question.

“This was taking a lot of time and was dependant on the programming expertise of the people carrying out the research. By bridging this gap, researchers can now start by addressing the biological hypothesis, rather than spending valuable time on the technical aspects of pre-processing.”

One of the main advantages of deep learning in genomics is its ability to read and understand DNA sequences to predict and detect disease-driving mutations.

For example, models can be used to pinpoint particular mutations that have the potential to drive cancer.

Kopp explains: “The DNA sequence is a string of four letters. It is a 3 billion letter sequence of ACGT. If you were to look at this sequence you wouldn’t be able to make sense of it.

“Deep learning models can be seen as a mini brain that can be trained to understand what information is contained in the sequence. In other words, it can read the DNA sequence like you would read a book; it can see if there is a word that means something.

“You can train it to predict from this DNA sequence if there is a regulatory potential or a mutation at a specific site which could drive a disease like cancer.”

Other research groups are addressing these issues by developing techniques for interpreting what deep learning models learn from genomic data.

Kopp says: “Deep learning models can read the DNA sequence and understand words that are important in the sequence. There are then techniques for highlighting important words and identifying words that might play a role in a mutation.

“This is an active research field and there is still room for improving how we understand what these models have learned. The better we understand what the model has learned, the better we can understand what’s contained in the DNA sequence.”

Janggu is publicly available and can be installed into a Python environment using the source code within the research paper. The full paper can be viewed on the Nature website.


Trending stories

Exit mobile version