Genome Sequencing in the Cloud


Without the cloud it is very difficult to store and share the huge volumes of data needed for genome sequencing. Given the sensitivity of human genome data in terms of ownership and privacy, security and compliance are paramount.

Introduction

You might be wondering what genomes and the cloud have to do with each other. Well quite a lot, since without the cloud it is very difficult to store and share the huge volumes of data needed. Given the sensitivity of human genome data in terms of ownership and privacy, security and compliance are paramount – something that modern cloud systems, like those at Safe Swiss Cloud, are well positioned to address. Also, Kubernetes containers are gaining ground in bioinformatics, since they can provide standardised application workflow and resource scaling, simplifying collaboration across research groups. Storage formats, object storage, the role of Machine Learning and databases are also touched upon in this post.

What is a Genome?

A genome is all the genetic information of an organism and consists of nucleotide sequences of DNA. Human genomes include both protein-coding DNA genes and noncoding DNA and are stored in cell nucleii and mitochondria. The DNA thread consists of two strands twisted to form a helix. Each strand comprises of a string of bases held together by a sugar-phosphate backbone. There are four possible bases abbreviated A, T, G, and C (adenine, thymine, guanine, and cytosine). On each DNA strand, the bases line up in pairs, an A opposite a T and a G opposite a C.  Thus, if the sequence of bases on one strand is known, the other is determined. Establishing the sequence of DNA is key for e.g. the diagnosis and treatment of diseases and for epidemiology studies. There are about 6 billion base pairs in the human genome. This equates to a computer file size of about 100 GB when additional attributes such as descriptions and data quality are included.

Figure 1: A nucleotide showing the four possible bases A, G, C and T

Next Generation Sequencing (NGS)

Sequencing can be carried out using for example NGS (Next Generation Sequencing). The technology is used to determine the order of nucleotides in entire genomes (Whole Genome Sequencing – WGS) or targeted regions of DNA or RNA. The basic NGS process involves fragmenting the DNA/RNA into multiple pieces, adding adapters, sequencing the libraries, and reassembling them to form a genomic sequence. See for example here.

Sequencing File Formats

Along with FASTA and SAM, one of the de facto file formats for nucleotide sequencing is FASTQ.  It is a text file containing the nucleotide sequence base (A, C, T or G) and its corresponding data quality score (Q). The byte representing quality uses the Phred score and runs from 0x21 (lowest quality; ‘!’ in ASCII) to 0x7e (highest quality; ‘~’ in ASCII). As an example, a Q value of 20 represents a 99% chance of the observation being correct. An example FASTQ file fragment is shown below:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Data Visiting in the Cloud

According to Tanjo et al (Journal of Human Genetics, 2021), “For effective genome data sharing and analysis, not only the security and legal compliance issues should be addressed, but also researchers need to deal with the recent data explosions and be familiar with the large-scale computational and networking infrastructures”. Using a certified cloud provider helps a researcher with compliance. Rather than spending weeks or months downloading data to their own servers, researchers are increasingly employing a “data visiting” strategy, whereby the data resides on commercial cloud platforms.

Running Workflows using Containers

Traditionally, research group workflows are described in programming languages and custom build tools which are nearly impossible to execute on different types of computing resource without modification. This makes efficient collaboration between research groups difficult. However, this problem can be easily solved using container technology, where the maintainer of the application container images has overall responsibility for the correct operation. Containers running in Kubernetes have the added advantage that the compute resources can automatically scale and adapt to the the size of the ongoing analysis. Workflows written in Terra (a biomedical cloud sharing platform) and Cromwell (a WDL based workflow management system) directed at a cloud / Kubernetes pipeline API are one of the latest solutions for scalability and collaboration.

Machine Learning

Yang et al (2020) summarized four typical applications of machine learning in DNA sequence data: DNA sequence alignment, classification, clustering, and pattern mining.

For example, the TensorFlow tfio.genome library provides commonly used genomics IO functionality such as reading several genomics file formats and providing some common operations for preparing the data (e.g. one hot encoding or parsing Phred quality into probabilities).

See also here.

Object Storage

Object storage in the cloud is a low cost storage alternative that contends well with bioinformatic data growth, long-term data preservation and random access of data. It can be used as a local repository for a single application or it can be shared and used by many clients. Secure access to object storage is via HTTPS and the stored objects can be optionally encrypted. 

Databases

In bioinformatics, databases are usually categorised as primary, secondary and composite.

  • Primary databases are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases is immutable and forms part of the scientific record.
  • Secondary databases comprise data derived from the results of analysing primary data. They are often referred to as curated databases.
  • With Composite databases, the data is first compared and then filtered based on desired criteria. The initial data is taken from the primary database, and then merged together based on certain conditions. It helps in searching sequences rapidly.

About Safe Swiss Cloud


Safe Swiss Cloud is a leading Swiss based cloud computing provider for customers with strong compliance and data privacy requirements:

  • More about Kubernetes at Safe Swiss Cloud
  • More about Open Cloud at Safe Swiss Cloud
  • More about Object Storage at Safe Swiss Cloud
  • More about Safe Swiss Cloud support packages

References

  1. https://www.nature.com/articles/s10038-020-00862-1
  2. https://medicalfuturist.com/the-genomic-data-challenges-of-the-future/
  3. https://www.frontiersin.org/articles/10.3389/fbioe.2020.01032/full
  4. https://www.ebi.ac.uk/training/online/courses/bioinformatics-terrified/what-makes-a-good-bioinformatics-database/primary-and-secondary-databases/

About the Author

David Poole

David Poole

CTO / CSO | Chief Technical Officer / Chief Security Officer

David has nearly 30 years experience in the IT industry principally in banking and mobile technology. David’s motto is “get the job done”. David gained a Ph.D in Physics (solid state) from Cambridge University in 1982. He also has a Master’s in electronics from Birmingham University.

Other interests: Art, karate and weight training

Connect on LinkedIn

CONTACT»

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Please Note:
You may use one of these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>