The DNA Data Bank of Japan: Supercomputing the Code of Life

In the heart of Japan, a silent revolution is underway, one that manages an avalanche of genetic information shaping the future of medicine, evolution, and biology itself.

Explore the Science

More Than a Library: What is the DDBJ?

Imagine a library that, instead of storing books, archives the genetic codes of every known living organism—from humans and mice to bacteria and viruses.

This is the essence of the DDBJ. Located at the National Institute of Genetics in Japan, the DDBJ is one of the three pillars of the International Nucleotide Sequence Database Collaboration, alongside GenBank in the United States and the EMBL Data Library in Europe 3 . These three organizations collectively form a comprehensive global repository for publicly available DNA sequencing data. Every day, they exchange the data collected and processed at each bank, ensuring that the quality and quantity of information are equivalent worldwide 3 .

Global Repository

Part of the International Nucleotide Sequence Database Collaboration with GenBank and EMBL 3 .

Dynamic Hub

Processes, annotates, and publicizes DNA sequences submitted by scientists worldwide.

Research Foundation

Supports medical breakthroughs and anthropological discoveries with raw genomic data.

The Data Deluge: Why a Supercomputer is Essential

"We at DDBJ have been flooded with mass submissions of DNA sequence data" 3 .

The dawn of the 21st century marked the beginning of an exponential explosion in genomic data. The DDBJ found itself at the epicenter of this tidal wave. The growth in 1999 was so dramatic that the amount of data submitted in that single year exceeded the total amount from the preceding ten years combined 3 .

This deluge was driven by ambitious Japanese genome projects, such as the human genome project and large-scale mouse cDNA initiatives 3 . One notable example was the submission of 175,734 complete mouse cDNAs from a single team at RIKEN, totaling over 46 million base pairs of sequence 3 .

Exponential Growth of Genomic Data

DDBJ's Response to the Data Explosion

Challenge Solution Developed by DDBJ Impact
Massive data submission from genome projects Mass Submission System (MSS) and Mass Submission Tool (MST) Streamlined the process of submitting and processing large datasets 3
Processing thousands of new entries daily Enhanced data processing pipelines Increased processing capacity to 20,000 entries per day in 2000, a four-fold increase from 1998 3
Finding relevant data in a giant database Species-oriented retrieval tools Allowed researchers to search within data from any one of over 50,700 species, drastically reducing search times 3

A Deep Dive into the Supercomputing Workflow

To understand how a supercomputer is used in genomics, we can look to a similar workflow implemented on the Cray XE6 supercomputer, known as "Beagle," at Argonne National Laboratory. The "MegaSeq" workflow developed there mirrors the challenges and solutions employed by the DDBJ 7 . The process of analyzing a whole genome is a marathon of computational steps, each requiring immense power and precision.

The Step-by-Step Journey of a Genome

Alignment

Finding the right place on the map using tools like BWA to match billions of reads to reference genome 7 .

Clean-Up

Marking duplicates, indel realignment, and base quality recalibration to polish the data 7 .

Variant Calling

Scanning the entire genome to identify genetic variations like SNPs and structural changes 7 .

Annotation

Understanding the potential consequences of variants on genes and proteins 7 .

Supercomputing Power Behind the MegaSeq Workflow (Cray XE6)

Component Specification Role in Genome Analysis
Compute Nodes 726 nodes Provide the physical processors to execute computations in parallel 7
Cores per Node 24 cores (AMD "Magny-Cours") Allows each node to handle 24 tasks simultaneously 7
Memory 32 GB per node Stores the massive amounts of genetic data being actively processed 7
File System Parallel Lustre file system Enables all nodes to access all data at any time without bottlenecks, a critical feature for large-scale analysis 7

The Scientist's Toolkit: Key Reagents and Software

The work of the DDBJ and genomic scientists worldwide relies on a sophisticated combination of biochemical reagents and software tools. The following details some of the essential components used in a typical supercomputer-powered genomics workflow.

BWA

Type: Software Algorithm

Aligns billions of short DNA sequences to a reference genome with high speed and accuracy 7

Picard Tools

Type: Software Toolkit

Provides a set of command-line tools for manipulating sequencing data, including sorting and marking duplicates 7

GATK

Type: Software Toolkit

An industry-standard toolkit developed by the Broad Institute for variant discovery and genotyping 7

SAM/BAM Tools

Type: Software Toolkit

Utilities for post-processing alignments, including compression, sorting, and indexing 7

DNA Sequencer

Type: Laboratory Instrument

The physical machine that generates the raw short-read sequence data from DNA samples 7

snpEff

Type: Software Algorithm

Annotates and predicts the effects of genetic variants on genes and proteins 7

The Future Written in Code

The work of the DDBJ and its supercomputer is never done. It is the backbone of a rapidly accelerating scientific journey.

DNA Data Storage

Researchers are exploring futuristic concepts like using DNA itself as a medium for ultra-dense, long-term data storage—a potential solution to the world's exploding data generation problem 5 .

Global Impact

The database has become the starting point for thousands of studies, enabling discoveries that are rewriting human history 4 and pushing the boundaries of technology.

The silent work of the supercomputer at the DNA Data Bank of Japan is more than just data management; it is an essential service to science and humanity. By providing the tools to navigate the jungle of genetic data, it ensures that this priceless information does not lose its value and continues to illuminate the mysteries of life itself 3 .

References

References