Petascale genomics
The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large datasets. Tom White explains how big data tools developed to handle large-scale Internet data (like Hadoop) help scientists effectively manage this new scale of data and also enable addressing a host of questions that were previously out of reach.
Talk Title | Petascale genomics |
Speakers | Tom White (Cloudera) |
Conference | Strata + Hadoop World |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | June 1-3, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The advent of next-generation DNA sequencing technologies is poised to revolutionize the way life sciences research is practiced. These new technologies are scaling significantly faster than Moore’s law and promise to catapult life sciences research and the biotech industry into the realm of big data. However, bioinformatics and data management in the life sciences have been slow to adopt the latest big data technologies pioneered by the Internet industry (e.g., Google and Facebook), in part because these tools are only beginning to become necessary today. Tom White reviews several ways in which distributed computing tools (e.g., the Hadoop ecosystem) can be used to significantly advance the state of the art in life sciences research, including scaling genome-wide association studies to find connections between your genes and your traits, large-scale data integration of the large number of public databases, and assembling genome sequences from short snippets for use in cancer genomics. Tom also covers the new ADAM project for rebooting genomics ETL on top of Spark and the Eggo project for providing Parquet-formatted public datasets.