November 17, 2019

263 words 2 mins read

Petascale genomics

Petascale genomics

The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large datasets. Tom White explains how big data tools developed to handle large-scale Internet data (like Hadoop) help scientists effectively manage this new scale of data and also enable addressing a host of questions that were previously out of reach.


Talk Title	Petascale genomics
Speakers	Tom White (Cloudera)
Conference	Strata + Hadoop World
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	June 1-3, 2016
URL	Talk Page
Slides	Talk Slides
Video

The advent of next-generation DNA sequencing technologies is poised to revolutionize the way life sciences research is practiced. These new technologies are scaling significantly faster than Moore’s law and promise to catapult life sciences research and the biotech industry into the realm of big data. However, bioinformatics and data management in the life sciences have been slow to adopt the latest big data technologies pioneered by the Internet industry (e.g., Google and Facebook), in part because these tools are only beginning to become necessary today. Tom White reviews several ways in which distributed computing tools (e.g., the Hadoop ecosystem) can be used to significantly advance the state of the art in life sciences research, including scaling genome-wide association studies to find connections between your genes and your traits, large-scale data integration of the large number of public databases, and assembling genome sequences from short snippets for use in cancer genomics. Tom also covers the new ADAM project for rebooting genomics ETL on top of Spark and the Eggo project for providing Parquet-formatted public datasets.

facebook google management dataset spark ecosystem large-scale etl hadoop database big data book internet

comments powered by Disqus

Architecting HBase in the field

Architecting HBase in the field

October 28, 2019

Most already know HBase, but many don't know that it can be coupled with other tools from the ecosystem to increase efficiency. Jean-Marc Spaggiari and Kevin O'Dell walk attendees through some real-life HBase use cases and demonstrate how they have been efficiently implemented.

IoT in the enterprise: A look at Intel (IoT) Inside

IoT in the enterprise: A look at Intel (IoT) Inside

October 23, 2019

Moty Fania shares Intels IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.

Python scalability: A convenient truth

Python scalability: A convenient truth

October 21, 2019

Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra

November 17, 2019

Which venues have similar visiting patterns? How can we detect when a user is on vacation? Can we predict which venues will be favorited by users by examining their friends' preferences? Natalino Busa explains how these predictive analytics tasks can be accomplished by using Spark SQL, Spark ML, and just a few lines of Scala code.

Simple, fast, and flexible risk aggregation in Hadoop

Simple, fast, and flexible risk aggregation in Hadoop

November 17, 2019

Value at risk (VaR) is a widely used risk measure. VaR is not simply additive, which provides unique challenges to report VaR at any aggregate level, as traditional database aggregation functions don't work. Deenar Toraskar explains how the Hive complex data types and user-defined functions can be used very effectively to provide simple, fast, and flexible VaR aggregation.

Big data-fueled feedback loops leveraging streaming data in SDN/NFV

Big data-fueled feedback loops leveraging streaming data in SDN/NFV

October 27, 2019

Software-defined networking (SDN) and network functions virtualization (NFV) hold tremendous potential to enable efficiency and flexibility in service delivery, but SDN/NFV environments are also highly complex and multilayered. Matt Olson explains why effective support for SDN/NFV services requires leveraging the tremendous amount of service and data streaming from the platform.