Python scalability: A convenient truth
Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.
Talk Title | Python scalability: A convenient truth |
Speakers | Travis Oliphant (Continuum Analytics) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 29-31, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Python is the fastest-growing data science language and is used in production at many of the Fortune 500 companies for everything from software engineering to data engineering to rapid analytics. Despite its easy-to-learn nature and its simple syntax, Python packs surprising amounts of power and performance right out of the box. For instance, many of the newest innovations in the big data ecosystem, such as columnar storage, dataflow programming, and stream processing, can all be expressed in a relatively straightforward manner using Python. Unfortunately, it is also very easy to implement Python in ways that impede its ability to scale. For instance, many Hadoop practitioners fail to consider the implications of serialization overhead when interfacing with tools like R and Python. Others may simply be unaware of the facilities in Python to manage multicore and larger-than-memory workloads and assume that they have to move to complex distributed computing the instant they hit a memory barrier. Travis Oliphant covers the basic concepts that lie at the heart of Python’s scalability and power and defuses myths about its performance limits. Travis looks at a few of the common antipatterns that tend to crop up as people integrate Python with Hadoop and Spark and take it into production-deployment environments. Travis will demonstrate examples of real-world code and scenarios where orders-of-magnitude performance improvement can be achieved by using better data-management techniques and artfully applying modern Python libraries for performance.