October 21, 2019

322 words 2 mins read

Python scalability: A convenient truth

Python scalability: A convenient truth

Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.


Talk Title	Python scalability: A convenient truth
Speakers	Travis Oliphant (Continuum Analytics)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

Python is the fastest-growing data science language and is used in production at many of the Fortune 500 companies for everything from software engineering to data engineering to rapid analytics. Despite its easy-to-learn nature and its simple syntax, Python packs surprising amounts of power and performance right out of the box. For instance, many of the newest innovations in the big data ecosystem, such as columnar storage, dataflow programming, and stream processing, can all be expressed in a relatively straightforward manner using Python. Unfortunately, it is also very easy to implement Python in ways that impede its ability to scale. For instance, many Hadoop practitioners fail to consider the implications of serialization overhead when interfacing with tools like R and Python. Others may simply be unaware of the facilities in Python to manage multicore and larger-than-memory workloads and assume that they have to move to complex distributed computing the instant they hit a memory barrier. Travis Oliphant covers the basic concepts that lie at the heart of Python’s scalability and power and defuses myths about its performance limits. Travis looks at a few of the common antipatterns that tend to crop up as people integrate Python with Hadoop and Spark and take it into production-deployment environments. Travis will demonstrate examples of real-world code and scenarios where orders-of-magnitude performance improvement can be achieved by using better data-management techniques and artfully applying modern Python libraries for performance.

code management performance spark ecosystem data engineering hadoop data science analytics big data programming python

comments powered by Disqus

Transforming Telefnica

Transforming Telefnica

October 19, 2019

Increasing competition and technological change is impelling the telco industry toward a new model of analytics. Telefnica has been at the front of this change, driving business transformation to a digital telco. John Belchamber and Arturo Canales tell the story of that transformation and detail the pitfalls and challenges faced by teams looking to follow a similar journey.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Scalable schema management for Hadoop and Spark applications

Scalable schema management for Hadoop and Spark applications

October 21, 2019

Schema plays a key role in the Hadoop architecture at Uber. Kelvin Chu and Evan Richards explain why schema is important and how it can make your Hadoop and Spark application more reliable and efficient.

Real-time Hadoop: What an ideal messaging system should bring to Hadoop

Real-time Hadoop: What an ideal messaging system should bring to Hadoop

October 21, 2019

Application messaging isnt newsolutions include IBM MQ, RabbitMQ, and ActiveMQ. Apache Kafka is a high-performance, high-scalability alternative that integrates well with Hadoop. Can modern distributed messaging systems like Kafka be considered a legacy replacement or is it purely complementary? Ted Dunning outlines Kafka's architectural benefits and tradeoffs to find the answer.

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

October 20, 2019

Celtra provides a platform for customers like Porsche and Fox to create, track, and analyze digital display advertising. Celtra's platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Grega Kepret outlines Celtras data-pipeline challenges and explains how it solved them by combining Snowflake's cloud data warehouse with Spark.

Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop

Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop

October 19, 2019

What if we have reached the point where open source can handle massively difficult streaming problems with enterprise-grade durability? Ilya Ganelin presents Capital Ones novel solution for real-time decisioning on Apache Apex. Ilya shows how Apex provides unique capabilities that ensure less than 2 ms latency in an enterprise-grade solution on Hadoop.