PySpark Coding Practices: Lessons Learned
Alex Gillmor and Shafi Bashar, Machine Learning Engineers
- May 14, 2018
In our previous post, we discussed how we used PySpark to build a large-scale distributed machine learning model. In this post, we will describe our experience and some of the lessons learned while deploying PySpark code in a production environment.
Yelp’s systems have robust testing in place. It’s a hallmark of our engineering. It allows us to push code confidently and forces engineers to design code that is testable and modular. Broadly speaking, we found the resources for working with PySpark in a large development environment and efficiently testing PySpark code to be a little sparse.
By design, a lot of PySpark code is very concise and readable. As such, it might be tempting for developers to forgo best practices but, as we learned, this can quickly become unmanageable. In the process of bootstrapping our system, our developers were asked to push code through prototype to production very quickly and the code was a little weak on testing.
As our project grew these decisions were compounded by other developers hoping to leverage PySpark and the codebase. We quickly found ourselves needing patterns in place to allow us to build testable and maintainable code that was frictionless for other developers to work with and get code into production.
This was further complicated by the fact that across our various environments PySpark was not easy to install and maintain. Early iterations of our workflow depended on running notebooks against individually managed development clusters without a local environment for testing and development.
Test Something by Mocking Everything?
Our workflow was streamlined with the introduction of the PySpark module into the Python Package Index (PyPI). Prior to PyPI, in an effort to have some tests with no local PySpark we did what we felt was reasonable in a codebase with a complex dependency and no tests: we implemented some tests using mocks.
However, this quickly became unmanageable, especially as more developers began working on our codebase. As result, the developers spent way too much time reasoning with opaque and heavily mocked tests.
Our initial PySpark use was very adhoc; we only had PySpark on EMR environments and we were pushing to produce an MVP. To formalize testing and development having a PySpark package in all of our environments was necessary. PySpark was made available in PyPI in May 2017. With PySpark available in our development environment we were able to start building a codebase with fixtures that fully replicated PySpark functionality.
One element of our workflow that helped development was the unification and creation of PySpark test fixtures for our code. One can start with a small set of consistent fixtures and then find that it encompasses quite a bit of data to satisfy the logical requirements of your code. In our service the testing framework is pytest.
And similarly a data fixture built on top of this looks like:
business_table_data is a representative sample of our business table.
And an example of a simple business logic unit test looks like:
While this is a simple example, having a framework is arguably more important in terms of structuring code as it is to verifying that the code works correctly.
PySpark Coding Conventions
As often happens, once you develop a testing pattern, a correspondent influx of things fall into place. We love Python at Yelp but it doesn’t provide a lot of structure that strong type systems like Scala or Java provide. And - while we’re all adults here - we have found the following general patterns particularly useful in coding in PySpark.
Separate your data loading and saving from any domain or business logic. We clearly load the data at the top level of our batch jobs into Spark data primitives (an RDD or DF). For most use-cases, we save these Spark data primitives back to S3 at the end of our batch jobs. Any further data extraction or transformation or pieces of domain logic should operate on these primitives. We try to encapsulate as much of our logic as possible into pure python functions with the tried and true patterns of testing, SRP, and DRY.
Be clear in notation. We make sure to denote what Spark primitives we are operating within their names. Spark provides a lot of design paradigms, so we try to clearly denote entry primitives as
spark_context and similarly data objects by postfixing types as
bar_df. We apply this pattern broadly in our codebase. A pattern we’re a little less strict on is to prefix the operation in the function. E.g. the signatures
map_filter_out_past_viewed_businesses(...) represent that these functions are applying filter and map operations.
Do as much of testing as possible in unit tests and have integration tests that are sane to maintain. Hopefully it’s a bit clearer how we structure unit tests inside our code base. However, we have noticed that complex integration tests can lead to a pattern where developers fix tests without paying close attention to the details of the failure. So what we’ve settled with is maintaining the test pyramid with integration tests as needed and a top level integration test that has very loose bounds and acts mainly as a smoke test that our overall batch works.
In moving fast from a minimum viable product to a larger scale production solution we found it pertinent to apply some classic guidance on automated testing and coding standards within our PySpark repository. These best practices worked well as we built our collaborative filtering model from prototype to production and expanded the use of our codebase within our engineering organization.
We would like to thank the following for their feedback and review: Eric Liu, Niloy Gupta, Srivathsan Rajagopalan, Daniel Yao, Xun Tang, Chris Farrell, Jingwei Shen, Ryan Drebin, Tomer Elmalem.
Passionate about large scale data problems and high quality software engineering?
We're hiring! Check out our current job openings. We’d like to hear from you!View Job