Open sourcing spark-redshift-community
-
Luca Giovagnoli, Software Engineer
- Oct 25, 2019
At Yelp, we are heavy users of both Spark and Redshift. We’re excited to announce spark-redshift-community, a fork from databricks’ original spark-redshift project.
spark-redshift is a Scala package which uses Amazon S3 to efficiently read and write data from AWS Redshift into Spark DataFrames. After the open source project effort was abandoned in 2017, the community has struggled to keep up with updating dependencies and fixing bugs. The situation came to a complete halt upon release of Spark 2.4 which was sharply incompatible with the latest spark-redshift. Developers looking for a solution turned to online threads on websites like StackOverflow or Github. Answers strayed far from even a simple workaround.
At Yelp, it was only a matter of time before we jumped into action. The inability to upgrade Spark from 2.3.3 to 2.4 meant that:
- We could not use highly sought-after features from Spark 2.4,
-
Our move on to Kubernetes was endangered. In order to move our infrastructure to run on Kubernetes, we needed Spark on 2.4:
“Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark [2.4].” 1
The spark-snowflake open source project is a stable spark-redshift fork for Snowflake. We considered adapting spark-snowflake to work with Redshift but the time estimate was higher than forking and upgrading the original spark-redshift. Upon suggestion from databricks, we did exactly that.
We focused on porting the functionalities that we use the most, like performant reads from Redshift. We had to make tradeoffs in supporting a subset of features due to the timeline and workload. While some made the cut (reading from Redshift, various data types parsing, implementing an InMemoryS3AFileSystem for testing), others didn’t (Postgres driver support, AWS IAM Authentication, some SaveMode options). We have already seen great internal adoption, and several teams are unblocked in their progress on moving to Spark 2.4.
Our plans for the future include supporting the project by focusing on the features we use the most, in the hope that the community could carry forward features they find useful. spark-redshift-community is an edition for the community. Any support in the form of Github issues or pull requests is greatly welcomed.
Become a Backend (Big Data) Engineer at Yelp
We work on a lot of cool projects at Yelp. If you're interested, apply!
View Job