Introducing Pyleus: An Open-source Framework for Building Storm Topologies in Pure Python
-
Patrick L., Software Engineer
- Oct 15, 2014
Yelp loves Python, and we use it at scale to power our websites and process the huge amount of data we produce.
Pyleus is a new open-source framework that aims to do for Storm what mrjob, another open-source Yelp project, does for Hadoop: let developers process large amounts of data in pure Python and iterate quickly, spending more time solving business-related problems and less time concerned with the underlying platform.
First, a brief introduction to Storm. From the project’s website, “Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.”
A Pyleus topology consists of, at minimum, a YAML file describing the structure of the topology, declaring each component and how tuples flow between them. The pyleus
command-line tool builds a self-contained Storm JAR which can be submitted to any Storm cluster.
When it comes to massive data processing demos, “word count” is a classic. Since Storm operates on “unbounded streams of data”, we can’t calculate a total count for each unique word, since the input stream could continue indefinitely. Instead, our topology will maintain a monotonically increasing counter for each unique word, and log the value each time we see it.
So how can you build a word count Storm topology using Pyleus? All you need is a pyleus_topology.yaml
and a few Python components.
For a simple demonstration, you need to know about the three core concepts in Storm:
- A tuple is the unit of data within a Storm topology, flowing into and out of processing components.
- Spouts are components that feed tuples into a topology. Usually, a spout consumes data from an external source, like Kafka or Kinesis, then emits records as tuples.
- Bolts subscribe to the output streams of one or more other spouts and bolts, do some processing, then emit tuples of their own.
This topology has three components: A spout that emits a random line of “lorem ipsum” text, a bolt that splits lines into words, and bolt that counts and logs occurrences of the same word.
Here are the contents of pyleus_topology.yaml
:
The spout configuration is self-explanatory, but the bolts must indicate the tuple streams to which they are to subscribe. The split-words
bolt subscribes to line-spout
with a shuffle_grouping
—this means that tuples emitted from line-spout
should be evenly and randomly distributed amongst all instances of split-words
, be there one, five, or fifty.
count-words
, however uses a fields_grouping
on the ‘word’ field. This forces all tuples emitted from split-words
with the same value of ‘word’ to go to the same instance of count-words
. This allows the code in word_count.count_words
to make the assumption that it will “see” all occurrences of the same word within the same process.
word_count/line_spout.py
:
word_count/count_words.py
:
word_count/split_words.py
is left as an exercise for the reader. (Or, you could just check out the full example on GitHub)
Now, running pyleus build in this directory will produce a file word_count.jar
which you can submit to a Storm cluster with pyleus submit
, or run locally for testing with pyleus local
.
The code for a Pyleus topology can be very simple, but one feature in particular we are excited about is built-in virtualenv integration. Simply include a requirements.txt
file alongside your pyleus_topology.yaml
, and pyleus build
will build a virtualenv that your code can use and embed it in the JAR. You can even re-use packaged Pyleus components, and refer to them directly in pyleus_topology.yaml
!
Another team at Yelp has already developed a Pyleus spout for consuming data from an internal source and built a Python package for it. Now, others within the company can add one line to their requirements.txt
and use the spout in their pyleus_topology.yaml
without writing a single line of code.
Pyleus is beta software, but feedback and pull requests are happily accepted.
Get started using Pyleus by installing it with pip install pyleus
, then check out the source on GitHub for more examples.