Yelp loves Python, and we use it at scale to power our websites and process the huge amount of data we produce.

Pyleus is a new open-source framework that aims to do for Storm what mrjob, another open-source Yelp project, does for Hadoop: let developers process large amounts of data in pure Python and iterate quickly, spending more time solving business-related problems and less time concerned with the underlying platform.

First, a brief introduction to Storm. From the project’s website, “Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.”

A Pyleus topology consists of, at minimum, a YAML file describing the structure of the topology, declaring each component and how tuples flow between them. The pyleus command-line tool builds a self-contained Storm JAR which can be submitted to any Storm cluster.

When it comes to massive data processing demos, “word count” is a classic. Since Storm operates on “unbounded streams of data”, we can’t calculate a total count for each unique word, since the input stream could continue indefinitely. Instead, our topology will maintain a monotonically increasing counter for each unique word, and log the value each time we see it.

So how can you build a word count Storm topology using Pyleus? All you need is a pyleus_topology.yaml and a few Python components.

For a simple demonstration, you need to know about the three core concepts in Storm:

  • A tuple is the unit of data within a Storm topology, flowing into and out of processing components.
  • Spouts are components that feed tuples into a topology. Usually, a spout consumes data from an external source, like Kafka or Kinesis, then emits records as tuples.
  • Bolts subscribe to the output streams of one or more other spouts and bolts, do some processing, then emit tuples of their own.

This topology has three components: A spout that emits a random line of “lorem ipsum” text, a bolt that splits lines into words, and bolt that counts and logs occurrences of the same word.

Here are the contents of pyleus_topology.yaml:

The spout configuration is self-explanatory, but the bolts must indicate the tuple streams to which they are to subscribe. The split-words bolt subscribes to line-spout with a shuffle_grouping—this means that tuples emitted from line-spout should be evenly and randomly distributed amongst all instances of split-words, be there one, five, or fifty.

count-words, however uses a fields_grouping on the ‘word’ field. This forces all tuples emitted from split-words with the same value of ‘word’ to go to the same instance of count-words. This allows the code in word_count.count_words to make the assumption that it will “see” all occurrences of the same word within the same process.



word_count/ is left as an exercise for the reader. (Or, you could just check out the full example on GitHub)

Now, running pyleus build in this directory will produce a file word_count.jar which you can submit to a Storm cluster with pyleus submit, or run locally for testing with pyleus local.

The code for a Pyleus topology can be very simple, but one feature in particular we are excited about is built-in virtualenv integration. Simply include a requirements.txt file alongside your pyleus_topology.yaml, and pyleus build will build a virtualenv that your code can use and embed it in the JAR. You can even re-use packaged Pyleus components, and refer to them directly in pyleus_topology.yaml!

Another team at Yelp has already developed a Pyleus spout for consuming data from an internal source and built a Python package for it. Now, others within the company can add one line to their requirements.txt and use the spout in their pyleus_topology.yaml without writing a single line of code.

Pyleus is beta software, but feedback and pull requests are happily accepted.

Get started using Pyleus by installing it with pip install pyleus, then check out the source on GitHub for more examples.

Back to blog