Yelp’s engineering team loves Docker. We’re already using it for a growing number of projects internally but there are applications that Docker isn’t a great fit for, such as providing independent (VM like) containers on shared hardware for people to interactively ssh into. You can, of course, run sshd inside a Docker container but you can only run one sshd per port. If you wanted multiple users to be able to ssh into the same server, without having a custom port allocated per user, you’re out of luck.
To solve this problem we built dockersh, a user shell for isolated, containerized environments in Docker. It’s available as a container on dockerhub with a 1 step install. This is a fully functional and usable implementation that you can play with. I’m already using it on some of my home servers to separate people’s screen/tmux sessions for irc into separate containers.
Originally I solved this by using the ssh ForceCommand setting so that users would log into the host and then immediately be forced to ssh to a Docker container. This was not ideal as key management and mapping user’s ForceCommand setting was complex.
After discussing this problem with a couple of my colleagues, and some inspiration from a recent blog post about sshd in containers, we had a rough idea of what we wanted to experiment with: making a shell which located a user inside a container using nsenter!
In brief, we planned to write a utility which would:
Be invoked by the login system as a user shell (when a user sshs into the host)
Start up a Docker container for the user (if it wasn’t already running), with the user’s home directory mounted
Lastly, exec nsenter to give the user a shell inside this container
Theoretically, this could give us isolated environments, as each user would have their own network stack, process and memory namespaces, etc. If subsequent ssh sessions just enter the pre-existing container, it would look (to the user) like they had their own dedicated machine.
We thought we could hack together a prototype of this pretty quickly. Turns out we were right; with a couple of patches, and a custom nsenter version we were in business, at least the ‘terrible but kinda functional prototype’ business, perfect for our upcoming hackathon.
We decided to rewrite the prototype version in Go, which was a first for me. At the time, this was mostly an excuse to play with Go but it quickly proved to be a very sound decision. We were able to make use of some excellent libraries, like libcontainer, which did a lot of the heavy lifting for us.
dockersh can be explicitly invoked for testing, but more usually will be setup as an ssh ForceCommand, or added to /etc/shells and used as the shell for users in /etc/passwd. It reads a global config file and per user configuration files, and then sets up a container and invokes a real shell process for the user. This user shell process inside their container should be more secure than access to the host, as the container’s kernel namespaces and cgroups are applied to the new shell, in addition to dropping additional ‘capabilities’ for the process including SUID and SGID bits.
The utility we have now is just a start – I’ve already got a list of future improvements and further work (in the issue tracker), and I encourage you to have a play with the project and let us know what you think.
Yelp is celebrating its 10th anniversary this year. That’s right, a decade of marvelous reviews from Yelpers all around the world. What better way to celebrate the big 1-0 than to build a tool that would take a sneak peak into over 61 million reviews from our community and let you discover real world trends in cities all over the globe?
As announced in our official blog, Yelp Trends is a fun way to visualize the review frequency of how often specific words are used in reviews and their development over the past 10 years. From popular trends in the culinary world to popular slang terms to what’s hot in fitness, users are encouraged to explore the world from the local communities’ perspective.
Yelp Trends started as a hackathon project when a group of engineers indexed words used in reviews into Elasticsearch. We leveraged its powerful facet queries in the backend and built the UI to graph the normalized review frequencies.
At Yelp, Elasticsearch is an important part of our search infrastructure, as it provides a robust, distributed platform that is easy to integrate with other services. On top of Elasticsearch we also have our own custom frameworks for creating clients as well as indexers called Apollo and ElasticIndexer, respectively.
Apollo (of course named after Apollo Creed) allows us to quickly build Elasticsearch clients with a fixed interface and also provides us with many default features such as monitoring and a managed infrastructure. Our Apollo client queries the reviews index and returns a JSON formatted time series of queries frequency.
What does a request out to Elasticsearch look like in Apollo? Not that much different than the regular JSON you would send but as Python data structures instead.
This sample shows how we built the request to correctly search through all of the indexed data for relevant results. We have three different filters here, all being selected with ‘and’ to make sure that each data point from Elasticsearch matches all three requirements: the city (San Francisco here), the restaurants, and the review language. These all get applied to the phrase we’re matching on, pizza.
After we have all of this data, we need to structure it in a manner that makes it easy for us to visualize. Using facets we’re able to have Elasticsearch format the data into an easily consumable format.
ElasticIndexer, the other framework that complements Apollo, is an indexing pipeline for loading Yelp data into Elasticsearch. Building a new index can take several days for some of our larger indices, so to help us avoid doing this, ElasticIndexer constantly monitors database tables for changes and re-indexes documents as they are modified or added. It also has the ability to determine field dependencies, which enables us to re-index only the fields that actually change when the database changes.
Leveraging both Apollo and ElasticIndexer, the hackathon project was at first branded Wordtime. We used AngularJS, D3.js, Rickshaw and adopted design templates from Bootstrap. Rickshaw provided the framework to display interactive graphs, drawn with SVG that are highly customizable and easily styled using the standard CSS techniques.
It quickly became obvious that the tool is addictive and people enjoyed trying out new examples. That’s when some of the other teams learned about the project, with the anniversary in mind, we decided to productionize the tool.
Wordtime originally supported English reviews only, while for Yelp Trends we wanted to add support for other review languages as well (through specialized search analysers). While we could have indexed the review text of reviews in other languages in the same field we use for English, this would not have worked well. Elasticsearch only allows a fixed analyzer per field and an English analyzer would not work well for a completely different language such as Japanese. We ended up adding new fields for each language and then reindexed the entire review corpus over a two day period. Our final reviews index grew to over 150GB but queries still only took under a second.
Yelp Trends is an inspiring tool! Give it a try yourself. We can’t wait to see what trends you will uncover in your city. And don’t forget to share your findings with us. Have fun!
We’ve got a busy month ahead of us with several great events being held at Yelp HQ. We hope everyone was able to catch our security team over at BSides LV this past week where one of our engineers, Ioannis, presented on honeypots and ran a workshop teaching people how to set up one of their own.
From panel discussions to hands-on workshops on everything from fashionable tech to Android development and growth hacking, there’s plenty to keep you busy at Yelp HQ. We’ve got six great events this month, and spaces are going fast. Make sure to sign up at the event links below. Hope to see you at Yelp!
People say that security is hard and that’s exactly why we have a dedicated security team here at Yelp! We place tremendous importance on securing our environment, our employees and the millions of visitors who trust Yelp every month.
Information Security is not a solo endeavor. You have to exchange information with fellow security engineers and researchers, get informed of new vulnerabilities and threats and build a “web of trust” containing security practitioners that you can count on. “Community” is the keyword in this case. This is why we are officially sponsoring Security BSides Las Vegas 2014!
BSides is a great community-driven security convention held in Las Vegas August 5th and 6th, at Tuscany Suites & Casino. Our own security team will be there and would be more than happy to meet and exchange GPG keys, errr… ideas and knowledge!
It’s also my great pleasure to have been selected to conduct a 4 hour long workshop for 28 lucky participants on one of my favorite research topics: honeypots! Unfortunately, all the available spots were filled within the first few days, but make sure to catch me at BSides if you are interested!
In the field of computer security, honeypots are systems aimed at deceiving malicious users or software that launch attacks against the servers and network infrastructure of various organizations. Essentially, they are systems running fake or emulated services with security holes that are open for exploitation. Everything that an attacker or malware does can be recorded for further analysis. Thus, honeypots can be deployed as protection mechanisms for an organization’s real systems, or as research units to study and analyze the methods employed by human hackers or malware.
At the BSides workshop, we will talk at length about the use cases and the value of honeypots, what problems they solve (or create), how to get the best out of the available deployment scenarios, what you can do with the data you can capture and how to get a better understanding of them.
This will be followed by a hands-on portion where participants will create and test several research honeypots by manually deploying and testing in real time. One honeypot system will undertake the role of a web trap for attackers who target the SSH service in order to gain illegal server access. SSH is the most common way sysadmins manage their systems and it’s always an easy entry point if public key authentication is not in place. Another one will undertake the role of a malware collector, a device usually deployed by malware analysts and anti-virus companies to gather and securely store malicious binary samples.
We will also talk about post-capturing activities and further analysis techniques. I will present some useful visualization tools, plus a honeypot bundle Linux distribution that contains many pre-configured versions of the aforementioned honeypots and tools, which can make the deployment of honeypots in small or large networks an easy task. The latter is a project by me called HoneyDrive and you can find the latest version (released only a few days ago) here: http://sourceforge.net/projects/honeydrive/
Do you think all of these sound interesting? We surely do! If you want to be part of a security team in one of the most exciting companies to work for, take a look at our careers page. We are currently hiring security engineers in our San Francisco, New York and London offices!
The Yelp Dataset Challenge provides the academic community with a real-world dataset over which to apply their research. We encourage students to take advantage of this wealth of data to develop and extend their own research in data science and machine learning. Students who submit their research are eligible for cash awards and incentives for publishing and presenting their findings.
The most recent Yelp Dataset Challenge (our third round) opened in February 2014, giving students access to our Phoenix Academic Dataset, with reviews and businesses from the greater Phoenix metro area. In the fourth round, open now, we are expanding the dataset to include data from four new cities from around the world. We are also opening up the challenge to international students, see the terms and conditions for more information.
We are proud to announce that we are extending the popular Phoenix Academic Dataset to include four new cities! By adding a diverse set of cities we hope to encourage students to compare and contrast the different aspects of each city and find new insights about what makes each city unique. The dataset is comprised of reviews, businesses and user information from:
Business Attributes – 320,002 (+208,441 new attributes!)
Check-in Sets – 31,617 (+20,183 new check-in sets!)
Tips – 403,210 (+289,217 new tips!)
Users – 252,898 (+182,081 new users!)
User Connections – 955,999 (+804,482 new edges!)
Reviews – 1,125,458 (+790,436 new reviews!)
Round 4 is Now Live
Along with the updated dataset, we’re also happy to announce the next iteration of the Yelp Dataset Challenge. The challenge will be open to students around the world and will run from August 1st, 2014 to December 31, 2014. See the website for the full terms and conditions. This data can be used to train a myriad of models and extend research in many fields. So download the dataset now and start using this real-world dataset right away!