How we bring LLM intelligence to millions of daily searches at Yelp.

From the moment a user enters a search query to when we present a list of results, understanding the user’s intent is crucial for meeting their needs. Were they looking for a general category of business for that evening, a particular dish or service, or one specific business nearby? Does the query contain nuanced location or attribute information? Is the query misspelled? Is their phrasing unusual, so that it might not align well with our business data? All of the above questions represent Natural Language Understanding tasks where Large Language Models (LLMs) might well do better than traditional techniques. In this blog post, we detail our development process and elucidate the steps we’ve taken at Yelp to enhance our query understanding using LLMs, from ideation to full-scale rollouts in production.

Introduction

Yelp has integrated Large Language Models (LLMs) into a wide array of features, from creating business summaries1 that highlight what a business is best known for based on first-hand reviews, to Yelp Assistant2 which intelligently guides consumers through the process of requesting a quote from a service provider with personalized relevant questions. Among these applications, query understanding was the pioneering project and has become the most refined, laying the groundwork for Yelp’s innovative use of LLMs to improve user search experiences. Particularly, tasks that require query understanding such as spelling correction, segmentation, canonicalization, and review highlighting all share a few common and advantageous features. These include: (1) all of these tasks can be cached at the query level, (2) the amount of text to be read and generated is relatively low, and (3) the query distribution follows the power law - a small number of queries are very popular. These features make the query understanding a very efficient ground for using LLMs.

In this post, we will discuss our generic approach for leveraging LLMs across these query understanding tasks. To showcase this approach, we use the following two running examples:

  • Query Segmentation: Given a query, we want to segment and label semantic parts of that query. For example, “pet-friendly sf restaurants open now” might have the following segmentation: {topic} pet-friendly {location} sf {topic} restaurants {time} open now. This can be used to further refine the search location when suitable, implicitly rewriting the geographic bounding box (geobox) to match the user’s intent.
  • Review Highlights: Given a query, we want a creatively expanded list of phrases to match on – particularly to help us find interesting “review snippets” for each business. Review snippets help the user see how each shown business is relevant to their search query. For example, if a user searched for “dinner before a broadway show,” bolding the phrase “pre-show dinner” in a short review snippet is a very helpful hint for their decision making.

Yelp had older pre-LLM systems for both of these tasks, but they were fragmented (i.e. several different systems stitched together) and often lacked intelligence, leaving room for improvement to provide an exceptional user experience. As we progress, we’ll continue to refer to these examples to highlight our path from conceptualization to full-scale production rollouts.

[Figure 1] Generic approach for leveraging LLMs across two query understanding tasks. We formulate and scope the use case for LLMs, build and validate a small proof of concept, and then aggressively scale if the POC indicates a positive impact.

Formulation

In this step, our initial goals are to: (1) determine if an LLM is the appropriate tool for the problem, (2) define the ideal scope and output format for the task, and (3) assess the feasibility of combining multiple tasks into a single prompt. Here, we also consider the potential for Retrieval Augmented Generation (RAG) to assist in the task, by identifying extra information (besides the query text) that could help the model make better decisions. This typically entails quick prototyping with the most powerful LLM available to us, such as the latest stable GPT-4 model, and creating many iterations of the prompt. At this stage, we also welcome changes to the task’s formulation itself, as we gain a deeper understanding of how the LLM perceives the task.

Query Segmentation

Compared to traditional Named Entity Recognition techniques, LLMs excel at segmentation tasks and are flexible enough to allow for easy customization of the individual classes. After several iterations, we settled on six classes for query segmentation: topic, name, location, time, question, and none. This involved a number of small but important decisions:

  1. Our legacy models had several subclasses all akin to “topic,” but this would have required the LLM to understand intricate details of our internal taxonomy that are both unintuitive and subject to change.
  2. We introduced a new “question” tag for searches that want an answer beyond just “a list of businesses.” For example, the query “magic kingdom upcoming events” might be classified as {name} magic kingdom {question} upcoming events.
  3. We aligned the model outputs with potential downstream applications that can benefit from a more intelligent labeling of these tags, such as implicit location rewrite, improved name intent detection, and more accurate auto-enabled filters.

Few-shot examples within query segmentation prompt:

1) chicago riverwalk hotels
   => {location} chicago riverwalk {topic} hotels
2) grand chicago riverwalk hotel
   => {name} grand chicago riverwalk hotel
3) healthy fod near me
   => {topic} healthy food {location} near me [spell corrected - high]

We also took note that spell correction is not only a prerequisite for segmentation, but is also a conceptually related task. Throughout the process we learned that spell-correction and segmentation can be done together by a sufficiently powerful model, so we added a meta tag to mark spell corrected sections and decided to combine these two tasks into a single prompt. On the RAG side, we augment the input query text with the names of businesses that have been viewed for that query. This helps the model learn and distinguish the many facets of business names from common topics, locations, and misspellings. This is highly useful for both segmentation and spell correction (so was another reason for combining the two tasks).

RAG examples (using businesses names):

1) barber open sunday [Fade Masters, Doug's Barber Shop]
   => {topic} barber {time} open sunday
2) buon cuon [Banh Cuon Tay Ho, Phuong Nga Banh Cuon]
   => {topic} banh cuon [spell corrected - high]

Review Highlights

LLMs also excel at creative tasks by expanding on concepts using their world knowledge. In this task, we used the LLM to generate terms that are suitable to be highlighted, and we agreed on a low bar for inclusion – opting to include any phrase that would be better than showing no snippet at all.

The hardest part of this task was devising great examples of phrase lists. Using only the words in the query text, we have very limited options as to what to highlight in the review snippet. Not only that, there are also many subtleties within this complex task that make it difficult for traditional text similarity models to solve, such as:

  1. What does a query mean in the context of Yelp - regarding reservations and pick ups, food near me searches, and/or Yelp guaranteed searches for services.
  2. If a user searches for seafood, it would be too limited to only consider reviews containing the “seafood” term. However, we can highlight adjacent terms such as “fresh fish,” “fresh catch,” “salmon roe,” “shrimp,” etc., which are interesting and sufficiently relevant to the business.
  3. In contrast, we might also need to go up the semantic tree when appropriate and be more general for searches like “vegan burritos” to “vegan,” “vegan options,” and so on.
  4. Generating multi-word or casual phrases like “watch the game,” which are substantially relevant to the searches like “best bar to watch lions games.”
  5. Being cognizant of whether phrases generated are likely to produce spurious matches in actual reviews, or what to prioritize when a query contains multiple concepts, such as “ayce korean bbq for under $10 near me.”

In essence, the way we define phrase expansions requires critical reasoning to resolve such subtleties, and we taught the LLMs to replicate that thought process through carefully curated examples.

On the RAG side, we enhanced the input raw query text with the most relevant business categories with respect to that query (from our in-house predictive model). This helps the LLM to generate more relevant phrases for our needs, especially for searches with a non-obvious topic (like the name of a specific restaurant) or ambiguous searches (like pool - swimming vs billiards).

Evolutions of curated examples within review highlighting prompt:

May 2022
Query: healthy food
-> Key concepts: healthy food, healthy, organic

March 2023
healthy food
-> healthy food, healthy, organic, low calorie, low carb

September 2023
healthy food
-> healthy food, healthy options, healthy | nutritious, organic, low calorie, low carb, low fat, high fiber | fresh, plant-based, superfood

December 2023 (with RAG)
search: healthy food, categories: [healthmarkets, vegan, vegetarian, organicstores]
-> healthy food, healthy options, healthy | nutritious, organic, low calorie, low carb, low fat, high fiber | fresh, plant-based, superfood

Proof of Concept

After formulating the task and defining our input/output formats, our focus shifts to building a proof of concept to demonstrate the effectiveness of the new approach in practice. Up to this point, we iterated on our ideas using the most powerful LLM model available, which typically entails significant latency and cost. However, this setup is not conducive to a real-time system dealing with a vast array of distinct queries.

To address this challenge, we leverage the fact that distribution of query frequencies can be estimated by the power-law. By caching (pre-computing) high-end LLM responses for only head queries above a certain frequency threshold, we can effectively cover a substantial portion of the traffic and run a quick experiment without incurring significant cost or latency. We then integrated the cached LLM responses to the existing system and performed offline and online (A/B) evaluations.

Query Segmentation

To evaluate offline, we observed the impact of the new segmentation on downstream tasks as well as specialized datasets. We compared the accuracy of LLM provided segmentation with the status quo system on human labeled datasets of name match and location intent. Among the different applications of this segmentation signal, we were able to (a) leverage token probabilities for (name) tags to improve our query to business name matching and ranking system and (b) achieve online metric wins with implicit location rewrite using the (location) tags.

Original Query Text Original Location Rewritten Location (only used by search backend)
Restaurants near Chase Center San Francisco, CA 1 Warriors Way, San Francisco, CA 94158
Ramen Upper West Side New York, NY Upper West Side, Manhattan, NY
Epcot restaurants Orlando, FL Epcot, Bay Lake, FL

Status Quo

Rewritten

[Figure 2] Better understanding of location intent lets us return more relevant results to the users. One of our POC leverages query segmentation to implicitly rewrite text within location boxes to a refined location within 30 miles of the user’s search if we have high confidence in the location intent. For example, the segmentation ”epcot restaurants => {location} epcot {topic} restaurant” helps us to understand the user’s intent in finding businesses within the Epcot theme park at Walt Disney World. By implicitly rewriting the location text from “Orlando, FL” to “Epcot” in the search backend, our geolocation system was able to narrow down the search geobox to the relevant latlong.

Review Highlights:

Offline evaluation of the quality of generated phrases is subjective and requires very strong human annotators with good product, qualitative, and engineering understanding. We cross-checked the opinion of human annotators as well as performed some quantitative methods for snippets like looking at how common the phrases are. After a thorough review, we carried out online A/B experimentations using the new highlight phrases.

[Figure 3] A/B evaluations with POC show high impact for snippets. Online evaluations showed that a better understanding of users’ intent can lead to impactful metric wins. By highlighting the relevant phrase to the user’s query, we increased Session / Search CTR across our platforms. Further iteration from GPT3 to GPT4 also improved Search CTR on top of previous gains. The impact was also higher for less common queries in the tail query range. A large portion of the wins came from incremental quality improvement as we addressed all of the nuances listed above for the task.

Scaling Up

If an online experiment for the proof of concept indicates a meaningful positive impact, it’s time to improve the model, and also expand its utilization to a larger volume of queries. However, scaling to millions of queries (or to a real-time model, in order to support never-before-seen queries) poses cost and infra challenges. For example, we’re building out new signal datastores to support larger pre-computed signals. And though we’d love for next-gen technology to work on all searches, the investment can get harder to justify. Particularly, given the distribution of queries, scaling up to millions of queries may require a disproportionately high investment to achieve only a marginal increase in traffic coverage.

Furthermore, as queries get further into the long tail, understanding user intent also becomes more challenging. So to scale up effectively, we need a more precise model that is also more cost-effective. At the moment, we’ve landed on a multi-step process for scaling from the prototype stage to a model that serves 100% of traffic:

  1. Iterate on the prompt using the “expensive” model (GPT-4/o1). This is mainly testing the prompt against real or contrived example inputs, looking for errors that could be teachable moments, and then augmenting the examples in the prompt. One approach we used to narrow down our search for problematic responses was tracking the query level metrics to find those queries that have nontrivial traffic and that their metric is obviously worse than status-quo.

  2. Create a golden dataset for fine tuning smaller models. We ran the GPT-4 prompt on a representative sample of input queries. The sample size should be large (but not unmanageably so, since quality > quantity) and it should cover a diverse distribution of inputs. For newer and more complex tasks that require logical reasoning, we have begun using o1-mini and o1-preview in some use cases, depending on the difficulty of the task.

  3. Improve the quality of the dataset if possible, prior to using it for fine tuning. With hard work here, it can be possible (for many tasks) to improve upon GPT-4’s raw output. Try to isolate sets of inputs that are likely to have been mislabeled and target these for human re-labeling or removal.

  4. Fine tune a smaller model (GPT4o-mini) that we can run offline at the scale of tens of millions, and utilize this as a pre-computed cache to support that vast bulk of all traffic. Because fine-tuned query understanding models only require very short inputs and outputs, we have seen up to a 100x savings in cost, compared to using a complex GPT-4 prompt directly.

  5. Optionally, fine tune an even smaller model that is less expensive and fast (to run in real-time only for long-tail queries). Specifically, at Yelp, we have used BERT and T5 to serve as our real time LLM model. These models are optimized for speed and efficiency, allowing us to process user queries rapidly and accurately during the complete rollout phase. As the cost and latency of LLMs improve, as seen with GPT4o-mini and smaller prompts, realtime calls for OpenAI’s fine-tuned model might also be achievable in the near future.

Review Highlights

After fine-tuning our model and validating the responses on a diverse and random test set, we scaled to 95% of traffic by pre-computing snippet expansions for those queries using OpenAI’s batch calls. The generated outputs were quality checked and uploaded to our query understanding datastores. Cache-based systems such as key/value DBs were used to improve retrieval latency due to the power law distribution of search queries. With this pre-computed signal, we further leveraged the “common sense” knowledge that they possessed in other downstream tasks. For instance, we used CTR signals for the relevant expanded phrases to further refine our ranking models, and additionally used the phrases (averaged over business categories) as heuristics to get highlight phrases for the remaining 5% of traffic not covered by our pre-computations.

[Figure 4] Example of tail query in our review highlights system. For the query “dinner before a broadway show,” the model outputs a creative list of phrases that can be used to match relevant and interesting phrases within real reviews written by real users. This not only enhances user trust by aligning with their search intentions, but also enables quick decision-making by allowing users to easily assess the experiences of others and determine the business that can meet their needs.

Future Work

The integration of LLMs deeper into our search systems holds great potential for transforming user search experiences. As the landscape of LLMs evolves, we continue to adapt to those new capabilities, which can unlock new ways to use our content. For some search tasks that require complex logical reasoning, we’re starting to see large benefits in the quality of outputs generated by latest reasoning models compared to the previous generative models. As we aim to develop even more advanced use cases, considering the trade-offs, we will continue to follow a multi-step validation and gradual scaling strategy. By staying agile and responsive to these new advancements, we can better showcase and highlight the best and most authentic content from our data, enhancing the overall user experience within the app.

Conclusion

LLMs hold immense potential for transforming user search experiences. To realize these possibilities, a strategic approach involving ideation, proof of concept testing, and full-scale production rollout is essential. This requires continuous iterations and adaptability to new advances in foundation models, as some query understanding tasks may require more complex logical reasoning while others require a deeper knowledgebase. Thus far, Yelp has successfully leveraged LLMs and our depth of content to improve the users’ experience and bring greater value to our business. We remain committed to staying at the forefront of LLM advancements and rapidly adapting these innovations to our use cases.

For more information on this topic, check out our more detailed talks at Haystack this year 3 4.

Acknowledgement

The authors would like to acknowledge the Search Quality team for their exceptional contributions to this initiative, especially Cem Aksoy, Akshat Gupta, Alexander Levin, Brian Johnson, Arthur Cruz de Araujo, and Ashwani Braj. This blog reflects the collaborative effort and technical expertise that each member has brought to the table. Your dedication and innovative approach have been crucial in advancing our engineering goals. Thank you!

Footnotes

Become an Engineer at Yelp

We work on a lot of cool projects at Yelp. If you're interested, apply!

View Job

Back to blog