RAG Recommendations That Actually Work
Building a retrieval-augmented recommendation engine over 10,000+ listings with OpenAI embeddings and Pinecone — what to optimize and what to skip.
Retrieval-Augmented Generation gets pitched as a chatbot trick. It's actually a great recommendation primitive. Here's the shape of an engine I built over 10,000+ service listings.
Embed once, query forever
Generate embeddings for every listing with text-embedding-3-small, store the vectors in
Pinecone, and you've turned fuzzy semantic search into a fast nearest-neighbor lookup.
vector = client.embeddings.create(
model="text-embedding-3-small",
input=listing_text,
).data[0].embedding
index.upsert([(listing_id, vector, {"category": category})])Context beats cleverness
The biggest accuracy gains didn't come from a fancier model — they came from what I embedded. Concatenating the title, description, category, and a few structured attributes into one normalized string outperformed any prompt tweak.
Filter in the vector store, not after
Pinecone metadata filters let you constrain by category, location or availability during retrieval. Filtering after the fact wastes recall and latency.
Keep the LLM for the last mile
Use vector similarity to get the top-N candidates, then let the LLM re-rank or explain — not search from scratch. Cheaper, faster, and far easier to evaluate.
RAG isn't magic. It's a good index plus disciplined inputs. Get those right and the rest is tuning.