Embedding AI Features: The Messy Middle Between POC and MVP
This article is a summary of my PM experience on designing an artificial intelligence (AI)-based capability for an existing product from the proof-of-concept (POC) stage to a minimal viable product (MVP).
When you are adding an AI feature to an existing product, you can either:
- introduce an entirely new functionality
- extend existing functionality with AI possibilities.
In my case, the latter is true: we aim to introduce AI search to enhance user experience. It sounds pretty trivial, I acknowledge that. The challenge is that search is the critical functionality for that product.
Therefore, we need to introduce a new search method to provide users with more value than the existing one. If AI search works "just good enough," then a traditional one could be considered a failure.
The improvement must be substantial, because of …
The Price
Product managers who previously worked on products that are primarily based on third-party integrations consider API costs for services like Google Maps and other providers as a key variable. Now, with the advent of AI, the rest of the PM world has become familiar with this concept.
An existing, mature functionality has a low maintenance cost. Introducing the new method of utilizing it with AI, adding permanent costs for accessing Large Language Models (LLMs, further just referenced as Model). Whether your organization has its own infrastructure or uses an external provider, such as AWS or Azure, there are still costs to pay.
In my case, I can't charge users an additional fee to cover AI costs. The only justification would be that users become more efficient with AI search, thereby increasing productivity and decreasing their time-to-market with our core system.
That means that the value delivered by AI must exceed the costs. In my case, value is difficult to measure, especially from the beginning. My job is to keep AI costs as low as possible while still delivering value to justify those costs and "overcome" the existing functionality.
That leads us…
Model Selection
The irony here is that you need to select a model first, but you don't know whether it is the best and cost-efficient solution. Relying on benchmarks helps to make the first choice.
There is a natural inclination to consider open-source LLMs. It is easy to kick off with building a POC and following an MVP.
A question you should keep in mind: «How do we go with that Model to production?". Having on-premise infrastructure within your organization requires not only purchasing and maintenance costs, but also expertise to support and scale it effectively. A cloud provider with an OSS model might be a better choice for the start.
Paid LLMs are generally better than their OSS counterparts, even though we are seeing the closing of the gap. A year ago, Llama 3.1 was not performing well with structured data, such as JSON (although texts in PDF were fine). Newer releases have improved much.
Also, a year ago, there was a concern about a context window (i.e., how many tokens can be consumed). Now it is a lesser problem with 200K and more tokens for popular models.
So my strategy is to proceed with paid, but on the older version of an LLM from a cloud provider. I get a cheaper price for the API, along with good performance and stability. However, I am constantly seeking alternatives for lower costs.
LLMs become cheaper, and their older versions become increasingly affordable. I believe that it won't last forever, and prices will eventually rise. The task is to avoid vendor lock and be flexible enough to switch the models easily.
To achieve that, you need…
Testing
The main lesson I learned from the last 6 months: you need to think about testing long before you deliver the first pieces of AI functionality. We've built the initial capabilities and then thought about how we're going to test it.
With a few attempts, I vibe-coded a thing I called "evaluation framework," allowing me to run a conversation with LLM and then estimate its response with another LLM using some subjective metric. That was a clunky solution, which was re-vibecoded into a more solid solution.
Also, the guys asked me not to call it a framework, but I insisted that 'framework' is a good name for something we are not sure how to entitle. Just call any unknown and doubtful stuff a "framework" and people will love it.
Now we have a tool that allows us to test cases across various models and come with their response evaluation. I would rather have such a solution when we started, as we might save much time on prompt engineering and model switching.
You might have a question about why we build our own…
Tooling
Where there is gold digging, there are people who sell shovels. And sometimes a shovel can cost like gold. There is so much AI tooling that I would need another full-time job to keep up with every tool in the market.
We proceed with building our tools because:
- Free tools are mostly (raw) shit
- We weren't sure about the paid tooling not being expensive shit
- In an enterprise, it takes time to onboard a new vendor. And we had a tight timeline.
- We can vibecode what we need pretty quickly.
We have built the previously mentioned evaluation framework and admin tool to manage prompts and configure models for various workflows. Those are internal tools we are not going to share outside the Team. This is an excellent case where vibecoding helps us increase productivity in line with our specific requirements and vision.
That all started with a question about how we would manage…
Prompts
Prompt engineering (and a broader term, "context engineering") is a game-changer in how we approach the development process. We kind of separated good-old development and writing prompts.
This is how our team dynamics works:
- 1 Dev is working on UI
- 1 Dev is building a backend around LLM
- 1 BA is working on requirements and prompts
- 1 QA to test the outcome
- 1 me as PM doing various stuff (that is a topic for another article)
For us, it wouldn't be possible to fix everything with prompts, so we were developing a logic to help "guide" our workflow through various tools or sub-agents. We added writing prompts as a Definition-of-Done. That caused multiple situations that development has been ready for some time, but we are stuck writing "the right" prompt to proceed with testing.
Sometimes changes in prompts broke already working test cases. Prompts tend to become bigger and unmanageable. Many times, we were stuck, wondering why things stopped working after changes in another unrelated part of the prompt. We went back and forth, trying various techniques, approaches, models, and even architectures.
That made estimation difficult. Initially, we stopped providing estimates in story points, and after some time, we initiated development. We were learning Spring AI, how to build the chatbot UX from scratch, how to write prompts, how to test, and how to manage all that.
We ended up estimating the "dev" part only, excluding the "prompt" part from any estimation, but time-framed within a sprint.
The precise scope of MVP helped us to stay on track. We were confident that we could do that, and our approach, with a relaxed process, paid off.
We end up with "Divide and conquer": many precise pieces of prompts tied to the steps of the workflow and some tools. From the outside, such prompt management might seem complicated, but that gives us a glimpse of control of the…
Black box
At the moment of the greatest despair, you should remember that it is an undeterministic thing outside your complete control. That is another game-changer all team members need to adapt to.
There are good days when a selected LLM performs, and there are bad days. Bad days are usually those when you need to demo your AI feature.
The outcome is not always consistent. That is the price we pay. There are techniques to achieve more consistent results, but they increase the price, as you need to call the LLM several times and compare the results (probably using another LLM).
Sometimes, that black box surprises you with use cases you did not initially anticipate. You might sound like a classic "scope creep". And it is more tempting because the price of the "non-documented" feature is a few new lines in the system prompt.
This is where a product vision is required, so everything stays on track.
And the right data in your context is the most important. While developing our AI search, "Prompt engineering" was considered no longer relevant alone, and the newly emerging "Context engineering" has come to attention.
It is the whole topic itself. My point here is that you need to prepare (preprocess) the structured data you want to utilize to decrease noise, which could mislead the LLM.
Summary
Here is a summary of what I am trying to share with you:
- Cheaper but good-enough LLM is a better long-term choice.
- Be ready to change LLMs frequently.
- Start thinking about testing from the start.
- Relax the processes while you are learning.
- Context is the king: it is still the same good-old "trash in - trash out".
- Don't invent your own chatbot UX. I didn't mention it before, but beware: use existing UI frameworks. Designing chatbots is not as straightforward as it may seem.