Agentic Workflows: an agent's recipe

Let's cook

NYC, September 4th 2025

Ingredients:

the task: what is the purpose of this?
prompt engineering
context engineering
model choice
output design
eval that output

The task: What is the purpose of this

Use case discovery

Every recipe starts with knowing what dish you're making. In agentic workflows, that means defining the purpose. Are you trying to automate classification? Build a decision system? Or just streamline a recurring task? Until you know the fundamental "why," the "how" won't stick.

Think of this as mise en place: you can't automate what you don't understand. If you can't do the task manually—or at least explain how it's done—you'll struggle to delegate it to an agent. Start with use case discovery, and every other choice in your workflow will follow naturally.

Prompt engineering

A prompt is not just an instruction—it's the DNA of your agent. You have two main strands:

The system prompt sets the role, behavior, and boundaries of the agent. It's the script your agent carries into every scene. Criminally under-rated.
The user prompt is the specific request—the immediate order shouted across the kitchen.

Together, they frame how the agent interprets and executes tasks. A weak system prompt leaves your agent rudderless; a well-crafted one turns it into a sous-chef who anticipates your next move.

Context engineering

Info Retrieval ==/== Info Extraction

This is where many practitioners slip. Retrieving information is not the same as extracting it.

Information retrieval answers: "Which section of this long text is relevant to my question?" Think of scanning 1984 to find only the passages about the symbolism of rats. You don't need the whole book, just the relevant slices. Retrieval thrives on chunking, cosine similarity, and RAG pipelines. There's a lot of literature on this: chunking overlap, vector search, knowledge graphs—you name it. But notice that retrieval always assumes the same thing: the text is already in a form you can query, and the only real problem is which part is relevant.

Information extraction, however, answers a completely different question: "What is inside this file, PDF, or document in the first place?"

And here's the kicker: extraction is much harder. Look at the veggie-text images I included above. The semantic "veggie as bonus vobis" is supposed to continue with "vobis," but depending on how the PDF text is laid out, you might accidentally pull in "gumbo beet greens turnip greens" or "shallot courgette latke endive cauliflower sea." What looks obvious to a human reader is anything but obvious to a machine.

Naively calling something like:

fitz.open("my.pdf")

won't guarantee the semantic string you get back is correct. It won't guarantee pagination, sentence flow, or logical grouping. And if the layout is column-based or table-structured, your extraction might scramble everything—just like the veggie examples show. Sometimes you end up with the wrong column stitched in, or fragments cut off mid-sentence.

Extraction hurdles go way beyond pagination. You're dealing with:

multi-column layouts
figures and charts
temporal relationships across documents
images and handwriting
OCR errors
even idiosyncratic formats created by legacy enterprise software

Entire companies exist just to make this problem tractable. It's non-trivial, and it's why companies like Structify and Extend exist.

So here's the newsflash: retrieval alone is not enough. If you're working with structured but messy data—like PDFs, contracts, regulatory filings—you will need serious investment in extraction. For agentic workflows or RAG systems, the formula is simple but uncompromising: you need both retrieval and extraction. It's not an either/or. And to make sure it actually works, you need proper evals—otherwise you're just hoping your extraction aligns with reality.

Model choice

Choosing a model isn't just about picking the "biggest brain." It's about trade-offs:

Do you care more about cost or capability?
Do you need speed, or is slow inference acceptable if it means deeper reasoning?
Can you configure verbosity, or do you want something opinionated out of the box?

If latency matters, you might reach for a fast 4o; if reasoning is king, maybe o3 with a higher reasoning config. It's an architectural choice: do you want shallow, parallelized subtasks that finish quickly, or a single deep reasoning thread that unpacks every detail? The rabbit hole can be as deep—or as wide—as you decide.

Output design

Once your agent cooks something up, how do you plate it?

If you want structured output, JSON is the industry-grade standard. But where will that JSON flow—into another system, a database, a dashboard?
If you want free-form text, will it be read by a human, or parsed by another machine?
And most importantly: will you want to provide feedback on that output? If yes, then you need to design the output in a way that's easy to evaluate and iterate on.

The shape of your output is not cosmetic—it's what determines whether your agent integrates smoothly or clogs the pipeline.

Eval

This is the piece most people try to skip—and it's the piece you can't live without.

Before you celebrate your workflow, you need to test it. That means:

Pick an evaluation set (yes, one with ground truth).
Run your system against it.
Review the results and take notes on what worked, what failed, and why.
Feed those notes back into your workflow to improve the next round.

Without this loop, you're guessing. With it, you're iterating. my advice is always the same, don't just eyeball it but rather build a proper eval set, run it, and close the feedback loop. Otherwise, your agent is developing blind, where you could easy have progressions but regressions as well.

This post focused on corpus data. Image and audio pipelines are another recipe entirely—one I haven' worked with yet.