Click the Subscribe button to sign up for regular insights on doing AI initiatives right.
Exponential vs S-Curve
GPT-5 has been out for a few days now, and apart from marketing hype, the response has been a resounding "meh". Some say it's great for their particular use case, others say it's mediocre at best.
In the endless cycle of whose company's AI model is the current best, we can get the impression that there are huge strides being made. The whole "accelerationist" movement tells us that we can expect an exponential growth of model capabilities, just because the early steps (GPT-1 to 2 to 3 and 4) were so monumental. They'd tell us that, before we'd know it, the AI would design better versions of itself and then we'd really be off to the races, towards the so-called singularity, super-human intelligence and, depending on your mood, annihilation or abundance for all.
Well, just like many other promises of endless growth, this one doesn't quite seem to pan out as well. Instead, progress levels off to incremental gains and diminishing returns. Just like medicine in the 1900s has made tremendous strides and made it look like life expectancy was on an exponential growth curve didn't mean that life expectancy would grow indefinitely (don't tell Peter Thiel or Ray Kurzweil I said that, though), there are natural limits and constraints.
So, what does that mean? It means it's crunch time for engineers. We can't just sit around with the same old systems and same old prompts and just wait for better models. Models will keep getting better, but not at a rate that excuses laziness. Now's the time to tweak the prompts, the model selection, the vector databases and the scaffolding. Now's also the time to be less forgiving of products and tools that seemed to bank too hard on vast improvements in bare LLM capabilities. If it's nowhere near useful right now, don't let them tell you to "just wait until GPT-6 hits".
It's okay for bare LLM progress to slow down. It's not like in classical software engineering we write low-performance software and then say, "oh well, we'll just wait for the next version of PostgreSQL to make our bad queries execute faster". (Though there was that glorious time in the 90s where CPU speed doubled every time you blinked...)
Long story short, GPT-5 was underwhelming but that fact itself is also underwhelming. Lets just get back to the actual work of engineering working solutions to real problems.
The Eagle’s Call
Back from vacation, here's a nature-inspired newsletter edition with a fun nature fact: The sound many movies use for a bald eagle's cry is actually the cry of the red-tailed hawk (Red-tailed hawk - Wikipedia). You know, the high-pitched, long and echo-y scream. Real eagles calls sound more like dolphin chatter. Bald eagle - Wikipedia
Because movies use the wrong call so much, those of us who don't live in an area with abundant bald eagles end up thinking that that's their real sound. Now, the consequences for this misunderstanding are benign, but it points to a certain danger when presented with plausible but wrong information which AI has a chance to amplify. The analogy goes like this:
I have no idea what a bald eagle sounds like because I've never heard one in the wild.
I come across a movie with the wrong eagle sound.
It sounds plausible enough. Powerful and piercing. I now think that that's what eagles sound like.
With AI:
I have no idea how to write good marketing copy because I'm not a marketing specialist.
I ask AI to write me some good marketing copy.
To me, it sounds good enough. Punchy and engaging. I now think AI makes marketers and copywriters obsolete.
Just because the AI result seems good to me doesn't mean it's actually good (unless it is meant purely for my own consumption). It takes an expert at a given craft to judge whether the result is truly good. Which reiterates the point: In the hands of an expert, AI can be a great productivity boon. In the hands of a hack, it can be a dangerous delusion.
The Review Trap
Consider this scenario: An AI tool generates some output. Maybe it's code. Perhaps it's a marketing communication to be sent out. A human reviews it before it gets sent out. This works and makes sense in scenarios where reviewing is faster than creating. Luckily, there are many such scenarios. Writing a good joke can take a comedian hours or days. Judging whether a joke is funny happens instantly. The same dynamics apply to writing and many creative endeavours. In such a scenario, the AI does not need to get it 100% right in one shot. The review will be brief, and if the output is 90% accurate, the subsequent changes will be rapid.
But there are scenarios where this dynamic doesn't apply, and that's reviews themselves. To judge whether an AI-generated review is correct, you have to do the whole review yourself. Otherwise, how can you be sure the AI didn't miss something critical? So in this scenario, where you want to use AI-generated reviews (or summaries, or style-transformations, ...), you're not better off unless you have complete trust in the tool's accuracy. That immediately sets a much higher bar for AI tools used in these scenarios.
In such a scenario, your current best bet is to use the AI review as a jumping-off point for your own thorough review: Let's say you use GitHub's Copilot Review functionality to provide an initial review of suggested code changes. Great. Take a look at those and then check the code thoroughly yourself, looking specifically for subtle things the AI might have missed. Just don't trust them blindly. And when thinking about AI use-cases in your org, pay attention to which scenario we're dealing with: Generation or review, and don't fall in the trap of automating something that needs to be double-checked by a human anyway.
P.S.: It's a beautiful summer here, and I'm off on another camping trip, returning in mid-August with more AI-related posts.
GRR Martin Does Not Need a Faster Typewriter
Fans of the Game of Thrones book series have been waiting over a decade now for the next installment. Distractions like the TV show certainly didn't help in the writing process. Either way, I don't know what exactly the author, G.R.R. Martin, needs to finally finish "Winds of Winter", but it's definitely not a faster typewriter. The bottlenecks show up elsewhere, and the speed of typing is just small noise in the grand scheme.
When considering AI and its potential to accelerate your organization, keep this in mind: It's a comprehensive system you're trying to optimize, not just a single component. In many traditional software organizations, for example, any changes to production code undergo multiple stages of review and quality gates. If you can suddenly generate code at double the speed, that won't matter if you keep the speed of reviews the same. Conversely, you can probably speed up your delivery by a significant factor if you optimize these handoffs and gates first, before implementing an advanced AI solution.
I'm starting to see a pattern: AI brings outsized benefits to organizations that, even before AI, were agile, nimble, and well-organized, whereas AI will struggle and spin its wheels in an organization that's dysfunctional, brittle, and messy. My hope is that the promised benefits from AI will serve as a sufficient wake-up call for organizations to clean up their act.
Onboarding for AI
In an old episode of his podcast, productivity expert Cal Newport tackles a listener question: "We're drowning in email at work and everything's a chaotic mess. Should we bring in an assistant to help with that?"
Makes sense on the surface. There's too much to do, and it prevents you from focusing on your core work, so why not bring in help? But Cal cautions against it: If your organization is beset by chaos and lacks clear processes, throwing another person into the mix does not help. Instead, he suggests to first get clarity: What needs to happen and how? Write your Standard Operating Procedures (SOP). Once you've got those, you can reassess: it might be that standardizing your processes has brought enough sanity to the organization that further assistance is no longer required. If that's not the case, and there's still too much of work you'd rather not do yourself, then at least a new helping hand has everything they need for maximum success.
In short, don't just hire someone, hand them the keys to your inbox and say, "good luck, kid."
And now to AI
The same holds true, even more so, for AI tools. The AI equivalent to letting a poor unsuspecting soul loose on your inbox and docs would be to just hook up ChatGPT or Claude to your Gmail and Google Drive and claiming victory. Maybe that's enough for a few simple workflow enhancements. But chances are you'll get much better results if you identify and map out the process you want to delegate to the AI. The numerous benefits include:
You'll be forced to articulate what "good" looks like for each of those tasks, and lay out which information sources must be consulted.
You can set up concrete evals that let you iterate and experiment with different AI tools, prompts, parameters etc to get hard data on what works and what doesn't.
You can identify and isolate those parts of the workflow that are deterministic and can therefore be more effectively handled by non-AI software solutions, and rely on the slower and more expensive AI tools for the parts that only they can perform.
Lacking human-level discernment about what information truly matters in a given context means throwing all the data, all the time, at the AI can degrade performance compared to only providing the context that's required for the task.
Just as a human assistant needs proper onboarding documentation to find success in their role, so do AI assistants need help and guidance to do their best.
AI Adoption for the Skeptical
A contact in the mining industry recently shared something fascinating with me: "I've got this client, they've always been anti-tech, but now they feel they need to do something with AI."
This is happening everywhere. Industries that spent decades perfecting their healthy skepticism of technology vendors are suddenly worried they're missing out. And honestly? That skepticism might be their biggest advantage.
Here's what I've learned: The companies that succeed with AI aren't the ones making headlines. They're not spending millions on "digital transformation initiatives" or hiring armies of consultants to build proof-of-concepts that gather dust.
Instead, they're asking better questions:
Where do our engineers waste hours searching through old reports?
Which compliance tasks eat up days but follow predictable patterns?
What knowledge is walking out the door when our veterans retire?
The $75K Pilot Beats the $2M Transformation
Big consultancies will tell you AI requires fundamental transformation. New systems! New processes! New everything! (New invoices!)
But generative AI actually works pretty well with your existing mess. Those thousands of PDFs collecting digital dust? That equipment manual from 1987? The handwritten inspection notes? Modern AI can (probably) read all of it.
You don't need perfect data. You need a specific problem.
Start Where It Doesn't Hurt (Much)
The best entry point? Read-only applications. Let AI search and summarize before it creates. Think of it as hiring a really fast intern who's read every document your company ever produced:
"What were the soil conditions in Block 7 in 2019?"
"Show me all safety incidents involving conveyor belts"
"Which permits mentioned groundwater contamination?"
Nobody's job is threatened. Nothing breaks if it's wrong (if you actually read what the AI digs up, of course). But suddenly, answers that took hours take seconds.
The Trust Ladder
Once people see AI finding information faster than their best document wizard, you can climb the ladder:
Search (no risk, high value)
Summarize (low risk, saves time)
Draft (medium risk, human reviews everything)
Integrate (only after proving value)
Most valuable applications never need to go past step 3. And that's fine.
Why Traditional Industries Have an Edge
That "anti-tech" instinct? It's actually perfect for AI adoption. You won't fall for hype. You'll demand ROI. You'll ask uncomfortable questions about what happens when it hallucinates.
Your skepticism forces vendors to prove value early on, not only after a big transformation for $2M. Your caution means you'll start small, fail fast, and scale what works.
The mining executive who said, "We need to do something with AI."? They're right. But that something should be specific, measurable, and boringly practical. Leave the moonshots to companies with venture capital to burn.
What's the most annoying document search in your organization? That's where your AI journey should start.
AI, the Genie
When talking about AI tools, they're often called "assistants". That word conjures up someone imminently capable, who will do their best to fulfil the requests made of them in letter and spirit. This is aspirational. Kent Beck, legendary programmer and co-creator of the original Agile manifesto, has picked a more apt metaphor: That of a genie.
In mythology, genies have to grant the wishes of their captors. But they're more than happy to grant them in letter only, and cause much mischief within those parameters. In coding tasks, that manifests itself in obviously nonsensical behaviour like, "I can make the tests pass by removing them!"
Armed with this metaphor, we can make better choices in using these tools: How do we instruct the genie so that it can't mess things up for us? How can we design a system where unleashing a genie behind the scenes ultimately creates something helpful and aligned? How can we build mechanisms of trust and verification into the system?
The genie won't go back in the bottle, so we might as well make the most of the wishes it grants us.
AI Slows You Down—Or does it?
As reported in a recent study, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR, developers using AI were, on average, 19% slower than those not using it. What's more, this observed result was contrary to how the developers felt about AI tools, both during and after the study. In short, developers felt 20% more productive but were actually 19% less so.
Take that, vibe coders! Robots 0, humans 1.
Or...?
I want to unpack two parts here. One is the discrepancy between experienced versus actual productivity, the other a larger view on what such results mean.
Motion is better than idleness
Once coming back from a weekend trip, Google Maps showed a lengthy slowdown on the typical highway route. It cheerfully told me that this was, still, the best route, despite a 30-minute slowdown. With tired kids in the backseats, I did not want to stuck in a jam, though, so I opted for a route that took 45 minutes longer but allowed me to keep moving. My hypothesis is that, with an AI tool at your fingertips, you're never at a standstill. Who cares that you have to prompt and re-prompt it to get to your desired outcome if at least it feels like code is being written.
The takeaway here: Feels matter—we want to keep our tool-users happy—but in the end, it must be cross-checked against real data.
Adoption and the initial drop
You do things a certain way, you get good at it, and eventually you reach a peak. To then ascend to another, even higher peak, you first have to come down, at least partway, from your current peak. This effect is near-universal. Whether it's your golf swing, tennis serve, chess opening repertoire, programming style/language, any change that will ultimately make you better results in an initial dip. Given how early we still are with AI coding tools, I'm not surprised that we observe initially decreased performance. As the tools and our knowledge of how to use them improve, we should come out ahead. (The study, incidentally, shows one outlier who was vastly more efficient when using AI. Maybe that individual was a bit ahead of the general learning curve.)
The takeaway here: Whatever change you introduce, with the hope of making things better, will initially make things worse. Prepare yourself and your team for it and have clear expectations about how to assess progress.
How to Kill Your AI Initiative Before it Starts
In last week's podcast interview, one point that came up was the idea that "market validation" applies to internal AI initiatives just as much as it applies to outward-facing product development. You, as a business leader, might have a great idea for tools and enhancements, but that will fall flat if the (internal) market does not adopt it.
Sure, you can mandate from the top that all employees must use a given tool. However, if your people perceive too much friction or if you fail to gain buy-in and present a clear ROI (return on time and effort invested), adoption will be perfunctory at best and actively undermined at worst.
I've heard of concrete examples of this: AI-assisted work scheduling software was rolled out with considerable fanfare and promised significant savings, only to be undermined by employees who didn't want to lose out on the now-eliminated overtime bonus payments. Another example, from older times, is the outright revolting of manufacturers when automation threatened to take their jobs. Your economic reasons notwithstanding, people can't be expected to embrace something they perceive as an existential threat: If the CEO of a company publicly talks about introducing AI to replace his workforce, don't be surprised if the workforce doesn't comply; this is the equivalent to, "We're firing you and will replace you with someone cheaper but can you please make sure to onboard them properly so they know everything you know about this critical area?"
Both for technical and ethical reasons, I'm much more enthusiastic about models of working where AI enhances what humans are capable of. If you build something for humans, it pays to build it with humans.
The Role of Agility in Getting AI Investments Right
The amazing Yuval Yeret had me on his podcast "Scaling w/ Agility". We discuss agile, AI, and their combination, exploring where both can go off the rails.
From Yuval's episode description:
In this episode of the Scaling With Agility podcast, host Yuval Yeret welcomes Clemens Adolphs, co-founder of AIce Labs, a company specializing in helping organizations successfully implement AI initiatives. Their conversation dives into the intersection of AI and agility, exploring how to avoid the all-too-common proof-of-concept trap and instead focus on delivering genuine outcomes. From internal market validation to adapting agile methods to the unique context of AI projects, this discussion is essential listening for leaders who want to scale AI initiatives without falling into process theater.
Notable Quotes
“The proof of concept is where AI projects often go to die. You need to design for internal market validation early on.” — Clemens Adolphs
“Agile is not a one-size-fits-all recipe—especially when you’re dealing with the inherent uncertainty of AI.” — Yuval Yeret
“Metrics are useful, but if they’re not tied to actual adoption or impact, you’re just performing success, not achieving it.” — Clemens Adolphs
Check out the full episode here or wherever you get your podcasts.
The AI - Human Feedback Loop
You might have already seen Andrej Karpathy's recent keynote to the 2025 AI Startup School —if not, catch it here—but what stood out for me was that he puts a strong emphasis on partial autonomy where the AI does some of the work for the human while the human provides verification and guidance.
I've written before about the types of tasks that are good for AI automation, with Guided Intelligence_—a task that is highly dependent on specific context _and very open-ended—being the toughtest of them.
High-context, open-ended tasks are precisely those where, with the current state-of-the-art models, we can hope at best for this partial autonomy. To make such a symbiotic relationship between human and AI work, Karpathy points out that the cycle between prompting the AI and getting outputs must be short.
A bad example is an AI coding agent that, after a single prompt, drops 1000 lines of code on you to review and verify. That creates lots of friction and slows you down immensely. It also gives the agent ample time running at top speed in the wrong direction.
A good example is the way Claude Code splits a prompt into several small tasks, asks if those make sense, then presents small code snippets to accept or refine. Instead of firing and forgetting, we are rapid-firing.
So, when designing an AI tool that's meant to assist human workers in a complex task, don't ask "how can we get everything right in one shot?" Don't even ask "how can we do as much work in one shot as possible and then iterate?" Instead, ask "how can we create a tool that affords a low-friction fast feedback cycle"?
Is anyone NOT looking into AI these days?
Turns out, yes. According to an opinion piece in the Communications of the ACM, a survey among business executives had some shocking facts. I won't rehash the whole article here, just a few juicy quotes:
Most companies are struggling to articulate AI plans
This is an extraordinary—and dangerous—"wait-and-see" mind-set.
Despite all this publicity [about GenAI], only 20% of companies defined AI initiatives as high priority, and more than 47% defined them as insufficient or "unknown".
77% said that [they] had only looked at GenAI briefly or not at all.
To those of us surrounded by AI through our work, this feels baffling. In my immediate bubble, it seems like everybody and their bog is either building a custom GPT or vibe-coding a business. Yet this survey reveals a vast number of companies that look at this technology and shrug their shoulders.
Now, I don't want to pull the FOMO (Fear of Missing Out) card. Beware the hype. By all means, do things sensibly and measuredly. But do things. And don't treat it like a box-checking exercise where you bring in an expensive consultancy to run a one-day workshop on the usage of GenAI for your employees and call it a day.
I've assumed it as a given that every company would want to do something with AI, i.e., articulate an AI plan and explore where it might speed up critical workflows, take low-value drudgery away from high-paid specialists, finally make use of the vast amounts of messy data collected by the business, or any of the other emerging use-cases. It turns out that many companies haven't yet reached that level. So maybe it is time to play the FOMO card after all? Because these companies are missing out.
So here's a humble ask: If you know someone who runs a business, or a department inside a business, and they fit the profile of this post—haven't looked into AI, or don't know where to start—kindly connect them with me because I'd love to ask them questions to refine my understanding of how business leaders are thinking about AI.
AI and Useless Tasks
If I use AI to turn a single bullet point into a long email I can pretend I wrote, and you, as the recipient of that way-too-long email, use AI to turn it back into a single bullet point, how productive was the use of AI in that scenario?
(Hat tip for the joke to Tom Fishburne: AI Written, AI Read cartoon - Marketoonist | Tom Fishburne)
In this scenario, AI indeed saved me the time to write a long email, and it saved you the time to summarize it. 10 minutes of writing for me, 5 minutes of summarizing for you, a total of 15 minutes of human time saved. Hooray! Multiplied over the dozens of emails we send and receive each day, the potential savings are huge.
Or I could just have sent you the single bullet point in the first place.
Maybe this exact example is a tad contrived. Fair. But how many business processes that look like prime candidates for AI automation shouldn't even exist in the first place? I like the vision of AI taming bureaucracy and reclaiming our time for the things that matter and that we're good at. But before you go through the expensive process of developing such an AI solution, ask yourself if you couldn't just get rid of (or at least vastly simplify) the process entirely.
AI Spam: Technical Success, Business Failure?
Here's a great example of an AI product that works great on a technical level but utterly fails at the business level: "Hyperpersonalized" sales emails courtesy of AI.
While going through my spam folder (since Google occasionally misclassifies emails), I came across a fun specimen. Someone had built a tool that examines company websites, identifies the relevant people and information about them, and then feeds that into an LLM with the task of generating "personalized" opening and post-script statements to be slapped around a generic sales email. In my case, what I got was
First of all, congratulations on your incredible journey from VortX Labs to becoming the Co-Founder of Aice Labs! I truly admire how you transform businesses with practical AI solutions, making a real difference in Vancouver's industry."
for the opener and then for the closer, I got
P.S. I really admire how Aice Labs makes AI practical and impactful for businesses. By the way, is the "Dungeness Crab" at "Blue Water" truly a must-try like I hear? Maybe we can meet there someday!
I can feel the sleazy used-car-salesman energy oozing right out of that. I'm not a sales expert, but I have a strong suspicion that if your sales emails aren't effective, it's probably due to what you're offering, rather than your opening line not being a sycophantic paraphrasing of the recipient's bio or lack of local flair in the P.S.
I'm sure the people building the tool paid attention to ensuring it worked technically: Does it correctly identify names, titles, backstory, and location? Does it successfully pull in some locally relevant info? (I genuinely wonder how they're pulling that off, on a technical level. Have an AI agent look for the top restaurants in the recipient's city and talk about a high-rated item on their menu?)
But I'm certainly not going to buy whatever they're selling.
So what's the greater lesson here? Maybe it's Dr. Ian Malcolm's quote from Jurassic Park:
Your [prompt engineers] were so preoccupied with whether or not they could, they didn't stop to think if they should.
"The Computer Doesn't Do What I Tell It To!"
As a teenager, I was often called on to provide basic tech support for friends and family. They'd complain that the computer wouldn't "listen" to them. I'd chuckle at that because the problem likely arose because the computer did do exactly as told. Barring outright bugs, old-school software is deterministic. Clicks and keystrokes have pre-programmed results you can rely on.
Not so with generative AI. Non-determinism lurks everywhere.
There is the inherent randomness of the generated output. The typical chat models answer differently for the same repeated question. When accessing a model programmatically, this randomness can be turned off, but other sources of nondeterminism remain.
In cases where the model, even with the randomness set to zero, is fed raw user input, even slight variations in the input can lead to different outcomes: Whether the user wrote "Can you write me a poem about cats?" or did they write, "Make up a poem about cats!" might lead to very different outcomes.
And then, there is of course the unpredictable way the model handles a large input, or a complex request. You might ask for a piece of writing in a particular format and it's anyone's guess whether you get it or not. So in that case, the computer really didn't do what you told it to.
What can AI engineers do here?
Aim narrow, accept wide. That's general good advice for a user experience. The more you have to tell your users exactly how to use the system, the less awesome they'll feel about themselves and the experience. On the flipside, the more you can make the tool's output predictable, or at least non-perplexing, the better.
Safeguards around user input and post-processing of model output, to file off the rough edges. This could be traditional preprocessing (like removing all special characters and punctuation from an input where it shouldn't influence the actual output) or it could be yet another AI step.
Tune the model's randomness. Don't just accept the default parameter. It might not be the best. Neither might setting it to zero be best. Experiment and evaluate.
And if all that doesn't help, just put "vibe" in front of your product name and all is forgiven ;)
AI in Medicine - Promise and Peril
I recently came across Rachel Thomas's talk on AI in medicine. She co-founded fast.ai (a great learning resource for all things deep learning) and has recently taken a strong interest in the intersection of AI and the medical field.
One line from the talk that hit me hard:
It can be exciting to hear about AI that can read MRIs accurately, but that’s not going to help patients whose doctors won’t take their symptoms seriously enough to order an MRI in the first place.
AI in medicine has so much promise; it's worth getting it right. There are the obvious technical pitfalls, such as data quality issues. But as the quote points out, there are also more pernicious inherent biases that, if we don't get it right, will be perpetuated instead of alleviated with AI.
Here are some points on how we would tackle an AI project in healthcare:
Relentless focus on the final desired outcome, which in my book means better patient health outcomes.
Brutal honesty about which metrics (accuracy, recall, F1 score etc) actually matter toward that end goal, and at what threshold, so we're not just chasing metrics for their own sake.
Before writing any code or putting together any wireframes or prototypes, understand exactly the current workflow, how the new product will be integrated, by whom, and what conflicting incentives they might have.
Conduct a pre-mortem: Ask, "it's now six months later and the project was a total disaster. What happened?" Then develop defensively to prevent those things.
In particular, consider the following hypothetical future scenario: "Our well-meaning AI product had this terrible unintended side effect. What could we have done to prevent it?"
All that can ensure that AI gets used for good is the non-negotiable groundwork for a successful initiative.
Leverage: A Physicist’s Rant
In business speak, to leverage means to use. You might as well employ, deploy, or utilize. Just say "use", then.
However, in physics, "leverage" has a precise meaning. It's the principle behind clever contraptions such as
the lever (duh).
pulleys
gears
These systems have one thing in common: They translate one physical situation into another and let you choose advantageous trade-offs. For example, with a 3:1 pulley, you can triple the force you can apply to something, but you pay for it by having to travel further: To lift something 1 foot with a 3:1 pulley, you must pull on the other end for 3 feet.
When businesses speak of leverage, this trade-off discussion is missing. To say, "We're leveraging AI" sounds like you get something for nothing. But just like in the Physical context, you rarely get a free lunch. Instead, you're making an advantageous trade-off. So, what is it you're trading in? And is it truly advantageous?
You can use AI to generate code, or writing, faster. But can you keep quality the same?
Using chatbots in customer service lets you scale up massively, but you trade in the human touch and connections.
It's essential to articulate what you're trading away and if you're okay with it that Leverage only makes sense if what you gain is more valuable than what you lose.
You’re Not Hiring a Calculator…
Imagine you're hiring an accountant, and you worry about making sure that they can't use a calculator in the interview, because all you're going to ask them is to add numbers in their head, so using a calculator would be cheating. Pointless, right?
It seems candidates and companies are locked in an arms race of AI sophistication, especially in tech: As it turns out, AI coding assistants are really good at the types of puzzles hiring managers like to use in tech assessments: "You're given two lists. Both are sorted. What's the fastest way to find their collective median?" and the likes.
So now we have candidates developing tools that sit in your browser and feed interview questions straight into ChatGPT and hiring managers wondering if it's time to bring back the in-person (as in, live, not via Zoom) interview.
That's misguided. We can't celebrate the productivity gains that AI enables in those who know how to use it, and then freak out when candidates use AI in the hiring process. Why not design the whole process so that only those who can produce great results with AI pass?
Take writing. Yes, AI can generate lucid and passable responses. But let's say you're actually hiring for a position where someone has to produce writing. Marketing copy for your website, for example. Why bother hiring someone? Couldn't you just ask ChatGPT to "Write my marketing copy, please?" Why? No, really. Why aren't you saving all that money and instead spend just five minutes each day prompting the AI for the writing you need? Or hire a part-time teenager to copy-paste your prompts into ChatGPT and copy-paste the answer to your website. Maybe because there's skill, taste, and discernment required beyond that? (I know for sure that I couldn't just prompt my way to a Pulitzer prize.)
So then, in hiring, make it an "open-book" exam. AI explicitly allowed. Just raise the bar for the outcome you want to see. It instantly defeats the point of using AI to "cheat." So ask for spicy takes, strong opinions, a human explanation for why this is good and that is bad. (Try asking ChatGPT something like "which is the worst frontend framework" and you get some "on the hand, on the other hand, to be fair, in the end..." wishy-washy position. A real human will be happy to go off on a fun rant.)
AI Theatre
Maybe it's because Vancouver's "Bard on the Beach" festival is kicking off, or agile expert Yuval Yeret recently wrote a great post about Agile Theatre, but I've been thinking about that theatre a bit.
As an art form, theatre is great. But when applied to anything else, the term is derogatory:
The security theatre we endure at airports. It causes a lot of hassle but doesn't demonstrably keep us any safer.
The Agile/Strategy/OKR theatre that Yuval writes about. Companies adopt a methodology's outward trappings and props without buying into its deeper insights.
And so, naturally, I think of AI theatre and forms that can take:
Press-Release-Driven Development: "Look at us, we're so innovative. We use AI!" Yet the splashy demo never gets used in production.
FOMO-Driven Marketing: "Don't get left behind! AI is coming for all you, your job, your company, your whole industry. Buy our consulting services NOW."
Chasing the wrong thing: Obsessing over which model tops which leaderboard or which system impressed mathematicians instead of focusing on tangible outcomes and benefits in the real world.
Wild extrapolations, prognostications, and tangential philosophical debates. It's fun. And it's safe, because you're not putting anything immediately falsifiable out there.
It can be entertaining to watch, but the real work happens quietly and with a much more pragmatic focus.
Benchmarks Considered Boring
I think about and write about AI and work with it daily. I've also repeatedly mentioned the importance of objective evaluations for your AI tools. So you might assume that I'm constantly checking the various leaderboards and benchmarks:
Who's on top of Chatbot Arena?
Is Opus-4 or o4-mini better at the FrontierMath benchmark?
"Gemini 2.5 Pro just topped SimpleBench! Here's why that's a big deal."
Honestly, I never really cared about them. Sure, having a rough sense of who's coming out with new models demonstrating generally improved capabilities is good. For that, it's enough to keep a very loose finger on the pulse. If at any point a model makes a really big splash, you're guaranteed to hear about it. No need to compulsively refresh the AI Benchmarking Dashboard.
But beyond larger trends, much of the leaderboard news is just noise. OpenAI comes out with its new model and tops the leaderboard. "This is why OpenAI is the leader and Google missed the boat", the pundits proclaim. Next week, Google comes out and it's all "Oh baby, Google is so back!".
Besides, the leaderboard position does a terrible job at making an informed choice:
The best model at a benchmark might fare poorly on the specific task you intend to use it for.
Or maybe it's "the best" in output quality, but you can't use it for your purpose due to... cost, deployment considerations, compliance issues, latency, etc
Instead of obsessing over rankings that collaps a multi-faceted decision into a single number, you'd do best to consider all these tradeoffs holistically and then, combined with a benchmark custom-built for your use case, make an informed decision.