Humans in the loop: How leading companies are building practical, trustworthy AI

At the NYC Generative AI Summit, experts from Wayfair, Morgan & Morgan, and Prolific came together to explore one of AI’s most pressing questions: how do we balance the power of automation with the necessity of human judgment?

From enhancing customer service at scale to navigating the complexity of legal workflows and optimizing human data pipelines, the panelists shared real-world insights into deploying AI responsibly. In a field moving at breakneck speed, this discussion was an opportunity to examine how we can build AI systems that are effective, ethical, and enduring.

From support to infrastructure: Evolving with generative AI

Generative AI is reshaping industries at a pace few could have predicted. And at Wayfair, that pace is playing out in real time. Vaidya Chandrasekhar, who leads pricing, competitive intelligence, and catalog ML algorithms at the company, shared how their approach to generative AI has grown from practical customer support tools to foundational infrastructure transformation.

Early experiments started with agent assistance, particularly in customer service. These included summarizing issue histories and providing real-time support to customer-facing teams, the kind of use cases many companies have used as a generative AI entry point.

From there, Wayfair moved into more technical territory. One significant area has been technology transformation: shifting from traditional SQL stored procedures toward more dynamic systems.

“We’ve been asking questions like, ‘if you're selecting specific data points and trying to understand your data model’s ontology, what would that look like as a GraphQL query?’” Vaidya explained. While not all scenarios fit the model, roughly 60-70% of use cases have proven viable.

Perhaps the most transformative application is in catalog enrichment, which is at the core of Wayfair’s operations. Generative AI is being used to enhance and accelerate how product data is organized and surfaced. And in such a fast-moving environment, agility is key.

“Just this morning, we were speaking with our CEO. What the plan was two months ago is already shifting,” Vaidya noted. “We’re constantly adapting to keep pace with what’s possible.”

The company is firmly positioned at the edge of change, continuously testing how emerging tools can bring efficiency, clarity, and value to both internal workflows and customer experiences.

Building AI inside the nation’s largest injury law firm

When most people think of personal injury law firms, they don’t picture teams of software engineers writing AI tools. But that’s exactly what’s happening at Morgan & Morgan, the largest injury law firm in the United States.

Paras Chaudhary, Software Engineering Lead at Morgan & Morgan, often gets surprised reactions when he explains what he does. “They wonder what engineers are even doing there,” he said.

The answer? Quite a lot, and increasingly, that work involves generative AI.

Law firms, by nature, have traditionally been slow to adopt new technologies. The legal profession values precedent, structure, and methodical processes, qualities that don’t always pair easily with the fast-evolving world of AI.

But Morgan & Morgan is taking a different approach. With the resources to invest in an internal engineering team, they’re working to lead the charge in legal AI adoption.

The focus isn’t on replacing lawyers, but empowering them. “I hate the narrative that AI will replace people,” Paras emphasized. “What we’re doing is building tools that make attorneys’ lives easier: tools that help them do more, and do it better.”

Of course, introducing new technology into a non-technical culture comes with its own challenges. Getting attorneys, many of whom have been doing things the same way for a decade or more, to adopt unfamiliar tools isn’t always easy.

“It’s been an uphill battle,” he admitted. “Engineering in a non-tech firm is hard enough. When your users are lawyers who love their ways of doing things, it’s even tougher.”

Despite the resistance, the team has had measurable success in deploying generative AI internally. And equally important, they’ve learned from their failures. The journey has been anything but flashy, but it’s quietly reshaping how legal work can be done at scale.

Human data's evolving role in AI: From volume to precision

While much of the conversation around generative AI focuses on model architecture and compute power, Sara Saab, VP of Product at Prolific, brought a vital perspective to the panel: the role of human data in shaping AI systems. Prolific positions itself as a human data platform, providing human-in-the-loop workflows at various stages of model development, from training to post-deployment.

“This topic is really close to my heart,” Sara shared, reflecting on how drastically the human data landscape has shifted over the last few years.

Back in the early days of ChatGPT, large datasets were the core currency. “There was an arms race,” she explained, “where value was all about having access to massive amounts of training data.”

But in today’s AI development pipeline, that’s no longer the case. Many of those large datasets have been distilled down, commoditized, or replaced by open-source alternatives used for benchmarking.

The industry's focus has since shifted. In 2023 and into 2024, efforts moved toward fine-tuning, both supervised and unsupervised, and the rise of retrieval-augmented generation (RAG) approaches. Human feedback became central through techniques like RLHF (reinforcement learning from human feedback), though even those methods have begun to evolve.

“AI is very much a white-paper-driven industry,” Sara noted. “Every time a new paper drops, everyone starts doing everything differently.” Innovations like reinforcement learning via rejection sampling or variational reward shaping (RLVR) began to reduce the need for heavy fine-tuning, at least on the surface. But peel back the layers, she argued, and humans are still deeply embedded in the loop.

Today, the emphasis is increasingly on precise, expert-curated datasets, the kind that underpin synthetic data generators, oracle solvers, and other sophisticated human-machine orchestrations. These systems are emerging as critical to the next generation of model training and evaluation.

At the same time, foundational concerns around alignment, trust, and safety are rising to the surface. Who defines the benchmarks on which models are evaluated? Who assures their quality?

“We look at leaderboards with a lot of interest,” Sara said. “But we also ask: who’s behind those benchmarks, and what are we actually optimizing for?”

It’s a timely reminder that while the tooling and terminology may shift rapidly, the human element, in all its philosophical, ethical, and practical complexity, remains central to the future of AI.

Human oversight in AI: The power of boring integration

As large language models (LLMs) become increasingly capable, the question of human oversight becomes more complex. How do you keep humans meaningfully in the loop when models are doing more of the heavy lifting, and doing it well?

For Paras, the answer isn’t flashy tools or complex interfaces. It’s simplicity, even if that means embracing the boring.

“Our workflows aren’t fancy because they didn’t need to be,” he explained. “Most of the human-in-the-loop flow at the firm is based on approval mechanisms. When the model extracts ultra-critical information, a human reviews it to confirm whether it makes sense or not.”

To drive adoption among lawyers, notoriously resistant to change, Paras applied what he calls the “radio sandwiching” approach. “Radio stations introduce new songs by sandwiching them between two tracks you already like. That way, the new stuff feels familiar and your alarms don’t go off,” he said. “That’s what we had to do. We disguised the cool AI stuff as the boring workflows people already knew.”

At Morgan & Morgan, that meant integrating AI into the firm’s existing Salesforce infrastructure, not building new tools or expecting users to learn new platforms. “All our attorney workflows are based in Salesforce,” Paras explained. “So we piped our AI outputs right into Salesforce, whether it was case data or something else. That was the only way to get meaningful adoption.”

When asked if this made Salesforce an annotation platform, Paras didn’t hesitate. “Exactly. It works. Do what you have to do. Don’t get stuck on whether it looks sexy. That’s not the point.”

Vaidya Chandrasekhar, who leads pricing and ML at Wayfair, echoed the sentiment. “I agree with a lot of what Paras said,” he noted. “I’d frame it slightly differently; it's about understanding where machine intelligence kicks in, and where human judgment still matters. You’re always negotiating that balance. But yes, integrating into existing, familiar workflows is essential.”

As AI systems evolve, the methods for keeping humans involved might not always be elegant. But as this panel made clear, pragmatism often beats perfection when it comes to real-world deployment.

Orchestrating intelligence: How humans and AI learn to work together

As the conversation turned to orchestration, the complex collaboration between humans and machines, Paras offered a grounded view shaped by hard-earned experience.

“I did walk right into that,” he joked as the question was directed his way. But his answer made it clear he’s thought deeply about this dynamic.

For Paras, orchestration isn’t about building futuristic autonomy. It’s about defining roles and designing practical workflows. “There are definitely some tasks machines can handle on their own,” he said. “But the majority of the work we do involves figuring out which parts to automate and where humans still need to make decisions.”

He emphasized that the key is not treating the system as a black box, but instead fostering a loop in which humans improve the AI by correcting, contextualizing, and even retraining it over time. “The job of humans is to continue evolving these machines,” he said. “They don’t get better on their own.”

Paras also highlighted the importance of being able to pause and escalate AI systems when needed, especially when the model encounters something novel or ambiguous. He gave the example of defining a new item like “angular stemless glass.”

“You don’t want the model to just make it up and run with it,” he said. “You want it to scrape the internet, make its best guess, and then ask a human – Is this right?”

That ability for the system to admit uncertainty is central to how Paras thinks about orchestration. “It’s like hiring someone new,” he said. “The smartest people still need to know when to raise their hand and say, I’m not sure about this. That’s the critical skill we need to train into our AI.”

Why determinism matters in AI orchestration

As the panel discussion on orchestration continued, Paras offered a grounded counterpoint to the rising excitement around agentic systems and autonomous AI decision-making.

“If I didn’t believe in a hybrid world between humans and machines, I wouldn’t be sitting here,” he said. “But let me be clear: at our firm, we have no interest in dabbling with agent tech.”

While many startups and venture-backed companies are chasing autonomous agents that can reason, plan, and act independently, Paras argued that this kind of complexity introduces too much uncertainty. “Agent tech creates too many steps, too many potential points of failure. And when the probability of failure multiplies across those steps, the overall chance of success drops.”

Instead, Paras advocated for an orchestration model grounded in determinism, where workflows are tightly scoped, predictable, and easily governed by clear logic.

“I love it when orchestration is deterministic,” he said. “That could mean a simple if/else statement. It could mean a human approver. What matters is that the system behaves in a way that’s traceable, testable, and reliable.”

At Morgan & Morgan, where the stakes are high and the work is bound by legal and procedural constraints, this type of orchestration isn’t just a preference, it’s a necessity. “We’re not in a startup trying to sell a dream,” Paras pointed out. “We’re in a firm where outcomes matter, and we need to know the system will work as expected.”

That pragmatic approach may not sound flashy, but it's exactly what’s enabling his team to make real, measurable progress. By prioritizing reliability over autonomy, they’re proving that impactful AI doesn’t always need to be cutting-edge; it needs to be dependable.

The broader conversation circled back to how the most valuable AI systems are the ones that know when they don’t know and are built to ask for help.

Human limits and AI benchmarks

As the discussion shifted toward AI orchestration, Sara paused to reflect on a subtler but essential thread: the limits of human understanding and how they shape the systems we build.

“Chain-of-thought reasoning and explainability in AI are fascinating,” she said. “But what’s just as fascinating is that humans aren’t always that explainable either. We often don’t know why we know something.”

That tension between human intuition and machine logic quickly leads to deeper questions. “Whenever I talk about these topics, we’re always two questions away from a philosophy lecture,” Sara joked. “What are the limits of human intelligence? Who quality-assures our own thinking? Are some of the world’s most unsolvable math problems even well formulated?”

These aren’t just abstract musings. In the context of large language models (LLMs), they expose a critical challenge: Can we ever be sure that models are doing what we expect them to do?

This thread naturally led Sara into a critique of how the industry measures performance. “Right now, we’re living in a leaderboard-driven moment,” she said. “Top-of-the-leaderboard has become a kind of default OKR, a stand-in for state-of-the-art.”

But that raises deeper concerns about accountability and meaning. What does it really mean for a model to be aligned, safe, or trustworthy? And perhaps more importantly, who decides?

“I’m always curious and a bit skeptical about who’s defining and scoring these benchmarks,” Sara added. “Who’s grounding the definitions of concepts like ‘verbosity’ or ‘alignment’? What counts as success, and who gets to say?”

These questions aren’t just philosophical; they’re foundational. As AI systems become more central to how decisions are made, the frameworks we use to evaluate them will increasingly shape what we build, what we trust, and what we ignore.

Sara’s insight served as a quiet but powerful reminder: in the rush toward smarter models and more automation, human judgment, with all its limits, still defines the boundaries of AI progress.

The moving target of AI benchmarks and human judgment

As the panel delved deeper into the topic of alignment and accountability, one question emerged front and center: Who gets to define the benchmarks that guide AI development? And perhaps more importantly, are those benchmarks grounded in human understanding, or just technical performance?

The challenge, according to Paras, lies in the fact that alignment is not static.

“It depends,” he said. “Alignment is always evolving.” From his perspective, the most important factor is recognizing where and how human input should be embedded in the process.

Paras pointed to nuanced judgment as a key domain where humans remain indispensable. “You might have alignment today, but taste changes. Priorities shift. What was acceptable last month might feel outdated next quarter,” he explained. “LLMs are like snapshots; they reflect a frozen point in time. Humans bring the real-time context that models simply can’t.”

He also emphasized the limits of what AI models can process. “You can’t pass in everything to an LLM,” he noted. “Some of the most valuable context, institutional knowledge, soft cues, and ethical boundaries live outside the prompt window. That’s where human judgment steps in.”

This makes benchmark-setting especially tricky. As use cases become more complex and cultural expectations continue to evolve, the metrics we use to measure alignment, safety, or usefulness must evolve too. And that evolution, Paras argued, has to be guided by humans, not just product teams or model architects, but people with a deep understanding of the problem domain.

“It’s not a perfect science,” he admitted. “But as long as we keep humans close to the loop, especially where the stakes are high, we can keep grounding those benchmarks in reality.”

In short, defining success in AI is a constant process of recalibration, driven by human judgment, values, and the ever-shifting landscape of what we expect machines to do.

The unsolved challenge of human representation in AI

As the panel explored the complexities of benchmarking and alignment, Sara turned the spotlight onto a fundamental and unresolved challenge: human representation in AI systems.

“Humans aren’t consistent with each other,” she began. “At Prolific, we care deeply about sourcing data from representative populations, but that creates tension. The more diverse your data sources are, the more disagreement you get on the ground truth. And that’s a really hard problem, I don’t think anyone has solved it yet.”

Most human-in-the-loop pipelines today rely on contributors from technologically advanced regions, creating a skewed perspective in what AI systems learn and reinforce. While it may be more convenient and accessible, the trade-off is systems that reflect a narrow slice of humanity and fail to generalize across cultures, languages, or values.

Paras expanded on that point by reminding the group of what LLMs really are at their core: “stochastic parrots.”

“They learn by mimicking human language,” he said. “So, if humans are biased, and we are, models will be biased too, often in the same ways or worse.” He drew a parallel to broader democratic ideals. “We all believe in democracy, but how many people actually feel represented by the people they vote for? If we haven’t figured out representation for humans, how can we expect to figure it out for language models?”

That philosophical thread, the limits of objectivity, the challenge of consensus, keeps resurfacing in AI conversations, and with good reason. As Paras put it, “Almost every problem in AI eventually becomes a philosophical question.”

Vaidya added a practical layer to the discussion, drawing on his experience with AI-generated content. Even when a model produces something that’s technically accurate or politically correct, it doesn’t mean it fits the intended use. “You have to ask: is this aligned with the tone, context, and audience we’re targeting?” he said.

Vaidya emphasized the value of multi-perspective prompting, asking the model to generate outputs as if different personas were viewing the same content. “What would this look like to a middle-aged person? What would a kid want to see? If the answers are wildly different, it’s a signal to bring in a human reviewer.”

In short, representation in AI is about surfacing variability, noticing it, and knowing when to intervene. And as all three panelists acknowledged, this challenge is still very much in progress.

Will humans always be in the loop?

As the panel drew to a close, the moderator posed a final, essential question: Will humans always be part of the AI loop? And if not, where might they be phased out?

It’s a question that sits at the heart of current debates around automation, accountability, and the future of work, and one that Paras didn’t shy away from.

“I hope we’re a part of the process and that this doesn’t turn into a Terminator situation anytime soon,” he joked. But humor aside, Paras emphasized that we’re still searching for an equilibrium between human judgment and machine autonomy. “We’re not there yet,” he said, “but we’re getting closer. As we build more of these systems, we’ll naturally find that balance.”

Paras pointed to a few specific use cases where agentic AI systems that can act autonomously without human intervention have started to show real promise.

“Research and code generation are the two strongest examples so far,” he noted. “If you pull out the human for a while, those agents still manage to perform reasonably well.”

But beyond those narrow domains, full autonomy still raises red flags.

“The truth is, even if AI can technically handle something, we still need a human in the loop, not because we can do it better, but because we need accountability,” Paras explained. “We need someone to point the finger at when things go wrong.”

This is why, despite years of development in AI and machine learning, fields like law and medicine have remained cautious adopters. “It’s not that the technology isn’t there,” Paras said. “It’s that when things go south, someone has to be responsible.”

And that need for traceability, interpretability, and yes, someone to blame, is unlikely to disappear anytime soon.

In a world that increasingly leans on AI to make decisions, keeping humans in the loop may be less about capability and more about ethics, governance, and trust. And for now, that role remains irreplaceable.

Final thoughts: Accountability and skill loss in an AI-driven future

As the conversation on human oversight neared its conclusion, Vaidya added a final and urgent perspective: we may be underestimating what we lose when we over-automate.

“One thing to keep in mind,” he said, “is that when we talk about AI performance today, we’re often comparing the entry-level output of a model with the peak performance of a human.”

That’s a flawed baseline, he argued, because while the best human output has a known ceiling, AI capability is continuing to grow rapidly. “What models can do today compared to just six months ago is mind-boggling,” he added. “And it’s only accelerating.”

But Vaidya’s deeper concern wasn’t just about the rate of improvement; it was about the risk of atrophy.

“I was just chatting with someone outside,” he shared. “They said people are going to forget how to write. And that stuck with me.”

The fear is that humans will lose foundational skills before we’ve built the safeguards to do those tasks well through automation. “That’s the danger,” Vaidya said. “We’re handing over control while our own capabilities fade, and without proper checks, we won’t notice until it’s too late.”

To avoid that future, Vaidya made a call to action for leaders and organizations: to treat human skill preservation and accountability mechanisms as part of responsible AI adoption.

“It’s on us to design for that,” he said. “We need systems, formal or informal, that ensure we retain critical human capabilities even as we scale what machines can do.”

As AI continues to evolve, this perspective added a final layer of nuance to the panel’s core message: progress doesn’t just mean doing more, it means knowing what to protect along the way.