From vision to reality: charting the current state of AI assistants with Doug Imbruce
Nearly a decade ago, Google was making bold strides in automated technologies. They poured significant resources into their AI assistant, ‘Google Now,’ acquired pioneering companies like DeepMind and Nest Labs, and introduced Android Auto. They’d already clocked in millions of miles of testing their self-driving cars. With a strong focus on anticipatory experiences, Google’s vision seemed to be taking shape.
At this time, I couldn’t help but wonder where it was all leading. To explore this curiosity, I embarked on an endeavor to envision the future. I conceptualized an idea for a product called Google Buddy — an omnipresent AI assistant that seamlessly integrated into our lives, connecting with users on a deep, personal level. I decided to storyboard the idea as a TV spot as a means to discover the use cases and communicate this possible future.
Now here we are, a decade later, and I wonder how closely reality has aligned with this visionary concept? To help me answer whether we are on the brink of realizing these remarkable AI-centered user experiences, and the technological and societal challenges that lie in the path I enlisted Doug Imbruce — a serial entrepreneur and a true trailblazer in the field of AI.
I’ve had the pleasure of knowing Doug since 2008, and his entrepreneurial journey has been remarkable to watch. With multiple successful acquisitions under his belt, including Podz, acquired by Spotify in 2021, and Qwiki, acquired by Yahoo in 2013, Doug’s expertise in AI has propelled him to the forefront of innovation.
The following is my interview with Doug —
Doug, you’ve been working with AI since before the emergence of the large-language-model and long before the generative pre-trained transformer, or GPT, was a household name. Can you highlight your journey as an entrepreneur and the insights that led you to leverage AI so early on?
In 2008, I was on my way to the airport and googling information on “Buenos Aires” with a first generation iPhone. Using that device for the first few weeks, two things were clear. One, that the future had arrived in the palm of my hand; and two, that the traditional search experience wasn’t adequate on a 10x smaller form factor. So I conceptualized a way to automatically transform search results into narrated “answers” for easy viewing on mobile.
Ultimately, executing this vision led me to Silicon Valley, where I built a team that created a product called Qwiki early in the AI adoption curve. We were among the first companies to pioneer “machine created media”, now referred to as generative AI. That big idea involving creating content on the fly in response to web-scale search was too challenging, so we settled on a smaller corpus of information — the photos in your iPhone’s camera roll — and created an app that leveraged rudimentary AI to generate short stories to share. In 2013, the product seemed like magic. The machines are making videos!
Fast forward to today, and it’s clear we’re ready to unleash the full creative potential of AI. The state of the art is remarkable.
I’d like to use the “Love in the Connected Age” storyboards as a springboard to examine the current state of AI and how the space is evolving.
In our opening scene, we see our protagonist, Guy, facing eviction from his apartment from an irate life partner, Mako. Seeking an escape, Guy turns to his AI assistant, Buddy, and exclaims, “Get me out of here!” Buddy, with its ever-helpful demeanor, responds, “Where would you like to go?” And Guy, devoid of a specific destination, simply replies, “I don’t know. Anywhere. You’re in control.”
From your vantage point, how close are we to realizing this level of seamless and intuitive engagement with AI?
Given your “always on” assistant here has persistent access to the user’s voice, this is totally achievable. Sentiment analysis was actually the first piece of what we built at Podz, with a small team of three data scientists, and it was uncannily accurate in using simple indicators like laughter or pace of speech to elevate the most interesting portions of lengthy audio files.
It’s relatively straightforward to understand the emotional range of any person’s audio. All you need to achieve the outcome outlined here is a significant amount of input in the form of the subject’s voice, and the resources to train a model that corresponds to it. Also, given all the pre-existing work in the space, understanding “get me out of here” as a prompt to travel somewhere special is also straightforward. But I think the technical potential of creating this product is way less interesting than debating the quality of the user experience.
Given what you have storyboarded, I’m curious as to whether or not people will be comfortable with their AI interface being so emotionally responsive, or even have a desire for AI to cross that personal threshold. ChatGPT, for example, already provides plenty of utility without this extra, forced and very intimate level of engagement. It’s a topic explored in the incredible Spike Jonze movie “HER”, where the main character Theodore is so attached to the AI personality of his OS he completely devalues human relationships. Theodore learns you want devices to behave more like your doctor, operating at a clinical distance, for fear of over-attachment. Seems like a reasonable conclusion.
As I crafted this scene, I envisioned an AI agent capable of learning and adapting to Guy’s individual idioms and expressions. It’s as if Buddy had become familiar with Guy over time like a true friend would. Are we currently at a stage where AI can effectively learn and incorporate our individual turns of phrase and possibly mirror our individual linguistic styles.
We’re not just closer to the vision in your ad, we’re there. A number of companies have created generative AI platforms that replicate human voices — top of mind are the various videos circulating online of fan-created songs using the voices of Drake, Kanye, etc. Another example of AI voice generation at global scale is the AI DJ Spotify introduced where a trademark, high-energy voice speaks just to you… recommending songs and mixing in tidbits of nostalgic facts just like a traditional radio DJ, but with the content customized on a user-by-user basis.
Spotify’s AI model is based on recording professional voice actor audio, deriving phonemes (units of speech) and spectrograms (representations of audio) and replicating them. Reverse engineering this approach to recognize the emotional content embedded in a given user’s audio would be a very straightforward next step. Again, the only roadblock would be training the model to recognize the nuances of each individual voice, which would be quite expensive at scale.
That said, the basic responsiveness you suggest in these storyboards might not even require a custom model. You’d be surprised how similar people are in their tone when they are upset, bothered, or frustrated. That’s what we learned from creating our LLM at Podz after analyzing over a million minutes of spoken word audio. People are not so unique in how they express their fundamental emotions.
Another aspect to consider is how Buddy, the AI assistant, could gather additional cues to understand Guy’s intended meaning. Could Buddy piece together semantic understanding by analyzing tone of voice, interpreting physical posture, or even triangulating meaning from a combination of various clues? Are there specific challenges we still need to address to enable AI to truly understand users on a multi-dimensional level?
AI is the result of brute force data collection — any desired output for which you can reliably collect input, you can predict and replicate. This was the breakthrough of the Transformer, a concept invented by Google that applied a “fill in the blank” approach to AI. The models generate output by making their best guess at what comes next using pre-existing examples. Given the car’s OS has access to the subject of your storyboards for hours a day, a predictive AI model navigating the subject to a relevant destination based on various inputs would be a straightforward feature to implement and build. After all, the model will be aware of almost every trip the subject has made in the past. All it has to do is correctly guess where he’s headed next given this additional emotional vector as input.
With so much data on an individual’s daily behavior, additional use cases are only limited by the imagination of the team operating this platform. The interesting piece here is all the creative decision making: how to respond to which sentiment using what effective statements? Do all people respond to similar situations in similar ways? What about correcting for differing cultural norms, times of day, and other factors? For example — a fight with my girlfriend in the morning probably leads me to a golf course, at night — probably the bar! This is where the magic of product management becomes really important. There is so much raw material to exploit creatively, the use cases are virtually limitless.
As the scene unfolds, Guy grants Buddy agency. When asked for a destination, Guy says “I don’t know. Anywhere, you’re in control,” empowering the AI with the decision-making. What would be required to grant an AI agent such control, both technically and sociologically?
AI will be adopted along the path of least resistance — it will improve the accuracy of suggestions via reinforcement loops until it can predict any and all relevant outcomes, and the outcomes that have commercial potential will be prioritized by the companies building the software. As these outcomes converge on consumer wants, needs, and desires, eventually people will be comfortable with ceding decision making to AI.
As far as this particular situation, a prerequisite here seems to be that AI has solved all other problems for the protagonist, to the point where he’s comfortable ceding an intuitive, emotional decision to his OS. But there’s a caveat here that’s important. If the AI in this example is wrong, say it takes him to a bar instead of the beach, there isn’t much downside. But in much more critical decision-making, you would hope regulations or other incentives exist that require the AI to present options or confirm its decisions with a human-in-the-loop. You see this in the practice of radiology, for example. Even though it’s been demonstrated that machines make fewer diagnostic errors than humans in this instance, the number of radiologists employed hasn’t decreased. This is because the medical community is simply unwilling to cede full control to machines.
So in the matter of life and death outcomes, in my lifetime at least, I anticipate there will always be a human monitor or interpreter required.
As the story progresses, Buddy takes the forlorn hero on their journey. The AI assistant intuitively selects moody tracks to complement the somber mood of our protagonist as it chauffeurs him to an unknown destination. Finally, Buddy announces their arrival and presents surf conditions on the car’s heads-up display. All along, Buddy demonstrates deep integration with the car’s operating system. Can you envision a future where companies like Google or Apple collaborate with automotive manufacturers like Ford to deeply integrate their AI assistants into the vehicle?
Car companies are terrible at building software. Most manufacturers are ceding control of the user-facing software layer to Apple, with CarPlay, or Google, with Android Auto. The only holdout will obviously be Tesla, with rumors of Tesla actually swimming downstream to introduce a mobile device to compete with Android and iPhone — could be potentially interesting.
Apple and Google are the most valuable companies in the world for a reason: as AI becomes the default interface, replacing your phone’s home screen or laptop’s keyboard as the primary means of input into your computing device, the aperture for customer engagement closes even further, and whoever has access to that transactional threshold wins. Integrated AI products will have a big role to play in your car, where we obviously spend a lot of our time, and will soon be doing so passively. Car manufacturers will be irrelevant in this new world. They have lost any opportunity to directly engage with their customers. It could be less than a decade before the introduction of Apple or Google cars overwhelms the traditional auto industry, and those two companies share the podium with Tesla in terms of how people are getting around. The car is just another piece of hardware.
Buddy’s choice of destination aims to uplift Guy’s spirits. To accomplish this, Buddy must possess knowledge about Guy, including his love for surfing, preferred surf conditions, and cross reference that with information on surf conditions from an entity like Surfline to find the right beach with the ideal conditions. Buddy is also aware that Guy’s mood tends to improve after a surf session. Are we progressing towards a future with such comprehensive personal graphs that encompass individual preferences, interests, and emotional states? Can we combine this personal graph with the capabilities of an AI agent and other data sources to proactively bring joy and enhance someone’s day?
Moments of surprise and delight — joy! — are the primary elements of any consumer technology product. We often talk about the “time to magic” in a product — a user experience that results in a “magic moment” within 5 or 10 seconds of setup. Given the data most of these large companies have — including your search history, content preferences, relationships, retargeting profile — what can be achieved now with model training provides an almost unlimited supply of potential serendipity.
The potential is so vast given the amount of data. My favorite conversation on this subject involves friends outside the tech industry who have the fear that their device is being used by various apps to eavesdrop on them. “I was just discussing visiting Italy with my wife and now I’m seeing ads for a hotel in Positano!” There are obviously some bad actors that abuse microphone access, but it turns out humans are very predictable when you have a powerful machine aware of all the articles we read, photographs we like, people we talk to, and things we buy.
Next, we see Guy surfing while Buddy records the session on the dashcam. This scenario raises some interesting question about the future of AI , privacy and surveillance.
Vanity is a wonderful motivator to encourage human beings to sacrifice privacy for utility. Social media is already essentially an always-on recording device for many, and I think social expectations have changed so much in the last 10 years — where OnlyFans became an acceptable side hustle — the bar for behavior is dropping and the expectation for transparency is rapidly increasing. A working person’s resume is now their Twitter feed, and the passive generation of high quality content will be necessary to feed the social media machine.
I think the changing hardware form factor will obviously play a critical role. The only future technology you missed describing in this story was the protagonist wearing a pair of goggles that records a first-person view of their surf outing. We saw a demo of this feature just last week when Apple unveiled their Vision Pro product. In a few years, when Vision Pro is at the price point and form factor where it makes sense to use it during a physical activity, you will absolutely see passive content creation and publishing become the norm.
As we discussed, when publishing recorded content, AI will present options for sharing until the accuracy is so high that the user trusts it to make full scale publishing decisions. Personal publishing is already so ubiquitous I doubt this will be perceived as anything more than a convenience, and certainly won’t be controversial.
After Buddy posts a highlight video of Guy’s surf session to social media, Buddy displays the message “5pm Let’s roll.” Buddy has a plan in mind and is actively keeping Guy on track. This particular interaction aligns closely with the typical consumer vision of an AI assistant: optimizing schedules and ensuring users stay punctual. However, despite our aspirations, we have yet to fully achieve this level of AI assistance.
Of all the problems being presented by humans and potentially solved by AI, sorting an individual’s schedule is among the most straightforward. What’s interesting here is given there’s a counterparty involved — your protagonist is probably headed to meetup with friends on the weekend, or maybe go see his therapist — the AI begins to act on the user’s behalf, spinning up an agent to interact with another agent to negotiate an outcome. In this case, the outcome is a desired time and place to meet.
We’re only a few months away from having ChatGPT plugins execute versions of this with live information, and I know AI calendaring in particular has been a solved problem for years, so this piece of your vision is extremely real. The machines are talking to machines! It’s happening.
As the story culminates, Buddy takes Guy to their next destination. With a touch of human-like anticipation, Buddy posits, “I think you’re going to like this.” It’s a humble, slightly coy expression, and Buddy is keeping its surprise close to its chest. Can we truly achieve such human-like moments with AI? What are the possibilities, challenges, and potential pitfalls of achieving human-like moments with AI while avoiding unsettling or bizarre interactions?
Anyone who has used ChatGPT has probably noticed the overly formal tone, almost as if you’re speaking to a first grader’s impression of the overly cautious and absurdly formal maitre’d at a fancy restaurant. This was clearly an editorial decision by the OpenAI team, and I am super curious as to the training data they used to establish that peculiar tone. As of now, AI assistants cannot mimic human emotion and engagement to the level where they are indistinguishable from human beings. Think about how few actual human beings can perform in conversational settings to meet the standard of successfully engaging with a large group of people: the list would be short and include a few popular politicians, podcasters, broadcasters, etc.
But this gap will close. All it takes is more data and more computing power to begin to predict and mimic the nuances of even the most capable conversationalists, and one bright day, you’ll see a tweet and use a product where the AI is indistinguishable from a human. That’s when these questions get really interesting, and where we really will need more personalization, because what’s unsettling, or bizarre, or uncomfortable to one person, is not to another. Each audience is unique. Not to mention the regulatory response required to manage the lack of perceivable reality.
In the final frames of our storyboard, Guy is enjoying Buddy’s selected drive-in movie — the AI assistant’s choice turned out to be a good one—when Guy receives a notification on his heads-up display that a friend is nearby. As it turns out, the friend is Mako, and a message from Buddy flashes on Guy’s window, imploring, “Time to make up?”
This suggests that Buddy not only understands Guy’s social connections but also interacts with other AI assistants, such as Mako’s own ‘Buddy.’ This was an orchestrated encounter, not a mere coincidence, meticulously planned by the Buddy AI system. The Buddy system has an understanding of Guy and Mako’s relationship dynamics, including the duration of a typical fight. By arranging their meeting at the right time, Buddy achieves a successful outcome — their reconciliation.
On one hand, this showcases the immense potential of AI as an assistant who can improve our moods and facilitate healthy interpersonal relationships. Imagine having an AI companion capable of not only understanding our emotions but also proactively guiding us towards reconciliation after a disagreement. However, on the other hand, ethical questions arise. We must carefully consider the ethics of AI systems that manipulate human behavior. While it may seem positive in the context of helping individuals mend their relationships, we need to establish clear boundaries. How do we determine when we have crossed the line as designers and developers? When does AI manipulation become too intrusive or potentially harmful?
Many brighter minds than me have pondered AI ethics. Maybe AI developers need their own version of the hippocratic oath. In general, we won’t know how far to push these use cases until we’ve broken something. That’s usually how technology finds its limits. Hopefully that doesn’t result in Skynet and nuclear armageddon.
I think it’s equally as terrifying and compelling to offer full deterministic control to AI. But it’s also remarkable to consider how much of our lives are already controlled and dictated by technology: every time you have surgery or fly in a plane you’re largely counting on a computer to keep you alive, and a computer dictates if you’re qualified to buy a house or even qualify for a job. Even things as trivial as what food to order for take-out is decided by AI — which has advantages and disadvantages. For example, when I fall into bad patterns with eating, and Uber recommends I order one pizza after another, versus suggesting I try reverting to my healthier food choices. How dangerous is that over 1,2,3 years? Just ask my cardiologist!
Sooner than we think, we’ll realize we’ve ceded a large portion of our day-to-day control to an AI model we’re constantly training via reinforcement of our existing decisions. So until human beings start making better decisions, the AI won’t guide us toward them, unless some entity specifically intends that outcome. And then, who is to say what is a better decision? For Uber, whatever creates the most revenue is the right outcome. In the early days, it will come down to the profit potential of the companies creating these experiences. Or regulatory involvement.
For better or worse, technology is a mirror of our individual behavior and collectively, of our culture. We get the heroes we deserve, right?
Thank you for joining me on this exploration of the story and its implications. Before we wrap up, I’m curious to know if you can share anything about what you’re up to lately and what exciting ventures lie ahead for you?
This is the first time in my life I’ve really enjoyed just being a fan — being able to watch these incredible developments come online and cheer from the sidelines. I’ve also spent a lot of time since the Podz acquisition getting healthy and reconnecting with friends: all the stuff you lose when you’re working 100 hours a week on your startup. But with Spotify recently launching the Podz platform and product into the Spotify app, our work has reached about 250,000,000 people, and that’s encouraging me to seek new opportunities for impact. It was really remarkable to experience. Daniel Ek labeled it the “largest update in the platform’s history”.
As I get older, I’m seeing more and more friends and family suffer from health related issues, and I think that’s the next category rife for disruption, so I’d like to find a way to add value there. With Qwiki, we positively impacted millions of people’s emotional lives by helping them easily share important memories; at Podz, we helped folks discover engaging, thoughtful audio creators and hopefully improved their intellectual lives, and now I want to find a way to help millions of people feel 10x better physically.
Really enjoyed connecting with you today! This was fun.
— — —
Are you someone thinking deeply about this space? Interested in being part of our next discussion? Reach out to join us for our next chat.