How machine learning algorithms figure out what you should watch next

Curation at scale needs to process a lot of data with a good algorithm.

Article hero image

You sit down in front of your television or flip to the streaming app on your smartphone. What do you choose to watch?

Determining what shows and movies ended up in front of a viewer used to be a very manual, human-led process. An individual would see what content was available, figure out what demographics watched when, and schedule shows and movies in time slots likely to have the right viewers.

With a streaming service, however, there are no schedules. Everything is available anytime. Getting the right shows in front of the viewer when they’re ready to watch becomes the central problem.

What was once a purely human process has now evolved thanks to advancements in Machine learning technology. At Warner Bros. Discovery, we’ve been using machine learning to surface the movies and shows that will most resonate with our viewers. Our editorial teams have long picked what they thought were the best programs among our libraries, but one person’s favorite won’t always appeal to another person. So, like a lot of industries, we’ve turned to machine learning and user data to make our digital experiences better.

Our goal is always to make our viewers’ experiences easier and simpler so they find the content that they want to watch quickly. No one in the industry has fully cracked this problem, which is what makes it so exciting.

In this article, we’ll talk about what we’re doing with ML to ensure that your new favorite show is waiting for you when you start up Discovery+ or HBO Max.

Moving from a human process to a machine process

At its most simple, recommendation is based on patterns. If you like science fiction, you’re likely to watch more science fiction movies. Based on our studies, we found that the average viewer sticks to five or six genres. They aren’t the same genres for every viewer, so coming up with a generic navigation sort—even an alphabetical one—can be difficult. You could just surface the most popular programs, but then you’d neglect your long-tail content.

The simplest automation we can do is ensure that a user’s favorite genres are easiest to access. We did this both on the browse page, when a user clicks into a list of TV shows and movies and sees the genres available, but also on the user’s home page. The construction of that home page needs to be personalized so that the user isn’t scrolling and scrolling to get to the genre of shows that they watch all the time.

A human editor would go through those genres and pick the movies or shows they think are the best: the gems. But a single editor, no matter how great their taste, won’t be able to pick winners for everyone. We capture data on user’s histories, the interactions that they make on the site, and various other signals that tell us what they are interested in. We use deep learning algorithms that run these histories through sequence-based models to determine the probability of this viewer wanting to watch any given show. We then rank the content by how likely it is to appeal to the customer and send that ranking to them—that’s what their gems are, based on the data they are supplying us.

Of course, we don’t just want to serve you the content that you already like. Human editors are very good at finding a wider group of connections between media. They’ll recommend something not because the metadata says there’s an action sequence here and a romantic sequence here, but because the editor is connecting dots that may not be easily translatable into labels. You enjoyed a film from this director; maybe you’ll like their work in a genre you don’t typically explore. Pandora tried this model for music by having human editors explicitly build in links between songs.

This web of connections that creates diverse content and delightful experiences is what we’re actively exploring now, except that we are trying to use our ML program to infer those connections. Whether that’s connecting watch patterns, looking at metadata, or extracting cues from the content itself, we want to create a richer pool of content than what would be available just from genre signals.

Warner Brothers Discovery is shifting overall from being very heavily editorial and human-driven to more ML-heavy. One of the places where we recently made in-roads into the editorial culture is in what we call the hero panel, the big panel at the top that shows a single preview for a featured show. Our editors have traditionally picked what goes there—no machine, just a constantly rotating set of picks. Right now, we’re turning this into a machine learning problem, trying to figure out how to personalize that space with a constantly rotating set of programs relevant to the person viewing it.

The machines that recommend you movies

There’s a lot of options and tooling to create ML solutions today. We’re mostly an AWS shop, and we started our ML journey using a lot of their services, including SageMaker for model training and deployment pipeline. We used AWS Personalize for our initial recommendation engines; it let us get started quickly and worked very well on most problems.

Now we’re building our own models in TensorFlow. If you want richer evaluation frameworks, faster turnaround times, and more control over the learning techniques and algorithms used, that’s the next step. Our custom models perform as well if not better with what the industry and AWS provided. And we’re looking to build ML pipelines that serve our specific use cases without relying on these generic frameworks.

We’re not looking to reinvent the wheel; there are a lot of open-source technologies and enterprise solutions that we’re considering adding to our stack. We’re looking at technologies like Feast for the feature store and inference engines like KServeand MLflow to manage our experiments and deployment pipeline. With our custom tooling and the excellent open-source technologies on the market, we can design ML solutions that handle our particular use cases.

In fact, ML tooling in general has come a very long way. The bar for getting started has been lowered so much over the last decade that you can build a very effective ML pipeline just using out-of-the-box tools. With hardware advances and the algorithms you can leverage, you can bootstrap a very effective solution that will make inferences in sub-milliseconds.

If you want to develop a richer evaluation framework and go deeper into your training data sets, that’s when you can start diving into customization. We’ve been developing our own models and pipelines to give us more control over the learning techniques and enable faster turnaround times on our datasets. Then we can build on the solutions we’ve bootstrapped.

Of course, the tooling, algorithms, and models aren’t the hardest parts about machine learning. It’s the data.

The real issue is the data

ML code is a small part of a larger puzzle: the data. Combing through a massive pile of data and metadata to determine features and decide how to apply semantics is both difficult and essential. If you’ve ever gone through an ML tutorial, the data is provided to you. But in real applications, the data is never as high-quality as you’d like. You end up fighting around the data for your models and then training your models. But the data management part is where much of our time is spent.

Some of the open-source tools are so good that you could write two lines of code in Tensorflow and have yourself an ML application. But then you need to deploy it, and when you deploy in a real business scenario, you need to run through a series of checklists. The pipeline needs to operate in real time, scale quickly, be maintainable, and remain transparent enough for us to assess whether we’re following the right signals and encouraging users in a healthy direction.

Take a simple signal: watch time. If a viewer watches more of a program, they probably like it, and we can use that to infer other programs that they might like. Pretty straightforward. But that data needs to flow back from the viewer to our systems. The content streams to the client, often buffering more than needed to prevent interruptions. For our recommendations to serve accurate content, this data needs to flow back in nearly real time. If the viewer hates a show and clicks back to the home page, that page needs to be ready to refresh with new recommendations.

This ends up being petabytes of data on a daily basis, and this data needs to be aggregated and passed to our backend systems. That data coming from the client does not come in an easily consumable format, so massaging it into a format that could be aggregated and fed into our models was one of the most challenging tasks we faced.

But percentage watched is a pretty basic metric, and it doesn’t tell us a whole lot about what the viewer liked about the program. One of our big metrics is content return on investment: how much viewership a program is getting based on our investment in it. Part of what we want from the signals that viewers send back to us is the ability to better understand the content of the videos themselves without relying on a human curator. We’re only scratching the surface of extracting metadata and features from videos, and are actively trying to determine if there is more we can learn about our content from ML.

Machine learning is always changing, as are our algorithms, so as we update models and iterate based on our data, we need a good way to evaluate whether the models and your changes are getting you the results that you want. We run a lot of experiments: side-by-side evaluations of models against various target metrics. As users interact with shows, genres, or sections of the app, we want to feed that information back into our models.

The risk is always that we’re biasing too heavily on one metric or another. If our sole metric was watch time, then the algorithms would optimize for that, and those numbers would go up. But are the viewers picking content that is meaningful to them? Are we directing them to videos that they like, or are we just throwing a bunch of content at them until something sticks? Leaning too heavily on a single metric can cause you to neglect your overall macro health, which may have unintended second-order consequences for the rest of your content.

Watching what you watch

Warner Bros Discovery has a content library that spans almost a hundred years, and we want to get our programs in front of people who will love them. Our ML program is trying to use the signals that viewers give us in order to give them their next favorite show.

If you’re interested in being part of the next generation of ML-powered recommendation engines, we’re hiring.

The Stack Overflow blog is committed to publishing interesting articles by developers, for developers. From time to time that means working with companies that are also clients of Stack Overflow’s through our advertising, talent, or teams business. When we publish work from clients, we’ll identify it as Partner Content with tags and by including this disclaimer at the bottom.

Login with your stackoverflow.com account to take part in the discussion.