Measuring gender bias in films using AI

Comparing our results with Google’s

Published in
4 min readJul 3, 2018


Inspired by Google/GD-IQ 2017 study on gender bias in 300 top grossing 2014–2016 US movies, we used Demografy to measure female on-screen time in the same movies by analyzing names of main character cast. Though both studies use different technologies and measure slightly different indicators we got very close results and decided to publish them.

Demografy is a GDPR-friendly SaaS platform that uses AI-based noninvasive technology to predict demographic data knowing only names. It can be used to get demographic insights or append marketing lists with missing demographic data.

Unlike traditional solutions, businesses don’t need to know and disclose their customer or prospect addresses, emails or other sensitive information. This makes Demografy privacy by design and enables businesses to get 100% coverage of any list since all they need to know is names.

Difference in measured indicators

First of all, we’d like to highlight difference between the two studies.

  • Google measured share of female on-screen time by analyzing video frames with detected human faces
  • Demografy measured share of female cast by analyzing names of top 10 characters in cast assuming that first 10 actors in cast represent main characters in the movie

Though the measured indicators are different and we don’t anticipate very accurate match, they are relevant.

For those who are interested in results, we are going to start with key findings and then dive deeper into describing our methodology.

Key Findings


This study was not intended for precise accuracy evaluation due to mentioned differences in measured indicators. However, our study independently verified all key findings of Google’s research and showed the same correlation of female on-screen time in different genres, MPAA ratings and Oscar-winning vs all movies. Below is comparison of results found in Google’s and Demografy’s studies:


Like in our previous study of measuring Twitter demographics, we were challenged to use Demografy technology on a real world task with results that can be cross-verified against reputed source. Google’s 2017 study was a perfect choice for this purpose.

About Google’s study. Google’s research offers hard data on gender disparities in films. It compares screen time of male and female characters in movies. With support from, the Geena Davis Institute on Gender in Media teamed up with Google to develop software that accurately measures how often we see and hear women on-screen. Dubbed the Geena Davis Inclusion Quotient (GD-IQ), algorithm uses machine learning to recognize patterns required to detect different characters on-screen, determine their gender, and calculate how often and for how long they spoke in relation to one another.


In this research we measured gender distribution of character cast in 2014–2016 top grossing US movies and compared it to Google’s 2017 research. For this purpose we collected cast information of the same movies which were used in Google’s research.

Collecting data

  1. We took initial list of movies from Google’s research paper. They don’t provide movie titles but provide enough of other data to identify each movie in the list. So we manually identified and compiled spreadsheet of 300 movies based on year, MPAA rating, gross earnings and list of genres.
  2. Then we aggregated cast information and other data for each movie. For this purpose we implemented a simple console application. The application used two free movie APIs — OMDb API (The Open Movie Database) and TMDb (The Movie Database) — to collect extra movie data based on movie title and IMDb ID. Collected data included the following key information for each movie: MPAA rating, list of genres, IMDb ID, list of awards including Oscar, cast. TMDb was used to collect cast information and list of genres while OMDb API was used to collect MPAA ratings and list of awards using previously collected IMDb ID.
  3. All collected data was stored in SQL database as a graph where each cast member is associated with one or more movies and respective characters.

Demografying cast information

After collecting data we used Demografy to detect gender of each cast member using only his or her names. Each detected gender information was saved to database. At this point we had all necessary data stored in SQL database. For the purposes of the research, we then ran SQL queries to select lists of cast for the following movie groups:

  • By MPAA rating: PG, PG-13, R
  • By whether movie won any Oscars: Oscar-winning movies (those having at least one Oscar among awards), all movies
  • By movie genre (movies containing one of the following genres in their list of genres): Horror, Romance, Comedy, Sci-Fi, Drama, Biography, Action, Crime

For each list of cast we made the following data pre-processing:

  • Filtered out all non-actors leaving only actors with associated characters in each movie
  • Selected only first 10 characters assuming that they represent main characters in respective movies

Then we calculated share of women in each list and compared this data with respective data in Google’s study.


Though we measured slightly different indicators, we found very similar share of female characters in movies and found the same correlation in different genres, MPAA ratings and Oscar-winning vs all movies. All movie groups have error rate of less than 3.5% except Horrors, Biography and Crime genres.

Follow us in social networks to get updates:

Connect with the Raven team on Telegram



Privacy by design AI platform that predicts customer demographics using only names -