MovieData

Tech Stack: Python Pandas Matplotlib Seaborn Sklearn
Tags: Python Pet

Project description: Personal python data analytics notebook from August 2025

Introduction

Do higher film budgets lead to more box office revenue? Let’s find out if there’s a relationship using IMDB 2023 Dataset from Kaggle.

Import Statements

import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
from sklearn.linear_model import LinearRegression

Notebook Presentation

pd.options.display.float_format = '{:,.0f}'.format
pd.set_option('display.width', 400)
pd.set_option('display.max_columns', 10)

Read the Data

df = pd.read_csv('imdb_data.csv')

Explore and Clean the Data

#df.shape
#df.head()
#df.tail()
df.info()
df.sample(5)
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3348 entries, 0 to 3347
    Data columns (total 12 columns):
     #   Column          Non-Null Count  Dtype
    ---  ------          --------------  -----
     0   id              3348 non-null   object
     1   primaryTitle    3348 non-null   object
     2   originalTitle   3348 non-null   object
     3   isAdult         3348 non-null   int64
     4   runtimeMinutes  3348 non-null   int64
     5   genres          3348 non-null   object
     6   averageRating   3348 non-null   float64
     7   numVotes        3348 non-null   int64
     8   budget          3348 non-null   int64
     9   gross           3297 non-null   float64
     10  release_date    3343 non-null   object
     11  directors       3348 non-null   object
    dtypes: float64(2), int64(4), object(6)
    memory usage: 314.0+ KB
idprimaryTitleoriginalTitleisAdultruntimeMinutesnumVotesbudgetgrossrelease_datedirectors
833tt01342738MM8MM01231399844000000096,618,699February 19, 1999Joel Schumacher
1641tt0454841The Hills Have EyesThe Hills Have Eyes01071803991500000070,009,308March 10, 2006Alexandre Aja
2620tt1821694RED 2RED 2011617819784000000148,075,565July 18, 2013Dean Parisot
1161tt0294870RentRent0135553154000000031,670,620November 23, 2005Chris Columbus
1332tt0362120Scary Movie 4Scary Movie 408312723445000000178,262,620April 12, 2006David Zucker

5 rows × 12 columns

Cleanup and conversions

  • convert ‘September 18, 2019’ dates to datetime
  • convert averageRating to numeric
  • delete column isAdult as we wont need it
  • drop empty gross
  • fix singlevalue
df.drop(columns=['isAdult'], inplace=True) # not used
df.dropna(inplace=True) # 51 films have falsely 0 gross

df['release_date'] = pd.to_datetime(df['release_date'], format='mixed')
#df['averageRating'] = pd.to_numeric(df['averageRating'], errors='coerce').astype('float64')
df.loc[df.budget == 18, 'budget'] = 18000000 # fix single value

df['gross'] = pd.to_numeric(df['gross'], errors='coerce')

df.head(5)
idprimaryTitleoriginalTitleruntimeMinutesgenresnumVotesbudgetgrossrelease_datedirectors
0tt0035423Kate & LeopoldKate & Leopold118Comedy,Fantasy,Romance879254800000076,019,0482001-12-11James Mangold
1tt0065421The AristocatsThe AristoCats78Adventure,Animation,Comedy111758400000035,459,5431970-12-11Wolfgang Reitherman
2tt0065938Kelly’s HeroesKelly’s Heroes144Adventure,Comedy,War5262840000005,200,0001970-01-01Brian G. Hutton
3tt0066026M*A*S*HM*A*S*H116Comedy,Drama,War75784350000081,600,0001970-01-25Robert Altman
4tt0066206PattonPatton172Biography,Drama,War1064761200000061,749,7651970-02-04Franklin J. Schaffner

5 rows × 11 columns

Descriptive Statistics

df.describe()
runtimeMinutesaverageRatingnumVotesbudgetgrossrelease_date
count3,2923,2923,2923,2923,2923292
mean1137217,09050,468,636168,264,5592005-11-07 19:06:03.061968512
min63150,0046,0002101970-01-01 00:00:00
25%98679,39615,000,00036,283,3031999-07-19 12:00:00
50%1097129,71532,000,00088,434,2902007-09-29 00:00:00
75%1247249,00368,000,000200,995,1462014-01-20 00:00:00
max22992,817,283356,000,0002,923,706,0262023-10-25 00:00:00
std201249,47251,786,917236,752,803-
  • the average film costs about $50m to make and earns more than 3x (or $168m) in worldwide revenue.
  • 25% are also profitable but only at around 2x budget rate.
  • The lowest budget was $6,000 with the revenue of $126,052
  • The highest production budget was $356,000,000 with highest worldwide revenue $2,923,706,026 or (8x the budget)!

I believe it should be Avatar by James Cameron, but lets check it out and also see the one with the lowest budget.

df[df.budget.isin([6000, 356000000])]
idprimaryTitleoriginalTitleruntimeMinutesgenresnumVotesbudgetgrossrelease_datedirectors
878tt0154506FollowingFollowing69Crime,Mystery,Thriller992196000126,0521998-04-24Christopher Nolan
3055tt4154796Avengers: EndgameAvengers: Endgame181Action,Adventure,Drama12244533560000002,799,439,1002019-04-18Anthony Russo, Joe Russo

2 rows × 11 columns

Surprise, it’s now actually Avengers: Endgame! And the lowest budget is Following by Christopher Nolan, released in 1998… Never heard of it, but ok. Interesting is that it also made x21 the budget.

Films that Lost Money

Of course not all films are successfull, lets find out what is the percentage of films where the production costs exceeded the worldwide gross revenue? 

money_losing = df.loc[df.budget > df.gross]
print(len(money_losing)/len(df))

0.14975698663426487

14.9% of films do not recoup their budget at the worldwide box office. Seems quite low but this dataset doesn’t include domestic revenue, so films that were never released worldwide are not even included.

Most common genres

genres = df['genres'].str.get_dummies(sep = ',')
plt.figure(figsize = (9,9))
plt.pie(genres.sum().sort_values(ascending = False),
        labels = genres.sum().sort_values(ascending = False).index,
        autopct='%1.1f%%',
        colors = sns.color_palette("Paired"))

plt.title('Most common genres', fontweight = 'bold')

plt.tight_layout()
plt.show()

Genre with the most gross

df['genres'] = df['genres'].str.split(',')
df_exploded = df.explode('genres')
genre_gross = df_exploded.groupby('genres')['gross'].max().sort_values(ascending=False).head(10)

genre_names = genre_gross.index
max_gross = genre_gross.values

plt.figure(figsize=(12, 8))
sns.barplot(x=max_gross, y=genre_names, hue=max_gross, palette="viridis", legend=False, orient='h')
plt.ylabel('Genre')
plt.xlabel('Maximum Gross, $billions')
plt.title('Maximum Gross by Genre')
plt.show()

genres Action 2,923,706,026 Fantasy 2,923,706,026 Adventure 2,923,706,026 Drama 2,799,439,100 Romance 2,264,743,305 Sci-Fi 2,071,310,218 Animation 1,663,075,401 Crime 1,515,341,399 Thriller 1,515,341,399 Comedy 1,453,683,476 Name: gross, dtype: float64

Most common genres: Adventure, Action, Drama also have max gross values. Interesting that Comedy gets replaced by Fantasy, which is only about 3.7%

Directors with most films

top_directors = df["directors"].value_counts().head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_directors.values, y=top_directors.index, hue=top_directors.index, palette="viridis", legend=False, orient='h')
plt.title('Top 10 Directors by Number of Films Directed')
plt.xlabel('Number of Films')
plt.ylabel('Director')
plt.show()

Directors with max gross

dir_gross=df.groupby('directors')['gross'].max().sort_values(ascending=False).head(10)

director_names = dir_gross.index
max_gross = dir_gross.values

plt.figure(figsize=(10, 6))
sns.barplot(x=max_gross, y=director_names, hue=director_names, palette="viridis", legend=False, orient='h')
plt.ylabel('Directors')
plt.xlabel('Maximum Gross')
plt.title('Maximum Gross by Director (Top 10)')
plt.xticks(rotation=45, ha='right')
plt.show()

Budget vs Revenue (Seaborn Bubble Charts)

plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
    ax = sns.scatterplot(data=df,
                         x='budget',
                         y='gross',
                         hue=('averageRating'),
                         palette="viridis",
                         legend=False,
                         size=('gross'))

    ax.set(ylim=(0, 3000000000),
           xlim=(0, 450000000),
           ylabel='Revenue in $ billions',
           xlabel='Budget in $100 millions')

plt.show()

Bigger budget seems to correspond to higher revenue. And also budgets above $100m tend to stick to fix sums like 150, 200

Movie Releases over Time

plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
    ax = sns.scatterplot(data=df,
                         x='release_date',
                         y='gross',
                         hue=('averageRating'),
                         legend=True,
                         palette="viridis",
                         size=('gross'))

    ax.set(ylim=(0, 3000000000),
           xlim=(df.release_date.min(), df.release_date.max()),
           ylabel='Revenue in $ billions',
           xlabel='Budget in $100 millions')

plt.show()

We clearly see a positive trend of budgets/revenue increasing over time

Seaborn Regression Plots

plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
  ax = sns.regplot(data=df,
                   x='budget',
                   y='gross',
                   color='#2f4b7c',
                   scatter_kws = {'alpha': 0.3},
                   line_kws = {'color': '#ff7c43'})

  ax.set(ylim=(0, 3000000000),
         xlim=(0, 450000000),
         ylabel='Revenue in $ billions',
         xlabel='Budget in $100 millions')

We also see that a film with a $150 million budget is predicted to make slightly under $500 million by our regression line. All in all, we can be pretty confident that there does indeed seem to be a relationship between a film’s budget and that film’s worldwide revenue.

Own Regression with scikit-learn

$ REVENUE = theta_0 + theta_1 * BUDGET $

regression = LinearRegression()
# Explanatory Variable or Feature
X = pd.DataFrame(df, columns=['budget'])

# Response Variable or Target
y = pd.DataFrame(df, columns=['gross'])
regression.fit(X, y)

# R-squared
regression.score(X, y)
theta0 = regression.intercept_[0] # y-intercept
theta1 = regression.coef_[0] # slope
r2 = regression.score(X,y) # r-squared

print(f"Y-Intercept (theta0) is {theta0}")
print(f"Slope coefficient(theta1) is {theta1}")
print(f"R-squared is {r2}")

Y-Intercept (theta0) is 7157763.965588003 Slope coefficient(theta1) is [3.19221611] R-squared is 0.48756712843206695

  • Y-intercept (theta0) tells us the estimated revenue for a given budget
  • Slope (theta1) tells us that for every extra $1 in the budget, movie revenue increases by $3.19
  • R-squared 0.48 means that our model explains about 48% of the variance in movie revenue. That’s actually pretty decent, considering we’ve got the simplest possible model, with only one explanatory variable.

Model Prediction

We just estimated the slope and intercept! Remember that our Linear Model has the following form:

$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$

budget = 350000000
revenue_estimate = theta0 + regression.coef_[0,0] * budget
revenue_estimate = round(revenue_estimate, -6)
print(f'The estimated revenue for a $350m film is around ${revenue_estimate:.10}.')

The estimated revenue for a $350m film is around $1.124e+09.

So for a $350M we estimate $1.12B

That’s it, thanks for watching!