MovieData

Tech Stack:

Tags: Python Pet

**Project description:** Personal python data analytics notebook from August 2025

Tech Stack: Python, Pandas, Matplotlib, Seaborn, Sklearn

Introduction

Do higher film budgets lead to more box office revenue? Let’s find out if there’s a relationship using IMDB 2023 Dataset from Kaggle.

Import Statements

import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
from sklearn.linear_model import LinearRegression

Notebook Presentation

pd.options.display.float_format = '{:,.0f}'.format
pd.set_option('display.width', 400)
pd.set_option('display.max_columns', 10)

Read the Data

df = pd.read_csv('imdb_data.csv')

Explore and Clean the Data

#df.shape
#df.head()
#df.tail()
df.info()
df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3348 entries, 0 to 3347
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   id              3348 non-null   object
 1   primaryTitle    3348 non-null   object
 2   originalTitle   3348 non-null   object
 3   isAdult         3348 non-null   int64
 4   runtimeMinutes  3348 non-null   int64
 5   genres          3348 non-null   object
 6   averageRating   3348 non-null   float64
 7   numVotes        3348 non-null   int64
 8   budget          3348 non-null   int64
 9   gross           3297 non-null   float64
 10  release_date    3343 non-null   object
 11  directors       3348 non-null   object
dtypes: float64(2), int64(4), object(6)
memory usage: 314.0+ KB

	id	primaryTitle	originalTitle	runtimeMinutes	...	numVotes	budget	gross	release_date	directors
833	tt0134273	8MM	8MM	123	...	139984	40000000	96,618,699	February 19, 1999	Joel Schumacher
1641	tt0454841	The Hills Have Eyes	The Hills Have Eyes	107	...	180399	15000000	70,009,308	March 10, 2006	Alexandre Aja
2620	tt1821694	RED 2	RED 2	116	...	178197	84000000	148,075,565	July 18, 2013	Dean Parisot
1161	tt0294870	Rent	Rent	135	...	55315	40000000	31,670,620	November 23, 2005	Chris Columbus
1332	tt0362120	Scary Movie 4	Scary Movie 4	83	...	127234	45000000	178,262,620	April 12, 2006	David Zucker

5 rows × 12 columns

Cleanup and conversions

convert ‘September 18, 2019’ dates to datetime
convert averageRating to numeric
delete column isAdult as we wont need it
drop empty gross
fix singlevalue

df.drop(columns=['isAdult'], inplace=True) # not used
df.dropna(inplace=True) # 51 films have falsely 0 gross

df['release_date'] = pd.to_datetime(df['release_date'], format='mixed')
#df['averageRating'] = pd.to_numeric(df['averageRating'], errors='coerce').astype('float64')
df.loc[df.budget == 18, 'budget'] = 18000000 # fix single value

df['gross'] = pd.to_numeric(df['gross'], errors='coerce')

df.head(5)

	id	primaryTitle	originalTitle	runtimeMinutes	genres	...	numVotes	budget	gross	release_date	directors
0	tt0035423	Kate & Leopold	Kate & Leopold	118	Comedy,Fantasy,Romance	...	87925	48000000	76,019,048	2001-12-11	James Mangold
1	tt0065421	The Aristocats	The AristoCats	78	Adventure,Animation,Comedy	...	111758	4000000	35,459,543	1970-12-11	Wolfgang Reitherman
2	tt0065938	Kelly's Heroes	Kelly's Heroes	144	Adventure,Comedy,War	...	52628	4000000	5,200,000	1970-01-01	Brian G. Hutton
3	tt0066026	MAS*H	MAS*H	116	Comedy,Drama,War	...	75784	3500000	81,600,000	1970-01-25	Robert Altman
4	tt0066206	Patton	Patton	172	Biography,Drama,War	...	106476	12000000	61,749,765	1970-02-04	Franklin J. Schaffner

5 rows × 11 columns

Descriptive Statistics

df.describe()

	runtimeMinutes	averageRating	numVotes	budget	gross	release_date
count	3,292	3,292	3,292	3,292	3,292	3292
mean	113	7	217,090	50,468,636	168,264,559	2005-11-07 19:06:03.061968512
min	63	1	50,004	6,000	210	1970-01-01 00:00:00
25%	98	6	79,396	15,000,000	36,283,303	1999-07-19 12:00:00
50%	109	7	129,715	32,000,000	88,434,290	2007-09-29 00:00:00
75%	124	7	249,003	68,000,000	200,995,146	2014-01-20 00:00:00
max	229	9	2,817,283	356,000,000	2,923,706,026	2023-10-25 00:00:00
std	20	1	249,472	51,786,917	236,752,803	-

the average film costs about $50m to make and earns more than 3x (or $168m) in worldwide revenue.
25% are also profitable but only at around 2x budget rate.
The lowest budget was $6,000 with the revenue of $126,052
The highest production budget was $356,000,000 with highest worldwide revenue $2,923,706,026 or (8x the budget)!

I believe it should be Avatar by James Cameron, but lets check it out and also see the one with the lowest budget.

df[df.budget.isin([6000, 356000000])]

	id	primaryTitle	originalTitle	runtimeMinutes	genres	...	numVotes	budget	gross	release_date	directors
878	tt0154506	Following	Following	69	Crime,Mystery,Thriller	...	99219	6000	126,052	1998-04-24	Christopher Nolan
3055	tt4154796	Avengers: Endgame	Avengers: Endgame	181	Action,Adventure,Drama	...	1224453	356000000	2,799,439,100	2019-04-18	Anthony Russo, Joe Russo

2 rows × 11 columns

Surprise, it’s now actually Avengers: Endgame! And the lowest budget is Following by Christopher Nolan, released in 1998… Never heard of it, but ok. Interesting is that it also made x21 the budget.

Films that Lost Money

Of course not all films are successfull, lets find out what is the percentage of films where the production costs exceeded the worldwide gross revenue?

money_losing = df.loc[df.budget > df.gross]
print(len(money_losing)/len(df))

0.14975698663426487

14.9% of films do not recoup their budget at the worldwide box office. Seems quite low but this dataset doesn’t include domestic revenue, so films that were never released worldwide are not even included.

Most common genres

genres = df['genres'].str.get_dummies(sep = ',')
plt.figure(figsize = (9,9))
plt.pie(genres.sum().sort_values(ascending = False),
        labels = genres.sum().sort_values(ascending = False).index,
        autopct='%1.1f%%',
        colors = sns.color_palette("Paired"))

plt.title('Most common genres', fontweight = 'bold')

plt.tight_layout()
plt.show()

Genre with the most gross

df['genres'] = df['genres'].str.split(',')
df_exploded = df.explode('genres')
genre_gross = df_exploded.groupby('genres')['gross'].max().sort_values(ascending=False).head(10)

genre_names = genre_gross.index
max_gross = genre_gross.values

plt.figure(figsize=(12, 8))
sns.barplot(x=max_gross, y=genre_names, hue=max_gross, palette="viridis", legend=False, orient='h')
plt.ylabel('Genre')
plt.xlabel('Maximum Gross, $billions')
plt.title('Maximum Gross by Genre')
plt.show()

genres
Action      2,923,706,026
Fantasy     2,923,706,026
Adventure   2,923,706,026
Drama       2,799,439,100
Romance     2,264,743,305
Sci-Fi      2,071,310,218
Animation   1,663,075,401
Crime       1,515,341,399
Thriller    1,515,341,399
Comedy      1,453,683,476
Name: gross, dtype: float64

Most common genres: Adventure, Action, Drama also have max gross values. Interesting that Comedy gets replaced by Fantasy, which is only about 3.7%

Directors with most films

top_directors = df["directors"].value_counts().head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_directors.values, y=top_directors.index, hue=top_directors.index, palette="viridis", legend=False, orient='h')
plt.title('Top 10 Directors by Number of Films Directed')
plt.xlabel('Number of Films')
plt.ylabel('Director')
plt.show()

Directors with max gross

dir_gross=df.groupby('directors')['gross'].max().sort_values(ascending=False).head(10)

director_names = dir_gross.index
max_gross = dir_gross.values

plt.figure(figsize=(10, 6))
sns.barplot(x=max_gross, y=director_names, hue=director_names, palette="viridis", legend=False, orient='h')
plt.ylabel('Directors')
plt.xlabel('Maximum Gross')
plt.title('Maximum Gross by Director (Top 10)')
plt.xticks(rotation=45, ha='right')
plt.show()

Budget vs Revenue (Seaborn Bubble Charts)

plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
    ax = sns.scatterplot(data=df,
                         x='budget',
                         y='gross',
                         hue=('averageRating'),
                         palette="viridis",
                         legend=False,
                         size=('gross'))

    ax.set(ylim=(0, 3000000000),
           xlim=(0, 450000000),
           ylabel='Revenue in $ billions',
           xlabel='Budget in $100 millions')

plt.show()

Bigger budget seems to correspond to higher revenue. And also budgets above $100m tend to stick to fix sums like 150, 200

Movie Releases over Time

plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
    ax = sns.scatterplot(data=df,
                         x='release_date',
                         y='gross',
                         hue=('averageRating'),
                         legend=True,
                         palette="viridis",
                         size=('gross'))

    ax.set(ylim=(0, 3000000000),
           xlim=(df.release_date.min(), df.release_date.max()),
           ylabel='Revenue in $ billions',
           xlabel='Budget in $100 millions')

plt.show()

We clearly see a positive trend of budgets/revenue increasing over time

Seaborn Regression Plots

plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
  ax = sns.regplot(data=df,
                   x='budget',
                   y='gross',
                   color='#2f4b7c',
                   scatter_kws = {'alpha': 0.3},
                   line_kws = {'color': '#ff7c43'})

  ax.set(ylim=(0, 3000000000),
         xlim=(0, 450000000),
         ylabel='Revenue in $ billions',
         xlabel='Budget in $100 millions')

We also see that a film with a $150 million budget is predicted to make slightly under $500 million by our regression line. All in all, we can be pretty confident that there does indeed seem to be a relationship between a film’s budget and that film’s worldwide revenue.

Own Regression with scikit-learn

$ REVENUE = \theta _0 + \theta _1 * BUDGET$

regression = LinearRegression()
# Explanatory Variable or Feature
X = pd.DataFrame(df, columns=['budget'])

# Response Variable or Target
y = pd.DataFrame(df, columns=['gross'])
regression.fit(X, y)

# R-squared
regression.score(X, y)
theta0 = regression.intercept_[0] # y-intercept
theta1 = regression.coef_[0] # slope
r2 = regression.score(X,y) # r-squared

print(f"Y-Intercept (theta0) is {theta0}")
print(f"Slope coefficient(theta1) is {theta1}")
print(f"R-squared is {r2}")

Y-Intercept (theta0) is 7157763.965588003
Slope coefficient(theta1) is [3.19221611]
R-squared is 0.48756712843206695

Y-intercept (theta0) tells us the estimated revenue for a given budget
Slope (theta1) tells us that for every extra $1 in the budget, movie revenue increases by $3.19
R-squared 0.48 means that our model explains about 48% of the variance in movie revenue. That’s actually pretty decent, considering we’ve got the simplest possible model, with only one explanatory variable.

Model Prediction

We just estimated the slope and intercept! Remember that our Linear Model has the following form:

$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$

budget = 350000000
revenue_estimate = theta0 + regression.coef_[0,0] * budget
revenue_estimate = round(revenue_estimate, -6)
print(f'The estimated revenue for a $350m film is around ${revenue_estimate:.10}.')

The estimated revenue for a $350m film is around $1.124e+09.

So for a $350M we estimate $1.12B

That’s it, thanks for watching!

Table of Contents

Other Python Projects