MovieData
**Project description:** Personal python data analytics notebook from August 2025
Tech Stack: Python, Pandas, Matplotlib, Seaborn, Sklearn
Introduction
Do higher film budgets lead to more box office revenue? Let’s find out if there’s a relationship using IMDB 2023 Dataset from Kaggle.
Import Statements
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
Notebook Presentation
pd.options.display.float_format = '{:,.0f}'.format
pd.set_option('display.width', 400)
pd.set_option('display.max_columns', 10)
Read the Data
df = pd.read_csv('imdb_data.csv')
Explore and Clean the Data
#df.shape
#df.head()
#df.tail()
df.info()
df.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3348 entries, 0 to 3347
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3348 non-null object
1 primaryTitle 3348 non-null object
2 originalTitle 3348 non-null object
3 isAdult 3348 non-null int64
4 runtimeMinutes 3348 non-null int64
5 genres 3348 non-null object
6 averageRating 3348 non-null float64
7 numVotes 3348 non-null int64
8 budget 3348 non-null int64
9 gross 3297 non-null float64
10 release_date 3343 non-null object
11 directors 3348 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 314.0+ KB
| id | primaryTitle | originalTitle | isAdult | runtimeMinutes | ... | numVotes | budget | gross | release_date | directors | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 833 | tt0134273 | 8MM | 8MM | 0 | 123 | ... | 139984 | 40000000 | 96,618,699 | February 19, 1999 | Joel Schumacher |
| 1641 | tt0454841 | The Hills Have Eyes | The Hills Have Eyes | 0 | 107 | ... | 180399 | 15000000 | 70,009,308 | March 10, 2006 | Alexandre Aja |
| 2620 | tt1821694 | RED 2 | RED 2 | 0 | 116 | ... | 178197 | 84000000 | 148,075,565 | July 18, 2013 | Dean Parisot |
| 1161 | tt0294870 | Rent | Rent | 0 | 135 | ... | 55315 | 40000000 | 31,670,620 | November 23, 2005 | Chris Columbus |
| 1332 | tt0362120 | Scary Movie 4 | Scary Movie 4 | 0 | 83 | ... | 127234 | 45000000 | 178,262,620 | April 12, 2006 | David Zucker |
5 rows × 12 columns
Cleanup and conversions
- convert ‘September 18, 2019’ dates to datetime
- convert averageRating to numeric
- delete column isAdult as we wont need it
- drop empty gross
- fix singlevalue
df.drop(columns=['isAdult'], inplace=True) # not used
df.dropna(inplace=True) # 51 films have falsely 0 gross
df['release_date'] = pd.to_datetime(df['release_date'], format='mixed')
#df['averageRating'] = pd.to_numeric(df['averageRating'], errors='coerce').astype('float64')
df.loc[df.budget == 18, 'budget'] = 18000000 # fix single value
df['gross'] = pd.to_numeric(df['gross'], errors='coerce')
df.head(5)
| id | primaryTitle | originalTitle | runtimeMinutes | genres | ... | numVotes | budget | gross | release_date | directors | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | tt0035423 | Kate & Leopold | Kate & Leopold | 118 | Comedy,Fantasy,Romance | ... | 87925 | 48000000 | 76,019,048 | 2001-12-11 | James Mangold |
| 1 | tt0065421 | The Aristocats | The AristoCats | 78 | Adventure,Animation,Comedy | ... | 111758 | 4000000 | 35,459,543 | 1970-12-11 | Wolfgang Reitherman |
| 2 | tt0065938 | Kelly's Heroes | Kelly's Heroes | 144 | Adventure,Comedy,War | ... | 52628 | 4000000 | 5,200,000 | 1970-01-01 | Brian G. Hutton |
| 3 | tt0066026 | M*A*S*H | M*A*S*H | 116 | Comedy,Drama,War | ... | 75784 | 3500000 | 81,600,000 | 1970-01-25 | Robert Altman |
| 4 | tt0066206 | Patton | Patton | 172 | Biography,Drama,War | ... | 106476 | 12000000 | 61,749,765 | 1970-02-04 | Franklin J. Schaffner |
5 rows × 11 columns
Descriptive Statistics
df.describe()
| runtimeMinutes | averageRating | numVotes | budget | gross | release_date | |
|---|---|---|---|---|---|---|
| count | 3,292 | 3,292 | 3,292 | 3,292 | 3,292 | 3292 |
| mean | 113 | 7 | 217,090 | 50,468,636 | 168,264,559 | 2005-11-07 19:06:03.061968512 |
| min | 63 | 1 | 50,004 | 6,000 | 210 | 1970-01-01 00:00:00 |
| 25% | 98 | 6 | 79,396 | 15,000,000 | 36,283,303 | 1999-07-19 12:00:00 |
| 50% | 109 | 7 | 129,715 | 32,000,000 | 88,434,290 | 2007-09-29 00:00:00 |
| 75% | 124 | 7 | 249,003 | 68,000,000 | 200,995,146 | 2014-01-20 00:00:00 |
| max | 229 | 9 | 2,817,283 | 356,000,000 | 2,923,706,026 | 2023-10-25 00:00:00 |
| std | 20 | 1 | 249,472 | 51,786,917 | 236,752,803 | - |
- the average film costs about $50m to make and earns more than 3x (or $168m) in worldwide revenue.
- 25% are also profitable but only at around 2x budget rate.
- The lowest budget was $6,000 with the revenue of $126,052
- The highest production budget was $356,000,000 with highest worldwide revenue $2,923,706,026 or (8x the budget)!
I believe it should be Avatar by James Cameron, but lets check it out and also see the one with the lowest budget.
df[df.budget.isin([6000, 356000000])]
| id | primaryTitle | originalTitle | runtimeMinutes | genres | ... | numVotes | budget | gross | release_date | directors | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 878 | tt0154506 | Following | Following | 69 | Crime,Mystery,Thriller | ... | 99219 | 6000 | 126,052 | 1998-04-24 | Christopher Nolan |
| 3055 | tt4154796 | Avengers: Endgame | Avengers: Endgame | 181 | Action,Adventure,Drama | ... | 1224453 | 356000000 | 2,799,439,100 | 2019-04-18 | Anthony Russo, Joe Russo |
2 rows × 11 columns
Surprise, it’s now actually Avengers: Endgame! And the lowest budget is Following by Christopher Nolan, released in 1998… Never heard of it, but ok. Interesting is that it also made x21 the budget.
Films that Lost Money
Of course not all films are successfull, lets find out what is the percentage of films where the production costs exceeded the worldwide gross revenue?
money_losing = df.loc[df.budget > df.gross]
print(len(money_losing)/len(df))
0.14975698663426487
14.9% of films do not recoup their budget at the worldwide box office. Seems quite low but this dataset doesn’t include domestic revenue, so films that were never released worldwide are not even included.
Most common genres
genres = df['genres'].str.get_dummies(sep = ',')
plt.figure(figsize = (9,9))
plt.pie(genres.sum().sort_values(ascending = False),
labels = genres.sum().sort_values(ascending = False).index,
autopct='%1.1f%%',
colors = sns.color_palette("Paired"))
plt.title('Most common genres', fontweight = 'bold')
plt.tight_layout()
plt.show()
Genre with the most gross
df['genres'] = df['genres'].str.split(',')
df_exploded = df.explode('genres')
genre_gross = df_exploded.groupby('genres')['gross'].max().sort_values(ascending=False).head(10)
genre_names = genre_gross.index
max_gross = genre_gross.values
plt.figure(figsize=(12, 8))
sns.barplot(x=max_gross, y=genre_names, hue=max_gross, palette="viridis", legend=False, orient='h')
plt.ylabel('Genre')
plt.xlabel('Maximum Gross, $billions')
plt.title('Maximum Gross by Genre')
plt.show()
genres
Action 2,923,706,026
Fantasy 2,923,706,026
Adventure 2,923,706,026
Drama 2,799,439,100
Romance 2,264,743,305
Sci-Fi 2,071,310,218
Animation 1,663,075,401
Crime 1,515,341,399
Thriller 1,515,341,399
Comedy 1,453,683,476
Name: gross, dtype: float64
Most common genres: Adventure, Action, Drama also have max gross values. Interesting that Comedy gets replaced by Fantasy, which is only about 3.7%
Directors with most films
top_directors = df["directors"].value_counts().head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_directors.values, y=top_directors.index, hue=top_directors.index, palette="viridis", legend=False, orient='h')
plt.title('Top 10 Directors by Number of Films Directed')
plt.xlabel('Number of Films')
plt.ylabel('Director')
plt.show()
Directors with max gross
dir_gross=df.groupby('directors')['gross'].max().sort_values(ascending=False).head(10)
director_names = dir_gross.index
max_gross = dir_gross.values
plt.figure(figsize=(10, 6))
sns.barplot(x=max_gross, y=director_names, hue=director_names, palette="viridis", legend=False, orient='h')
plt.ylabel('Directors')
plt.xlabel('Maximum Gross')
plt.title('Maximum Gross by Director (Top 10)')
plt.xticks(rotation=45, ha='right')
plt.show()
Budget vs Revenue (Seaborn Bubble Charts)
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
ax = sns.scatterplot(data=df,
x='budget',
y='gross',
hue=('averageRating'),
palette="viridis",
legend=False,
size=('gross'))
ax.set(ylim=(0, 3000000000),
xlim=(0, 450000000),
ylabel='Revenue in $ billions',
xlabel='Budget in $100 millions')
plt.show()
Bigger budget seems to correspond to higher revenue. And also budgets above $100m tend to stick to fix sums like 150, 200
Movie Releases over Time
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
ax = sns.scatterplot(data=df,
x='release_date',
y='gross',
hue=('averageRating'),
legend=True,
palette="viridis",
size=('gross'))
ax.set(ylim=(0, 3000000000),
xlim=(df.release_date.min(), df.release_date.max()),
ylabel='Revenue in $ billions',
xlabel='Budget in $100 millions')
plt.show()
We clearly see a positive trend of budgets/revenue increasing over time
Seaborn Regression Plots
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
ax = sns.regplot(data=df,
x='budget',
y='gross',
color='#2f4b7c',
scatter_kws = {'alpha': 0.3},
line_kws = {'color': '#ff7c43'})
ax.set(ylim=(0, 3000000000),
xlim=(0, 450000000),
ylabel='Revenue in $ billions',
xlabel='Budget in $100 millions')
We also see that a film with a $150 million budget is predicted to make slightly under $500 million by our regression line. All in all, we can be pretty confident that there does indeed seem to be a relationship between a film’s budget and that film’s worldwide revenue.
Own Regression with scikit-learn
$ REVENUE = \theta _0 + \theta _1 * BUDGET$
regression = LinearRegression()
# Explanatory Variable or Feature
X = pd.DataFrame(df, columns=['budget'])
# Response Variable or Target
y = pd.DataFrame(df, columns=['gross'])
regression.fit(X, y)
# R-squared
regression.score(X, y)
theta0 = regression.intercept_[0] # y-intercept
theta1 = regression.coef_[0] # slope
r2 = regression.score(X,y) # r-squared
print(f"Y-Intercept (theta0) is {theta0}")
print(f"Slope coefficient(theta1) is {theta1}")
print(f"R-squared is {r2}")
Y-Intercept (theta0) is 7157763.965588003
Slope coefficient(theta1) is [3.19221611]
R-squared is 0.48756712843206695
- Y-intercept (theta0) tells us the estimated revenue for a given budget
- Slope (theta1) tells us that for every extra $1 in the budget, movie revenue increases by $3.19
- R-squared 0.48 means that our model explains about 48% of the variance in movie revenue. That’s actually pretty decent, considering we’ve got the simplest possible model, with only one explanatory variable.
Model Prediction
We just estimated the slope and intercept! Remember that our Linear Model has the following form:
$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$
budget = 350000000
revenue_estimate = theta0 + regression.coef_[0,0] * budget
revenue_estimate = round(revenue_estimate, -6)
print(f'The estimated revenue for a $350m film is around ${revenue_estimate:.10}.')
The estimated revenue for a $350m film is around $1.124e+09.
So for a $350M we estimate $1.12B
That’s it, thanks for watching!