Whether this is the tour the force, a great number or peculiar gadgets or glasses of cold vodka Martni, there are definitely many aspects that constitute a great James Bond movie. Having watched A spy who loved me (1977) recently, I’ve begun to wonder if that could be somehow figured out with the help of machine learning. This is my small summer data science project that aims to predict the average IMDB score for the newest 007 movie No time to die and see what contributes to the score.
More formally, we will be looking at a regression problem for a small dataset (24 items) and training a random forest with cross validation. Then we will examine the break down plot for a previously unseen example - the new movie. I will be using caret package for the model and DALEX for explanation.
data(JamesBond)
df <- JamesBond %>% select(-US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom) %>%
relocate(Avg_User_IMDB, .after = last_col())
I will be using the James Bond dataset from the HoRM package. See resources
Here we have the data of the previous 24 movies. I’m going to use 12 ouf
of the 18 available features, namely: Year, Movie, Bond, World_Adj,
Budget_Adj, Film_Length, Conquests, Martinis, BJB, Kills_Bond,
Kills_Others, Top_100, Avg_User_IMDB.
To clear out the feature names:
- World_Adj The film’s 2013-adjusted worldwide gross
(in 1000’s of U.S. dollars).
- Budget_Adj The film’s 2013-adjusted budget (in 1000’s
of U.S. dollars).
- Avg_User_IMDB The average user rating on IMDB
(www.imdb.com).
- Conquests The number of “conquests” by Bond in the
film.
- Martinis The number of martinis Bond drank in the
film.
- BJB The number of times Bond stated “Bond. James
Bond.” in the movie.
- Kills_Bond The number of people killed by Bond.
- Kills_Others The number of people killed in the film
by people other than Bond.
- Top_100 An indicator where a value of 1 means the
title song within the top 100 on the UK Singles Chart and the U.S.
Billboard Hot 100 and a value of 0 means it did not.
kable(df)
Year | Movie | Bond | World_Adj | Budget_Adj | Film_Length | Conquests | Martinis | BJB | Kills_Bond | Kills_Others | Top_100 | Avg_User_IMDB |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1962 | Dr. No | Sean Connery | 457928 | 7688 | 110 | 3 | 2 | 1 | 4 | 8 | 0 | 7.3 |
1963 | From Russia with Love | Sean Connery | 598624 | 15174 | 115 | 4 | 0 | 0 | 11 | 16 | 0 | 7.5 |
1964 | Goldfinger | Sean Connery | 935404 | 22468 | 110 | 2 | 1 | 2 | 9 | 68 | 1 | 7.8 |
1965 | Thunderball | Sean Connery | 1040693 | 66333 | 130 | 3 | 0 | 0 | 20 | 90 | 1 | 7.0 |
1967 | You Only Live Twice | Sean Connery | 775740 | 66035 | 117 | 3 | 1 | 0 | 21 | 175 | 1 | 6.9 |
1969 | On Her Majesty’s Secret Service | George Lazenby | 518736 | 50608 | 142 | 3 | 1 | 2 | 5 | 37 | 0 | 6.8 |
1971 | Diamonds Are Forever | Sean Connery | 664969 | 41274 | 120 | 1 | 0 | 1 | 7 | 42 | 1 | 6.7 |
1973 | Live and Let Die | Roger Moore | 846046 | 36603 | 121 | 3 | 0 | 1 | 8 | 5 | 1 | 6.8 |
1974 | The Man with the Golden Gun | Roger Moore | 459623 | 32965 | 125 | 2 | 0 | 2 | 1 | 5 | 0 | 6.7 |
1977 | The Spy Who Loved Me | Roger Moore | 710290 | 53636 | 125 | 3 | 1 | 1 | 31 | 116 | 1 | 7.1 |
1979 | Moonraker | Roger Moore | 672514 | 99134 | 126 | 3 | 1 | 1 | 12 | 69 | 0 | 6.2 |
1981 | For Your Eyes Only | Roger Moore | 498812 | 71514 | 127 | 2 | 0 | 2 | 18 | 36 | 1 | 6.8 |
1983 | Octopussy | Roger Moore | 437059 | 64102 | 131 | 2 | 0 | 1 | 15 | 43 | 1 | 6.5 |
1985 | A View to a Kill | Roger Moore | 329322 | 64730 | 131 | 4 | 0 | 2 | 5 | 57 | 1 | 6.2 |
1987 | The Living Daylights | Timothy Dalton | 390758 | 81749 | 130 | 2 | 2 | 1 | 13 | 29 | 0 | 6.7 |
1989 | License to Kill | Timothy Dalton | 292392 | 78637 | 133 | 2 | 1 | 1 | 10 | 13 | 0 | 6.5 |
1995 | Goldeneye | Pierce Brosnan | 542985 | 91404 | 130 | 2 | 1 | 1 | 47 | 25 | 0 | 7.2 |
1997 | Tomorrow Never Dies | Pierce Brosnan | 491098 | 159117 | 119 | 3 | 1 | 1 | 30 | 24 | 0 | 6.4 |
1999 | The World Is Not Enough | Pierce Brosnan | 504091 | 188130 | 128 | 3 | 1 | 2 | 27 | 43 | 0 | 6.3 |
2002 | Die Another Day | Pierce Brosnan | 557433 | 183255 | 133 | 2 | 2 | 1 | 31 | 20 | 1 | 6.0 |
2006 | Casino Royale | Daniel Craig | 686784 | 117465 | 144 | 2 | 3 | 1 | 11 | 11 | 1 | 7.9 |
2008 | Quantum of Solace | Daniel Craig | 638035 | 248014 | 106 | 1 | 6 | 0 | 16 | 15 | 1 | 6.7 |
2012 | Skyfall | Daniel Craig | 1120980 | 202240 | 143 | 3 | 1 | 1 | 26 | 26 | 1 | 7.8 |
2015 | Spectre | Daniel Craig | 864553 | 240803 | 148 | 3 | 1 | 1 | 30 | 205 | 1 | 6.8 |
In this section we will look at some interesting features. Hover over the bullet marker to get extra information about the movie.
plot_df <- JamesBond
plot_df$Movie <- factor (plot_df$Movie , levels = plot_df$Movie) # we want to preserve the dataframe's chronological movie ordering
m <- list(
l = 100,
r = 50,
b = 80,
t = 150,
pad = 2
)
fig <- plot_ly(plot_df, x = ~Year, y = ~Martinis, type= 'scatter', mode= 'lines+markers',
line = list(color = 'rgba(49,130,189, 1)', width = 4),
marker = list(color = 'rgba(49,130,189, 1)', size = 12),
hoverinfo = 'text',
text = ~paste(Movie, Year)) %>%
layout(title = "Martini drinks Bond consumes in the movie",
font=list(size = 20),
margin = m,
xaxis = list(
showgrid = FALSE,
zeroline = FALSE,
tickfont = list(size = 16)
),
yaxis = list(
showgrid = FALSE,
zeroline = TRUE
))
fig
plot_df <- JamesBond
plot_df$Movie <- factor (plot_df$Movie , levels = plot_df$Movie)
fig2 <- plot_ly(plot_df, x = ~Year, y = ~Kills_Bond, name = 'By Bond', type= 'scatter', mode= 'lines+markers',
line = list(color = 'rgba(166,4,4,1)', width = 4),
marker = list(color = 'rgba(166,4,4,1)', size = 12),
hoverinfo = 'text',
text = ~paste(Movie, Year, "<br>Kills:", Kills_Bond)) %>%
add_trace(y = ~Kills_Others, type = 'scatter', mode= 'lines+markers',
name = "By Others",
line = list(color = 'rgba(192,172,182,1)', width = 4),
marker = list(color = 'rgba(192,172,182,1)', size = 12),
hoverinfo = 'text',
text = ~paste(Movie, Year, "<br>Kills:", Kills_Others)) %>%
layout(title = "Kills in 007 movies",
font=list(size = 20),
margin = m,
xaxis = list(
showgrid = FALSE,
zeroline = FALSE,
tickfont = list(size = 16)
),
yaxis = list(
title.text = '',
showgrid = FALSE,
zeroline = TRUE
))
fig2
The most correlated feature with the average IMDB score is the World gross, which should be expected. It is interesting to note that Kills, Film Length and Budget are growing in the newer movies.
fig_corr <- JamesBond %>% select(-US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom, -Movie, -Bond) %>%
relocate(Avg_User_IMDB, .after = last_col()) %>%
cor() %>%
ggcorrplot(method = 'square', type = 'upper', outline.color = 'white', lab_size = 8,
title= "Feature Correlations",
legend.title = "Corelation",
tl.cex = 18,
lab=TRUE) +
theme( text = element_text(size = 22))
fig_corr
It is time to train our regressor. We will use random forest with cross-validation. Next, we will create a DALEX explainer to examine the Feature Importance.
data(JamesBond)
df <- JamesBond %>% select(-Movie, -US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom) %>%
relocate(Avg_User_IMDB, .after = last_col())
train_control <- trainControl(method = "cv", number = 8, savePredictions = "all")
model_rf <- train(Avg_User_IMDB~. , data = df, method = "rf", ntree = 100, trControl = train_control)
explainer_rf <- DALEX::explain(model = model_rf,
label = "rf",
verbose = FALSE,
y = df$Avg_User_IMDB
)
print(model_performance(explainer_rf))
## Measures for: regression
## mse : 0.05157403
## rmse : 0.2270992
## r2 : 0.7990078
## mad : 0.1360167
##
## Residuals:
## 0% 10% 20% 30% 40% 50%
## -0.38593333 -0.30377500 -0.19552000 -0.12311000 -0.07096667 -0.03891667
## 60% 70% 80% 90% 100%
## -0.01014000 0.07145333 0.13466333 0.31706333 0.49035000
plot(model_parts(explainer_rf, loss_function = loss_root_mean_square)) + theme(text = element_text(size = 20))
I found the necessary statistics from No time to die (2021). As some were hard to find, I assumed the values of Martinis and BJBs. The gross and budget were adjusted to 2013 for consistency:
no_time_to_die <- data.frame(2021, "Daniel Craig", 608510000, 214930000, 163, 0, 1, 1, 66, 109, 1, 0)
colnames(no_time_to_die) <- colnames(df)
df <- rbind(df, no_time_to_die)
kable(no_time_to_die)
Year | Bond | World_Adj | Budget_Adj | Film_Length | Conquests | Martinis | BJB | Kills_Bond | Kills_Others | Top_100 | Avg_User_IMDB |
---|---|---|---|---|---|---|---|---|---|---|---|
2021 | Daniel Craig | 608510000 | 214930000 | 163 | 0 | 1 | 1 | 66 | 109 | 1 | 0 |
predict(explainer_rf, tail(df, n=1))
## 25
## 7.347983
pb = predict_parts(explainer = explainer_rf, tail(df, n=1), type= "break_down")
plot(pb)
Let’s examine the contribution of each feature of our example, the 25th 007 movie. What influances the final score the most is the world gross (+), year (+), length (+) and budget (-). My favorite features, introductions and Martinis, are unfortunately not as significant, but they are an advantage (who would have thought!).
We are very close to the true average IMDB rating! (https://www.imdb.com/title/tt2382320/, August 2022).
Remark: This is just a toy example with a small dataset and it should be further tested on more examples to asses the model’s quality. Though it is definitely not enough to consider the rating of a movie based on the number of drinks consumed or main character’s introductions (even if this is Bond, James Bond), I hope this project offers quality entertainment and pleasing visualizations.
From Poland with love,
Kacper Dobek
https://github.com/kapiblue
Inspiration: https://betaandbit.github.io/RML/
James Bond R dataset https://search.r-project.org/CRAN/refmans/HoRM/html/JamesBond.html
How to use DALEX with caret https://htmlpreview.github.io/?https://github.com/ModelOriented/DALEX-docs/blob/master/vignettes/DALEX_caret.html
R Markdown Cookbook https://bookdown.org/yihui/rmarkdown-cookbook/
EMA book https://ema.drwhy.ai/
Daniel Craig’s retrospective https://youtu.be/2oZdJrph3RA