Before the release of “Star Wars: The Force Awakens”, the team at FiveThirtyEight wanted to answer some questions about the Star Wars franchise. In particular they were interested in answering the question Which movie is the best movie in the franchise?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.
For this post, we will be exploring the data set in Jupyter notebook. The following code will read the data into pandas Dataframe.
# Import the required modules
import pandas as pd
import numpy as np
# Read data into python
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
We need to specify an encoding because the data set has some characters that aren’t in Python’s default utf-8
encoding.
You can read more about character encodings on developer Joel Spolsky’s blog.
star_wars.head(10) # Exploring the first ten rows of the dataframe
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | nan | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.29288e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don’t understand this question | Yes | No | No | Male | 18-29 | nan | High school degree | South Atlantic |
2 | 3.29288e+09 | No | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.29277e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | nan | nan | nan | 1 | 2 | 3 | 4 | 5 | 6 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don’t understand this question | No | nan | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.29276e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 6 | 1 | 2 | 4 | 3 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don’t understand this question | No | nan | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.29273e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 4 | 6 | 2 | 1 | 3 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.29272e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | 4 | 3 | 6 | 5 | 2 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Neither favorably nor unfavorably (neutral) | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.29268e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6 | 5 | 4 | 3 | 1 | 2 | Very favorably | Very favorably | Somewhat favorably | Somewhat favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat unfavorably | Somewhat favorably | Very favorably | Han | Yes | No | No | Male | 18-29 | nan | High school degree | East North Central |
8 | 3.29266e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4 | 5 | 6 | 3 | 2 | 1 | Very favorably | Somewhat favorably | Very favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Very unfavorably | Somewhat unfavorably | Neither favorably nor unfavorably (neutral) | Somewhat favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat unfavorably | Very favorably | Han | No | nan | Yes | Male | 18-29 | nan | High school degree | South Atlantic |
9 | 3.29265e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 4 | 6 | 2 | 1 | 3 | Very favorably | Somewhat unfavorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Very favorably | Very favorably | Very favorably | Very favorably | Neither favorably nor unfavorably (neutral) | Somewhat favorably | Very unfavorably | Somewhat unfavorably | Somewhat favorably | Han | No | nan | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
The data has several columns including:
Respondent ID
- Anonymized ID for the respondentGender
- The respondent’s genderAge
- The respondent’s ageHousehold Income
- The respondent’s incomeEducation
- The respondent’s education levelLocation (Census Region)
- The respondent’s locationHave you seen any of the 6 films in the Star Wars franchise?
- Has aYes
orNo
responseDo you consider yourself to be a fan of the Star Wars film franchise?
- Has aYes
orNo
response
There are several other columns containing answers to questions about the Star Wars movies. For some questions like Which of the folllowing Star Wars films have you seen? Please select all that apply.
respondent’s had to check multiple boxes. This type of data is difficult to represent in a typical columnar format. As a result, this data set needs a lot of cleaning.
The first thing we will do is remove any rows, where RespondentID
is NaN
. In order to accomplish this task we will utilize the pandas.notnull() function.
star_wars = star_wars[pd.notnull(star_wars['RespondentID'])]
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.29288e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don’t understand this question | Yes | No | No | Male | 18-29 | nan | High school degree | South Atlantic |
2 | 3.29288e+09 | No | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.29277e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | nan | nan | nan | 1 | 2 | 3 | 4 | 5 | 6 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don’t understand this question | No | nan | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.29276e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 6 | 1 | 2 | 4 | 3 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don’t understand this question | No | nan | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.29273e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 4 | 6 | 2 | 1 | 3 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.29272e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | 4 | 3 | 6 | 5 | 2 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Neither favorably nor unfavorably (neutral) | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.29268e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6 | 5 | 4 | 3 | 1 | 2 | Very favorably | Very favorably | Somewhat favorably | Somewhat favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat unfavorably | Somewhat favorably | Very favorably | Han | Yes | No | No | Male | 18-29 | nan | High school degree | East North Central |
8 | 3.29266e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4 | 5 | 6 | 3 | 2 | 1 | Very favorably | Somewhat favorably | Very favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Very unfavorably | Somewhat unfavorably | Neither favorably nor unfavorably (neutral) | Somewhat favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat unfavorably | Very favorably | Han | No | nan | Yes | Male | 18-29 | nan | High school degree | South Atlantic |
9 | 3.29265e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 4 | 6 | 2 | 1 | 3 | Very favorably | Somewhat unfavorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Very favorably | Very favorably | Very favorably | Very favorably | Neither favorably nor unfavorably (neutral) | Somewhat favorably | Very unfavorably | Somewhat unfavorably | Somewhat favorably | Han | No | nan | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
10 | 3.29264e+09 | Yes | No | nan | Star Wars: Episode II Attack of the Clones | nan | nan | nan | nan | 1 | 2 | 3 | 4 | 5 | 6 | Neither favorably nor unfavorably (neutral) | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat unfavorably | Very favorably | Somewhat unfavorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Somewhat unfavorably | Very favorably | I don’t understand this question | No | nan | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
Let’s have a look at the following two columns in the dataset:
Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
Both represent Yes/No
questions. They can also be NaN
where are respondent chooses not to answer a question. In order to observe all of the unique values in a column, along with the total number of times each value appears, pandas.Series.value_counts() will be utilized.
Both columns are currently string type, because the main values they contain are Yes
and No
. We can make the data a bit easier to analyze down the road by converting each column to a Boolean having only the values True
, False
and NaN
. Booleans are easier to work with because we can select the rows that are True
or False
without having to do a string comparison.
We can use the pandas.Series.map() method on series objects to perform the conversion.
# A dictionary to define the mapping from each value in the series to a new value
yes_no = {
"Yes": True,
"No": False
}
# Convert both the columns mentioned above to boolean type using series.map function
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen?
Please select all that apply.
The columns for this question are:
Which of the following Star Wars films have you seen?
Please select all that apply. - Whether or not the respondent sawStar Wars: Episode I The Phantom Menace
.Unnamed: 4
- Whether or not the respondent sawStar Wars: Episode II Attack of the Clones
.Unnamed: 5
- Whether or not the respondent sawStar Wars: Episode III Revenge of the Sith
.Unnamed: 6
- Whether or not the respondent sawStar Wars: Episode IV A New Hope
.Unnamed: 7
- Whether or not the respondent sawStar Wars: Episode V The Empire Strikes Back
.Unnamed: 8
- Whether or not the respondent sawStar Wars: Episode VI Return of the Jedi
.
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN
, the respondent either didn’t answer or didn’t see the movie. We’ll assume that they didn’t see the movie.
We’ll need to convert each of these columns to a Boolean, then rename the column to something more intuitive. We can convert the values the same way we did earlier, except that we’ll need to include the movie title and NaN
in the mapping dictionary.
# Mapping dictionary
movie_mapping = {
"Star Wars: Episode I The Phantom Menace": True,
np.nan: False,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True
}
# Converting the values in a loop using the mapping dictionary
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(movie_mapping)
After calling the map()
function, we need to rename the columns to better reflect what they represent. To accomplish this we will use the pandas.DataFrame.rename() method. The df.rename()
method works a lot like map()
. We pass it a dictionary that maps the current column names to new ones:
star_wars = star_wars.rename(columns={
'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6' : 'seen_4', 'Unnamed: 7' : 'seen_5', 'Unnamed: 8' : 'seen_6'})
The pandas.DataFrame.rename()
method will only rename the columns we specify in the dictionary, and won’t change the names of other columns. The code above will rename the Which of the following Star Wars films have you seen? Please select all that apply.
column to seen_1
and Unnamed: 4
to seen_2
and so on.
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1
means the film was the most favorite, and 6
means it was the least favorite. Each of the following columns can contain the value 1
, 2
, 3
, 4
, 5
, 6
, or NaN
:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
- How much the respondent likedStar Wars: Episode I The Phantom Menace
.Unnamed: 10
- How much the respondent likedStar Wars: Episode II Attack of the Clones
.Unnamed: 11
- How much the respondent likedStar Wars: Episode III Revenge of the Sith
.Unnamed: 12
- How much the respondent likedStar Wars: Episode IV A New Hope
.Unnamed: 13
- How much the respondent likedStar Wars: Episode V The Empire Strikes Back
.Unnamed: 14
- How much the respondent likedStar Wars: Episode VI Return of the Jedi
.
Fortunately, these columns don’t require a lot of cleanup. We’ll need to convert each column to a numeric type, though, then rename the columns so that we can tell what they represent more easily.
We can do the numeric conversions using pandas.DataFrame.astype() on dataframes.
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
The code above will convert column 9
up to but not including column 15
to the float data type. We will also give each column a more descriptive name to better reflect what they represent.
star_wars = star_wars.rename(columns = {'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'ranking_1', 'Unnamed: 10' : 'ranking_2', 'Unnamed: 11' : 'ranking_3', 'Unnamed: 12' : 'ranking_4', 'Unnamed: 13' : 'ranking_5', 'Unnamed: 14' : 'ranking_6'})
Now that we’ve cleaned up the ranking columns, we can find the highest-ranked movie more quickly. To do this, we take the mean of each of the ranking columns using the pandas.DataFrame.mean() method on dataframes and assign it to rank
.
rank = star_wars[star_wars.columns[9:15]].mean()
We will visualize the ranking of each movies using a bar chart. To plot this graph we will make use of an open source graphing libraries called plotly. The link provides a brief description about plotly and how to install it on jupyter notebook.
# Loading the required module.
import plotly.graph_objects as go
# Set the colors of the bars
colors = ['lightslategray',] * 6
# Highlight the bar that has the lowest ranking (or is the most well liked)
colors[np.argmin(rank.values)] = 'crimson'
fig = go.Figure(data=[go.Bar(
x=rank.index.base, # Set the x-axis labels
y= np.around(rank.values, 1), # Set the values of y-axis and round them of to a single decimal place
marker_color=colors # marker color can be a single color value or an iterable
)], layout=go.Layout(
xaxis=dict(showgrid=False))) # Remove the vertical grids
# Set the title for the plot
fig.update_layout(title_text='Ranking of Star Wars Movies')
# Set the number of ticks on the y-axis
fig.update_yaxes(nticks=5)
So far, we’ve cleaned up the data, renamed several columns, and computed the average ranking of each movie. As we suspected, it looks like the “original” movies are rated much more highly than the newer ones. The $5^{th}$ movie in the franchise is the most highly ranked movie.
Earlier in this project, we cleaned up the seen
columns and converted their values to the Boolean type. When we call methods like pandas.DataFrame.sum() or mean(), they treat Booleans like integers. They consider True
as 1
, and False
as 0
. That means we can figure out how many people have seen each movie just by taking the sum of the column. Similar to ranking let us compute the sum of each seen
column and assign it to most_seen
.
most_seen = star_wars[star_wars.columns[3:9]].sum()
# Set the colors of the bars
colors = ['lightslategray',] * 6
# Highlight the bar that has the highest value for most seen liked)
colors[np.argmax(most_seen.values)] = 'crimson'
fig = go.Figure(data=[go.Bar(
x=most_seen.index.base, # Set the x-axis labels
y= most_seen.values, # Set the values of y-axis round of to a single decimal place
marker_color=colors # marker color can be a single color value or an iterable
)], layout=go.Layout(
xaxis=dict(showgrid=False))) # Remove the vertical grids
# Set the title for the plot
fig.update_layout(title_text='Ranking of Star Wars Movies')
# Set the number of ticks on the y-axis
fig.update_yaxes(nticks=5)
It appears that the original movies were seen by more respondents than the newer movies in the franchise. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular. We will split the dataframe into two groups based on a binary column such as Gender
by creating two subsets of that column.
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
The subsets will allow us to compute the most viewed movie, the highest-ranked movie, and other statistics separately for each group.
males_rank = males[males.columns[9:15]].mean()
females_rank = females[females.columns[9:15]].mean()
males_most_seen = males[males.columns[3:9]].sum()
females_most_seen = females[females.columns[3:9]].sum()
fig = go.Figure(data=[
go.Bar(name='Male', x=males_rank.index.base, y=np.around(males_rank.values, 1)),
go.Bar(name='Female', x=females_rank.index.base, y=np.around(females_rank.values, 1))
], layout=go.Layout(
xaxis=dict(showgrid=False),
yaxis = dict(showgrid=False)))
fig.update_layout(title_text='Ranking of Star Wars Movies')
# Change the bar mode
fig.update_layout(barmode='group')
fig.update_yaxes(nticks=5)
fig.show()
fig = go.Figure(data=[
go.Bar(name='Male', x=males_most_seen.index.base, y=males_most_seen.values),
go.Bar(name='Female', x=females_most_seen.index.base, y=females_most_seen.values)
], layout=go.Layout(
xaxis=dict(showgrid=False),
yaxis = dict(showgrid=False)))
fig.update_layout(title_text='Views of each movie in the Star Wars Franchise')
# Change the bar mode
fig.update_layout(barmode='group')
fig.update_yaxes(nticks=5)
fig.show()
Interestingly, more men watched episodes 1-3, but men liked them far less than women did. We could also try to segment this dataframe based on different columns to find interesting patterns. Another path to explore would be to try to ascertain, which characters did the respondents like/dislike the most. However, I will save these task for the future.