Data Preprocessing using pandas and Data Visualization with matplotlib

Cleaning and Preprocessing is arguably the most important step in any data analysis, machine learning, AI, or other process. Here I will go through some popular transformations and preprocessing procedures in python using pandas, however similar can be accomplished using tidyverse in R.

Taking a look at the Dataset

This dataset was obtained from Kaggle here. It contains 822 rows with information about popular movies including title, content rating, genre, duration, actors, and star rating. We're going to focus heavily on cleaning the actors column in this project.

Looking Closer at the actors_list column

Taking a closer look into the actors_list column, I noticed some weird patterns. each name is set within quotes ('name') and precedes with a u (u'Morgan Freeman'). Lets go ahead and remove these characters as a first step to clean up the list.

Splitting columns and Removing the original

Now I wanted to split the actors_list column into three separate actor name columns for better analysis, then remove the original combined column. Lets take a look at the data now. Much better right?

Visualization One: Heatmap correlation between Star Rating and Duration

One way to look at numerical data is to do correlations via a heatmap. For this visualization I looked specifically at Duration against Star Rating to see if there was a correlation. In this case it seems quite unlikely that duration has impact on star rating.

Output:

Visualization Two: Sorted Bar Plot Ranking the Top 10 Movies by Star Rating

Another way to visualize the data is by ranking movies against each other by star rating. This is a great way to see which movies are the most acclaimed by audiences and a good start to further analysis on the top n movies. In this case I plotted the top 10 movies with the highest star rating. Looks like The Shawshank Redemption is a fan favorite!

Output:

Visualization Three: Distribution of Genres

One final visualization looks into the distribution of genres in our data. This is a great way to start looking into the discography of different actors. In this case, Drama is by far the most popular movie made from this dataset.

Output: