Baby Steps in Exploratory Data Analysis (EDA) with Python.
The first thing I encounter with data is noise. Data is not perfect, thus the need for cleaning and munging. I remember a bit overwhelmed with excitement and anxiety over how to deal with the sheer amount of data received. It is easy to get confused when dealing with multiple dataframes. One good way to deal with that is to keep a track on few questions we want to answer and then taking small tiny steps towards that end goal.
Here we shall discuss all the baby steps to loading, checking, cleaning, and munging of Data. This process is also known as Exploratory Data Analysis (EDA). There will be more analysis and visualizations but our main focus will be cleaning and munging.
For the purpose of this exercise, we will first download a free set of data on Amazon's top 50 best-selling books per year here from Kaggle.
Next, we need to use some version of Notebooks like Jupyter Notebook or Integrated Development Environment (IDE) like PyCharm.
A quick summary of the steps we shall cover here includes:
- Importing libraries
- Read csv, tsv
- Preview Dataframe
- Finding Unique and Drop duplicates
- Extracting rows based on conditions
- Working with Groupby
Think of the library as a little toolbox. When you need to drive in a nail you grab a hammer. Someone has already created the hammer and we will not need to make one from scratch.
Read CSV or TSV files
We shall load the CSV or TSV files into PANDAS, a data manipulation tool.
Amz_50Best = pd.read_csv('bestsellers with categories.csv')# we could also read tsv using csv by calling out tab as the seperator
# dataframe = pd.read_csv('file.tsv', sep='\t')
After setting a variable (Amz_50Best) as a dataframe, we can quickly preview the dataframe with .head() or .tail() code. Personally, I like to use .head(-5) to show the top and bottom 5 entries in a dataframe.
# preview most top and bottom 5 entries
Another quick check for null values and data type is .info() function. Another quick way is the isna() function. From the below summary, we know that there is 550 entry in the dataframe and it is great that we don't have any null values!
#this function creates a summary for the dataframe
RangeIndex: 550 entries, 0 to 549
Data columns (total 7 columns):
Name 550 non-null object
Author 550 non-null object
User Rating 550 non-null float64
Reviews 550 non-null int64
Price 550 non-null int64
Year 550 non-null int64
Genre 550 non-null object
dtypes: float64(1), int64(3), object(3)
memory usage: 30.2+ KB# another way to check if there is missing value,
# this will create a table of booleen
Amz_50Best.isna()# combine it with .sum() will gives a summary per column
Amz_50Best.isna().sum()# not needed for this dataframe,
# but if we need to fill the missing value
Amz_50Best.fillna(0, inplace= True)
Finding Unique & Drop Duplicates
Looking at the booklist, we noticed that there are duplicates in the book title. That is because some of the books were published more than once, often in a different year. Say, we were told to compile a unique list of the books. Also, we wanted to create a new dataframe dropping all the duplicated books.
Phew. I hope I didn't lose you guys yet. Think baby-steps. It’s not a big step, but it’s a start, and many more comes with practice. Sometimes we fell, took a break, and then we got back up and code on.
Extracting rows based on conditions
What if we were told to find books written by a certain author? Find books with a certain amount of reviewers? Often we might not want the whole dataframe, let’s see how we can slice the dataframe to find some answers.
Working with Groupby
The groupby() operation in PANDAS allows us to split, apply functions or combine certain results in dataframes. This is useful when we want to count, find the average, or adding other aggregate functions to the dataframe. A good strategy is to first predict what data structure we want to get before creating them.
Kudos if you get this far. These sum up the baby steps into EDA. Following these, there are many techniques to make beautiful graphs and visualization. and often tweakings are necessary to show useful information. For a bonus, here are couple of graphs we could get from the above work. All the best in your EDA journey.