Baby Steps in Exploratory Data Analysis (EDA) with Python.

The first thing I encounter with data is noise. Data is not perfect, thus the need for cleaning and munging. I remember a bit overwhelmed with excitement and anxiety over how to deal with the sheer amount of data received. It is easy to get confused when dealing with multiple dataframes. One good way to deal with that is to keep a track on few questions we want to answer and then taking small tiny steps towards that end goal.

Here we shall discuss all the baby steps to loading, checking, cleaning, and munging of Data. This process is also known as Exploratory Data Analysis (EDA). There will be more analysis and visualizations but our main focus will be cleaning and munging.

For the purpose of this exercise, we will first download a free set of data on Amazon's top 50 best-selling books per year here from Kaggle.

Next, we need to use some version of Notebooks like Jupyter Notebook or Integrated Development Environment (IDE) like PyCharm.

A quick summary of the steps we shall cover here includes:

  • Importing libraries
  • Read csv, tsv
  • Preview Dataframe
  • Finding Unique and Drop duplicates
  • Extracting rows based on conditions
  • Working with Groupby

Import Library

Think of the library as a little toolbox. When you need to drive in a nail you grab a hammer. Someone has already created the hammer and we will not need to make one from scratch.

Read CSV or TSV files

We shall load the CSV or TSV files into PANDAS, a data manipulation tool.

Preview Dataframe

After setting a variable (Amz_50Best) as a dataframe, we can quickly preview the dataframe with .head() or .tail() code. Personally, I like to use .head(-5) to show the top and bottom 5 entries in a dataframe.

Another quick check for null values and data type is .info() function. Another quick way is the isna() function. From the below summary, we know that there is 550 entry in the dataframe and it is great that we don't have any null values!

Finding Unique & Drop Duplicates

Looking at the booklist, we noticed that there are duplicates in the book title. That is because some of the books were published more than once, often in a different year. Say, we were told to compile a unique list of the books. Also, we wanted to create a new dataframe dropping all the duplicated books.

Phew. I hope I didn't lose you guys yet. Think baby-steps. It’s not a big step, but it’s a start, and many more comes with practice. Sometimes we fell, took a break, and then we got back up and code on.

Extracting rows based on conditions

What if we were told to find books written by a certain author? Find books with a certain amount of reviewers? Often we might not want the whole dataframe, let’s see how we can slice the dataframe to find some answers.

Working with Groupby

The groupby() operation in PANDAS allows us to split, apply functions or combine certain results in dataframes. This is useful when we want to count, find the average, or adding other aggregate functions to the dataframe. A good strategy is to first predict what data structure we want to get before creating them.

Kudos if you get this far. These sum up the baby steps into EDA. Following these, there are many techniques to make beautiful graphs and visualization. and often tweakings are necessary to show useful information. For a bonus, here are couple of graphs we could get from the above work. All the best in your EDA journey.

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Build an Extreme Learning Machine in Python

Prediction Model- What Do Data Science Jobs Pay?

Why Scaling is Important in Machine Learning?

Linear Regression of a certain region of the data

Building a data science product part 1: A short history of the data scientist

Microsoft Data Science Interview Questions

Microsoft Data Science Interview Questions

Design an A/B Test with Case Study from Trustworthy Online Controlled Experiments, the most…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Timothy W Lim

Timothy W Lim

More from Medium

Clean data — the holy grail of data science

Hot Cereals vs Cold Cereals— a 5-Day Data Analysis Challenge for Beginner (Part One)

DIABETES DATASET

Statistics concepts for data analysis