Intoduction To Pandas Data Analysis

This Page is still being developed, bear with us please!

What is pandas in python?

Panda essentially is a python library developed specifically for data science, data analysis, and visualisation of data. It is a free software developed for python programming language for above afore-mentioned purposes and manipulation of data. These functions of pandas would not be possible without the two in-built packages arround it, these are series and DataFrame. Before any pandas data model can be used in python (that is, series and DataFrame), We must first of all make sure pandas is installed in our python software. It is from this we can call any of these data model we want to use for our analysis. If pandas has not been installed, then you need to download and install it through your terminal, using the pip program that comes with the python. This article asumes you are on windows computing environment. Linux, mac os, and other operating systems are not covered in the installation proccess, but they use the same syntax of creating, accessing, and maniputing data through series and DataFrame with windows environment.

How to install pandas on windows

After pandas software has been fully downloaded, the following proccesses will guide you in its installation. Go to your command propt or terminal and type:

>>> pip install pandas

If your python software is installed through anaconda, you need to install your pandas through anaconda. When everything is set up and running, then to use pandas is quite very simple.

To use pandas you need to first understand the two primary Components of pandas (data modelling), these are series and dataFrame. Series is essentially a column and dataFrame is a multidimentional table, made up of a collection of series. To create series and dataFrame, we need to import the python library that is houses them, which is pandas, so Pandas needs to be imported first on top of your program code. This is because series and Dataframe reside in pandas.

Importing pandas

Follow the step below to import pandas.

import pandas

Having imported pandas library on top of our program, the it is time to practice data analysis offered by pandas. Let us start our data analysis by using series and dataFrame. Please take a close look at the series and dataFrame below, you will see that the first series table contains data about orange fruit, while the second series table contains mango fruit data in a grocery. The third table is Dataframe that combines data in series table one and table two. You will also notice that what common to all the tables is the index number from zero to three (0 - 3). Actually this number represents the address of each fruit in the table.

Explanation

For clarity sake, let us explain simple analysis of the table below. For example we have a shop that sells mango and orange fruits. These mangoes and oranges are arrange serially on the stack. On the table 1, the stack are serially numbered from 0 - 3, on stack 0, 3 fruits are kept there, stack 1 contains 2 fruits, stack 2 contains 0 (nothing is kept here), while the last stack; which is stack 3 has 1 fruit kept in there. We can represent this programmatically, using pandas series.

Series +

	Orange
0	3
1	2
2	4
3	0

Series =

	Mango
0	1
1	3
2	0
3	5

DataFrame

	Orange	Mango
0	3	1
1	2	3
2	4	0
3	0	5

Series - DataFrame Interaction

pandas series

What is pandas series, how important is to the study of data analysis? and how is it relate to DataFrame? All these questions would be answered in the series article on this website. This is because this article is entirely created for DataFrame. Series and dataFrame are quite similar in operation, many tasks that can be done with one are actually possible with he other. For example calculating the total, mean, median, and mode of data distribution.

How to Create DataFrame

There are many ways to create DataFrame, but i am going to limit DataFrame creation to just only two - creating DataFrame from the scratch, using simple python "dictionary", and importing data from csv file. Let us use our grocery example of mangoes and orange fruits we cited above to create the DataFrame.

step 1. import pandas
step 2. create a column for each fruit - orange and mango
step 3. the row, which will occupied by index number, panda is will generate this implicitly, you don't have to do anything here for now.

import pandas as pd
//create your data, using python dictionary to load the number of fruits you want to show.

data = {
'orange' : [3, 2, 4, 0],
'mango' : [1, 3, 0, 5]
}

great! you have successfully created your data, using python dictionary. The next step is for you to create your Dataframe and pass your data into it. See it below.

stock = pd.DataFrame(data)

Great! DataFrame has been created successfully. The next thing is visualization of DataFrame, that is, display your data. See below:

print(stock)

The whole proccesses have been completed now, for clearity sake let us see the complete codes in action then followed by the display of our dataFrame to see what is going on on the table.

import pandas as pd

data = {
'orange' : [3, 2, 4, 0],
'mango' : [1, 3, 0, 5]
}
stock = pd.DataFrame(data)
print(stock)

When this program is run, the output is shown below. Based on the output we will be able to carry out some data analysis about what we see on the DataFrame table, and that will help us to make a decisive business dicisions to upgrade company profile.

Output

	Orange	Mango
0	3	1
1	2	3
2	4	0
3	0	5

Great! Isn't it? We have just displayed the content of the DataFrame. It has four rows and three colums - columns for index, orange and mango respectively. While columns for mango and orage were explicitly declared by the programmer, index column on the other hand is implicitly added by pandas, and labelled 0 - 3. We will see how we can eleminate this index column from our table or substitute it with our own defined label if we so desire it later on, in this article. Meanwhile, before we start our data analysis proper, the other method of creating DataFrame needs to be touched as well, loading csv file data on the dataFrame, using .read_cvs() method.

Pandas read csv file

What is csv file? csv file or comma separated value is a plain text file in which each entry are separated from the next by comma. This type of file is a simple way to store big data set, and not only it can be read by python, but all programming languages suppor it. This is one of the beauty of csv file when it comes to data science and data analysis, and also that makes it essential tool for data scientists and analyists too. To load csv file on the DataFrame, use the command below:

import pandas as pd

df = pd.read_csv('name of the csv file')
print(df)

Let us see what is happening in the code above. The program contains three lines of code. line one imports panda package on our program, line two creates df variable and load the csv file on it, the last line prints the content of our DataFrame.

For the purpose of this study, we have created a csv file named 'insolation-temperature.csv file as a sample. See it below:

import pandas as pd

df = pd.read_csv('insolation-temperature.csv')
print(df)

and the output of this code is displayed below. Output

	Month	Solar Insolation (kw/m2)	Earth temperature (0c)
0	January	6.06	24.5
1	February	5.88	23.8
2	March	5.69	23.1
3	April	5.47	21.8
4	May	5.00	20.0
5	June	4.53	18.2
6	July	4.73	18.6
7	August	5.59	21.9
8	September	6.38	26.1
9	October	6.51	27.4
10	November	6.25	26.8
11	December	5.84	24.6
12	Annual	5.66	23.0

Great! We have successfully pull our csv file containing monthly solar insolation and earth temperature file into the pandas environment and display it as well. Next is to demonstrate how we can use DataFrame to read JSON file. We will use the same avarage monthly solar insolation and earth temperature data in our csv file above to demonstrate this proccess.

Pandas Read JSON File

JSON file is one of the best means of storing big data set for data science and data analysis when it comes to data storage, retrieval, and analysis. What is JSON file? JSON file is just a text file, just like csv file we have treated above, however unlike csv file, JSON file is in object format which makes it well known very popular in all programming languages.

How To use JSON File in Pandas

There are two ways JSON file can be used in pandas environment (1.) load the existing JSON file on pandas DataFrame through .read_JSON() method. This will allow pandas to have access to the file and read it for further manipulation. See how to do this below:

import pandas as pd
df = pd.read_JSON('name of the file.json')

This will create DataFrame and load the named .json file for reading and analysing of data, just the same way the csv file was loaded, read, and displayed by the DataFrame. We do not need to repeat this proccess because it has been done and analysed above.

Let us look at the second method of of pandas reading JSON file, that is by creating json file into the python dictionary. Just like I said in the previous lesson above, JSON object have the same format with python dictionary, which means if your JSON is not in a file, you can create one with python dictionary and load it ino the DataFrame. Let us see how to create this now.

import pandas as pd

data = {

"Month": [
"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"Noveember",
"December",
"ANNUAL"
],
"Solar Insolation (kw/m2)":[
6.06,
5.88,
5.69,
5.47,
5.00,
4.53,
4.73,
5.59,
6.38,
6.51,
6.25,
5.84,
5.66
],
"Earth Temprature (0C)" : [
24.5,
23.8,
23.1,
21.8,
20.0,
18.2,
18.6,
21.9,
26.1,
27.4,
26.8,
24.6,
23.0
],
}

df = pd.DataFrame(data) // This loads data into the DataFrame
print(df) // This displays the data

Output

	Month	Solar Insolation (kw/m2)	Earth temperature (0c)
0	January	6.06	24.5
1	February	5.88	23.8
2	March	5.69	23.1
3	April	5.47	21.8
4	May	5.00	20.0
5	June	4.53	18.2
6	July	4.73	18.6
7	August	5.59	21.9
8	September	6.38	26.1
9	October	6.51	27.4
10	November	6.25	26.8
11	December	5.84	24.6
12	Annual	5.66	23.0

NOTE: Looking at the output table, it could be seen that all the output results have index column appended implicitly by the DataFrame, We could get rid of these indexes or replace them with our own customed label if the need arises. I will explain how to do this as we progress in this study.