Panda essentially is a python library developed specifically for data science, data analysis, and visualisation of data. It is a free software developed for python programming language for above afore-mentioned purposes and manipulation of data. These functions of pandas would not be possible without the two in-built packages arround it, these are series and DataFrame. Before any pandas data model can be used in python (that is, series and DataFrame), We must first of all make sure pandas is installed in our python software. It is from this we can call any of these data model we want to use for our analysis. If pandas has not been installed, then you need to download and install it through your terminal, using the pip program that comes with the python. This article asumes you are on windows computing environment. Linux, mac os, and other operating systems are not covered in the installation proccess, but they use the same syntax of creating, accessing, and maniputing data through series and DataFrame with windows environment.
After pandas software has been fully downloaded, the following proccesses will guide you in its installation. Go to your command propt or terminal and type:
If your python software is installed through anaconda, you need to install your pandas through anaconda. When everything is set up and running, then to use pandas is quite very simple.
To use pandas you need to first understand the two primary Components of pandas (data modelling), these are series and dataFrame. Series is essentially a column and dataFrame is a multidimentional table, made up of a collection of series. To create series and dataFrame, we need to import the python library that is houses them, which is pandas, so Pandas needs to be imported first on top of your program code. This is because series and Dataframe reside in pandas.
Having imported pandas library on top of our program, the it is time to practice data analysis offered by pandas. Let us start our data analysis by using series and dataFrame. Please take a close look at the series and dataFrame below, you will see that the first series table contains data about orange fruit, while the second series table contains mango fruit data in a grocery. The third table is Dataframe that combines data in series table one and table two. You will also notice that what common to all the tables is the index number from zero to three (0 - 3). Actually this number represents the address of each fruit in the table.
For clarity sake, let us explain simple analysis of the table below. For example we have a shop that sells mango and orange fruits. These mangoes and oranges are arrange serially on the stack. On the table 1, the stack are serially numbered from 0 - 3, on stack 0, 3 fruits are kept there, stack 1 contains 2 fruits, stack 2 contains 0 (nothing is kept here), while the last stack; which is stack 3 has 1 fruit kept in there. We can represent this programmatically, using pandas series.
Orange | |
---|---|
0 | 3 |
1 | 2 |
2 | 4 |
3 | 0 |
Mango | |
---|---|
0 | 1 |
1 | 3 |
2 | 0 |
3 | 5 |
Orange | Mango | |
---|---|---|
0 | 3 | 1 |
1 | 2 | 3 |
2 | 4 | 0 |
3 | 0 | 5 |
What is pandas series, how important is to the study of data analysis? and how is it relate to DataFrame? All these questions would be answered in the series article on this website. This is because this article is entirely created for DataFrame. Series and dataFrame are quite similar in operation, many tasks that can be done with one are actually possible with he other. For example calculating the total, mean, median, and mode of data distribution.
There are many ways to create DataFrame, but i am going to limit DataFrame creation to just only two -
creating DataFrame from the scratch, using simple python "dictionary", and importing data from csv file.
Let us use our grocery example of mangoes and orange fruits we cited above to create the DataFrame.
step 1. import pandas
step 2. create a column for each fruit - orange and mango
step 3. the row,
which will occupied by index number, panda is will generate this implicitly, you don't have to do anything
here for now.
great! you have successfully created your data, using python dictionary. The next step is for you to create your Dataframe and pass your data into it. See it below.
Great! DataFrame has been created successfully. The next thing is visualization of DataFrame, that is, display your data. See below:
The whole proccesses have been completed now, for clearity sake let us see the complete codes in action then followed by the display of our dataFrame to see what is going on on the table.
When this program is run, the output is shown below. Based on the output we will be able to carry out some data analysis about what we see on the DataFrame table, and that will help us to make a decisive business dicisions to upgrade company profile.
OutputOrange | Mango | |
---|---|---|
0 | 3 | 1 |
1 | 2 | 3 |
2 | 4 | 0 |
3 | 0 | 5 |
Great! Isn't it? We have just displayed the content of the DataFrame. It has four rows and three colums - columns for index, orange and mango respectively. While columns for mango and orage were explicitly declared by the programmer, index column on the other hand is implicitly added by pandas, and labelled 0 - 3. We will see how we can eleminate this index column from our table or substitute it with our own defined label if we so desire it later on, in this article. Meanwhile, before we start our data analysis proper, the other method of creating DataFrame needs to be touched as well, loading csv file data on the dataFrame, using .read_cvs() method.
What is csv file? csv file or comma separated value is a plain text file in which each entry are separated from the next by comma. This type of file is a simple way to store big data set, and not only it can be read by python, but all programming languages suppor it. This is one of the beauty of csv file when it comes to data science and data analysis, and also that makes it essential tool for data scientists and analyists too. To load csv file on the DataFrame, use the command below:
Let us see what is happening in the code above. The program contains three lines of code. line one imports panda package on our program, line two creates df variable and load the csv file on it, the last line prints the content of our DataFrame.
For the purpose of this study, we have created a csv file named 'insolation-temperature.csv file as a sample. See it below:
Month | Solar Insolation (kw/m2) | Earth temperature (0c) | |
---|---|---|---|
0 | January | 6.06 | 24.5 |
1 | February | 5.88 | 23.8 |
2 | March | 5.69 | 23.1 |
3 | April | 5.47 | 21.8 | 4 | May | 5.00 | 20.0 |
5 | June | 4.53 | 18.2 |
6 | July | 4.73 | 18.6 |
7 | August | 5.59 | 21.9 | 8 | September | 6.38 | 26.1 |
9 | October | 6.51 | 27.4 |
10 | November | 6.25 | 26.8 |
11 | December | 5.84 | 24.6 |
12 | Annual | 5.66 | 23.0 |
Let us look at the second method of of pandas reading JSON file, that is by creating json file into the python dictionary. Just like I said in the previous lesson above, JSON object have the same format with python dictionary, which means if your JSON is not in a file, you can create one with python dictionary and load it ino the DataFrame. Let us see how to create this now.
Output
Month | Solar Insolation (kw/m2) | Earth temperature (0c) | |
---|---|---|---|
0 | January | 6.06 | 24.5 |
1 | February | 5.88 | 23.8 |
2 | March | 5.69 | 23.1 |
3 | April | 5.47 | 21.8 | 4 | May | 5.00 | 20.0 |
5 | June | 4.53 | 18.2 |
6 | July | 4.73 | 18.6 |
7 | August | 5.59 | 21.9 | 8 | September | 6.38 | 26.1 |
9 | October | 6.51 | 27.4 |
10 | November | 6.25 | 26.8 |
11 | December | 5.84 | 24.6 |
12 | Annual | 5.66 | 23.0 |
This document was last modified: