Sorting a dataset by a criteria is an important operation. It is an interesting way to start analyzing different datasets by playing with the data around.
If you are migrating from SQL this would be the equivalent of the “ORDER BY” clause.
For this example we are going to use “cop shooting dataset” that can be downloaded from the following url https://www.kaggle.com/andrewmvd/police-deadly-force-usage-us.
First we are going to import pandas library
import pandas as pd
Then we are going to read file that we loaded. Change the path to math with your file Directory.
police_shootings = pd.read_csv("H:\Datasets\datasets_686757_1204345_fatal-police-shootings-data.csv")
As a result of that now we have a dataFrame called “police_shootings”, Which we are going to explore.
After checking the columns, we are going to obtain the top 5 values that we have on the dataframe.
police_shootings.head(n = 5)
2. Sorting Syntax
Sorting a dataset based on columns syntax is pretty simple.
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
The most used parameters for sort_values are
by : str or list of str
- Name or list of names to sort by.
- Specifies the index or column level names.
ascending: bool or list of bool, default True
- Sort ascending vs. descending.
- Specify list for multiple sort orders. If this is a list of bools, must match the length of the “by”.
Inplace: bool, default False
- If True, perform operation in-place without creating a new dataset.
2.1. Sorting by column names.
For this examples we are going to use the data extracted from the “police deadly force usage” dataset.
We are going to start exploring the dataset, after exploring the dataset we would like to sort the dataframe by date.
For example, We would like to know which the first is and the last date that we have included on it.
Subsequently, we are going to select date, gender, age, race and city from the current dataframe and store it in another dataframe.
police_shooting_col = police_shootings[['date','gender','age','race','city']]
After that we are going to sort that dataframe by date.
As you can see values got sorted ascending. If you don’t specify the sorting order, default behavior is ascending (Going from lower to higher).
All the other columns are sorted based on this order.
You can establish the order of the following columns by adding ascending in the sort_values clause.
This clause is going to return the same information as the one above.
By changing the ascending value for ascending=”False” we are going to obtain the following results.
police_shooting_col.sort_values('date', ascending = False)
2.2 Sorting multiple columns
If we want to order this databaset by two columns, we would need to add a list instead of the name of a single column. For example, if we want to order our dataframe by date and age we would have the following results.
police_shooting_col.sort_values(['date','age'], ascending = [False,False])
As you can see the first orderded column is date and then age column is ordered.
In other words, in this dataframe we have date ordered in a descending order. As you can see in the second and third column we have the same date ‘2020-05-25′ and ages ordered in a descending order for that date, ’48’ and ’35’.
To sum up this part, the first column that is going to be ordered is date and if date rows are duplicated then age is going to be sorted.
When ordering a dataframe order of the columns affect the final result. For example
police_shooting_col.sort_values(['age','date'], ascending = [False,False])
On this example we are ordering first age on a descending way. Then we order date based on the age.
Similar to the other example if we have duplicated ages, date is going to be ordered on a descending way for those ages.
If you want you can also order one column in an ascending order and the other one in a descending and vice versa.
police_shooting_col.sort_values(['age','date'], ascending = [True,False])
police_shooting_col.sort_values(['age','date'], ascending = [False,True])
Results may vary depending on what you are looking for.
3. Sorting results by index.
There are other ways to sort dataframes, one way consist in sorting it by index.
Syntax is almost the same but instead of using “sort_values” we are going to use “sort_index”.