Pandas Analyzing Data

Analyzing data is one of the most critical steps in any data science or analytics workflow. Pandas offers a wide range of functions and methods to explore, summarize, and understand data efficiently. These tools help in deriving meaningful insights and preparing data for further analysis. Let’s dive into some of the essential features of Pandas for data analysis.

Generating Summary Statistics

You can use the describe() method to generate summary statistics for numerical columns in your DataFrame. This method provides useful metrics such as count, mean, standard deviation, and percentiles. Here’s an example:

import pandas as pd

# Create a sample DataFrame
data = {
    "City": ["Chennai", "Bengaluru", "Hyderabad", "Mumbai", "Delhi"],
    "Population (Millions)": [7.09, 8.44, 6.81, 12.44, 16.79],
    "Area (sq km)": [426, 741, 650, 603, 1484]
}

df = pd.DataFrame(data)

# Generate summary statistics
print(df.describe())

Output

Population (Millions) Area (sq km)
count 5.000000 5.000000
mean 10.114000 780.800000
std 4.248452 389.255411
min 6.810000 426.000000
25% 7.090000 603.000000
50% 8.440000 650.000000
75% 12.440000 741.000000
max 16.790000 1484.000000

Explanation: The describe() method generates summary statistics for numerical columns in the DataFrame. Metrics such as mean, std (standard deviation), and percentiles provide valuable insights into the distribution of data.

Inspecting Data

Pandas provides functions to quickly inspect the structure and content of your dataset. Use head() to preview the first few rows, info() to check column types, and shape to see the dimensions of the DataFrame. Here’s an example:

# Inspect the first few rows
print(df.head())

# Get information about the DataFrame
print(df.info())

# Get the shape of the DataFrame
print("Shape:", df.shape)

Output

Head Output:

City Population (Millions) Area (sq km)
Chennai 7.09 426
Bengaluru 8.44 741
Hyderabad 6.81 650
Mumbai 12.44 603
Delhi 16.79 1484

Info Output:


RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   City                  5 non-null      object 
 1   Population (Millions) 5 non-null      float64
 2   Area (sq km)          5 non-null      int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0 bytes
                        

Shape Output:

(5, 3)

Explanation: The head() method previews the first 5 rows of the DataFrame, info() provides details about column types and non-null counts, and shape returns the dimensions of the DataFrame as (rows, columns).

Key Takeaways

  • Summary Statistics: Use describe() to get an overview of numerical data.
  • Data Inspection: Functions like head(), info(), and shape help quickly inspect the dataset’s structure and content.
  • Efficient Analysis: These tools allow you to understand the dataset and prepare it for further analysis.