Pandas Analyzing Data
Analyzing data is one of the most critical steps in any data science or analytics workflow. Pandas offers a wide range of functions and methods to explore, summarize, and understand data efficiently. These tools help in deriving meaningful insights and preparing data for further analysis. Let’s dive into some of the essential features of Pandas for data analysis.
Generating Summary Statistics
You can use the describe()
method to generate summary statistics for numerical columns in your DataFrame. This method provides useful metrics such as count, mean, standard deviation, and percentiles. Here’s an example:
import pandas as pd
# Create a sample DataFrame
data = {
"City": ["Chennai", "Bengaluru", "Hyderabad", "Mumbai", "Delhi"],
"Population (Millions)": [7.09, 8.44, 6.81, 12.44, 16.79],
"Area (sq km)": [426, 741, 650, 603, 1484]
}
df = pd.DataFrame(data)
# Generate summary statistics
print(df.describe())
Output
Population (Millions) | Area (sq km) | |
---|---|---|
count | 5.000000 | 5.000000 |
mean | 10.114000 | 780.800000 |
std | 4.248452 | 389.255411 |
min | 6.810000 | 426.000000 |
25% | 7.090000 | 603.000000 |
50% | 8.440000 | 650.000000 |
75% | 12.440000 | 741.000000 |
max | 16.790000 | 1484.000000 |
Explanation: The describe()
method generates summary statistics for numerical columns in the DataFrame. Metrics such as mean
, std
(standard deviation), and percentiles
provide valuable insights into the distribution of data.
Inspecting Data
Pandas provides functions to quickly inspect the structure and content of your dataset. Use head()
to preview the first few rows, info()
to check column types, and shape
to see the dimensions of the DataFrame. Here’s an example:
# Inspect the first few rows
print(df.head())
# Get information about the DataFrame
print(df.info())
# Get the shape of the DataFrame
print("Shape:", df.shape)
Output
Head Output:
City | Population (Millions) | Area (sq km) |
---|---|---|
Chennai | 7.09 | 426 |
Bengaluru | 8.44 | 741 |
Hyderabad | 6.81 | 650 |
Mumbai | 12.44 | 603 |
Delhi | 16.79 | 1484 |
Info Output:
RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 City 5 non-null object 1 Population (Millions) 5 non-null float64 2 Area (sq km) 5 non-null int64 dtypes: float64(1), int64(1), object(1) memory usage: 248.0 bytes
Shape Output:
(5, 3)
Explanation: The head()
method previews the first 5 rows of the DataFrame, info()
provides details about column types and non-null counts, and shape
returns the dimensions of the DataFrame as (rows, columns).
Key Takeaways
- Summary Statistics: Use
describe()
to get an overview of numerical data. - Data Inspection: Functions like
head()
,info()
, andshape
help quickly inspect the dataset’s structure and content. - Efficient Analysis: These tools allow you to understand the dataset and prepare it for further analysis.