In this post, we will see what Pandas is and how we can use it in our Python projects.
Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools.
Built on top of NumPy and provides easy-to-use data structures and data analysis tools. It is particularly well-suited for handling structured data, such as tables, and offers robust capabilities for data cleaning, transformation, and analysis. Pandas is widely used in data science, finance, academia, and various other fields where data manipulation is crucial.
It provides fast, flexible data structures and tools to make working with structured (tabular) data straightforward and efficient.
Key Highlights of Pandas
- DataFrame: The core concept is the tabular data structure with labeled rows and columns.
- Fast & Efficient: Built on top of NumPy, it’s optimized for high-performance data operations.
- Rich Ecosystem: Strong integration with libraries like Matplotlib, NumPy, SciPy, and others in the Python data stack.
Why and When Should we Use Pandas?
- Working with tabular data: If our data naturally fits into rows and columns (like spreadsheets, database tables, or CSV files), Pandas’ DataFrame object will make our life much easier than trying to manage nested lists or dictionaries.
- Data cleaning: With Pandas, we can easily handle missing data, filter outliers, and transform data into a usable format.
- Data analysis: When we need to aggregate, group, pivot, or perform statistical operations on our data, Pandas provides concise methods that would otherwise require many lines of custom code.
- Integration with other data science tools: Pandas works fine with visualization libraries like Matplotlib and Seaborn, machine learning frameworks, and other Python data tools.
Let’s see some core functionalities of Pandas with practical examples but before we begin, make sure we have installed Pandas in our environment:
pip install pandas
Creating a DataFrame
import pandas as pd
# A dictionary of lists, where keys will become column names
data = {
'Name': ['Dorian', 'Marco', 'Laura'],
'Age': [32, 21, 43],
'City': ['Amsterdam', 'London', 'Rome']
}
# Create a DataFrame from this data
df = pd.DataFrame(data)
# Display the DataFrame
print(df)

Indexing and Filtering Rows
import pandas as pd
# A dictionary of lists, where keys will become column names
data = {
'Name': ['Dorian', 'Marco', 'Laura'],
'Age': [32, 21, 43],
'City': ['Amsterdam', 'London', 'Rome']
}
df = pd.DataFrame(data)
# Filter rows where Age > 28
filtered_df = df[df['Age'] > 28]
print("Filtered DataFrame where Age > 28:")
print(filtered_df)
# Select specific columns
name_city_df = df[['Name', 'City']]
print("\nDataFrame with only Name and City columns:")
print(name_city_df)
df_indexed = df.set_index('Name')
print("\nDataFrame with Name as index:")
print(df_indexed)
# Using query() for more readable filtering
high_paid = df.query('Age > 40')
print("\nFiltered DataFrame using query() where Age > 40:")
print(high_paid)
# Let's give everyone a 5 years more age
df['Age'] = df['Age'] + 5
df_indexed = df.set_index('Name')
print("\nDataFrame after adding 5 years to Age with Name as index:")
print(df_indexed)

Merging and Joining DataFrames
import pandas as pd
# Define two separate DataFrames
employees = pd.DataFrame({
'EmployeeID': [1, 2, 3],
'Name': ['Roberto', 'Lucas', 'Anna']
})
salaries = pd.DataFrame({
'EmployeeID': [1, 2, 3],
'Salary': [70000, 65000, 80000]
})
# Merge on a common key, 'EmployeeID'
merged_df = pd.merge(employees, salaries, on='EmployeeID')
print(merged_df.sort_values(by='Salary', ascending=False))

Grouping and Aggregation
import pandas as pd
# Define two separate DataFrames
employees = pd.DataFrame({
'EmployeeID': [1, 2, 3],
'Name': ['Roberto', 'Lucas', 'Anna'],
'Office': ['Rome', 'London', 'Rome']
})
salaries = pd.DataFrame({
'EmployeeID': [1, 2, 3],
'Salary': [70000, 65000, 80000]
})
# Merge on a common key, 'EmployeeID'
merged_df = pd.merge(employees, salaries, on='EmployeeID')
# Group by City and compute the average salary per office
grouped = merged_df.groupby('Office')['Salary'].mean()
print(grouped)

Dealing with Missing Data
import pandas as pd
import numpy as np
data = {
'Name': ['Dorian', 'Marco', 'Laura'],
'Age': [32, np.nan, 43],
'Office': ['Amsterdam', 'London', None]
}
df_start = pd.DataFrame(data)
print("Original DataFrame:")
print(df_start)
# Check for missing values
print("\nMissing values in the DataFrame:")
print(df_start.isnull())
# Drop rows with any missing values
print("\nDataFrame after dropping rows with missing values:")
df_dropped = df_start.dropna()
print(df_dropped)
# Fill missing values with a placeholder
print("\nDataFrame after filling missing values:")
df_filled = df_start.fillna({'Age': 0, 'Office': 'Unknown'})
print(df_filled)

Reading Data from a CSV File
import pandas as pd
# Read data from a CSV file named 'employees.csv'
df_employees = pd.read_csv('employees.csv')
# Print the first 5 rows of the DataFrame
print(df_employees.head())
# Print the summary info (column types, non-null counts, etc.)
print(df_employees.info())
Working with Excel Files
import pandas as pd
# By default, read_excel() will read the first sheet of the Excel file.
# If we need to specify a particular sheet, we can pass sheet_name='SheetName'.
df = pd.read_excel('data.xlsx')
# Now df contains all rows and columns from the sheet we read.
# We can inspect the DataFrame by printing it or calling .head()
print(df.head())
# Now we have all of Excel data in a DataFrame, ready for filtering,
# grouping, merging, or any other Pandas operations we have seen above.
Data visualization
import pandas as pd
import matplotlib.pyplot as plt
employees = pd.DataFrame({
"EmployeeID": [1, 2, 3],
"Name": ["Roberto", "Lucas", "Anna"],
"Salary": [70000, 65000, 80000]
})
# ------------------------------------------------------------------
# 1. Bar chart – salary per employee
# ------------------------------------------------------------------
plt.figure()
plt.bar(employees["Name"], employees["Salary"])
plt.title("Salary per Employee")
plt.xlabel("Employee")
plt.ylabel("Salary")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# ------------------------------------------------------------------
# 2. Line plot – “trend” of salaries in EmployeeID order
# (sort first so the plot follows the natural ID sequence)
# ------------------------------------------------------------------
employees_sorted = employees.sort_values("EmployeeID")
plt.figure()
plt.plot(employees_sorted["Name"], employees_sorted["Salary"], marker="o")
plt.title("Salary Trend by Employee")
plt.xlabel("Employee")
plt.ylabel("Salary")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# ------------------------------------------------------------------
# 3. Histogram – distribution of salaries
# ------------------------------------------------------------------
plt.figure()
plt.hist(employees["Salary"], bins=5)
plt.title("Distribution of Salaries")
plt.xlabel("Salary")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()


