Pandas

Introduction

Pandas is the most widely used Python library for working with tabular data. It provides two core data structures — Series (a single column) and DataFrame (a table of rows and columns) — along with a comprehensive set of tools for loading, cleaning, transforming, and analysing data.

Install it first:

pip install pandas
# or with uv
uv add pandas

Importing

import pandas as pd

Creating a DataFrame

import pandas as pd

data = {
    "firstname": ["Alice", "Bob", "Charlie", "Diana"],
    "department": ["Engineering", "Finance", "Engineering", "HR"],
    "salary": [80000, 65000, 75000, 55000],
}

df = pd.DataFrame(data)
print(df)

Output

  firstname   department  salary
   Alice  Engineering   80000
     Bob      Finance   65000
 Charlie  Engineering   75000
   Diana           HR   55000

Reading Data

# From CSV
df = pd.read_csv("employees.csv")

# From JSON
df = pd.read_json("employees.json")

# First look at the data
print(df.head())        # first 5 rows
print(df.tail(3))       # last 3 rows
print(df.shape)         # (rows, columns)
print(df.columns)       # column names
print(df.dtypes)        # data types
print(df.info())        # summary of structure and nulls
print(df.describe())    # statistics for numeric columns

Selecting Data

# Select a single column (returns a Series)
print(df["salary"])

# Select multiple columns (returns a DataFrame)
print(df[["firstname", "salary"]])

# Select rows by condition
high_earners = df[df["salary"] > 70000]

# Multiple conditions
engineers = df[(df["department"] == "Engineering") & (df["salary"] > 70000)]

# Select by row index — iloc uses integer positions
print(df.iloc[0])       # first row
print(df.iloc[0:3])     # first three rows

# Select by label — loc uses index labels
print(df.loc[0, "salary"])

Adding and Modifying Columns

# New column from calculation
df["tax"] = df["salary"] * 0.2
df["net_salary"] = df["salary"] - df["tax"]

# Update values conditionally
df.loc[df["department"] == "Engineering", "salary"] = df["salary"] * 1.05

Handling Missing Data

print(df.isnull().sum())         # count nulls per column
df_clean = df.dropna()           # remove rows with any null
df_filled = df.fillna(0)         # replace nulls with 0
df["salary"] = df["salary"].fillna(df["salary"].mean())

Grouping and Aggregation

# Group by department, calculate mean salary
dept_summary = df.groupby("department")["salary"].mean()
print(dept_summary)

# Multiple aggregations
summary = df.groupby("department").agg(
    headcount=("firstname", "count"),
    avg_salary=("salary", "mean"),
    total_salary=("salary", "sum"),
)
print(summary)

Sorting

df_sorted = df.sort_values("salary", ascending=False)
df_sorted = df.sort_values(["department", "salary"], ascending=[True, False])

Writing Data

df.to_csv("output.csv", index=False)
df.to_json("output.json", orient="records", indent=2)

Practice Exercises

Load the employees.csv you created in Grade 5 into a DataFrame. Print its shape and column names.
Filter the DataFrame to only show employees in Engineering.
Add a tax column (20% of salary) and a net_salary column.
Group by department and calculate the average and total salary per department.
Sort by salary descending and write the result to a new CSV file.

Pandas

Introduction​

Importing​

Creating a DataFrame​

Reading Data​

Selecting Data​

Adding and Modifying Columns​

Handling Missing Data​

Grouping and Aggregation​

Sorting​

Writing Data​

Practice Exercises​