Home Data Analysis with Python

Data Analysis with Python

by Bernard Baah

Introduction to Pandas

Pandas is an open-source Python library providing high-performance, easy-to-use data structures, and data analysis tools. It’s an essential tool for data scientists and analysts working in Python.

Installation

First, ensure that Python and pip (Python’s package installer) are installed on your system. You can install Pandas using pip:

pip install pandas

Basic Concepts

Data Structures: Series and DataFrame

  • Series: A one-dimensional labeled array capable of holding any data type.
  • DataFrame: A two-dimensional labeled data structure with columns of potentially different types.

Importing Pandas

Before you can use Pandas, you need to import it. The library is typically imported using the alias pd:

import pandas as pd

Creating DataFrames

You can create a DataFrame from various data sources, such as a list, dictionary, or even a CSV file.

# Creating a DataFrame from a dictionary
data = {
‘Name’: [‘John’, ‘Anna’, ‘Peter’, ‘Linda’],
‘Age’: [28, 34, 29, 32],
‘City’: [‘New York’, ‘Paris’, ‘Berlin’, ‘London’]
}
df = pd.DataFrame(data)

print(df)

Reading Data

Pandas supports reading data from many sources including CSV files, Excel files, SQL databases, and more.

# Reading data from a CSV file
df = pd.read_csv(‘path/to/your/file.csv’)

Data Manipulation

Viewing Data

  • df.head(n) – Displays the first n rows of the DataFrame.
  • df.tail(n) – Displays the last n rows of the DataFrame.

Selecting Data

  • Select a column: df['column_name']
  • Select rows by position: df.iloc[position]
  • Select rows by index: df.loc[index]

Filtering Data

You can use conditions to filter rows.

# Filtering data
filtered_df = df[df[‘Age’] > 30]

Adding and Dropping Columns

  • Add a new column: df['new_column'] = value
  • Drop a column: df.drop('column_name', axis=1, inplace=True)

Basic Data Analysis

Pandas provides several methods to perform basic data analysis:

  • df.describe() – Provides a quick overview of the numerical data in your DataFrame.
  • df.mean() – Calculates the mean of each numerical column.
  • df.corr() – Calculates the correlation between columns.

Visualizing Data

Pandas can also be used in conjunction with libraries like Matplotlib and Seaborn for data visualization.

import matplotlib.pyplot as plt

# Plotting a histogram
df[‘Age’].plot(kind=’hist’)
plt.show()

Example 1: Working with Time Series Data

Time series data is common in many fields, such as finance, economics, and even in web analytics. Here’s how you can handle time series data in Pandas:

import pandas as pd

# Creating a time series dataset
dates = pd.date_range(‘20230101’, periods=6)
df = pd.DataFrame(data = range(6), index=dates, columns=[‘Value’])

print(df)

# Resampling – aggregating data by month
monthly_resampled_data = df.resample(‘M’).mean()
print(monthly_resampled_data)

This example demonstrates creating a time series data set, indexing it with dates, and then resampling it by month to aggregate the data.

Example 2: Merging DataFrames

Merging data is a common operation when dealing with relational data. Here’s how you can merge two DataFrames:

import pandas as pd

# Creating two DataFrames
df1 = pd.DataFrame({
‘EmployeeID’: [‘1’, ‘2’, ‘3’],
‘Name’: [‘John Doe’, ‘Jane Smith’, ‘Emily Davis’]
})

df2 = pd.DataFrame({
‘EmployeeID’: [‘1’, ‘2’, ‘4’],
‘Position’: [‘Data Scientist’, ‘Software Engineer’, ‘Product Manager’]
})

# Merging DataFrames
merged_df = pd.merge(df1, df2, on=’EmployeeID’)
print(merged_df)

This example merges two DataFrames on the EmployeeID column, which is common to both DataFrames, showcasing a simple inner join operation.

Example 3: Handling Missing Data

Handling missing data is crucial in data preprocessing. Pandas provides convenient methods to deal with it:

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
df = pd.DataFrame({
‘A’: [1, 2, np.nan],
‘B’: [5, np.nan, np.nan],
‘C’: [1, 2, 3]
})

# Filling missing values with a placeholder
df.fillna(value=0, inplace=True)
print(df)

# Dropping any rows with missing values
# df.dropna(inplace=True)

This snippet demonstrates how to handle missing data by either filling NaN values with a placeholder or dropping rows with missing values.

Example 4: Grouping and Aggregating Data

Grouping data based on certain criteria and then applying aggregation functions is a powerful data analysis technique.

import pandas as pd

# Creating a DataFrame

data = {
‘Company’: [‘Google’, ‘Google’, ‘Microsoft’, ‘Microsoft’, ‘Facebook’, ‘Facebook’],
‘Employee’: [‘Sam’, ‘Charlie’, ‘Amy’, ‘Vanessa’, ‘Carl’, ‘Sarah’],
‘Sales’: [200, 120, 340, 124, 243, 350]
}
df = pd.DataFrame(data)

# Grouping by company and calculating total sales
grouped = df.groupby(‘Company’).sum()
print(grouped)

 

Example 5: Advanced Grouping with Aggregations

Pandas allows for sophisticated grouping operations, combining them with various aggregations to summarize data effectively.

import pandas as pd

# Sample dataset
data = {
‘Department’: [‘Sales’, ‘Sales’, ‘HR’, ‘HR’, ‘IT’, ‘IT’],
‘Employee’: [‘John’, ‘Jane’, ‘Mike’, ‘Sally’, ‘Emma’, ‘Tom’],
‘Salary’: [70000, 80000, 50000, 60000, 75000, 82000],
‘Join Date’: [‘2019-01-23’, ‘2020-02-10’, ‘2018-05-14’, ‘2019-07-02’, ‘2020-11-29’, ‘2021-01-30’]
}

df = pd.DataFrame(data)

# Converting ‘Join Date’ to datetime
df[‘Join Date’] = pd.to_datetime(df[‘Join Date’])

# Grouping by Department and applying multiple aggregation functions to Salary
aggregated_df = df.groupby(‘Department’)[‘Salary’].agg([‘mean’, ‘min’, ‘max’])
print(aggregated_df)

This example demonstrates grouping the DataFrame by the Department column and then applying multiple aggregation functions to the Salary column to find the mean, minimum, and maximum salaries within each department.

Example 6: Filtering after Grouping

Filtering groups based on certain criteria after you’ve grouped data can be very useful for narrowing down your data analysis to specific conditions.

# Filtering to find departments where the average salary is over 65000
high_pay = aggregated_df[aggregated_df[‘mean’] > 65000]
print(high_pay)

This snippet filters the previously aggregated DataFrame to only show departments where the average salary is greater than $65,000.

Example 7: Grouping and Filtering with Complex Conditions

Pandas allows for complex grouping and filtering operations, including using custom functions to apply more intricate conditions.

# Group by Department and filter out groups with less than 2 employees
filtered_df = df.groupby(‘Department’).filter(lambda x: len(x) > 1)
print(filtered_df)

Here, the data is grouped by Department, and only those departments with more than one employee are retained. This kind of operation is useful for filtering data based on the size of the group.

Example 8: Grouping by Multiple Columns

You can also group data by multiple columns, which allows for a more granular analysis of the dataset.

# Adding a ‘Year’ column for the year of the join date
df[‘Year’] = df[‘Join Date’].dt.year

# Grouping by both Department and Year
grouped_multiple = df.groupby([‘Department’, ‘Year’]).mean()
print(grouped_multiple)

This example groups the data first by Department and then by Year (extracted from the Join Date column), calculating the average salary for each group. It demonstrates how to perform a more detailed analysis by grouping data based on multiple criteria

Example 9: Grouping with Custom Aggregate Functions

Pandas allows you to define custom aggregate functions, which can be applied to grouped data for more specific analysis.

import pandas as pd

# Sample dataset
data = {
‘Category’: [‘A’, ‘A’, ‘B’, ‘B’, ‘C’, ‘C’],
‘Values’: [100, 150, 200, 250, 300, 350]
}
df = pd.DataFrame(data)

# Custom aggregation function to compute the range (max – min)
def range_func(group):
return group.max() – group.min()

# Applying the custom function to groups
range_df = df.groupby(‘Category’)[‘Values’].agg(range_func).reset_index(name=’Range’)
print(range_df)

This example demonstrates how to apply a custom function to grouped data, calculating the range of values within each category.

Example 10: Conditional Filtering with query()

The query() method provides a more readable way to perform conditional filtering, especially useful for complex conditions.

import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({
‘Name’: [‘John’, ‘Anna’, ‘Peter’, ‘Linda’],
‘Age’: [28, 34, 29, 32],
‘Salary’: [50000, 60000, np.nan, 52000]
})

# Using query() to filter data
filtered_df = df.query(‘Age > 30 & Salary > 50000’)
print(filtered_df)

This snippet filters the DataFrame for individuals over 30 years of age with a salary greater than 50,000, showcasing how query() can simplify complex filtering logic.

Example 11: Complex Grouping and Sorting

Combining grouping with sorting can reveal patterns or trends within specific segments of the data.

This example calculates the average values for each category, then sorts the results in descending order to quickly identify the categories with the highest average values.# Grouping by ‘Category’ and calculating average values, then sorting by ‘Values’
grouped_sorted = df.groupby(‘Category’)[‘Values’].mean().sort_values(ascending=False).reset_index()
print(grouped_sorted)

Example 12: Multi-level Indexing in Grouping

When grouping by multiple columns, you can work with multi-level indices, which allows for more detailed analysis at different levels of granularity.

# Sample data including a ‘Subcategory’ column
data = {
‘Category’: [‘A’, ‘A’, ‘A’, ‘B’, ‘B’, ‘B’],
‘Subcategory’: [‘X’, ‘X’, ‘Y’, ‘X’, ‘Y’, ‘Y’],
‘Values’: [10, 20, 15, 25, 30, 35]
}
df = pd.DataFrame(data)

# Grouping by multiple columns
multi_level_df = df.groupby([‘Category’, ‘Subcategory’]).sum()
print(multi_level_df)

# Accessing a specific group
print(multi_level_df.loc[(‘A’, ‘X’)])

This example demonstrates multi-level grouping and how to access data within a specific group, offering a nuanced approach to analyzing hierarchical data.

Example 13: Handling Missing Data

Pandas provides multiple options for handling missing data, including filling missing values with a specific value, or interpolating the values based on other data points.

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
df = pd.DataFrame({
‘A’: [1, 2, np.nan, 4, 5],
‘B’: [5, np.nan, np.nan, 8, 9],
‘C’: [10, 11, 12, np.nan, 14]
})

# Filling missing values with a specific value
df_filled = df.fillna(value=0)

# Interpolating missing values
df_interpolated = df.interpolate()

print(“Filled missing values:\n”, df_filled)
print(“\nInterpolated missing values:\n”, df_interpolated)

Example 14: Removing Duplicates

Removing duplicate records is essential to ensure the integrity of your data analysis. Pandas makes it easy to remove duplicates.

# Sample DataFrame with duplicate rows
df = pd.DataFrame({
‘A’: [‘foo’, ‘bar’, ‘foo’, ‘bar’, ‘foo’, ‘bar’, ‘foo’, ‘foo’],
‘B’: [‘one’, ‘one’, ‘two’, ‘three’, ‘two’, ‘two’, ‘one’, ‘three’],
‘C’: [1, 2, 3, 4, 5, 6, 7, 8],
‘D’: [9, 10, 11, 12, 13, 14, 15, 16]
})

# Removing duplicate rows
df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates)

Example 15: Replacing Values

In some cases, you might need to replace values in your dataset, either due to errors or to standardize the data.

# Replacing values in a DataFrame
df = pd.DataFrame({
‘A’: [1, 2, 3, 4, 5],
‘B’: [5, 6, 7, 8, 9],
‘C’: [‘one’, ‘two’, ‘three’, ‘four’, ‘five’]
})

# Replacing a value
df_replaced = df.replace(5, 500)

print(df_replaced)

Data cleaning is a critical step in the data analysis process, ensuring the accuracy and completeness of the data before any analysis is performed. Below are examples that cover various data cleaning tasks using Pandas, suitable for enhancing the curriculum of your IT training programs.

Example 13: Handling Missing Data

Pandas provides multiple options for handling missing data, including filling missing values with a specific value, or interpolating the values based on other data points.

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
df = pd.DataFrame({
‘A’: [1, 2, np.nan, 4, 5],
‘B’: [5, np.nan, np.nan, 8, 9],
‘C’: [10, 11, 12, np.nan, 14]
})

# Filling missing values with a specific value
df_filled = df.fillna(value=0)

# Interpolating missing values
df_interpolated = df.interpolate()

print(“Filled missing values:\n”, df_filled)
print(“\nInterpolated missing values:\n”, df_interpolated)

Example 14: Removing Duplicates

Removing duplicate records is essential to ensure the integrity of your data analysis. Pandas makes it easy to remove duplicates.

# Sample DataFrame with duplicate rows
df = pd.DataFrame({
‘A’: [‘foo’, ‘bar’, ‘foo’, ‘bar’, ‘foo’, ‘bar’, ‘foo’, ‘foo’],
‘B’: [‘one’, ‘one’, ‘two’, ‘three’, ‘two’, ‘two’, ‘one’, ‘three’],
‘C’: [1, 2, 3, 4, 5, 6, 7, 8],
‘D’: [9, 10, 11, 12, 13, 14, 15, 16]
})

# Removing duplicate rows
df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates)

Example 15: Replacing Values

In some cases, you might need to replace values in your dataset, either due to errors or to standardize the data.

# Replacing values in a DataFrame
df = pd.DataFrame({
‘A’: [1, 2, 3, 4, 5],
‘B’: [5, 6, 7, 8, 9],
‘C’: [‘one’, ‘two’, ‘three’, ‘four’, ‘five’]
})

# Replacing a value
df_replaced = df.replace(5, 500)

print(df_replaced)

Example 16: Normalizing Data

Data normalization is a technique often applied as part of the data preprocessing step to standardize the ranges of independent variables or features of data.

# Example DataFrame
df = pd.DataFrame({
‘A’: [1, 2, 3],
‘B’: [100, 200, 300],
‘C’: [1000, 2000, 3000]
})

# Normalizing the DataFrame
df_normalized = (df – df.min()) / (df.max() – df.min())

print(df_normalized)

Example 17: Handling Outliers

Outliers can significantly skew your data analysis. Identifying and handling outliers is therefore crucial.

# Identifying outliers with IQR
df = pd.DataFrame({
‘Data’: [10, 12, 12, 13, 12, 11, 14, 13, 15, 102, 12, 14, 14, 17, 11, 13, 12, 10, 3, 12, 14, 13, 12, 15, 10, 15, 12, 10, 14, 13, 15]
})

Q1 = df[‘Data’].quantile(0.25)
Q3 = df[‘Data’].quantile(0.75)
IQR = Q3 – Q1

# Defining outliers
outliers = df[(df[‘Data’] < (Q1 – 1.5 * IQR)) | (df[‘Data’] > (Q3 + 1.5 * IQR))]

# Handling outliers
df_no_outliers = df[~((df[‘Data’] < (Q1 – 1.5 * IQR)) | (df[‘Data’] > (Q3 + 1.5 * IQR)))]

print(“Outliers:\n”, outliers)
print(“\nData without outliers:\n”, df_no_outliers)

Example 18: Converting Data Types

Sometimes, you’ll need to convert the data type of a column to perform certain operations or analyses correctly.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
‘A’: [‘1’, ‘2’, ‘3’],
‘B’: [‘4.5’, ‘5.5’, ‘6.5’],
‘C’: [‘7’, ‘8’, ‘nine’]
})

# Converting columns to numeric, errors=’coerce’ will convert non-convertible values to NaN
df[‘A’] = pd.to_numeric(df[‘A’])
df[‘B’] = pd.to_numeric(df[‘B’])
df[‘C’] = pd.to_numeric(df[‘C’], errors=’coerce’)

print(df.dtypes)
print(df)

Example 19: Renaming Columns

Renaming columns can be necessary for clarity, consistency, or to match specific formatting requirements.

# Renaming columns for clarity or consistency
df.rename(columns={‘A’: ‘Column1’, ‘B’: ‘Column2’, ‘C’: ‘Column3’}, inplace=True)

print(df.columns)

Example 20: Dropping Columns and Rows

Dropping unnecessary columns and rows helps focus the analysis on relevant data.

# Dropping a column
df.drop(‘Column3’, axis=1, inplace=True)

# Dropping a row by index
df.drop(index=1, inplace=True)

print(df)

Example 19: Renaming Columns

Renaming columns can be necessary for clarity, consistency, or to match specific formatting requirements.

# Renaming columns for clarity or consistency
df.rename(columns={‘A’: ‘Column1’, ‘B’: ‘Column2’, ‘C’: ‘Column3’}, inplace=True)

print(df.columns)

Example 20: Dropping Columns and Rows

Dropping unnecessary columns and rows helps focus the analysis on relevant data.

# Dropping a column
df.drop(‘Column3’, axis=1, inplace=True)

# Dropping a row by index
df.drop(index=1, inplace=True)

print(df)

Example 21: Filling Missing Data with Forward/Backward Fill

In time series data, you might want to fill missing values by carrying forward the previous value or backfilling the next value.

# Sample DataFrame with missing values
df = pd.DataFrame({
‘Time’: pd.date_range(‘20240101′, periods=6, freq=’D’),
‘Value’: [1, np.nan, np.nan, 4, np.nan, 6]
})

# Forward fill
df[‘Forward Fill’] = df[‘Value’].fillna(method=’ffill’)

# Backward fill
df[‘Backward Fill’] = df[‘Value’].fillna(method=’bfill’)

print(df)

Example 22: Detecting and Filtering Outliers

Detecting and filtering outliers is often necessary to prevent skewed analysis.

# Using Z-score to detect outliers
from scipy import stats
import numpy as np

df = pd.DataFrame({
‘Data’: np.random.normal(0, 1, 100).tolist() + [10, -10, 12]
})

df[‘Z-score’] = np.abs(stats.zscore(df[‘Data’]))

# Filtering outliers, typically Z-score > 3 are considered outliers
df_no_outliers = df[df[‘Z-score’] < 3]

print(df_no_outliers)

Example 23: String Operations for Text Data

Cleaning text data often involves string operations, such as stripping whitespace, lowercasing, or finding and replacing substrings.

# Sample DataFrame with text data
df = pd.DataFrame({
‘Text’: [‘ Example text ‘, ‘ANOTHER TEXT’, ‘third-Text ‘, ‘FoUrtH_tExT’]
})

# Stripping whitespace
df[‘Text’] = df[‘Text’].str.strip()

# Lowercasing
df[‘Text’] = df[‘Text’].str.lower()

# Replacing dashes and underscores with spaces
df[‘Text’] = df[‘Text’].str.replace(‘-‘, ‘ ‘).str.replace(‘_’, ‘ ‘)

print(df)

Example 24: Splitting Columns

Often, dataset columns contain combined information that needs to be split into separate columns for better analysis.

import pandas as pd

# Sample DataFrame with combined information in one column
df = pd.DataFrame({
‘Name_Age’: [‘John_34’, ‘Doe_45’, ‘Jane_29’]
})

# Splitting the ‘Name_Age’ column into ‘Name’ and ‘Age’ columns
df[[‘Name’, ‘Age’]] = df[‘Name_Age’].str.split(‘_’, expand=True)

# Converting ‘Age’ to numeric type
df[‘Age’] = pd.to_numeric(df[‘Age’])

print(df)

Example 25: Aggregating Text Data

When dealing with text data, you might need to aggregate strings in a group, such as concatenating names or comments.

# Sample DataFrame with text data
df = pd.DataFrame({
‘Group’: [‘A’, ‘A’, ‘B’, ‘B’, ‘C’],
‘Comments’: [‘Good’, ‘Better’, ‘Best’, ‘Worse’, ‘Bad’]
})

# Aggregating text data by group
aggregated_comments = df.groupby(‘Group’)[‘Comments’].agg(‘ ‘.join).reset_index()

print(aggregated_comments)

Example 26: Dealing with Date and Time

Date and time data often require cleaning and formatting to be useful for analysis or as features in a model.

# Sample DataFrame with inconsistent date formats
df = pd.DataFrame({
‘DateTime’: [‘2024-01-01’, ‘1/2/2024’, ‘January 3, 2024’, ‘2024.01.04’, ‘2024/01/05’]
})

# Converting to consistent datetime format
df[‘DateTime’] = pd.to_datetime(df[‘DateTime’])

print(df)

Example 27: Normalizing Text Data

Text data often comes in various forms. Normalizing text to a standard form can be crucial for text analysis or natural language processing tasks.

# Sample DataFrame with text in various cases and spacing
df = pd.DataFrame({
‘Text’: [‘ FIRST entry’, ‘Second Entry ‘, ‘THIRD\tENTRY’]
})

# Normalizing text: strip spaces, replace tabs, and lowercase
df[‘Normalized Text’] = df[‘Text’].str.lower().str.replace(‘\s+’, ‘ ‘, regex=True).str.strip()

print(df[[‘Text’, ‘Normalized Text’]])

# Sample DataFrame with text in various cases and spacing
df = pd.DataFrame({
‘Text’: [‘ FIRST entry’, ‘Second Entry ‘, ‘THIRD\tENTRY’]
})

# Normalizing text: strip spaces, replace tabs, and lowercase
df[‘Normalized Text’] = df[‘Text’].str.lower().str.replace(‘\s+’, ‘ ‘, regex=True).str.strip()

print(df[[‘Text’, ‘Normalized Text’]])

Example 28: Encoding Categorical Variables

For machine learning models, categorical variables need to be encoded as numerical data. Pandas offers a straightforward way to achieve this.

# Sample DataFrame with categorical variable
df = pd.DataFrame({
‘Category’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘A’, ‘C’]
})

# Encoding categorical variables using get_dummies()
df_encoded = pd.get_dummies(df, columns=[‘Category’])

print(df_encoded)

Example 29: Cleaning and Tokenizing Text

When preparing text data for analysis or machine learning, cleaning and tokenizing the text can be essential steps.

# Assuming installation of NLTK: pip install nltk
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

# Sample DataFrame with text
df = pd.DataFrame({
‘Text’: [‘This is a sample sentence.’, ‘Another sample, with punctuation!’]
})

# Function to clean and tokenize text
def clean_and_tokenize(text):
# Lowercasing and removing punctuation
tokens = word_tokenize(text.lower())
# Filtering out non-alphabetic tokens
words = [word for word in tokens if word.isalpha()]
return words

# Applying the function to the text column
df[‘Tokens’] = df[‘Text’].apply(clean_and_tokenize)

print(df[[‘Text’, ‘Tokens’]])

# Assuming installation of NLTK: pip install nltk
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

# Sample DataFrame with text
df = pd.DataFrame({
‘Text’: [‘This is a sample sentence.’, ‘Another sample, with punctuation!’]
})

# Function to clean and tokenize text
def clean_and_tokenize(text):
# Lowercasing and removing punctuation
tokens = word_tokenize(text.lower())
# Filtering out non-alphabetic tokens
words = [word for word in tokens if word.isalpha()]
return words

# Applying the function to the text column
df[‘Tokens’] = df[‘Text’].apply(clean_and_tokenize)

print(df[[‘Text’, ‘Tokens’]])

Example 30: Filling Missing Time Series Data

Filling missing values in time series data can be approached differently compared to regular data due to the sequential nature of the data.

# Sample time series DataFrame with missing values
df = pd.DataFrame({
‘Date’: pd.date_range(start=’2024-01-01′, periods=6, freq=’D’),
‘Value’: [1, np.nan, np.nan, 4, np.nan, 6]
}).set_index(‘Date’)

# Filling missing values using interpolation
df[‘Interpolated’] = df[‘Value’].interpolate(method=’time’)

print(df)

Example 31: Removing Rows Based on Column Threshold

Remove rows that have less than a certain number of non-NA values.

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
df = pd.DataFrame(np.random.rand(5, 3), columns=[‘A’, ‘B’, ‘C’])
df.iloc[1:3, 1] = np.nan

# Removing rows with less than 2 non-NA values
df_thresh = df.dropna(thresh=2)

print(df_thresh)

Example 32: Filling Missing Values with Column Mean

Replace missing values with the mean of their respective column.

# Assuming df is defined as before
df[‘B’] = df[‘B’].fillna(df[‘B’].mean())

print(df)

Example 33: Categorical Data Ordinal Encoding

Convert categorical data into an ordinal encoding when the order matters.

# Creating a DataFrame with categorical data
df = pd.DataFrame({
‘Grade’: [‘Excellent’, ‘Good’, ‘Poor’, ‘Good’, ‘Excellent’]
})

# Mapping categories to numbers considering order
grade_map = {‘Poor’: 1, ‘Good’: 2, ‘Excellent’: 3}
df[‘Grade’] = df[‘Grade’].map(grade_map)

print(df)

Example 34: Resetting Index After Row Operations

Reset the DataFrame index after operations like sorting or filtering that may leave gaps in the index.

# Filtering and then resetting index
df_filtered = df[df[‘Grade’] > 1].reset_index(drop=True)

print(df_filtered)

Example 35: Applying a Function to Clean Data

Apply a custom function to a column to perform complex data cleaning operations.

# Example function to clean text data
def clean_text(text):
return text.strip().lower().replace(” “, “_”)

df[‘Cleaned’] = df[‘Grade’].apply(lambda x: clean_text(str(x)))

print(df)

Example 36: Combining str Methods with applymap()

Use str methods combined with applymap() for element-wise operations on a DataFrame of strings.

# Sample DataFrame with text
df = pd.DataFrame({
‘Text1’: [‘ Python’, ‘Pandas ‘, ‘ Data ‘],
‘Text2’: [‘Cleaning’, ‘ Techniques’, ‘Examples’]
})

df = df.applymap(lambda x: x.strip().lower())

print(df)

Example 37: Filtering Rows Using str.contains

Filter rows based on whether a column’s text contains a specific substring.

# Filtering rows where Text1 contains ‘python’ or ‘data’
df_filtered = df[df[‘Text1’].str.contains(‘python|data’)]

print(df_filtered)

Example 38: Expanding and Collapsing Lists

Expand lists into rows or columns, and collapse rows or columns into lists.

# Expanding lists in a column into multiple rows
df = pd.DataFrame({
‘Data’: [[1, 2], [3, 4], [5, 6]]
})

df_expanded = df.explode(‘Data’)

# Collapsing rows into a list
df_collapsed = df_expanded.groupby(level=0).agg(list)

print(df_expanded)
print(df_collapsed)

Example 39: Deduplicating Data Based on a Subset of Columns

Remove duplicate rows based on specific columns, keeping the first occurrence.

# Creating a DataFrame with potential duplicate rows
df = pd.DataFrame({
‘A’: [‘foo’, ‘bar’, ‘foo’, ‘bar’],
‘B’: [‘one’, ‘one’, ‘two’, ‘three’],
‘C’: [1, 2, 3, 4]
})

# Removing duplicates based on columns A and B
df_deduplicated = df.drop_duplicates(subset=[‘A’, ‘B’])

print(df_deduplicated)

Example 40: Converting Between Time Zones

Convert datetime objects from one time zone to another.

# Sample DataFrame with datetime in UTC
df = pd.DataFrame({
‘DateTimeUTC’: pd.date_range(‘2024-01-01′, periods=3, freq=’D’, tz=’UTC’)
})

# Converting from UTC to Eastern Time
df[‘DateTimeEastern’] = df[‘DateTimeUTC’].dt.tz_convert(‘US/Eastern’)

print(df)

Example 41: Splitting a String Column into Multiple Columns

Sometimes, data comes in a single string column that you want to split into multiple columns based on a delimiter.

import pandas as pd

df = pd.DataFrame({
‘Name_Score’: [‘John Doe_90’, ‘Jane Smith_85’, ‘Alex Lee_88’]
})

# Splitting the ‘Name_Score’ column into ‘Name’ and ‘Score’ columns
df[[‘Name’, ‘Score’]] = df[‘Name_Score’].str.split(‘_’, expand=True)
df[‘Score’] = pd.to_numeric(df[‘Score’])

print(df)

Example 42: Converting Strings to Title Case

When dealing with text data, you might need to standardize the case formatting.

df[‘Name’] = df[‘Name’].str.title()

print(df)

Example 43: Removing Leading and Trailing Characters from String Columns

Cleaning up strings in DataFrame columns by removing leading and trailing spaces or specific characters.

df[‘Name’] = df[‘Name’].str.strip()

# Assuming another column ‘Remarks’ with extra spaces
df[‘Remarks’] = [‘ Excellent ‘, ‘ Good ‘, ‘ Average ‘]
df[‘Remarks’] = df[‘Remarks’].str.strip()

print(df)

Example 44: Replacing Missing Values with the Median

For numerical columns, replacing missing values with the median can sometimes be more appropriate than the mean.

df = pd.DataFrame({
‘Scores’: [88, 92, np.nan, 85, 90, np.nan, 87]
})

df[‘Scores’].fillna(df[‘Scores’].median(), inplace=True)

print(df)

Example 45: Converting Boolean Columns to Integer

You might need to convert boolean values to integers (1 and 0) for analysis or machine learning models.

df = pd.DataFrame({
‘Passed’: [True, False, True, False]
})

df[‘Passed’] = df[‘Passed’].astype(int)

print(df)

Example 46: Extracting Date Components

Extracting year, month, and day into separate columns can be useful for further analysis or feature engineering.

df = pd.DataFrame({
‘DateTime’: pd.to_datetime([‘2024-01-01’, ‘2024-04-15’, ‘2024-07-20’])
})

df[‘Year’] = df[‘DateTime’].dt.year
df[‘Month’] = df[‘DateTime’].dt.month
df[‘Day’] = df[‘DateTime’].dt.day

print(df)

Example 47: Binning Numerical Data

Creating categorical bins for numerical data can simplify analysis and visualization.

df = pd.DataFrame({
‘Age’: [22, 25, 19, 35, 29, 40]
})

df[‘Age Group’] = pd.cut(df[‘Age’], bins=[0, 20, 30, 40], labels=[‘0-20′, ’21-30′, ’31-40’])

print(df)

Example 48: Dropping Rows with Any or All Columns Empty

Sometimes, you’ll want to remove rows that are entirely empty or have any missing values.

df = pd.DataFrame({
‘A’: [1, np.nan, 3],
‘B’: [4, np.nan, np.nan]
})

df.dropna(how=’all’, inplace=True) # Drops rows where all columns are NaN
df.dropna(how=’any’, inplace=True) # Drops rows where any column is NaN

print(df)

Example 49: Mapping Values to Categories

Map numerical or string values to categories for readability or analysis.

df = pd.DataFrame({
‘Rating’: [1, 2, 3, 4, 5]
})

rating_map = {1: ‘Poor’, 2: ‘Fair’, 3: ‘Good’, 4: ‘Very Good’, 5: ‘Excellent’}
df[‘Rating Category’] = df[‘Rating’].map(rating_map)

print(df)

Example 50: Aggregating Data After Grouping

After grouping data, you might want to perform multiple aggregation operations for deeper insights.

df = pd.DataFrame({
‘Department’: [‘Sales’, ‘HR’, ‘IT’, ‘Sales’, ‘HR’, ‘IT’],
‘Employee’: [‘John’, ‘Doe’, ‘Jane’, ‘Smith’, ‘Alex’, ‘Lee’],
‘Salary’: [70000, 60000, 80000, 75000, 58000, 90000]
})

aggregated = df.groupby(‘Department’)[‘Salary’].agg([‘mean’, ‘min’, ‘max’])

print(aggregated)

Example 51: Converting Category Types for Efficiency

For large datasets with categorical data, converting columns to category type can save memory and speed up operations.

import pandas as pd

df = pd.DataFrame({
‘Product’: [‘Apple’, ‘Banana’, ‘Carrot’, ‘Apple’, ‘Banana’, ‘Carrot’]
})

df[‘Product’] = df[‘Product’].astype(‘category’)

print(df.dtypes)

Example 52: Applying Conditions to Subset Data

Applying conditions to subset data is essential for focusing analysis on relevant segments.

df = pd.DataFrame({
‘Sales’: [150, 200, 250, 300, 350, 400],
‘Region’: [‘East’, ‘West’, ‘East’, ‘West’, ‘East’, ‘West’]
})

# Get data where Sales are greater than 300
high_sales_df = df[df[‘Sales’] > 300]

print(high_sales_df)

Example 53: Correcting Typos in Data Entries

Correcting typos or inconsistent entries is a common need in text data cleaning.

df = pd.DataFrame({
‘Name’: [‘John’, ‘Jahn’, ‘Jane’, ‘Jone’, ‘John’]
})

# Correcting typos in the ‘Name’ column
corrections = {
‘Jahn’: ‘John’,
‘Jone’: ‘Jane’
}
df[‘Name’] = df[‘Name’].replace(corrections)

print(df)

Example 54: Standardizing Date Formats

Handling different date formats to ensure consistency in time series data.

df = pd.DataFrame({
‘Date’: [’01-01-2024′, ‘2024/02/01’, ‘March 3, 2024’]
})

# Standardizing date formats
df[‘Date’] = pd.to_datetime(df[‘Date’])

print(df)

Example 56: Aggregating and Transforming Data Simultaneously

Use transform() to apply an aggregation function but keep the original DataFrame’s shape.

df = pd.DataFrame({
‘Group’: [‘A’, ‘A’, ‘B’, ‘B’, ‘C’, ‘C’],
‘Data’: [100, 150, 200, 250, 300, 350]
})

df[‘Average’] = df.groupby(‘Group’)[‘Data’].transform(‘mean’)

print(df)

Example 57: Handling Time Zones in DateTimes

Converting DateTime objects to handle time zones properly.

df = pd.DataFrame({
‘DateTime’: pd.to_datetime([‘2024-01-01T12:00’, ‘2024-02-01T15:00’, ‘2024-03-01T18:00’], utc=True)
})

# Converting UTC to Eastern Time
df[‘DateTime’] = df[‘DateTime’].dt.tz_convert(‘US/Eastern’)

print(df)

Example 58: Detecting and Replacing Outliers with Quantiles

Detecting outliers and replacing them can prevent skewed analyses.

df = pd.DataFrame({
‘Values’: [1, 2, 2, 3, 4, 10, 100]
})

q_low = df[‘Values’].quantile(0.01)
q_hi = df[‘Values’].quantile(0.99)

df[‘Values’] = df[‘Values’].apply(lambda x: q_hi if x > q_hi else (q_low if x < q_low else x))

print(df)

Example 59: Pivoting Data for Better Visualization

Pivoting tables to transform data for better analysis and visualization.

df = pd.DataFrame({
‘Date’: [‘2024-01-01’, ‘2024-01-01’, ‘2024-01-02’],
‘Category’: [‘Fruits’, ‘Vegetables’, ‘Fruits’],
‘Sales’: [100, 150, 200]
})

pivot_df = df.pivot_table(values=’Sales’, index=’Date’, columns=’Category’, fill_value=0)

print(pivot_df)

Example 60: Collapsing Multiple Columns into One

Consolidating multiple columns into one for a streamlined dataset.

df = pd.DataFrame({
‘R1’: [1, 0, 1],
‘R2’: [0, 1, 0],
‘R3’: [1, 1, 0]
})

# Collapsing multiple columns into a single column ‘Responses’
df[‘Responses’] = df.apply(lambda row: ”.join(row.values.astype(str)), axis=1)

print(df)

Example 61: Encoding Categorical Variables with Label Encoding

Label encoding is useful for converting text labels into a numeric form so they can be used in machine learning algorithms.

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({
‘Category’: [‘red’, ‘blue’, ‘green’, ‘blue’, ‘red’]
})

encoder = LabelEncoder()
df[‘Category_Encoded’] = encoder.fit_transform(df[‘Category’])

print(df)

Example 62: Generating Dummy Variables

Creating dummy variables from categorical data is a common step for statistical modeling and machine learning.

df = pd.DataFrame({
‘Color’: [‘Red’, ‘Blue’, ‘Green’, ‘Blue’, ‘Red’]
})

dummies = pd.get_dummies(df[‘Color’], prefix=’Color’)

df = pd.concat([df, dummies], axis=1)

print(df)

Example 63: Converting Continuous Data into Categorical Data

Sometimes, it’s useful to bin continuous data into discrete intervals.

df = pd.DataFrame({
‘Temperature’: [72, 85, 90, 100, 65]
})

df[‘Temp_Category’] = pd.cut(df[‘Temperature’], bins=[60, 70, 80, 90, 100], labels=[“Cool”, “Mild”, “Warm”, “Hot”])

print(df)

Example 64: Combining Multiple DataFrames

Merging multiple dataframes is crucial when combining different data sources.

df1 = pd.DataFrame({
‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’]
})

df2 = pd.DataFrame({
‘A’: [‘A4’, ‘A5’, ‘A6’, ‘A7’],
‘B’: [‘B4’, ‘B5’, ‘B6’, ‘B7’]
})

df_combined = pd.concat([df1, df2], ignore_index=True)

print(df_combined)

Example 65: Handling Missing Data in Time Series

Time series data often requires special handling of gaps and missing data.

ts_df = pd.DataFrame({
‘Dates’: pd.date_range(‘20240101’, periods=6),
‘Values’: [1, np.nan, np.nan, 4, 5, 6]
})

ts_df[‘Values’].fillna(method=’ffill’, inplace=True) # Forward fill

print(ts_df)

Example 66: Cleaning Text Data Using Regular Expressions

Regular expressions are powerful for text manipulations, such as removing unwanted characters.

df = pd.DataFrame({
‘Text’: [‘Example#1’, ‘Data@Science’, ‘100%True’]
})

df[‘Text’] = df[‘Text’].str.replace(r'[\W_]+’, ”, regex=True)

print(df)

Example 67: Correcting Index after Merging or Other Operations

Resetting the DataFrame index ensures that the indexing remains consistent after various transformations.

df = pd.concat([df1, df2]).reset_index(drop=True)

print(df)

Let’s dive into more data cleaning examples using Pandas to further enhance your ability to handle various types of data preparation challenges. These examples can be useful in broadening the scope of your IT training programs, providing learners with practical skills for real-world data scenarios.

Example 61: Encoding Categorical Variables with Label Encoding

Label encoding is useful for converting text labels into a numeric form so they can be used in machine learning algorithms.

python

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({
‘Category’: [‘red’, ‘blue’, ‘green’, ‘blue’, ‘red’]
})

encoder = LabelEncoder()
df[‘Category_Encoded’] = encoder.fit_transform(df[‘Category’])

print(df)

Example 62: Generating Dummy Variables

Creating dummy variables from categorical data is a common step for statistical modeling and machine learning.

python

df = pd.DataFrame({
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

dummies = pd.get_dummies(df[‘Color’], prefix=‘Color’)

df = pd.concat([df, dummies], axis=1)

print(df)

Example 63: Converting Continuous Data into Categorical Data

Sometimes, it’s useful to bin continuous data into discrete intervals.

python

df = pd.DataFrame({
'Temperature': [72, 85, 90, 100, 65]
})

df[‘Temp_Category’] = pd.cut(df[‘Temperature’], bins=[60, 70, 80, 90, 100], labels=[“Cool”, “Mild”, “Warm”, “Hot”])

print(df)

Example 64: Combining Multiple DataFrames

Merging multiple dataframes is crucial when combining different data sources.

python

df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
})

df2 = pd.DataFrame({
‘A’: [‘A4’, ‘A5’, ‘A6’, ‘A7’],
‘B’: [‘B4’, ‘B5’, ‘B6’, ‘B7’]
})

df_combined = pd.concat([df1, df2], ignore_index=True)

print(df_combined)

Example 65: Handling Missing Data in Time Series

Time series data often requires special handling of gaps and missing data.

python

ts_df = pd.DataFrame({
'Dates': pd.date_range('20240101', periods=6),
'Values': [1, np.nan, np.nan, 4, 5, 6]
})

ts_df[‘Values’].fillna(method=‘ffill’, inplace=True) # Forward fill

print(ts_df)

Example 66: Cleaning Text Data Using Regular Expressions

Regular expressions are powerful for text manipulations, such as removing unwanted characters.

python

df = pd.DataFrame({
'Text': ['Example#1', 'Data@Science', '100%True']
})

df[‘Text’] = df[‘Text’].str.replace(r'[\W_]+’, , regex=True)

print(df)

Example 67: Correcting Index after Merging or Other Operations

Resetting the DataFrame index ensures that the indexing remains consistent after various transformations.

python

df = pd.concat([df1, df2]).reset_index(drop=True)

print(df)

Example 68: Aggregating Data Using Custom Functions

Custom aggregation functions can be applied during grouping for more specific calculations.

df = pd.DataFrame({
‘Group’: [‘X’, ‘X’, ‘Y’, ‘Y’],
‘Data’: [1, 2, 3, 4]
})

def custom_agg(x):
return x.max() – x.min()

grouped = df.groupby(‘Group’)[‘Data’].agg(custom_agg)

print(grouped)

Example 69: Filtering Data Based on Statistical Metrics

Using statistical measures like standard deviation can help in identifying and filtering outliers.

df = pd.DataFrame({
‘Data’: [1, 2, 2, 3, 4, 10, 100]
})

mean_val = df[‘Data’].mean()
std_dev = df[‘Data’].std()

filtered_df = df[(df[‘Data’] > (mean_val – 2 * std_dev)) & (df[‘Data’] < (mean_val + 2 * std_dev))]

print(filtered_df)

Example 70: Normalizing Data for Machine Learning

Normalization is a common preprocessing step for many machine learning algorithms.

from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()

normalized_data = scaler.fit_transform(data)

normalized_df = pd.DataFrame(normalized_data, columns=[‘X’, ‘Y’])

print(normalized_df)

Example 71: Removing Rows with Outliers Using IQR

This method involves calculating the interquartile range (IQR) to identify outliers and then filtering them out.

import pandas as pd

df = pd.DataFrame({
‘Values’: [10, 12, 12, 13, 12, 11, 14, 13, 15, 102, 12, 14, 14, 17, 11, 13, 12, 10, 3, 12, 14, 13, 12, 15, 10, 15, 12, 10, 14, 13, 15]
})

Q1 = df[‘Values’].quantile(0.25)
Q3 = df[‘Values’].quantile(0.75)
IQR = Q3 – Q1

df_cleaned = df[~((df[‘Values’] < (Q1 – 1.5 * IQR)) | (df[‘Values’] > (Q3 + 1.5 * IQR)))]

print(df_cleaned)

import pandas as pd

df = pd.DataFrame({
‘Values’: [10, 12, 12, 13, 12, 11, 14, 13, 15, 102, 12, 14, 14, 17, 11, 13, 12, 10, 3, 12, 14, 13, 12, 15, 10, 15, 12, 10, 14, 13, 15]
})

Q1 = df[‘Values’].quantile(0.25)
Q3 = df[‘Values’].quantile(0.75)
IQR = Q3 – Q1

df_cleaned = df[~((df[‘Values’] < (Q1 – 1.5 * IQR)) | (df[‘Values’] > (Q3 + 1.5 * IQR)))]

print(df_cleaned)

Example 72: Converting a Series of Lists into a DataFrame

If you have a series where each entry is a list, you can convert this into a DataFrame for easier manipulation.

data = pd.Series([[‘red’, ‘blue’], [‘green’, ‘yellow’], [‘white’]])
df = pd.DataFrame(data.tolist(), columns=[‘Color1’, ‘Color2’])

print(df)

Example 73: Dealing with Missing Time Data in Time Series

For time series data with missing timestamps, you can reindex the series to fill in the missing periods.

ts_index = pd.date_range(‘2024-01-01′, periods=6, freq=’D’)
ts_data = pd.Series([1, None, None, 2, 3, 4], index=ts_index)

# Reindexing with a complete date range
complete_index = pd.date_range(‘2024-01-01’, ‘2024-01-06’)
ts_data = ts_data.reindex(complete_index)

print(ts_data)

Example 74: Concatenating DataFrames with Different Columns

Concatenate two DataFrames that have different sets of columns.

df1 = pd.DataFrame({‘A’: [1, 2], ‘B’: [3, 4]})
df2 = pd.DataFrame({‘A’: [5, 6], ‘C’: [7, 8]})

df_combined = pd.concat([df1, df2]).fillna(0)

print(df_combined)

Example 75: Dropping Columns with a High Percentage of Missing Values

Drop columns from a DataFrame that have a high percentage of missing values.

df = pd.DataFrame({
‘A’: [1, 2, np.nan, 4],
‘B’: [np.nan, np.nan, np.nan, np.nan],
‘C’: [1, 2, 3, 4]
})

threshold = 0.75 # 75% missing values
df = df.dropna(thresh=len(df) * (1 – threshold), axis=1)

print(df)

Example 76: Mapping External Data to a DataFrame

Use a mapping from external data to add a new column to a DataFrame.

mapping = {1: ‘apple’, 2: ‘banana’, 3: ‘cherry’}
df = pd.DataFrame({‘Numbers’: [1, 2, 1, 3]})

df[‘Fruits’] = df[‘Numbers’].map(mapping)

print(df)

Example 77: Calculating Rolling Averages in Time Series

Calculate rolling averages to smooth out time series data.

ts_df = pd.DataFrame({
‘Dates’: pd.date_range(start=’2024-01-01′, periods=5, freq=’D’),
‘Values’: [1, 2, 3, 4, 5]
})

ts_df[‘Rolling_Mean’] = ts_df[‘Values’].rolling(window=3).mean()

print(ts_df)

Example 78: Filling Missing Values with Predictive Imputation

Predict missing values based on other columns, using a simple linear model for demonstration.

from sklearn.linear_model import LinearRegression

df = pd.DataFrame({
‘A’: [1, 2, 3, 4, 5],
‘B’: [2, np.nan, 6, 8, 10]
})

model = LinearRegression()
X = df.loc[df[‘B’].notna(), [‘A’]]
y = df.loc[df[‘B’].notna(), ‘B’]
model.fit(X, y)

# Predicting the missing ‘B’ values
df.loc[df[‘B’].isna(), ‘B’] = model.predict(df.loc[df[‘B’].isna(), [‘A’]])

print(df)

Example 79: Unstacking Data for Wider Format

Convert index levels to columns to create a wider format DataFrame from a multi-index series.

df = pd.DataFrame({
‘Year’: [2024, 2024, 2025, 2025],
‘Month’: [1, 2, 1, 2],
‘Value’: [100, 200, 300, 400]
})

df.set_index([‘Year’, ‘Month’]).unstack().fillna(0)

print(df)

Example 80: Cleaning and Extracting Numeric Data from Strings

Extract numeric data embedded in string data, which is common in real-world datasets.

df = pd.DataFrame({
‘Raw_Data’: [‘123 kg’, ‘456 lbs’, ‘789 kg’, ‘no data’]
})

df[‘Weight’] = df[‘Raw_Data’].str.extract(‘(\d+)’).astype(float).fillna(0)

Example 81: Standardizing Numeric Data with Z-Scores

Standardizing data involves scaling the features so they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.

import pandas as pd
from scipy.stats import zscore

df = pd.DataFrame({
‘Scores’: [90, 82, 70, 98, 95, 85]
})

df[‘Z_Scores’] = zscore(df[‘Scores’])

print(df)

Example 82: Stripping Whitespace from All String Columns

Automatically trim whitespace from all string columns in a DataFrame, which is useful for cleaning up textual data.

df = pd.DataFrame({
‘Name’: [‘ Alice ‘, ‘Bob ‘, ‘ Charlie’],
‘City’: [‘ New York’, ‘Los Angeles ‘, ‘San Francisco ‘]
})

df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

print(df)

Example 83: Merging DataFrames on Multiple Keys

Merge two DataFrames based on more than one column to ensure precise data integration.

df1 = pd.DataFrame({
‘First Name’: [‘John’, ‘Jane’, ‘Alice’],
‘Last Name’: [‘Doe’, ‘Doe’, ‘Smith’],
‘Age’: [28, 34, 24]
})

df2 = pd.DataFrame({
‘First Name’: [‘John’, ‘Jane’, ‘Alice’],
‘Last Name’: [‘Doe’, ‘Doe’, ‘Smith’],
‘Profession’: [‘Engineer’, ‘Doctor’, ‘Artist’]
})

df_merged = pd.merge(df1, df2, on=[‘First Name’, ‘Last Name’])

print(df_merged)

Example 84: Converting Data Types in Bulk

Convert multiple columns to different data types efficiently, which is common after data importation.

df = pd.DataFrame({
‘A’: [‘1’, ‘2’, ‘3’],
‘B’: [‘4’, ‘5’, ‘6’],
‘C’: [‘7.1’, ‘8.2’, ‘9.3’]
})

convert_dict = {‘A’: int, ‘B’: float, ‘C’: float}
df = df.astype(convert_dict)

print(df.dtypes)

Example 85: Filtering Data Using Regular Expressions

Use regular expressions to filter rows based on complex patterns in textual data.

df = pd.DataFrame({
‘Email’: [‘john.doe@example.com’, ‘jane.smith@foo.org’, ‘alice@example.com’]
})

# Filter emails not containing ‘example’
df_filtered = df[df[‘Email’].str.contains(‘example’, regex=True)]

print(df_filtered)

Example 86: Filling Missing Values in a Categorical Series

Fill missing values in categorical data using the mode (the most frequent value).

df = pd.DataFrame({
‘Favorite Color’: [‘Blue’, ‘Red’, ‘Green’, ‘Blue’, None]
})

df[‘Favorite Color’].fillna(df[‘Favorite Color’].mode()[0], inplace=True)

print(df)

Example 87: Creating Indicator Variables for Missing Data

Create binary indicator variables that denote whether a value was missing in the original dataset.

df = pd.DataFrame({
‘Income’: [55000, 48000, None, 53000]
})

df[‘Income_Missing’] = df[‘Income’].isnull().astype(int)

print(df)

Example 88: Splitting and Expanding a DataFrame from One Column

Split a single string column into multiple columns and expand the DataFrame accordingly.

df = pd.DataFrame({
‘Data’: [‘John|28|Engineer’, ‘Jane|34|Doctor’, ‘Alice|24|Artist’]
})

df[[‘Name’, ‘Age’, ‘Profession’]] = df[‘Data’].str.split(‘|’, expand=True)

print(df)

Example 89: Aggregating with Custom Functions After Filtering

Use custom functions to perform complex aggregations on filtered data.

df = pd.DataFrame({
‘Product’: [‘Apples’, ‘Oranges’, ‘Bananas’, ‘Apples’, ‘Oranges’],
‘Sales’: [100, 150, 200, 90, 140]
})

# Custom aggregation after filtering
def custom_aggregation(series):
return series.max() – series.min()

result = df[df[‘Product’] == ‘Apples’][‘Sales’].agg(custom_aggregation)

print(“Sales difference for Apples:”, result)

Example 90: Rotating Data from Wide to Long Format

Rotate a DataFrame from a wide format to a long format, often needed for time series and panel data analysis.

df = pd.DataFrame({
‘ID’: [1, 2],
‘2024_Sales’: [100, 200],
‘2025_Sales’: [150, 250]
})

df_long = pd.wide_to_long(df, stubnames=’Sales’, i=’ID’, j=’Year’, sep=’_’)

print(df_long)

 

Welcome to Coding Filly, your go-to destination for all things tech! We are a passionate team of tech enthusiasts dedicated to providing insightful and inspiring content to empower individuals in the world of technology.

Subscribe

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Cooding Filly – All Right Reserved. Designed and Developed by Filly Coder

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?
-
00:00
00:00
Update Required Flash plugin
-
00:00
00:00