Introduction to Pandas
Pandas is an open-source Python library providing high-performance, easy-to-use data structures, and data analysis tools. It’s an essential tool for data scientists and analysts working in Python.
Installation
First, ensure that Python and pip (Python’s package installer) are installed on your system. You can install Pandas using pip:
pip install pandas
Basic Concepts
Data Structures: Series and DataFrame
- Series: A one-dimensional labeled array capable of holding any data type.
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types.
Importing Pandas
Before you can use Pandas, you need to import it. The library is typically imported using the alias pd
:
import pandas as pd
Creating DataFrames
You can create a DataFrame from various data sources, such as a list, dictionary, or even a CSV file.
# Creating a DataFrame from a dictionary
data = {
‘Name’: [‘John’, ‘Anna’, ‘Peter’, ‘Linda’],
‘Age’: [28, 34, 29, 32],
‘City’: [‘New York’, ‘Paris’, ‘Berlin’, ‘London’]
}
df = pd.DataFrame(data)
print(df)
Reading Data
Pandas supports reading data from many sources including CSV files, Excel files, SQL databases, and more.
# Reading data from a CSV file
df = pd.read_csv(‘path/to/your/file.csv’)
Data Manipulation
Viewing Data
df.head(n)
– Displays the first n rows of the DataFrame.df.tail(n)
– Displays the last n rows of the DataFrame.
Selecting Data
- Select a column:
df['column_name']
- Select rows by position:
df.iloc[position]
- Select rows by index:
df.loc[index]
Filtering Data
You can use conditions to filter rows.
# Filtering data
filtered_df = df[df[‘Age’] > 30]
Adding and Dropping Columns
- Add a new column:
df['new_column'] = value
- Drop a column:
df.drop('column_name', axis=1, inplace=True)
Basic Data Analysis
Pandas provides several methods to perform basic data analysis:
df.describe()
– Provides a quick overview of the numerical data in your DataFrame.df.mean()
– Calculates the mean of each numerical column.df.corr()
– Calculates the correlation between columns.
Visualizing Data
Pandas can also be used in conjunction with libraries like Matplotlib and Seaborn for data visualization.
import matplotlib.pyplot as plt
# Plotting a histogram
df[‘Age’].plot(kind=’hist’)
plt.show()
Example 1: Working with Time Series Data
Time series data is common in many fields, such as finance, economics, and even in web analytics. Here’s how you can handle time series data in Pandas:
import pandas as pd
# Creating a time series dataset
dates = pd.date_range(‘20230101’, periods=6)
df = pd.DataFrame(data = range(6), index=dates, columns=[‘Value’])
print(df)
# Resampling – aggregating data by month
monthly_resampled_data = df.resample(‘M’).mean()
print(monthly_resampled_data)
This example demonstrates creating a time series data set, indexing it with dates, and then resampling it by month to aggregate the data.
Example 2: Merging DataFrames
Merging data is a common operation when dealing with relational data. Here’s how you can merge two DataFrames:
import pandas as pd
# Creating two DataFrames
df1 = pd.DataFrame({
‘EmployeeID’: [‘1’, ‘2’, ‘3’],
‘Name’: [‘John Doe’, ‘Jane Smith’, ‘Emily Davis’]
})
df2 = pd.DataFrame({
‘EmployeeID’: [‘1’, ‘2’, ‘4’],
‘Position’: [‘Data Scientist’, ‘Software Engineer’, ‘Product Manager’]
})
# Merging DataFrames
merged_df = pd.merge(df1, df2, on=’EmployeeID’)
print(merged_df)
This example merges two DataFrames on the EmployeeID
column, which is common to both DataFrames, showcasing a simple inner join operation.
Example 3: Handling Missing Data
Handling missing data is crucial in data preprocessing. Pandas provides convenient methods to deal with it:
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
df = pd.DataFrame({
‘A’: [1, 2, np.nan],
‘B’: [5, np.nan, np.nan],
‘C’: [1, 2, 3]
})
# Filling missing values with a placeholder
df.fillna(value=0, inplace=True)
print(df)
# Dropping any rows with missing values
# df.dropna(inplace=True)
This snippet demonstrates how to handle missing data by either filling NaN values with a placeholder or dropping rows with missing values.
Example 4: Grouping and Aggregating Data
Grouping data based on certain criteria and then applying aggregation functions is a powerful data analysis technique.
import pandas as pd
# Creating a DataFrame
data = {
‘Company’: [‘Google’, ‘Google’, ‘Microsoft’, ‘Microsoft’, ‘Facebook’, ‘Facebook’],
‘Employee’: [‘Sam’, ‘Charlie’, ‘Amy’, ‘Vanessa’, ‘Carl’, ‘Sarah’],
‘Sales’: [200, 120, 340, 124, 243, 350]
}
df = pd.DataFrame(data)
# Grouping by company and calculating total sales
grouped = df.groupby(‘Company’).sum()
print(grouped)
Example 5: Advanced Grouping with Aggregations
Pandas allows for sophisticated grouping operations, combining them with various aggregations to summarize data effectively.
import pandas as pd
# Sample dataset
data = {
‘Department’: [‘Sales’, ‘Sales’, ‘HR’, ‘HR’, ‘IT’, ‘IT’],
‘Employee’: [‘John’, ‘Jane’, ‘Mike’, ‘Sally’, ‘Emma’, ‘Tom’],
‘Salary’: [70000, 80000, 50000, 60000, 75000, 82000],
‘Join Date’: [‘2019-01-23’, ‘2020-02-10’, ‘2018-05-14’, ‘2019-07-02’, ‘2020-11-29’, ‘2021-01-30’]
}
df = pd.DataFrame(data)
# Converting ‘Join Date’ to datetime
df[‘Join Date’] = pd.to_datetime(df[‘Join Date’])
# Grouping by Department and applying multiple aggregation functions to Salary
aggregated_df = df.groupby(‘Department’)[‘Salary’].agg([‘mean’, ‘min’, ‘max’])
print(aggregated_df)
This example demonstrates grouping the DataFrame by the Department
column and then applying multiple aggregation functions to the Salary
column to find the mean, minimum, and maximum salaries within each department.
Example 6: Filtering after Grouping
Filtering groups based on certain criteria after you’ve grouped data can be very useful for narrowing down your data analysis to specific conditions.
high_pay = aggregated_df[aggregated_df[‘mean’] > 65000]
print(high_pay)
This snippet filters the previously aggregated DataFrame to only show departments where the average salary is greater than $65,000.
Example 7: Grouping and Filtering with Complex Conditions
Pandas allows for complex grouping and filtering operations, including using custom functions to apply more intricate conditions.
# Group by Department and filter out groups with less than 2 employees
filtered_df = df.groupby(‘Department’).filter(lambda x: len(x) > 1)
print(filtered_df)
Here, the data is grouped by Department
, and only those departments with more than one employee are retained. This kind of operation is useful for filtering data based on the size of the group.
Example 8: Grouping by Multiple Columns
You can also group data by multiple columns, which allows for a more granular analysis of the dataset.
# Adding a ‘Year’ column for the year of the join date
df[‘Year’] = df[‘Join Date’].dt.year
# Grouping by both Department and Year
grouped_multiple = df.groupby([‘Department’, ‘Year’]).mean()
print(grouped_multiple)
This example groups the data first by Department
and then by Year
(extracted from the Join Date
column), calculating the average salary for each group. It demonstrates how to perform a more detailed analysis by grouping data based on multiple criteria
Example 9: Grouping with Custom Aggregate Functions
Pandas allows you to define custom aggregate functions, which can be applied to grouped data for more specific analysis.
import pandas as pd
# Sample dataset
data = {
‘Category’: [‘A’, ‘A’, ‘B’, ‘B’, ‘C’, ‘C’],
‘Values’: [100, 150, 200, 250, 300, 350]
}
df = pd.DataFrame(data)
# Custom aggregation function to compute the range (max – min)
def range_func(group):
return group.max() – group.min()
# Applying the custom function to groups
range_df = df.groupby(‘Category’)[‘Values’].agg(range_func).reset_index(name=’Range’)
print(range_df)
This example demonstrates how to apply a custom function to grouped data, calculating the range of values within each category.
Example 10: Conditional Filtering with query()
The query()
method provides a more readable way to perform conditional filtering, especially useful for complex conditions.
import pandas as pd
import numpy as np
# Sample DataFrame
df = pd.DataFrame({
‘Name’: [‘John’, ‘Anna’, ‘Peter’, ‘Linda’],
‘Age’: [28, 34, 29, 32],
‘Salary’: [50000, 60000, np.nan, 52000]
})
# Using query() to filter data
filtered_df = df.query(‘Age > 30 & Salary > 50000’)
print(filtered_df)
This snippet filters the DataFrame for individuals over 30 years of age with a salary greater than 50,000, showcasing how query()
can simplify complex filtering logic.
Example 11: Complex Grouping and Sorting
Combining grouping with sorting can reveal patterns or trends within specific segments of the data.
This example calculates the average values for each category, then sorts the results in descending order to quickly identify the categories with the highest average values.# Grouping by ‘Category’ and calculating average values, then sorting by ‘Values’
grouped_sorted = df.groupby(‘Category’)[‘Values’].mean().sort_values(ascending=False).reset_index()
print(grouped_sorted)
Example 12: Multi-level Indexing in Grouping
When grouping by multiple columns, you can work with multi-level indices, which allows for more detailed analysis at different levels of granularity.
# Sample data including a ‘Subcategory’ column
data = {
‘Category’: [‘A’, ‘A’, ‘A’, ‘B’, ‘B’, ‘B’],
‘Subcategory’: [‘X’, ‘X’, ‘Y’, ‘X’, ‘Y’, ‘Y’],
‘Values’: [10, 20, 15, 25, 30, 35]
}
df = pd.DataFrame(data)
# Grouping by multiple columns
multi_level_df = df.groupby([‘Category’, ‘Subcategory’]).sum()
print(multi_level_df)
# Accessing a specific group
print(multi_level_df.loc[(‘A’, ‘X’)])
This example demonstrates multi-level grouping and how to access data within a specific group, offering a nuanced approach to analyzing hierarchical data.
Example 13: Handling Missing Data
Pandas provides multiple options for handling missing data, including filling missing values with a specific value, or interpolating the values based on other data points.
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
df = pd.DataFrame({
‘A’: [1, 2, np.nan, 4, 5],
‘B’: [5, np.nan, np.nan, 8, 9],
‘C’: [10, 11, 12, np.nan, 14]
})
# Filling missing values with a specific value
df_filled = df.fillna(value=0)
# Interpolating missing values
df_interpolated = df.interpolate()
print(“Filled missing values:\n”, df_filled)
print(“\nInterpolated missing values:\n”, df_interpolated)
Example 14: Removing Duplicates
Removing duplicate records is essential to ensure the integrity of your data analysis. Pandas makes it easy to remove duplicates.
# Sample DataFrame with duplicate rows
df = pd.DataFrame({
‘A’: [‘foo’, ‘bar’, ‘foo’, ‘bar’, ‘foo’, ‘bar’, ‘foo’, ‘foo’],
‘B’: [‘one’, ‘one’, ‘two’, ‘three’, ‘two’, ‘two’, ‘one’, ‘three’],
‘C’: [1, 2, 3, 4, 5, 6, 7, 8],
‘D’: [9, 10, 11, 12, 13, 14, 15, 16]
})
# Removing duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
Example 15: Replacing Values
In some cases, you might need to replace values in your dataset, either due to errors or to standardize the data.
# Replacing values in a DataFrame
df = pd.DataFrame({
‘A’: [1, 2, 3, 4, 5],
‘B’: [5, 6, 7, 8, 9],
‘C’: [‘one’, ‘two’, ‘three’, ‘four’, ‘five’]
})
# Replacing a value
df_replaced = df.replace(5, 500)
print(df_replaced)
Data cleaning is a critical step in the data analysis process, ensuring the accuracy and completeness of the data before any analysis is performed. Below are examples that cover various data cleaning tasks using Pandas, suitable for enhancing the curriculum of your IT training programs.
Example 13: Handling Missing Data
Pandas provides multiple options for handling missing data, including filling missing values with a specific value, or interpolating the values based on other data points.
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
df = pd.DataFrame({
‘A’: [1, 2, np.nan, 4, 5],
‘B’: [5, np.nan, np.nan, 8, 9],
‘C’: [10, 11, 12, np.nan, 14]
})
# Filling missing values with a specific value
df_filled = df.fillna(value=0)
# Interpolating missing values
df_interpolated = df.interpolate()
print(“Filled missing values:\n”, df_filled)
print(“\nInterpolated missing values:\n”, df_interpolated)
Example 14: Removing Duplicates
Removing duplicate records is essential to ensure the integrity of your data analysis. Pandas makes it easy to remove duplicates.
# Sample DataFrame with duplicate rows
df = pd.DataFrame({
‘A’: [‘foo’, ‘bar’, ‘foo’, ‘bar’, ‘foo’, ‘bar’, ‘foo’, ‘foo’],
‘B’: [‘one’, ‘one’, ‘two’, ‘three’, ‘two’, ‘two’, ‘one’, ‘three’],
‘C’: [1, 2, 3, 4, 5, 6, 7, 8],
‘D’: [9, 10, 11, 12, 13, 14, 15, 16]
})
# Removing duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
Example 15: Replacing Values
In some cases, you might need to replace values in your dataset, either due to errors or to standardize the data.
# Replacing values in a DataFrame
df = pd.DataFrame({
‘A’: [1, 2, 3, 4, 5],
‘B’: [5, 6, 7, 8, 9],
‘C’: [‘one’, ‘two’, ‘three’, ‘four’, ‘five’]
})
# Replacing a value
df_replaced = df.replace(5, 500)
print(df_replaced)
Example 16: Normalizing Data
Data normalization is a technique often applied as part of the data preprocessing step to standardize the ranges of independent variables or features of data.
# Example DataFrame
df = pd.DataFrame({
‘A’: [1, 2, 3],
‘B’: [100, 200, 300],
‘C’: [1000, 2000, 3000]
})
# Normalizing the DataFrame
df_normalized = (df – df.min()) / (df.max() – df.min())
print(df_normalized)
Example 17: Handling Outliers
Outliers can significantly skew your data analysis. Identifying and handling outliers is therefore crucial.
# Identifying outliers with IQR
df = pd.DataFrame({
‘Data’: [10, 12, 12, 13, 12, 11, 14, 13, 15, 102, 12, 14, 14, 17, 11, 13, 12, 10, 3, 12, 14, 13, 12, 15, 10, 15, 12, 10, 14, 13, 15]
})
Q1 = df[‘Data’].quantile(0.25)
Q3 = df[‘Data’].quantile(0.75)
IQR = Q3 – Q1
# Defining outliers
outliers = df[(df[‘Data’] < (Q1 – 1.5 * IQR)) | (df[‘Data’] > (Q3 + 1.5 * IQR))]
# Handling outliers
df_no_outliers = df[~((df[‘Data’] < (Q1 – 1.5 * IQR)) | (df[‘Data’] > (Q3 + 1.5 * IQR)))]
print(“Outliers:\n”, outliers)
print(“\nData without outliers:\n”, df_no_outliers)
Example 18: Converting Data Types
Sometimes, you’ll need to convert the data type of a column to perform certain operations or analyses correctly.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
‘A’: [‘1’, ‘2’, ‘3’],
‘B’: [‘4.5’, ‘5.5’, ‘6.5’],
‘C’: [‘7’, ‘8’, ‘nine’]
})
# Converting columns to numeric, errors=’coerce’ will convert non-convertible values to NaN
df[‘A’] = pd.to_numeric(df[‘A’])
df[‘B’] = pd.to_numeric(df[‘B’])
df[‘C’] = pd.to_numeric(df[‘C’], errors=’coerce’)
print(df.dtypes)
print(df)
Example 19: Renaming Columns
Renaming columns can be necessary for clarity, consistency, or to match specific formatting requirements.
# Renaming columns for clarity or consistency
df.rename(columns={‘A’: ‘Column1’, ‘B’: ‘Column2’, ‘C’: ‘Column3’}, inplace=True)
print(df.columns)
Example 20: Dropping Columns and Rows
Dropping unnecessary columns and rows helps focus the analysis on relevant data.
# Dropping a column
df.drop(‘Column3’, axis=1, inplace=True)
# Dropping a row by index
df.drop(index=1, inplace=True)
print(df)
Example 19: Renaming Columns
Renaming columns can be necessary for clarity, consistency, or to match specific formatting requirements.
# Renaming columns for clarity or consistency
df.rename(columns={‘A’: ‘Column1’, ‘B’: ‘Column2’, ‘C’: ‘Column3’}, inplace=True)
print(df.columns)
Example 20: Dropping Columns and Rows
Dropping unnecessary columns and rows helps focus the analysis on relevant data.
# Dropping a column
df.drop(‘Column3’, axis=1, inplace=True)
# Dropping a row by index
df.drop(index=1, inplace=True)
print(df)
Example 21: Filling Missing Data with Forward/Backward Fill
In time series data, you might want to fill missing values by carrying forward the previous value or backfilling the next value.
# Sample DataFrame with missing values
df = pd.DataFrame({
‘Time’: pd.date_range(‘20240101′, periods=6, freq=’D’),
‘Value’: [1, np.nan, np.nan, 4, np.nan, 6]
})
# Forward fill
df[‘Forward Fill’] = df[‘Value’].fillna(method=’ffill’)
# Backward fill
df[‘Backward Fill’] = df[‘Value’].fillna(method=’bfill’)
print(df)
Example 22: Detecting and Filtering Outliers
Detecting and filtering outliers is often necessary to prevent skewed analysis.
# Using Z-score to detect outliers
from scipy import stats
import numpy as np
df = pd.DataFrame({
‘Data’: np.random.normal(0, 1, 100).tolist() + [10, -10, 12]
})
df[‘Z-score’] = np.abs(stats.zscore(df[‘Data’]))
# Filtering outliers, typically Z-score > 3 are considered outliers
df_no_outliers = df[df[‘Z-score’] < 3]
print(df_no_outliers)
Example 23: String Operations for Text Data
Cleaning text data often involves string operations, such as stripping whitespace, lowercasing, or finding and replacing substrings.
# Sample DataFrame with text data
df = pd.DataFrame({
‘Text’: [‘ Example text ‘, ‘ANOTHER TEXT’, ‘third-Text ‘, ‘FoUrtH_tExT’]
})
# Stripping whitespace
df[‘Text’] = df[‘Text’].str.strip()
# Lowercasing
df[‘Text’] = df[‘Text’].str.lower()
# Replacing dashes and underscores with spaces
df[‘Text’] = df[‘Text’].str.replace(‘-‘, ‘ ‘).str.replace(‘_’, ‘ ‘)
print(df)
Example 24: Splitting Columns
Often, dataset columns contain combined information that needs to be split into separate columns for better analysis.
import pandas as pd
# Sample DataFrame with combined information in one column
df = pd.DataFrame({
‘Name_Age’: [‘John_34’, ‘Doe_45’, ‘Jane_29’]
})
# Splitting the ‘Name_Age’ column into ‘Name’ and ‘Age’ columns
df[[‘Name’, ‘Age’]] = df[‘Name_Age’].str.split(‘_’, expand=True)
# Converting ‘Age’ to numeric type
df[‘Age’] = pd.to_numeric(df[‘Age’])
print(df)
Example 25: Aggregating Text Data
When dealing with text data, you might need to aggregate strings in a group, such as concatenating names or comments.
# Sample DataFrame with text data
df = pd.DataFrame({
‘Group’: [‘A’, ‘A’, ‘B’, ‘B’, ‘C’],
‘Comments’: [‘Good’, ‘Better’, ‘Best’, ‘Worse’, ‘Bad’]
})
# Aggregating text data by group
aggregated_comments = df.groupby(‘Group’)[‘Comments’].agg(‘ ‘.join).reset_index()
print(aggregated_comments)
Example 26: Dealing with Date and Time
Date and time data often require cleaning and formatting to be useful for analysis or as features in a model.
# Sample DataFrame with inconsistent date formats
df = pd.DataFrame({
‘DateTime’: [‘2024-01-01’, ‘1/2/2024’, ‘January 3, 2024’, ‘2024.01.04’, ‘2024/01/05’]
})
# Converting to consistent datetime format
df[‘DateTime’] = pd.to_datetime(df[‘DateTime’])
print(df)
Example 27: Normalizing Text Data
Text data often comes in various forms. Normalizing text to a standard form can be crucial for text analysis or natural language processing tasks.
# Sample DataFrame with text in various cases and spacing
df = pd.DataFrame({
‘Text’: [‘ FIRST entry’, ‘Second Entry ‘, ‘THIRD\tENTRY’]
})
# Normalizing text: strip spaces, replace tabs, and lowercase
df[‘Normalized Text’] = df[‘Text’].str.lower().str.replace(‘\s+’, ‘ ‘, regex=True).str.strip()
print(df[[‘Text’, ‘Normalized Text’]])
# Sample DataFrame with text in various cases and spacing
df = pd.DataFrame({
‘Text’: [‘ FIRST entry’, ‘Second Entry ‘, ‘THIRD\tENTRY’]
})
# Normalizing text: strip spaces, replace tabs, and lowercase
df[‘Normalized Text’] = df[‘Text’].str.lower().str.replace(‘\s+’, ‘ ‘, regex=True).str.strip()
print(df[[‘Text’, ‘Normalized Text’]])
Example 28: Encoding Categorical Variables
For machine learning models, categorical variables need to be encoded as numerical data. Pandas offers a straightforward way to achieve this.
# Sample DataFrame with categorical variable
df = pd.DataFrame({
‘Category’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘A’, ‘C’]
})
# Encoding categorical variables using get_dummies()
df_encoded = pd.get_dummies(df, columns=[‘Category’])
print(df_encoded)
Example 29: Cleaning and Tokenizing Text
When preparing text data for analysis or machine learning, cleaning and tokenizing the text can be essential steps.
# Assuming installation of NLTK: pip install nltk
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
# Sample DataFrame with text
df = pd.DataFrame({
‘Text’: [‘This is a sample sentence.’, ‘Another sample, with punctuation!’]
})
# Function to clean and tokenize text
def clean_and_tokenize(text):
# Lowercasing and removing punctuation
tokens = word_tokenize(text.lower())
# Filtering out non-alphabetic tokens
words = [word for word in tokens if word.isalpha()]
return words
# Applying the function to the text column
df[‘Tokens’] = df[‘Text’].apply(clean_and_tokenize)
print(df[[‘Text’, ‘Tokens’]])
# Assuming installation of NLTK: pip install nltk
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
# Sample DataFrame with text
df = pd.DataFrame({
‘Text’: [‘This is a sample sentence.’, ‘Another sample, with punctuation!’]
})
# Function to clean and tokenize text
def clean_and_tokenize(text):
# Lowercasing and removing punctuation
tokens = word_tokenize(text.lower())
# Filtering out non-alphabetic tokens
words = [word for word in tokens if word.isalpha()]
return words
# Applying the function to the text column
df[‘Tokens’] = df[‘Text’].apply(clean_and_tokenize)
print(df[[‘Text’, ‘Tokens’]])
Example 30: Filling Missing Time Series Data
Filling missing values in time series data can be approached differently compared to regular data due to the sequential nature of the data.
# Sample time series DataFrame with missing values
df = pd.DataFrame({
‘Date’: pd.date_range(start=’2024-01-01′, periods=6, freq=’D’),
‘Value’: [1, np.nan, np.nan, 4, np.nan, 6]
}).set_index(‘Date’)
# Filling missing values using interpolation
df[‘Interpolated’] = df[‘Value’].interpolate(method=’time’)
print(df)
Example 31: Removing Rows Based on Column Threshold
Remove rows that have less than a certain number of non-NA values.
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
df = pd.DataFrame(np.random.rand(5, 3), columns=[‘A’, ‘B’, ‘C’])
df.iloc[1:3, 1] = np.nan
# Removing rows with less than 2 non-NA values
df_thresh = df.dropna(thresh=2)
print(df_thresh)
Example 32: Filling Missing Values with Column Mean
Replace missing values with the mean of their respective column.
# Assuming df is defined as before
df[‘B’] = df[‘B’].fillna(df[‘B’].mean())
print(df)
Example 33: Categorical Data Ordinal Encoding
Convert categorical data into an ordinal encoding when the order matters.
# Creating a DataFrame with categorical data
df = pd.DataFrame({
‘Grade’: [‘Excellent’, ‘Good’, ‘Poor’, ‘Good’, ‘Excellent’]
})
# Mapping categories to numbers considering order
grade_map = {‘Poor’: 1, ‘Good’: 2, ‘Excellent’: 3}
df[‘Grade’] = df[‘Grade’].map(grade_map)
print(df)
Example 34: Resetting Index After Row Operations
Reset the DataFrame index after operations like sorting or filtering that may leave gaps in the index.
# Filtering and then resetting index
df_filtered = df[df[‘Grade’] > 1].reset_index(drop=True)
print(df_filtered)
Example 35: Applying a Function to Clean Data
Apply a custom function to a column to perform complex data cleaning operations.
# Example function to clean text data
def clean_text(text):
return text.strip().lower().replace(” “, “_”)
df[‘Cleaned’] = df[‘Grade’].apply(lambda x: clean_text(str(x)))
print(df)
Example 36: Combining str
Methods with applymap()
Use str
methods combined with applymap()
for element-wise operations on a DataFrame of strings.
# Sample DataFrame with text
df = pd.DataFrame({
‘Text1’: [‘ Python’, ‘Pandas ‘, ‘ Data ‘],
‘Text2’: [‘Cleaning’, ‘ Techniques’, ‘Examples’]
})
df = df.applymap(lambda x: x.strip().lower())
print(df)
Example 37: Filtering Rows Using str.contains
Filter rows based on whether a column’s text contains a specific substring.
# Filtering rows where Text1 contains ‘python’ or ‘data’
df_filtered = df[df[‘Text1’].str.contains(‘python|data’)]
print(df_filtered)
Example 38: Expanding and Collapsing Lists
Expand lists into rows or columns, and collapse rows or columns into lists.
# Expanding lists in a column into multiple rows
df = pd.DataFrame({
‘Data’: [[1, 2], [3, 4], [5, 6]]
})
df_expanded = df.explode(‘Data’)
# Collapsing rows into a list
df_collapsed = df_expanded.groupby(level=0).agg(list)
print(df_expanded)
print(df_collapsed)
Example 39: Deduplicating Data Based on a Subset of Columns
Remove duplicate rows based on specific columns, keeping the first occurrence.
# Creating a DataFrame with potential duplicate rows
df = pd.DataFrame({
‘A’: [‘foo’, ‘bar’, ‘foo’, ‘bar’],
‘B’: [‘one’, ‘one’, ‘two’, ‘three’],
‘C’: [1, 2, 3, 4]
})
# Removing duplicates based on columns A and B
df_deduplicated = df.drop_duplicates(subset=[‘A’, ‘B’])
print(df_deduplicated)
Example 40: Converting Between Time Zones
Convert datetime objects from one time zone to another.
# Sample DataFrame with datetime in UTC
df = pd.DataFrame({
‘DateTimeUTC’: pd.date_range(‘2024-01-01′, periods=3, freq=’D’, tz=’UTC’)
})
# Converting from UTC to Eastern Time
df[‘DateTimeEastern’] = df[‘DateTimeUTC’].dt.tz_convert(‘US/Eastern’)
print(df)
Example 41: Splitting a String Column into Multiple Columns
Sometimes, data comes in a single string column that you want to split into multiple columns based on a delimiter.
import pandas as pd
df = pd.DataFrame({
‘Name_Score’: [‘John Doe_90’, ‘Jane Smith_85’, ‘Alex Lee_88’]
})
# Splitting the ‘Name_Score’ column into ‘Name’ and ‘Score’ columns
df[[‘Name’, ‘Score’]] = df[‘Name_Score’].str.split(‘_’, expand=True)
df[‘Score’] = pd.to_numeric(df[‘Score’])
print(df)
Example 42: Converting Strings to Title Case
When dealing with text data, you might need to standardize the case formatting.
df[‘Name’] = df[‘Name’].str.title()
print(df)
Example 43: Removing Leading and Trailing Characters from String Columns
Cleaning up strings in DataFrame columns by removing leading and trailing spaces or specific characters.
df[‘Name’] = df[‘Name’].str.strip()
# Assuming another column ‘Remarks’ with extra spaces
df[‘Remarks’] = [‘ Excellent ‘, ‘ Good ‘, ‘ Average ‘]
df[‘Remarks’] = df[‘Remarks’].str.strip()
print(df)
Example 44: Replacing Missing Values with the Median
For numerical columns, replacing missing values with the median can sometimes be more appropriate than the mean.
df = pd.DataFrame({
‘Scores’: [88, 92, np.nan, 85, 90, np.nan, 87]
})
df[‘Scores’].fillna(df[‘Scores’].median(), inplace=True)
print(df)
Example 45: Converting Boolean Columns to Integer
You might need to convert boolean values to integers (1 and 0) for analysis or machine learning models.
df = pd.DataFrame({
‘Passed’: [True, False, True, False]
})
df[‘Passed’] = df[‘Passed’].astype(int)
print(df)
Example 46: Extracting Date Components
Extracting year, month, and day into separate columns can be useful for further analysis or feature engineering.
df = pd.DataFrame({
‘DateTime’: pd.to_datetime([‘2024-01-01’, ‘2024-04-15’, ‘2024-07-20’])
})
df[‘Year’] = df[‘DateTime’].dt.year
df[‘Month’] = df[‘DateTime’].dt.month
df[‘Day’] = df[‘DateTime’].dt.day
print(df)
Example 47: Binning Numerical Data
Creating categorical bins for numerical data can simplify analysis and visualization.
df = pd.DataFrame({
‘Age’: [22, 25, 19, 35, 29, 40]
})
df[‘Age Group’] = pd.cut(df[‘Age’], bins=[0, 20, 30, 40], labels=[‘0-20′, ’21-30′, ’31-40’])
print(df)
Example 48: Dropping Rows with Any or All Columns Empty
Sometimes, you’ll want to remove rows that are entirely empty or have any missing values.
df = pd.DataFrame({
‘A’: [1, np.nan, 3],
‘B’: [4, np.nan, np.nan]
})
df.dropna(how=’all’, inplace=True) # Drops rows where all columns are NaN
df.dropna(how=’any’, inplace=True) # Drops rows where any column is NaN
print(df)
Example 49: Mapping Values to Categories
Map numerical or string values to categories for readability or analysis.
df = pd.DataFrame({
‘Rating’: [1, 2, 3, 4, 5]
})
rating_map = {1: ‘Poor’, 2: ‘Fair’, 3: ‘Good’, 4: ‘Very Good’, 5: ‘Excellent’}
df[‘Rating Category’] = df[‘Rating’].map(rating_map)
print(df)
Example 50: Aggregating Data After Grouping
After grouping data, you might want to perform multiple aggregation operations for deeper insights.
df = pd.DataFrame({
‘Department’: [‘Sales’, ‘HR’, ‘IT’, ‘Sales’, ‘HR’, ‘IT’],
‘Employee’: [‘John’, ‘Doe’, ‘Jane’, ‘Smith’, ‘Alex’, ‘Lee’],
‘Salary’: [70000, 60000, 80000, 75000, 58000, 90000]
})
aggregated = df.groupby(‘Department’)[‘Salary’].agg([‘mean’, ‘min’, ‘max’])
print(aggregated)
Example 51: Converting Category Types for Efficiency
For large datasets with categorical data, converting columns to category type can save memory and speed up operations.
import pandas as pd
df = pd.DataFrame({
‘Product’: [‘Apple’, ‘Banana’, ‘Carrot’, ‘Apple’, ‘Banana’, ‘Carrot’]
})
df[‘Product’] = df[‘Product’].astype(‘category’)
print(df.dtypes)
Example 52: Applying Conditions to Subset Data
Applying conditions to subset data is essential for focusing analysis on relevant segments.
df = pd.DataFrame({
‘Sales’: [150, 200, 250, 300, 350, 400],
‘Region’: [‘East’, ‘West’, ‘East’, ‘West’, ‘East’, ‘West’]
})
# Get data where Sales are greater than 300
high_sales_df = df[df[‘Sales’] > 300]
print(high_sales_df)
Example 53: Correcting Typos in Data Entries
Correcting typos or inconsistent entries is a common need in text data cleaning.
df = pd.DataFrame({
‘Name’: [‘John’, ‘Jahn’, ‘Jane’, ‘Jone’, ‘John’]
})
# Correcting typos in the ‘Name’ column
corrections = {
‘Jahn’: ‘John’,
‘Jone’: ‘Jane’
}
df[‘Name’] = df[‘Name’].replace(corrections)
print(df)
Example 54: Standardizing Date Formats
Handling different date formats to ensure consistency in time series data.
df = pd.DataFrame({
‘Date’: [’01-01-2024′, ‘2024/02/01’, ‘March 3, 2024’]
})
# Standardizing date formats
df[‘Date’] = pd.to_datetime(df[‘Date’])
print(df)
Example 56: Aggregating and Transforming Data Simultaneously
Use transform()
to apply an aggregation function but keep the original DataFrame’s shape.
df = pd.DataFrame({
‘Group’: [‘A’, ‘A’, ‘B’, ‘B’, ‘C’, ‘C’],
‘Data’: [100, 150, 200, 250, 300, 350]
})
df[‘Average’] = df.groupby(‘Group’)[‘Data’].transform(‘mean’)
print(df)
Example 57: Handling Time Zones in DateTimes
Converting DateTime objects to handle time zones properly.
df = pd.DataFrame({
‘DateTime’: pd.to_datetime([‘2024-01-01T12:00’, ‘2024-02-01T15:00’, ‘2024-03-01T18:00’], utc=True)
})
# Converting UTC to Eastern Time
df[‘DateTime’] = df[‘DateTime’].dt.tz_convert(‘US/Eastern’)
print(df)
Example 58: Detecting and Replacing Outliers with Quantiles
Detecting outliers and replacing them can prevent skewed analyses.
df = pd.DataFrame({
‘Values’: [1, 2, 2, 3, 4, 10, 100]
})
q_low = df[‘Values’].quantile(0.01)
q_hi = df[‘Values’].quantile(0.99)
df[‘Values’] = df[‘Values’].apply(lambda x: q_hi if x > q_hi else (q_low if x < q_low else x))
print(df)
Example 59: Pivoting Data for Better Visualization
Pivoting tables to transform data for better analysis and visualization.
df = pd.DataFrame({
‘Date’: [‘2024-01-01’, ‘2024-01-01’, ‘2024-01-02’],
‘Category’: [‘Fruits’, ‘Vegetables’, ‘Fruits’],
‘Sales’: [100, 150, 200]
})
pivot_df = df.pivot_table(values=’Sales’, index=’Date’, columns=’Category’, fill_value=0)
print(pivot_df)
Example 60: Collapsing Multiple Columns into One
Consolidating multiple columns into one for a streamlined dataset.
df = pd.DataFrame({
‘R1’: [1, 0, 1],
‘R2’: [0, 1, 0],
‘R3’: [1, 1, 0]
})
# Collapsing multiple columns into a single column ‘Responses’
df[‘Responses’] = df.apply(lambda row: ”.join(row.values.astype(str)), axis=1)
print(df)
Example 61: Encoding Categorical Variables with Label Encoding
Label encoding is useful for converting text labels into a numeric form so they can be used in machine learning algorithms.
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({
‘Category’: [‘red’, ‘blue’, ‘green’, ‘blue’, ‘red’]
})
encoder = LabelEncoder()
df[‘Category_Encoded’] = encoder.fit_transform(df[‘Category’])
print(df)
Example 62: Generating Dummy Variables
Creating dummy variables from categorical data is a common step for statistical modeling and machine learning.
df = pd.DataFrame({
‘Color’: [‘Red’, ‘Blue’, ‘Green’, ‘Blue’, ‘Red’]
})
dummies = pd.get_dummies(df[‘Color’], prefix=’Color’)
df = pd.concat([df, dummies], axis=1)
print(df)
Example 63: Converting Continuous Data into Categorical Data
Sometimes, it’s useful to bin continuous data into discrete intervals.
df = pd.DataFrame({
‘Temperature’: [72, 85, 90, 100, 65]
})
df[‘Temp_Category’] = pd.cut(df[‘Temperature’], bins=[60, 70, 80, 90, 100], labels=[“Cool”, “Mild”, “Warm”, “Hot”])
print(df)
Example 64: Combining Multiple DataFrames
Merging multiple dataframes is crucial when combining different data sources.
df1 = pd.DataFrame({
‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’]
})
df2 = pd.DataFrame({
‘A’: [‘A4’, ‘A5’, ‘A6’, ‘A7’],
‘B’: [‘B4’, ‘B5’, ‘B6’, ‘B7’]
})
df_combined = pd.concat([df1, df2], ignore_index=True)
print(df_combined)
Example 65: Handling Missing Data in Time Series
Time series data often requires special handling of gaps and missing data.
ts_df = pd.DataFrame({
‘Dates’: pd.date_range(‘20240101’, periods=6),
‘Values’: [1, np.nan, np.nan, 4, 5, 6]
})
ts_df[‘Values’].fillna(method=’ffill’, inplace=True) # Forward fill
print(ts_df)
Example 66: Cleaning Text Data Using Regular Expressions
Regular expressions are powerful for text manipulations, such as removing unwanted characters.
df = pd.DataFrame({
‘Text’: [‘Example#1’, ‘Data@Science’, ‘100%True’]
})
df[‘Text’] = df[‘Text’].str.replace(r'[\W_]+’, ”, regex=True)
print(df)
Example 67: Correcting Index after Merging or Other Operations
Resetting the DataFrame index ensures that the indexing remains consistent after various transformations.
df = pd.concat([df1, df2]).reset_index(drop=True)
print(df)
Let’s dive into more data cleaning examples using Pandas to further enhance your ability to handle various types of data preparation challenges. These examples can be useful in broadening the scope of your IT training programs, providing learners with practical skills for real-world data scenarios.
Example 68: Aggregating Data Using Custom Functions
Custom aggregation functions can be applied during grouping for more specific calculations.
df = pd.DataFrame({
‘Group’: [‘X’, ‘X’, ‘Y’, ‘Y’],
‘Data’: [1, 2, 3, 4]
})
def custom_agg(x):
return x.max() – x.min()
grouped = df.groupby(‘Group’)[‘Data’].agg(custom_agg)
print(grouped)
Example 69: Filtering Data Based on Statistical Metrics
Using statistical measures like standard deviation can help in identifying and filtering outliers.
df = pd.DataFrame({
‘Data’: [1, 2, 2, 3, 4, 10, 100]
})
mean_val = df[‘Data’].mean()
std_dev = df[‘Data’].std()
filtered_df = df[(df[‘Data’] > (mean_val – 2 * std_dev)) & (df[‘Data’] < (mean_val + 2 * std_dev))]
print(filtered_df)
Example 70: Normalizing Data for Machine Learning
Normalization is a common preprocessing step for many machine learning algorithms.
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
normalized_df = pd.DataFrame(normalized_data, columns=[‘X’, ‘Y’])
print(normalized_df)
Example 71: Removing Rows with Outliers Using IQR
This method involves calculating the interquartile range (IQR) to identify outliers and then filtering them out.
import pandas as pd
df = pd.DataFrame({
‘Values’: [10, 12, 12, 13, 12, 11, 14, 13, 15, 102, 12, 14, 14, 17, 11, 13, 12, 10, 3, 12, 14, 13, 12, 15, 10, 15, 12, 10, 14, 13, 15]
})
Q1 = df[‘Values’].quantile(0.25)
Q3 = df[‘Values’].quantile(0.75)
IQR = Q3 – Q1
df_cleaned = df[~((df[‘Values’] < (Q1 – 1.5 * IQR)) | (df[‘Values’] > (Q3 + 1.5 * IQR)))]
print(df_cleaned)
import pandas as pd
df = pd.DataFrame({
‘Values’: [10, 12, 12, 13, 12, 11, 14, 13, 15, 102, 12, 14, 14, 17, 11, 13, 12, 10, 3, 12, 14, 13, 12, 15, 10, 15, 12, 10, 14, 13, 15]
})
Q1 = df[‘Values’].quantile(0.25)
Q3 = df[‘Values’].quantile(0.75)
IQR = Q3 – Q1
df_cleaned = df[~((df[‘Values’] < (Q1 – 1.5 * IQR)) | (df[‘Values’] > (Q3 + 1.5 * IQR)))]
print(df_cleaned)
Example 72: Converting a Series of Lists into a DataFrame
If you have a series where each entry is a list, you can convert this into a DataFrame for easier manipulation.
data = pd.Series([[‘red’, ‘blue’], [‘green’, ‘yellow’], [‘white’]])
df = pd.DataFrame(data.tolist(), columns=[‘Color1’, ‘Color2’])
print(df)
Example 73: Dealing with Missing Time Data in Time Series
For time series data with missing timestamps, you can reindex the series to fill in the missing periods.
ts_index = pd.date_range(‘2024-01-01′, periods=6, freq=’D’)
ts_data = pd.Series([1, None, None, 2, 3, 4], index=ts_index)
# Reindexing with a complete date range
complete_index = pd.date_range(‘2024-01-01’, ‘2024-01-06’)
ts_data = ts_data.reindex(complete_index)
print(ts_data)
Example 74: Concatenating DataFrames with Different Columns
Concatenate two DataFrames that have different sets of columns.
df1 = pd.DataFrame({‘A’: [1, 2], ‘B’: [3, 4]})
df2 = pd.DataFrame({‘A’: [5, 6], ‘C’: [7, 8]})
df_combined = pd.concat([df1, df2]).fillna(0)
print(df_combined)
Example 75: Dropping Columns with a High Percentage of Missing Values
Drop columns from a DataFrame that have a high percentage of missing values.
df = pd.DataFrame({
‘A’: [1, 2, np.nan, 4],
‘B’: [np.nan, np.nan, np.nan, np.nan],
‘C’: [1, 2, 3, 4]
})
threshold = 0.75 # 75% missing values
df = df.dropna(thresh=len(df) * (1 – threshold), axis=1)
print(df)
Example 76: Mapping External Data to a DataFrame
Use a mapping from external data to add a new column to a DataFrame.
mapping = {1: ‘apple’, 2: ‘banana’, 3: ‘cherry’}
df = pd.DataFrame({‘Numbers’: [1, 2, 1, 3]})
df[‘Fruits’] = df[‘Numbers’].map(mapping)
print(df)
Example 77: Calculating Rolling Averages in Time Series
Calculate rolling averages to smooth out time series data.
ts_df = pd.DataFrame({
‘Dates’: pd.date_range(start=’2024-01-01′, periods=5, freq=’D’),
‘Values’: [1, 2, 3, 4, 5]
})
ts_df[‘Rolling_Mean’] = ts_df[‘Values’].rolling(window=3).mean()
print(ts_df)
Example 78: Filling Missing Values with Predictive Imputation
Predict missing values based on other columns, using a simple linear model for demonstration.
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
‘A’: [1, 2, 3, 4, 5],
‘B’: [2, np.nan, 6, 8, 10]
})
model = LinearRegression()
X = df.loc[df[‘B’].notna(), [‘A’]]
y = df.loc[df[‘B’].notna(), ‘B’]
model.fit(X, y)
# Predicting the missing ‘B’ values
df.loc[df[‘B’].isna(), ‘B’] = model.predict(df.loc[df[‘B’].isna(), [‘A’]])
print(df)
Example 79: Unstacking Data for Wider Format
Convert index levels to columns to create a wider format DataFrame from a multi-index series.
df = pd.DataFrame({
‘Year’: [2024, 2024, 2025, 2025],
‘Month’: [1, 2, 1, 2],
‘Value’: [100, 200, 300, 400]
})
df.set_index([‘Year’, ‘Month’]).unstack().fillna(0)
print(df)
Example 80: Cleaning and Extracting Numeric Data from Strings
Extract numeric data embedded in string data, which is common in real-world datasets.
df = pd.DataFrame({
‘Raw_Data’: [‘123 kg’, ‘456 lbs’, ‘789 kg’, ‘no data’]
})
df[‘Weight’] = df[‘Raw_Data’].str.extract(‘(\d+)’).astype(float).fillna(0)
Example 81: Standardizing Numeric Data with Z-Scores
Standardizing data involves scaling the features so they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
import pandas as pd
from scipy.stats import zscore
df = pd.DataFrame({
‘Scores’: [90, 82, 70, 98, 95, 85]
})
df[‘Z_Scores’] = zscore(df[‘Scores’])
print(df)
Example 82: Stripping Whitespace from All String Columns
Automatically trim whitespace from all string columns in a DataFrame, which is useful for cleaning up textual data.
df = pd.DataFrame({
‘Name’: [‘ Alice ‘, ‘Bob ‘, ‘ Charlie’],
‘City’: [‘ New York’, ‘Los Angeles ‘, ‘San Francisco ‘]
})
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
print(df)
Example 83: Merging DataFrames on Multiple Keys
Merge two DataFrames based on more than one column to ensure precise data integration.
df1 = pd.DataFrame({
‘First Name’: [‘John’, ‘Jane’, ‘Alice’],
‘Last Name’: [‘Doe’, ‘Doe’, ‘Smith’],
‘Age’: [28, 34, 24]
})
df2 = pd.DataFrame({
‘First Name’: [‘John’, ‘Jane’, ‘Alice’],
‘Last Name’: [‘Doe’, ‘Doe’, ‘Smith’],
‘Profession’: [‘Engineer’, ‘Doctor’, ‘Artist’]
})
df_merged = pd.merge(df1, df2, on=[‘First Name’, ‘Last Name’])
print(df_merged)
Example 84: Converting Data Types in Bulk
Convert multiple columns to different data types efficiently, which is common after data importation.
df = pd.DataFrame({
‘A’: [‘1’, ‘2’, ‘3’],
‘B’: [‘4’, ‘5’, ‘6’],
‘C’: [‘7.1’, ‘8.2’, ‘9.3’]
})
convert_dict = {‘A’: int, ‘B’: float, ‘C’: float}
df = df.astype(convert_dict)
print(df.dtypes)
Example 85: Filtering Data Using Regular Expressions
Use regular expressions to filter rows based on complex patterns in textual data.
df = pd.DataFrame({
‘Email’: [‘[email protected]’, ‘[email protected]’, ‘[email protected]’]
})
# Filter emails not containing ‘example’
df_filtered = df[df[‘Email’].str.contains(‘example’, regex=True)]
print(df_filtered)
Example 86: Filling Missing Values in a Categorical Series
Fill missing values in categorical data using the mode (the most frequent value).
df = pd.DataFrame({
‘Favorite Color’: [‘Blue’, ‘Red’, ‘Green’, ‘Blue’, None]
})
df[‘Favorite Color’].fillna(df[‘Favorite Color’].mode()[0], inplace=True)
print(df)
Example 87: Creating Indicator Variables for Missing Data
Create binary indicator variables that denote whether a value was missing in the original dataset.
df = pd.DataFrame({
‘Income’: [55000, 48000, None, 53000]
})
df[‘Income_Missing’] = df[‘Income’].isnull().astype(int)
print(df)
Example 88: Splitting and Expanding a DataFrame from One Column
Split a single string column into multiple columns and expand the DataFrame accordingly.
df = pd.DataFrame({
‘Data’: [‘John|28|Engineer’, ‘Jane|34|Doctor’, ‘Alice|24|Artist’]
})
df[[‘Name’, ‘Age’, ‘Profession’]] = df[‘Data’].str.split(‘|’, expand=True)
print(df)
Example 89: Aggregating with Custom Functions After Filtering
Use custom functions to perform complex aggregations on filtered data.
df = pd.DataFrame({
‘Product’: [‘Apples’, ‘Oranges’, ‘Bananas’, ‘Apples’, ‘Oranges’],
‘Sales’: [100, 150, 200, 90, 140]
})
# Custom aggregation after filtering
def custom_aggregation(series):
return series.max() – series.min()
result = df[df[‘Product’] == ‘Apples’][‘Sales’].agg(custom_aggregation)
print(“Sales difference for Apples:”, result)
Example 90: Rotating Data from Wide to Long Format
Rotate a DataFrame from a wide format to a long format, often needed for time series and panel data analysis.
df = pd.DataFrame({
‘ID’: [1, 2],
‘2024_Sales’: [100, 200],
‘2025_Sales’: [150, 250]
})
df_long = pd.wide_to_long(df, stubnames=’Sales’, i=’ID’, j=’Year’, sep=’_’)
print(df_long)
Example 91: Detecting Changes in a Column
Detect changes from one value to another in a DataFrame column, which can be useful for identifying transitions or state changes.
import pandas as pd
df = pd.DataFrame({
‘Status’: [‘Open’, ‘Open’, ‘Closed’, ‘Closed’, ‘Open’]
})
df[‘Changed’] = df[‘Status’].diff().ne(0).astype(int)
print(df)
Example 92: Grouping by Multiple Conditions
Create groups based on multiple conditions to perform customized aggregations.
df = pd.DataFrame({
‘Category’: [‘A’, ‘B’, ‘A’, ‘A’, ‘B’],
‘Value’: [100, 200, 150, 120, 180],
‘Type’: [‘X’, ‘X’, ‘Y’, ‘Y’, ‘X’]
})
result = df.groupby([‘Category’, df[‘Type’].eq(‘X’)]).mean()
print(result)
Example 93: Removing Characters Except Alphabets and Numbers
Clean a text column by removing all characters except alphabets and numbers.
df = pd.DataFrame({
‘Text’: [‘Hello!@#’, ‘$$World$$’, ‘**Python**’]
})
df[‘Cleaned_Text’] = df[‘Text’].str.replace(‘[^\w\s]’, ”, regex=True)
print(df)
Example 94: Generating Frequency Tables
Generate a frequency table for a categorical variable, which is essential for data analysis and understanding distributions.
df = pd.DataFrame({
‘Grade’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘B’, ‘A’, ‘C’]
})
frequency = df[‘Grade’].value_counts()
print(frequency)
Example 95: Handling Seasonal Data
Decompose a time series into seasonal, trend, and residual components.
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
# Generate sample data
np.random.seed(0)
dates = pd.date_range(start=’2024-01-01′, periods=100)
data = np.random.rand(100) + np.linspace(0, 10, 100) # Increasing trend
df = pd.DataFrame({‘Value’: data}, index=dates)
# Decompose data
result = seasonal_decompose(df[‘Value’], model=’additive’, period=12)
result.plot()
Example 96: Calculating Weighted Average
Compute the weighted average of a column, useful for customized statistical calculations.
df = pd.DataFrame({
‘Values’: [10, 20, 30, 40],
‘Weights’: [1, 2, 3, 4]
})
weighted_average = np.average(df[‘Values’], weights=df[‘Weights’])
print(“Weighted Average:”, weighted_average)
Example 97: Removing Sparse Columns
Drop columns from a DataFrame that have a high percentage of zero values, often considered sparse.
df = pd.DataFrame({
‘A’: [0, 0, 0, 1, 0],
‘B’: [1, 2, 3, 4, 5],
‘C’: [0, 0, 0, 0, 0]
})
threshold = 0.8 # 80% zeros
df = df.loc[:, (df == 0).mean() < threshold]
print(df)
Example 98: Filling Missing Dates in a Time Series
Fill missing dates in a time series dataset, ensuring continuity in the data.
dates = pd.date_range(start=’2024-01-01′, end=’2024-01-10′)
data = pd.Series([1, np.nan, np.nan, 4, 5, 6, np.nan, 8, 9, 10], index=dates)
filled_data = data.resample(‘D’).ffill() # Forward fill
print(filled_data)
Example 99: Pivoting a DataFrame for Custom Summaries
Pivot data for custom summaries to better understand relationships between variables.
df = pd.DataFrame({
‘Product’: [‘Apple’, ‘Banana’, ‘Apple’, ‘Banana’, ‘Apple’],
‘City’: [‘NY’, ‘NY’, ‘SF’, ‘SF’, ‘SF’],
‘Sales’: [100, 200, 150, 300, 400]
})
pivot_table = df.pivot_table(values=’Sales’, index=’Product’, columns=’City’, aggfunc=’sum’)
print(pivot_table)
Example 100: Random Sampling of Rows
Perform random sampling of rows in a DataFrame, useful for creating subsets for training and testing.
df = pd.DataFrame({
‘Data’: range(100)
})
sampled_df = df.sample(n=10, random_state=42) # Sample 10 rows
print(sampled_df)