🐼 Pandas
Data Cleaning
Data cleaning handles missing values (NaN), duplicates, type conversions, and outliers. Essential preprocessing step before any analysis or ML model training.
● Intermediate
📖 Based on: Python for Data Analysis — Wes McKinney, Data Science and Analytics with Python
📋 Table of Contents
1 · Core Concepts
Data cleaning handles missing values (NaN), duplicates, type conversions, and outliers. Essential preprocessing step before any analysis or ML model training.
2 · Code Examples
Python
import pandas as pd import numpy as np # Example DataFrame df = pd.DataFrame({ "name": ["Alice","Bob","Charlie","Diana"], "age": [25,30,35,28], "salary": [50000,60000,70000,55000], "dept": ["Engineering","Marketing","Engineering","Marketing"] }) print(df)
▶ Output
name age salary dept0 Alice 25 50000 Engineering
1 Bob 30 60000 Marketing
2 Charlie 35 70000 Engineering
3 Diana 28 55000 Marketing
3 · Common Patterns
Textbook Insight
Pandas is built on NumPy and provides high-performance, easy-to-use data structures. The two primary structures are Series (1D) and DataFrame (2D). Always prefer vectorized operations over loops.
4 · Best Practices
Performance Tip
Use .apply() sparingly — prefer vectorized operations. Use category dtype for string columns with few unique values to save memory.