In machine learning, data encoding is essential for converting categorical data into numerical formats that models can interpret. Two common techniques for this are label encoding and one-hot encoding.
Label Encoding
Label encoding converts categorical data into numerical labels, assigning a unique integer to each category. This method is simple and effective for ordinal data where categories have a natural order.
Example in Python:
pythonCopier le codefrom sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder()data[‘encoded_column’] = label_encoder.fit_transform(data[‘category_column’])
Benefits:
- Efficient and straightforward for ordinal data.
- Reduces dimensionality compared to one-hot encoding.
One-Hot Encoding
One-hot encoding transforms categorical variables into a binary matrix, where each category is represented by a vector with a single high (1) and all others low (0). This is ideal for nominal data where categories have no intrinsic order.
Example in Python:
pythonCopier le codeimport pandas as pd data = pd.get_dummies(data, columns=[‘category_column’])
Benefits:
- Avoids unintended ordinal relationships.
- Suitable for algorithms sensitive to categorical hierarchy.