The two main reasons for working with text in pandas.
1. Canonicalization: Convert data that has multiple formats into a standard form.
- By manipulating text, we can join tables with mismatched string labels.
2. Extract information into a new feature.
- For example, we can extract data and time features from text.
Python String Methods
# In Pnadas(Series)
# s.lower(_) in python
ser.str.lower(_)
# s.upper(_) in python
ser.str.upper(_)
# s.replace(_)
ser.str.replace(_)
# s.split(_)
ser.str.split(_)
# s[1:4]
ser.str[1:4]
# '_' in s
ser.str.contains(_)
# len(s)
ser.str.len()
Canonicalization
# this shows the two tables
display(country_and_state), display(country_and_pop);
# eliminate whitespace, punctuation, and unnecessary text
def canonicalize_county(county_name):
return (
county_name
.lower()
.replace(' ', '')
.replace('&', 'and')
.replace('.', '')
.replace('county', '')
.replace('parish', '')
)
canonicalize_county("St. John the Baptist")
# return 'stjohnthebaptist'
# Then apply the every row in both DataFrames
county_and_pop['clean_county_python'] = county_and_pop['County'].map(canonicalize_county)
county_and_state['clean_county_python'] = county_and_state['County'].map(canonicalize_county)
display(county_and_state), display(county_and_pop);
Result:
In padas, we can do it also like this:
def canonicalize_county_series(county_series):
return (
county_series
.str.lower()
.str.replace(' ', '')
.str.replace('&', 'and')
.str.replace('.', '')
.str.replace('county', '')
.str.replace('parish', '')
)
county_and_pop['clean_county_pandas'] = canonicalize_county_series(county_and_pop['County'])
county_and_state['clean_county_pandas'] = canonicalize_county_series(county_and_state['County'])
display(county_and_pop), display(county_and_state);
'Computer Science 🌋 > Machine Learning🐼' 카테고리의 다른 글
Canonicalization (0) | 2023.05.24 |
---|---|
Regex (0) | 2023.05.24 |
Reproduce Data: Compute Incidence (0) | 2023.05.24 |
Gather more data & join data on primary keys (0) | 2023.05.24 |
Record Granularity (0) | 2023.05.24 |