Python String Method

Python String Methods
Canonicalization
Result:

The two main reasons for working with text in pandas.

1. Canonicalization: Convert data that has multiple formats into a standard form.

By manipulating text, we can join tables with mismatched string labels.

2. Extract information into a new feature.

For example, we can extract data and time features from text.

Python String Methods

# In Pnadas(Series)
# s.lower(_) in python
ser.str.lower(_)

# s.upper(_) in python
ser.str.upper(_)

# s.replace(_)
ser.str.replace(_)

# s.split(_) 
ser.str.split(_)

# s[1:4]
ser.str[1:4]

# '_' in s 
ser.str.contains(_)

# len(s)
ser.str.len()

Canonicalization

# this shows the two tables
display(country_and_state), display(country_and_pop);

# eliminate whitespace, punctuation, and unnecessary text
def canonicalize_county(county_name):
    return (
        county_name
            .lower()
            .replace(' ', '')
            .replace('&', 'and')
            .replace('.', '')
            .replace('county', '')
            .replace('parish', '')
    )

canonicalize_county("St. John the Baptist")

# return 'stjohnthebaptist'

# Then apply the every row in both DataFrames
county_and_pop['clean_county_python'] = county_and_pop['County'].map(canonicalize_county)
county_and_state['clean_county_python'] = county_and_state['County'].map(canonicalize_county)
display(county_and_state), display(county_and_pop);

Result:

https://ds100.org/course-notes/regex/regex.html#python-string-methods

In padas, we can do it also like this:

def canonicalize_county_series(county_series):
    return (
        county_series
            .str.lower()
            .str.replace(' ', '')
            .str.replace('&', 'and')
            .str.replace('.', '')
            .str.replace('county', '')
            .str.replace('parish', '')
    )

county_and_pop['clean_county_pandas'] = canonicalize_county_series(county_and_pop['County'])
county_and_state['clean_county_pandas'] = canonicalize_county_series(county_and_state['County'])

display(county_and_pop), display(county_and_state);

'Computer Science 🌋 > Machine Learning🐼' 카테고리의 다른 글

Canonicalization (0)	2023.05.24
Regex (0)	2023.05.24
Reproduce Data: Compute Incidence (0)	2023.05.24
Gather more data & join data on primary keys (0)	2023.05.24
Record Granularity (0)	2023.05.24

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Python String Method

Python String Methods

Canonicalization

Result:

'Computer Science 🌋 > Machine Learning🐼' 카테고리의 다른 글

Python String Methods

Canonicalization

Result:

'Computer Science 🌋 > Machine Learning🐼' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역