Data Cleaning Review

Computer Science 🌋/Machine Learning🐼

Data Cleaning Review

KB0129 2023. 5. 28. 05:05

Bad Data

All of these are commonly seen in the real world:

Zeros replace missing values
Spelling inconsistent(esp with human-entered data)
Rows are duplicated
Inconsistent date formats (e.g. 10/9/15 vs. 9/10/15)
Units not specified

Rectangular Data

Easy to manipulate, visualize, and combine,

Tables (DataFrames):

Each labeled column has values of the same type.
Manipulated using group, sort, join, etc.
Formal description of data transformations is called relational algebra.

Matrices:

All values have the same type.

Keys:

Primary key: the column (or set of columns) that determines the values in the remaining columns.
- Unique for each row & 1-to-1 with entities.
- Ensures that the row can be identified, even after appending more data. E.g., SSN
- Is an email address a good primary key? -> Yes
Foreign key: a column containing values that are primary keys for other rows.
- A foreign key serves as a reference to a row
- Joining tables expands the reference with values from the referenced row.
- The referenced row can be in the same table or a different table.

Tidy Data:

Every variable has its own column
Every observation has its own row
Every value has its own cell

Pipe:

# same with td.drop(columns=['iso2', 'iso3'])
def drop_iso(df):
	return df.drop(columns=['iso2', 'iso3'])
td.pipe(drop_iso)

# tidy function
def tidy_up(df):
	return pd.melt(df, id_vars=['country', 'year])
td.pipe(tidy_up)

Now want to splite sex with number!

def split_entry(df):
	codes = df['entry].str.split('_').str[-1]
	return (df.assign(sex=codes[0], agecode=code.str[1:]).drop(columns=['entry']))

Missing Values

If possible, replace missing values with the true value that was removed.

leg_df['religion'] = [x['bio'].get('religion') for x in legislators]
leg_df