KB0129 2023. 5. 28. 05:05

Bad Data

All of these are commonly seen in the real world:

  • Zeros replace missing values
  • Spelling inconsistent(esp with human-entered data)
  • Rows are duplicated
  • Inconsistent date formats (e.g. 10/9/15 vs. 9/10/15)
  • Units not specified

Rectangular Data

Easy to manipulate, visualize, and combine,

Tables (DataFrames):

  • Each labeled column has values of the same type.
  • Manipulated using group, sort, join, etc.
  • Formal description of data transformations is called relational algebra.

Matrices:

  • All values have the same type.

Keys:

  • Primary key: the column (or set of columns) that determines the values in the remaining columns.
    • Unique for each row & 1-to-1 with entities.
    • Ensures that the row can be identified, even after appending more data. E.g., SSN
    • Is an email address a good primary key? -> Yes
  • Foreign key: a column containing values that are primary keys for other rows.
    • A foreign key serves as a reference to a row
    • Joining tables expands the reference with values from the referenced row.
    • The referenced row can be in the same table or a different table.

Tidy Data:

  • Every variable has its own column
  • Every observation has its own row
  • Every value has its own cell

Tidy
Tidy example
Not tidy example


Pipe:

# same with td.drop(columns=['iso2', 'iso3'])
def drop_iso(df):
	return df.drop(columns=['iso2', 'iso3'])
td.pipe(drop_iso)

# tidy function
def tidy_up(df):
	return pd.melt(df, id_vars=['country', 'year])
td.pipe(tidy_up)

 

Before
After

Now want to splite sex with number!

def split_entry(df):
	codes = df['entry].str.split('_').str[-1]
	return (df.assign(sex=codes[0], agecode=code.str[1:]).drop(columns=['entry']))

Split

 


Missing Values

  • If possible, replace missing values with the true value that was removed.
leg_df['religion'] = [x['bio'].get('religion') for x in legislators]
leg_df

before
After