Bad Data
All of these are commonly seen in the real world:
- Zeros replace missing values
- Spelling inconsistent(esp with human-entered data)
- Rows are duplicated
- Inconsistent date formats (e.g. 10/9/15 vs. 9/10/15)
- Units not specified
Rectangular Data
Easy to manipulate, visualize, and combine,
Tables (DataFrames):
- Each labeled column has values of the same type.
- Manipulated using group, sort, join, etc.
- Formal description of data transformations is called relational algebra.
Matrices:
- All values have the same type.
Keys:
- Primary key: the column (or set of columns) that determines the values in the remaining columns.
- Unique for each row & 1-to-1 with entities.
- Ensures that the row can be identified, even after appending more data. E.g., SSN
- Is an email address a good primary key? -> Yes
- Foreign key: a column containing values that are primary keys for other rows.
- A foreign key serves as a reference to a row
- Joining tables expands the reference with values from the referenced row.
- The referenced row can be in the same table or a different table.
Tidy Data:
- Every variable has its own column
- Every observation has its own row
- Every value has its own cell
Pipe:
# same with td.drop(columns=['iso2', 'iso3'])
def drop_iso(df):
return df.drop(columns=['iso2', 'iso3'])
td.pipe(drop_iso)
# tidy function
def tidy_up(df):
return pd.melt(df, id_vars=['country', 'year])
td.pipe(tidy_up)
Now want to splite sex with number!
def split_entry(df):
codes = df['entry].str.split('_').str[-1]
return (df.assign(sex=codes[0], agecode=code.str[1:]).drop(columns=['entry']))
Missing Values
- If possible, replace missing values with the true value that was removed.
leg_df['religion'] = [x['bio'].get('religion') for x in legislators]
leg_df
'Computer Science 🌋 > Machine Learning🐼' 카테고리의 다른 글
Text Fields Review (0) | 2023.05.30 |
---|---|
EDA Review (0) | 2023.05.30 |
Pandas part 2 Review (0) | 2023.05.27 |
Pandas part 1 Review (0) | 2023.05.26 |
Life Cycle and Design Review (0) | 2023.05.26 |