Goals of EDA
- Data Types: What kinds of data do we have?
- Granularity: How fine/coarse is each datum?
- Scope: How (in)complete are the data?
- Temporality: How are the data situated in time?
- Faithfulness: How accurately do the data describe the world?
Data Type:
- Nominal Data: categories without natural ordering
- Ordinal Data: categories with natural ordering
- Numerical Data: amounts or quantities
- Computational data types: int, float, string, boolean, etc.
- Statistical data types: nominal, numeric, etc.
Granularity
- Some data will include summaries as records
- Sampling, averaging -> aggregate data!
- What does each record represent? (business, restaurant, location)
- What is the primary key?
- What would you find by grouping by the following columns?
Scope
- Is it a sample? Or census?
- What time frame?
Temporality
- When the described event happened?
- When the data were collected or entered into the system?
- When the data were copied?
- Time zones
- Daylight savings
- Regional Formatting
- Are there strange zero/null values?
Faithfulness
- Do the data violate obvious dependencies?
- Were the data entered by hand?
- Did the data entry from provide default values?
- Are there signs of deliberate data falsification?
'Computer Science 🌋 > Machine Learning🐼' 카테고리의 다른 글
SQL in Pandas Review (0) | 2023.05.30 |
---|---|
Text Fields Review (0) | 2023.05.30 |
Data Cleaning Review (0) | 2023.05.28 |
Pandas part 2 Review (0) | 2023.05.27 |
Pandas part 1 Review (0) | 2023.05.26 |