Schemas Schema describes all relations and their attribute names & types. Granularity (what does one record in each table represent?) Primary and Foreign keys Representation CREATE TABLE users( id INTEGER PRIMARY KEY, name TEXT ) CREATE TABLE orders( item TEXT PRIMARY KEY, price NUMERIC, name TEXT ) GROUP BY and HAVING # SQL SELECT max(name), legs, weight FROM animals GROUP BY legs, weight HAVIN..
Goals of EDA Data Types: What kinds of data do we have? Granularity: How fine/coarse is each datum? Scope: How (in)complete are the data? Temporality: How are the data situated in time? Faithfulness: How accurately do the data describe the world? Data Type: Nominal Data: categories without natural ordering Ordinal Data: categories with natural ordering Numerical Data: amounts or quantities Compu..
Bad Data All of these are commonly seen in the real world: Zeros replace missing values Spelling inconsistent(esp with human-entered data) Rows are duplicated Inconsistent date formats (e.g. 10/9/15 vs. 9/10/15) Units not specified Rectangular Data Easy to manipulate, visualize, and combine, Tables (DataFrames): Each labeled column has values of the same type. Manipulated using group, sort, join..
Python list: Pandas: The word "index" refers to the collection of labels for each row. groupby: Harder Question What was the most popular male name during each year in the data? What are the three states with the most babies born? By doing groupby, we can easily approach. # avarage of percent, group by Party df['%'].groupby(df['Party']).mean() # return minimum value, group by Party df['%'].group..
Feature Engineering is the process of transforming the raw features into more informative features that can be used in modeling or EDA tasks. Feature Functions As number of features grows, we can capture arbitrarily complex relationships. Suppose we wish to develop a model to predict a vehicle's fuel efficiency ("mpg") as a function of its horsepower("hp"). Glancing at the plot below, we see tha..
import re text = "Moo" pattern = r"]+>" re.sub(pattern, '', text) # return 'Moo' Notice the r proceeding the regular expression pattern; this specifies the regular expression is a raw string. Raw string do not recognize escape sequences. This makes them useful for regular expressions, which often contain literal '\' chracters. data = {"HTML": ["Moo", \ "Link", \ "Bold text"]} html_data = pd.Data..
Regex: Regular Expression Regex are useful in many applications beyond data sceince. # For example, Social Security Numbers(SSNs) r"[0-9]{3}-[0-9]{2}-[0-9]{4}" # Regular Expression Syntax # 3 of any digit, then a dash, # then 2 of any digit, then a dash, # then 4 of any digit # result: '[0-9]{3}-[0-9]{2}-[0-9]{4}' Basics Regex Syntax Questions! Convenient Regex
The two main reasons for working with text in pandas. 1. Canonicalization: Convert data that has multiple formats into a standard form. By manipulating text, we can join tables with mismatched string labels. 2. Extract information into a new feature. For example, we can extract data and time features from text. Python String Methods # In Pnadas(Series) # s.lower(_) in python ser.str.lower(_) # s..