pandas

Computer Science 🌋/Machine Learning🐼

SQL in Pandas Review

Schemas Schema describes all relations and their attribute names & types. Granularity (what does one record in each table represent?) Primary and Foreign keys Representation CREATE TABLE users( id INTEGER PRIMARY KEY, name TEXT ) CREATE TABLE orders( item TEXT PRIMARY KEY, price NUMERIC, name TEXT ) GROUP BY and HAVING # SQL SELECT max(name), legs, weight FROM animals GROUP BY legs, weight HAVIN..

Computer Science 🌋/Machine Learning🐼

Text Fields Review

Text Fields and Data Cleaning / EDA Extract quantitative values from text: dates, times, positions, etc. Determine if missing values are denoted # split time_str = first.split('[')[1].split(' ', 1)[0] # '26/Jan/2014:10:47:58' day, month, rest = time_str.split('/') # ['26', 'Jan', '2014:10:47:58'] year, hour, minute, second = rest.split(':') # ['2014', '10', '47', '58'] year, month, day, hour, mi..

Computer Science 🌋/Machine Learning🐼

EDA Review

Goals of EDA Data Types: What kinds of data do we have? Granularity: How fine/coarse is each datum? Scope: How (in)complete are the data? Temporality: How are the data situated in time? Faithfulness: How accurately do the data describe the world? Data Type: Nominal Data: categories without natural ordering Ordinal Data: categories with natural ordering Numerical Data: amounts or quantities Compu..

Computer Science 🌋/Machine Learning🐼

Data Cleaning Review

Bad Data All of these are commonly seen in the real world: Zeros replace missing values Spelling inconsistent(esp with human-entered data) Rows are duplicated Inconsistent date formats (e.g. 10/9/15 vs. 9/10/15) Units not specified Rectangular Data Easy to manipulate, visualize, and combine, Tables (DataFrames): Each labeled column has values of the same type. Manipulated using group, sort, join..

Computer Science 🌋/Machine Learning🐼

Pandas part 2 Review

Python list: Pandas: The word "index" refers to the collection of labels for each row. groupby: Harder Question What was the most popular male name during each year in the data? What are the three states with the most babies born? By doing groupby, we can easily approach. # avarage of percent, group by Party df['%'].groupby(df['Party']).mean() # return minimum value, group by Party df['%'].group..

Computer Science 🌋/Machine Learning🐼

Pandas part 1 Review

Question werid = pd.DataFrame({1:["topdog","botdog"], "1":["topcat","botcat"]}) werid Try to predict the output of the following: weird[1] werid["1"] werid[1:] Name --> [ ] --> Series (Single Column Selection) List --> [ ] --> DataFrame (Multiple Column Selection) Numeric Slice -- > [ ] --> DataFrame (Multiple Raw Selection) Answer: weird[1] weird["1"], werid[['1']], werid['1'] weird[1: ] # bool..

Computer Science 🌋/Machine Learning🐼

Feature Engineering

Feature Engineering is the process of transforming the raw features into more informative features that can be used in modeling or EDA tasks. Feature Functions As number of features grows, we can capture arbitrarily complex relationships. Suppose we wish to develop a model to predict a vehicle's fuel efficiency ("mpg") as a function of its horsepower("hp"). Glancing at the plot below, we see tha..

Computer Science 🌋/Machine Learning🐼

Canonicalization

import re text = "Moo" pattern = r"]+>" re.sub(pattern, '', text) # return 'Moo' Notice the r proceeding the regular expression pattern; this specifies the regular expression is a raw string. Raw string do not recognize escape sequences. This makes them useful for regular expressions, which often contain literal '\' chracters. data = {"HTML": ["Moo", \ "Link", \ "Bold text"]} html_data = pd.Data..

Computer Science 🌋/Machine Learning🐼

Regex

Regex: Regular Expression Regex are useful in many applications beyond data sceince. # For example, Social Security Numbers(SSNs) r"[0-9]{3}-[0-9]{2}-[0-9]{4}" # Regular Expression Syntax # 3 of any digit, then a dash, # then 2 of any digit, then a dash, # then 4 of any digit # result: '[0-9]{3}-[0-9]{2}-[0-9]{4}' Basics Regex Syntax Questions! Convenient Regex

Computer Science 🌋/Machine Learning🐼

Python String Method

The two main reasons for working with text in pandas. 1. Canonicalization: Convert data that has multiple formats into a standard form. By manipulating text, we can join tables with mismatched string labels. 2. Extract information into a new feature. For example, we can extract data and time features from text. Python String Methods # In Pnadas(Series) # s.lower(_) in python ser.str.lower(_) # s..

KB0129
'pandas' 태그의 글 목록