For compute this, we can do 2019 example: tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000 tb_census_df.head(5) # recompute incidence for all years for year in [2019, 2020, 2021]: tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000 tb_census_df.head(5) # useful to explore the hundredths ..
Gather more data: Census # 2010s census data census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",") census_2010s_df = ( census_2010s_df .reset_index() .drop(columns=["index", "Census", "Estimates Base"]) .rename(columns={"Unnamed: 0": "Geographic Area"}) .convert_dtypes() # "smart" converting of columns, use at your own risk .dropna() # we'll introduce this next time )..
# check out the first three lines: with open("data/cdc_tuberculosis.csv", "r") as f: i = 0 for row in f: print(row) i += 1 if i > 3: break ※ Python's print() prints each string (including the newline), and an additional newline on top of that. # We can use the repr() function to return the raw sting with all special characters with open("data/cdc_tuberculosis.csv", "r") as f: i = 0 for row in f:..
Data Cleaning(Data Wrangling) Data Cleaning is the process of transforming raw data to facilitate subsequent analysis. It is used to like: Unclear structure or formatting Missing or corrupted values Unit Conversions Exploratory Data Analysis (EDA) EDA is the process of understanding a new dataset. It is an open- ended, informational analysis that involves familiarizing ourselves with the variabl..
# This 'str' operation splits each candidiate's full name at each # blank space, then takes just the candidiate's first name elections["First Name"] = elections["Candiate"].str.split().str[0] elections.head(5) # Here, we'll only consider `babynames` data from 2020 babynames_2020 = babynames[babynames["Year"]==2020] babynames_2020.head() Now, ready to join two tables. pd.merge() merged = pdmerge(..
# Find the total number of baby names associated with each sex for each year in the data babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6) # The 'Pivot_table' method is used to generate a Pandas pivot table import numpy as np babynames.pivot_table(index = "Year", columns = "Sex", values = "Count", aggfunc = np.sum).head(5) # This includes multiple values in the index or columns of o..
Numpy bella_counts = babynames[babynames["Name"] == "Bella"]["Count"] # Average number of babies named Bella each year np.mean(bella_counts) # Max number of babies named Bella born on a given year max(bella_counts) .shape & .size # return a tuple containing the number of rows and columns babynames.shape # return the total number of elements in a structure, equivalent to the number of rows times ..
Conditional Selection # Ask yourself: why is :9 is the correct slice to select the first 10 rows? babynames_first_10_rows = babaynames.loc[:9, :] # Notice how we have exactly 10 elements in our boolean array argument babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]] To make things easier, we can instead provide a logical condition as an input to .loc or [..
Data cleaning: Data cleaning corrects issues in the structure and formatting of data, including missing values and unit conversions. Exploratory data analysis (EDA): EDA describe the process of transforming raw data to insightful observations. It is open-ended analysis of transforming, visualizaing, and summarizing patterns in data. # 'pd' is the conventional alias for Pandas, as 'np' is for Num..