import re text = "Moo" pattern = r"]+>" re.sub(pattern, '', text) # return 'Moo' Notice the r proceeding the regular expression pattern; this specifies the regular expression is a raw string. Raw string do not recognize escape sequences. This makes them useful for regular expressions, which often contain literal '\' chracters. data = {"HTML": ["Moo", \ "Link", \ "Bold text"]} html_data = pd.Data..
Regex: Regular Expression Regex are useful in many applications beyond data sceince. # For example, Social Security Numbers(SSNs) r"[0-9]{3}-[0-9]{2}-[0-9]{4}" # Regular Expression Syntax # 3 of any digit, then a dash, # then 2 of any digit, then a dash, # then 4 of any digit # result: '[0-9]{3}-[0-9]{2}-[0-9]{4}' Basics Regex Syntax Questions! Convenient Regex
The two main reasons for working with text in pandas. 1. Canonicalization: Convert data that has multiple formats into a standard form. By manipulating text, we can join tables with mismatched string labels. 2. Extract information into a new feature. For example, we can extract data and time features from text. Python String Methods # In Pnadas(Series) # s.lower(_) in python ser.str.lower(_) # s..
For compute this, we can do 2019 example: tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000 tb_census_df.head(5) # recompute incidence for all years for year in [2019, 2020, 2021]: tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000 tb_census_df.head(5) # useful to explore the hundredths ..
Gather more data: Census # 2010s census data census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",") census_2010s_df = ( census_2010s_df .reset_index() .drop(columns=["index", "Census", "Estimates Base"]) .rename(columns={"Unnamed: 0": "Geographic Area"}) .convert_dtypes() # "smart" converting of columns, use at your own risk .dropna() # we'll introduce this next time )..
Row 0 is rollup record. The granularity of record 0 vs the rest of the records (States) is different. # the sum of all state cases td_df.sum(axis=0) # If we sum over all rows, we should get 2x the total cases in each of our cases by year # check out the column types tb_df.dtypes The commas cause all TB cases to be read as the object datatype, or storage type (close to the Python sting datatype),..
# check out the first three lines: with open("data/cdc_tuberculosis.csv", "r") as f: i = 0 for row in f: print(row) i += 1 if i > 3: break ※ Python's print() prints each string (including the newline), and an additional newline on top of that. # We can use the repr() function to return the raw sting with all special characters with open("data/cdc_tuberculosis.csv", "r") as f: i = 0 for row in f:..
Data Cleaning(Data Wrangling) Data Cleaning is the process of transforming raw data to facilitate subsequent analysis. It is used to like: Unclear structure or formatting Missing or corrupted values Unit Conversions Exploratory Data Analysis (EDA) EDA is the process of understanding a new dataset. It is an open- ended, informational analysis that involves familiarizing ourselves with the variabl..
# This 'str' operation splits each candidiate's full name at each # blank space, then takes just the candidiate's first name elections["First Name"] = elections["Candiate"].str.split().str[0] elections.head(5) # Here, we'll only consider `babynames` data from 2020 babynames_2020 = babynames[babynames["Year"]==2020] babynames_2020.head() Now, ready to join two tables. pd.merge() merged = pdmerge(..
# Find the total number of baby names associated with each sex for each year in the data babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6) # The 'Pivot_table' method is used to generate a Pandas pivot table import numpy as np babynames.pivot_table(index = "Year", columns = "Sex", values = "Count", aggfunc = np.sum).head(5) # This includes multiple values in the index or columns of o..