Pandas: The data analysis and manipulation tool¶
About¶
import pandas as pd
na_values = ['NA', 'Missing', 'Null'] # df will consider all these values as NaN
df = pd.read_csv('abc.csv', na_values=na_values)
Pandas methods¶
Command | Description |
---|---|
pd.set_option('display.max_columns', 10) |
Show 10 columns in jupyter notebook |
pd.set_option('display.max_rows', 40) |
Show 40 rows in jupyter notebook |
Pandas property constants¶
Command | Description |
---|---|
df.shape |
(64, 11) |
df.columns |
list of all columns |
df.index |
returns list of index values |
df.dtypes |
data types of all pandas |
Pandas Methods¶
Informatory¶
Method | Description |
---|---|
df.info() |
data_type, no_of_rows of each column |
df.head(10) |
first 10 rows, 5 if argument is not passed |
df.tail(10) |
last 10 rows, 5 if argument is not passed |
df.set_index('10_201909', inplace=False) |
set a column as index |
df.sort_index(ascending=False, inplace=False) |
sort rows by index, descending order |
df.reset_index(inplace=True) |
reset dataframe's index |
df.rename(columns={'col1': 'val1', 'col2': 'val2'}, inplace=True) |
rename only some column names |
df.apply(len) |
returns length of each column, considering columns (Series) as a single entity |
df.applymap(len) |
len will be applied to each value of all columns (all values of df) |
df.drop(columns=['col1', 'col2'], index=2, inplace=False) |
Delete multiple columns, third row |
df.drop(index=df['filter'].index) |
check below for filter |
df.append('df2', ignore_index=True, sort=False) |
No inplace arg here, we have to assign it to df itself |
df.sort_values(by=['col_name', 'col_2'], ascending=[False, True]) |
sort dataframe rows by col_name, first sort by col_name (descending), then by col_2 for common values in col_name (ascending) |
df.nlargest(10, 'col_name') |
returns df with 10 largest values of col_name |
df.nsmallest(10, 'col_name') |
returns df with 10 smallest values of col_name |
df.replace('NA', numpy.nan, inplace=True) |
replace all NA strings in whole df with numpy nan |
df.isna() |
returns boolean df with True for NaN values |
df.fillna('other_str') |
replace all NaN with other_str |
df.to_csv('path/modified.csv') |
export df to a csv file |
df.dropna(axis='index', how='any', subset=['email']) |
drop all rows where email is NaN, if multiple subset then drop rows where both are NaN |
df['col_name'].sort_values() |
return a sorted series |
df['col_name'].nlargest(10) |
highest 10 values of col_name, returns series |
df['10_201909'].value_counts() |
count frequency of all unique values in 10_201909 column |
df['age'].astype(float) |
by default numbers are parsed as strings, and we can convert np.NaN to float but not int so best practice to convert all to float |
df['age'].mean() |
returns mean of age column |
df['age'].unique() |
returns a list of all unique values in the age column |
df['age'].replace('text', 0, inplace=True) |
replace text with 0 |
Accessing¶
columns¶
Command | Description |
---|---|
df['10_201909'] |
accessing '10_201909' |
df[['10_201909', '10_201910', '10_201909']] |
accessing multiple columns |
rows¶
Command | Description |
---|---|
df.iloc[1] |
accesing second (single) row, index starting from 0 |
df.iloc[[0, 1, 3, 7]] |
accessing multiple rows |
df.iloc[[2, 5], 1] |
second column, third and sixth rows |
df.iloc[[1, 5], [3, 8]] |
accessing via multiple rows and column (indexes) |
--- | --- |
df.loc[[0, 1, 3], ['10_201909', '10_201910', '10_201909']] |
accessing via labels instead of indexes |
df.loc[0:21, '10_201909'] |
0 to 21 (inclusive row values of 10_201909 column) |
df.loc[0:5, '10_201909': '10_201912'] |
row and column slicing |
How to¶
Expand pandas DataFrame column into multiple rows
Input:
days | name | |
---|---|---|
0 | [1, 3, 5, 7] | John |
1 | [2, 4] | Eric |
Output:
days | name | |
---|---|---|
0 | 1 | John |
1 | 3 | John |
2 | 5 | John |
3 | 7 | John |
4 | 2 | Eric |
5 | 4 | Eric |
lens = [len(item) for item in df['days']]
pd.DataFrame({"name" : np.repeat(df['name'].values,lens),
"days" : np.hstack(df['days'])
})