CTRL K

Pandas Complete Reference: Cheat Sheet and Statistical Functions

Mr Nanko,8/30/2018

数据基础操作
数据高级处理
统计分析
- Pandas 数据对象的轴参数
- 统计函数速查表
总结
参考

Numpy 和 Pandas 是最受欢迎的 Python 数据科学和分析库。Numpy 用于较低级别的科学计算，而 Pandas 构建于 Numpy 之上，专为 Python 中的实际数据分析而设计。本文整理了 Pandas 最常用的功能和统计函数，方便快速查阅。

数据基础操作

导入数据

任何类型的数据分析都从获取某些数据开始。Pandas 为您提供了很多将数据导入 Python 工作簿的选项：


pd.read_csv(filename) # From a CSV file
pd.read_table(filename) # From a delimited text file (like TSV)
pd.read_excel(filename) # From an Excel file
pd.read_sql(query, connection_object) # Reads from a SQL table/database
pd.read_json(json_string) # Reads from a JSON formatted string, URL or file.
pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists

探索数据

将数据导入 Pandas 数据帧后，可以使用这些方法来了解数据的外观：


df.shape() # Prints number of rows and columns in dataframe
df.head(n) # Prints first n rows of the DataFrame
df.tail(n) # Prints last n rows of the DataFrame
df.info() # Index, Datatype and Memory information
df.describe() # Summary statistics for numerical columns
df[col].value_counts(dropna=False) # Views unique values and counts for a column
df.apply(pd.Series.value_counts) # Unique values and counts for all columns
df.mean() # Returns the mean of all columns
df.corr() # Returns the correlation between columns in a DataFrame
df.count() # Returns the number of non-null values in each DataFrame column
df.max() # Returns the highest value in each column
df.min() # Returns the lowest value in each column
df.median() # Returns the median of each column
df.std() # Returns the standard deviation of each column

选择数据

通常，您可能需要选择单个元素或数据的某个子集来检查它或执行进一步分析。这些方法会派上用场：


df[col] # Returns column with label col as Series
df[[col1, col2]] # Returns Columns as a new DataFrame
df[col].iloc[0] # Selection by position (selects first element of a column)
df[col].loc[0] # Selection by index (selects element at index 0 of a column)
df.iloc[0,:] # First row
df.iloc[0,0] # First element of first column

数据清理

如果您正在使用真实世界的数据，您可能需要清理它。这些是一些有用的方法：


df.columns = ['a','b','c'] # Renames columns
pd.isnull() # Checks for null Values, Returns Boolean Array
pd.notnull() # Checks for non-null Values, Returns Boolean Array
df.dropna() # Drops all rows that contain null values
df.dropna(axis=1) # Drops all columns that contain null values
df.dropna(axis=1,thresh=n) # Drops all rows have have less than n non null values
df.fillna(x) # Replaces all null values with x
df[col].fillna(df[col].mean()) # Replaces all null values in a column with the column mean
df[col].astype(float) # Converts the datatype of a column to float
df[col].replace(1,'one') # Replaces all values equal to 1 with 'one' in a column
df[col].replace([1,3],['one','three']) # Replaces all 1 with 'one' and 3 with 'three' in a column
df.rename(columns=lambda x: x + 1) # Mass renaming of columns
df.rename(columns={'old_name': 'new_ name'}) # Selective renaming
df.set_index('column_one') # Changes the index
df.rename(index=lambda x: x + 1) # Mass renaming of index

数据高级处理

过滤、排序和分组

过滤、排序和分组数据的方法：


df[df[col] > 0.5] # Rows where the col column is greater than 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7
df.sort_values(col1) # Sorts values by col1 in ascending order
df.sort_values(col2,ascending=False) # Sorts values by col2 in descending order
df.sort_values([col1,col2], ascending=[True,False]) # Sorts values by col1 in ascending order then col2 in descending order
df.groupby(col) # Returns a groupby object for values from one column
df.groupby([col1,col2]) # Returns a groupby object values from multiple columns
df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
df.pivot_table(index=col1, values=[col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group
df.apply(np.mean) # Applies a function across each column
df.apply(np.max, axis=1) # Applies a function across each row

加入和组合

组合两个数据帧的方法：


df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
pd.concat([df1, df2],axis=1) # Adds the columns in df1 to the end of df2 (rows should be identical)
df1.join(df2,on=col1,how='inner') # SQL-style joins the columns in df1 with the columns on df2 where the rows for col1 have identical values

写入数据

当您通过分析生成结果时，有几种方法可以导出数据：


df.to_csv(filename) # Writes to a CSV file
df.to_excel(filename) # Writes to an Excel file
df.to_sql(table_name, connection_object) # Writes to a SQL table
df.to_json(filename) # Writes to a file in JSON format
df.to_html(filename) # Saves as an HTML table
df.to_clipboard() # Writes to the clipboard

统计分析

Pandas 数据对象的轴参数

在使用统计函数时，了解 Pandas 三个数据对象的轴参数很重要：

Series: 没有轴参数
DataFrame: "index" (axis=0, default)，"columns" (axis=1)
Panel: "items" (axis=0)，"major" (axis=1, default)，"minor" (axis=2)

统计函数速查表

以下是 Pandas 中常用的统计函数及其描述：

函数	英文描述	中文描述
count	Number of non-null observations	观测值的个数
sum	Sum of values	求和
mean	Mean of values	求平均值
mad	Mean absolute deviation	平均绝对方差
median	Arithmetic median of values	中位数
min	Minimum	最小值
max	Maximum	最大值
argmin	Calculate the index position (integer) that can get the minimum value	计算能够获取到最小值的索引位置（整数）
argmax	Calculate the index position where the maximum value can be obtained	计算能够获取到最大值的索引位置
idxmin	Row index of each column minimum	每列最小值的行索引
idxmax	Row index of the maximum value per column	每列最大值的行索引
mode	Mode	众数
abs	Absolute Value	绝对值
prod	Product of values	乘积
std	Bessel-corrected sample standard deviation	标准差
var	Unbiased variance	方差
sem	Standard error of the mean	标准误
skew	Sample skewness (3rd moment)	偏度系数
kurt	Sample kurtosis (4th moment)	峰度
quantile	Sample quantile (value at %)	分位数
cumsum	Cumulative sum	累加
cumprod	Cumulative product	累乘
cummax	Cumulative maximum	累最大值
cummin	Cumulative minimum	累最小值
cov()	covariance	协方差
corr()	correlation	相关系数
rank()	rank by values	排名
pct_change()	time change	时间序列变化

总结

本文整理了 Pandas 最常用的操作和统计函数，涵盖了从数据导入、探索、清理到分析的完整流程。这些方法在日常数据分析工作中经常用到，建议收藏备查。

虽然我学过 Pandas，但有时候需要用的时候一时间无法想到具体的函数名或用法。这份备忘单让数据分析工作更加高效！

参考