🎉 Nanko's Log has been deployed. Read more →
PostsArchiveTags

Pandas Complete Reference: Cheat Sheet and Statistical Functions

Mr Nanko,

Table of Contents

Numpy 和 Pandas 是最受欢迎的 Python 数据科学和分析库。Numpy 用于较低级别的科学计算,而 Pandas 构建于 Numpy 之上,专为 Python 中的实际数据分析而设计。本文整理了 Pandas 最常用的功能和统计函数,方便快速查阅。

数据基础操作

导入数据

任何类型的数据分析都从获取某些数据开始。Pandas 为您提供了很多将数据导入 Python 工作簿的选项:

pd.read_csv(filename) # From a CSV file pd.read_table(filename) # From a delimited text file (like TSV) pd.read_excel(filename) # From an Excel file pd.read_sql(query, connection_object) # Reads from a SQL table/database pd.read_json(json_string) # Reads from a JSON formatted string, URL or file. pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table() pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists

探索数据

将数据导入 Pandas 数据帧后,可以使用这些方法来了解数据的外观:

df.shape() # Prints number of rows and columns in dataframe df.head(n) # Prints first n rows of the DataFrame df.tail(n) # Prints last n rows of the DataFrame df.info() # Index, Datatype and Memory information df.describe() # Summary statistics for numerical columns df[col].value_counts(dropna=False) # Views unique values and counts for a column df.apply(pd.Series.value_counts) # Unique values and counts for all columns df.mean() # Returns the mean of all columns df.corr() # Returns the correlation between columns in a DataFrame df.count() # Returns the number of non-null values in each DataFrame column df.max() # Returns the highest value in each column df.min() # Returns the lowest value in each column df.median() # Returns the median of each column df.std() # Returns the standard deviation of each column

选择数据

通常,您可能需要选择单个元素或数据的某个子集来检查它或执行进一步分析。这些方法会派上用场:

df[col] # Returns column with label col as Series df[[col1, col2]] # Returns Columns as a new DataFrame df[col].iloc[0] # Selection by position (selects first element of a column) df[col].loc[0] # Selection by index (selects element at index 0 of a column) df.iloc[0,:] # First row df.iloc[0,0] # First element of first column

数据清理

如果您正在使用真实世界的数据,您可能需要清理它。这些是一些有用的方法:

df.columns = ['a','b','c'] # Renames columns pd.isnull() # Checks for null Values, Returns Boolean Array pd.notnull() # Checks for non-null Values, Returns Boolean Array df.dropna() # Drops all rows that contain null values df.dropna(axis=1) # Drops all columns that contain null values df.dropna(axis=1,thresh=n) # Drops all rows have have less than n non null values df.fillna(x) # Replaces all null values with x df[col].fillna(df[col].mean()) # Replaces all null values in a column with the column mean df[col].astype(float) # Converts the datatype of a column to float df[col].replace(1,'one') # Replaces all values equal to 1 with 'one' in a column df[col].replace([1,3],['one','three']) # Replaces all 1 with 'one' and 3 with 'three' in a column df.rename(columns=lambda x: x + 1) # Mass renaming of columns df.rename(columns={'old_name': 'new_ name'}) # Selective renaming df.set_index('column_one') # Changes the index df.rename(index=lambda x: x + 1) # Mass renaming of index

数据高级处理

过滤、排序和分组

过滤、排序和分组数据的方法:

df[df[col] > 0.5] # Rows where the col column is greater than 0.5 df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7 df.sort_values(col1) # Sorts values by col1 in ascending order df.sort_values(col2,ascending=False) # Sorts values by col2 in descending order df.sort_values([col1,col2], ascending=[True,False]) # Sorts values by col1 in ascending order then col2 in descending order df.groupby(col) # Returns a groupby object for values from one column df.groupby([col1,col2]) # Returns a groupby object values from multiple columns df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section) df.pivot_table(index=col1, values=[col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3 df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group df.apply(np.mean) # Applies a function across each column df.apply(np.max, axis=1) # Applies a function across each row

加入和组合

组合两个数据帧的方法:

df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical) pd.concat([df1, df2],axis=1) # Adds the columns in df1 to the end of df2 (rows should be identical) df1.join(df2,on=col1,how='inner') # SQL-style joins the columns in df1 with the columns on df2 where the rows for col1 have identical values

写入数据

当您通过分析生成结果时,有几种方法可以导出数据:

df.to_csv(filename) # Writes to a CSV file df.to_excel(filename) # Writes to an Excel file df.to_sql(table_name, connection_object) # Writes to a SQL table df.to_json(filename) # Writes to a file in JSON format df.to_html(filename) # Saves as an HTML table df.to_clipboard() # Writes to the clipboard

统计分析

Pandas 数据对象的轴参数

在使用统计函数时,了解 Pandas 三个数据对象的轴参数很重要:

统计函数速查表

以下是 Pandas 中常用的统计函数及其描述:

函数英文描述中文描述
countNumber of non-null observations观测值的个数
sumSum of values求和
meanMean of values求平均值
madMean absolute deviation平均绝对方差
medianArithmetic median of values中位数
minMinimum最小值
maxMaximum最大值
argminCalculate the index position (integer) that can get the minimum value计算能够获取到最小值的索引位置(整数)
argmaxCalculate the index position where the maximum value can be obtained计算能够获取到最大值的索引位置
idxminRow index of each column minimum每列最小值的行索引
idxmaxRow index of the maximum value per column每列最大值的行索引
modeMode众数
absAbsolute Value绝对值
prodProduct of values乘积
stdBessel-corrected sample standard deviation标准差
varUnbiased variance方差
semStandard error of the mean标准误
skewSample skewness (3rd moment)偏度系数
kurtSample kurtosis (4th moment)峰度
quantileSample quantile (value at %)分位数
cumsumCumulative sum累加
cumprodCumulative product累乘
cummaxCumulative maximum累最大值
cumminCumulative minimum累最小值
cov()covariance协方差
corr()correlation相关系数
rank()rank by values排名
pct_change()time change时间序列变化

总结

本文整理了 Pandas 最常用的操作和统计函数,涵盖了从数据导入、探索、清理到分析的完整流程。这些方法在日常数据分析工作中经常用到,建议收藏备查。

虽然我学过 Pandas,但有时候需要用的时候一时间无法想到具体的函数名或用法。这份备忘单让数据分析工作更加高效!

参考

  1. 用于数据科学的Python备忘单 
  2. Pandas 描述统计函数 
  3. pandas的汇总和计算描述统计 
  4. Pandas 官方文档 
© 2026 Mr Nanko. CC BY-NC 4.0