CTRL K
Pandas Complete Reference: Cheat Sheet and Statistical Functions
Mr Nanko,
Table of Contents
Numpy 和 Pandas 是最受欢迎的 Python 数据科学和分析库。Numpy 用于较低级别的科学计算,而 Pandas 构建于 Numpy 之上,专为 Python 中的实际数据分析而设计。本文整理了 Pandas 最常用的功能和统计函数,方便快速查阅。
数据基础操作
导入数据
任何类型的数据分析都从获取某些数据开始。Pandas 为您提供了很多将数据导入 Python 工作簿的选项:
pd.read_csv(filename) # From a CSV file
pd.read_table(filename) # From a delimited text file (like TSV)
pd.read_excel(filename) # From an Excel file
pd.read_sql(query, connection_object) # Reads from a SQL table/database
pd.read_json(json_string) # Reads from a JSON formatted string, URL or file.
pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists探索数据
将数据导入 Pandas 数据帧后,可以使用这些方法来了解数据的外观:
df.shape() # Prints number of rows and columns in dataframe
df.head(n) # Prints first n rows of the DataFrame
df.tail(n) # Prints last n rows of the DataFrame
df.info() # Index, Datatype and Memory information
df.describe() # Summary statistics for numerical columns
df[col].value_counts(dropna=False) # Views unique values and counts for a column
df.apply(pd.Series.value_counts) # Unique values and counts for all columns
df.mean() # Returns the mean of all columns
df.corr() # Returns the correlation between columns in a DataFrame
df.count() # Returns the number of non-null values in each DataFrame column
df.max() # Returns the highest value in each column
df.min() # Returns the lowest value in each column
df.median() # Returns the median of each column
df.std() # Returns the standard deviation of each column选择数据
通常,您可能需要选择单个元素或数据的某个子集来检查它或执行进一步分析。这些方法会派上用场:
df[col] # Returns column with label col as Series
df[[col1, col2]] # Returns Columns as a new DataFrame
df[col].iloc[0] # Selection by position (selects first element of a column)
df[col].loc[0] # Selection by index (selects element at index 0 of a column)
df.iloc[0,:] # First row
df.iloc[0,0] # First element of first column数据清理
如果您正在使用真实世界的数据,您可能需要清理它。这些是一些有用的方法:
df.columns = ['a','b','c'] # Renames columns
pd.isnull() # Checks for null Values, Returns Boolean Array
pd.notnull() # Checks for non-null Values, Returns Boolean Array
df.dropna() # Drops all rows that contain null values
df.dropna(axis=1) # Drops all columns that contain null values
df.dropna(axis=1,thresh=n) # Drops all rows have have less than n non null values
df.fillna(x) # Replaces all null values with x
df[col].fillna(df[col].mean()) # Replaces all null values in a column with the column mean
df[col].astype(float) # Converts the datatype of a column to float
df[col].replace(1,'one') # Replaces all values equal to 1 with 'one' in a column
df[col].replace([1,3],['one','three']) # Replaces all 1 with 'one' and 3 with 'three' in a column
df.rename(columns=lambda x: x + 1) # Mass renaming of columns
df.rename(columns={'old_name': 'new_ name'}) # Selective renaming
df.set_index('column_one') # Changes the index
df.rename(index=lambda x: x + 1) # Mass renaming of index数据高级处理
过滤、排序和分组
过滤、排序和分组数据的方法:
df[df[col] > 0.5] # Rows where the col column is greater than 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7
df.sort_values(col1) # Sorts values by col1 in ascending order
df.sort_values(col2,ascending=False) # Sorts values by col2 in descending order
df.sort_values([col1,col2], ascending=[True,False]) # Sorts values by col1 in ascending order then col2 in descending order
df.groupby(col) # Returns a groupby object for values from one column
df.groupby([col1,col2]) # Returns a groupby object values from multiple columns
df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
df.pivot_table(index=col1, values=[col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group
df.apply(np.mean) # Applies a function across each column
df.apply(np.max, axis=1) # Applies a function across each row加入和组合
组合两个数据帧的方法:
df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
pd.concat([df1, df2],axis=1) # Adds the columns in df1 to the end of df2 (rows should be identical)
df1.join(df2,on=col1,how='inner') # SQL-style joins the columns in df1 with the columns on df2 where the rows for col1 have identical values写入数据
当您通过分析生成结果时,有几种方法可以导出数据:
df.to_csv(filename) # Writes to a CSV file
df.to_excel(filename) # Writes to an Excel file
df.to_sql(table_name, connection_object) # Writes to a SQL table
df.to_json(filename) # Writes to a file in JSON format
df.to_html(filename) # Saves as an HTML table
df.to_clipboard() # Writes to the clipboard统计分析
Pandas 数据对象的轴参数
在使用统计函数时,了解 Pandas 三个数据对象的轴参数很重要:
- Series: 没有轴参数
- DataFrame:
"index"(axis=0, default),"columns"(axis=1) - Panel:
"items"(axis=0),"major"(axis=1, default),"minor"(axis=2)
统计函数速查表
以下是 Pandas 中常用的统计函数及其描述:
| 函数 | 英文描述 | 中文描述 |
|---|---|---|
| count | Number of non-null observations | 观测值的个数 |
| sum | Sum of values | 求和 |
| mean | Mean of values | 求平均值 |
| mad | Mean absolute deviation | 平均绝对方差 |
| median | Arithmetic median of values | 中位数 |
| min | Minimum | 最小值 |
| max | Maximum | 最大值 |
| argmin | Calculate the index position (integer) that can get the minimum value | 计算能够获取到最小值的索引位置(整数) |
| argmax | Calculate the index position where the maximum value can be obtained | 计算能够获取到最大值的索引位置 |
| idxmin | Row index of each column minimum | 每列最小值的行索引 |
| idxmax | Row index of the maximum value per column | 每列最大值的行索引 |
| mode | Mode | 众数 |
| abs | Absolute Value | 绝对值 |
| prod | Product of values | 乘积 |
| std | Bessel-corrected sample standard deviation | 标准差 |
| var | Unbiased variance | 方差 |
| sem | Standard error of the mean | 标准误 |
| skew | Sample skewness (3rd moment) | 偏度系数 |
| kurt | Sample kurtosis (4th moment) | 峰度 |
| quantile | Sample quantile (value at %) | 分位数 |
| cumsum | Cumulative sum | 累加 |
| cumprod | Cumulative product | 累乘 |
| cummax | Cumulative maximum | 累最大值 |
| cummin | Cumulative minimum | 累最小值 |
| cov() | covariance | 协方差 |
| corr() | correlation | 相关系数 |
| rank() | rank by values | 排名 |
| pct_change() | time change | 时间序列变化 |
总结
本文整理了 Pandas 最常用的操作和统计函数,涵盖了从数据导入、探索、清理到分析的完整流程。这些方法在日常数据分析工作中经常用到,建议收藏备查。
虽然我学过 Pandas,但有时候需要用的时候一时间无法想到具体的函数名或用法。这份备忘单让数据分析工作更加高效!
参考
© 2026 Mr Nanko. CC BY-NC 4.0