Pandas学习笔记
1 | import pandas as pd |
Pandas数据结构
Series
Series
是一维的数据结构。
通过list构建Series
1 | ser_obj =pd.Series(range(10,15)) |
<class 'pandas.core.series.Series'>
0 10
1 11
2 12
3 13
4 14
dtype: int32
获取数据
1 | print(type(ser_obj.values)) # <class 'numpy.ndarray'> |
<class 'numpy.ndarray'>
[10 11 12 13 14]
获取索引
1 | print(type(ser_obj.index)) # <class 'pandas.core.indexes.range.RangeIndex'> |
<class 'pandas.core.indexes.range.RangeIndex'>
RangeIndex(start=0, stop=5, step=1)
注意索引对象不可变
1 | # 索引对象不可变 |
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-53-ce46badf9dd7> in <module>()
----> 1 ser_obj.index[0] = 2
G:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
1668
1669 def __setitem__(self, key, value):
-> 1670 raise TypeError("Index does not support mutable operations")
1671
1672 def __getitem__(self, key):
TypeError: Index does not support mutable operations
预览数据
1 | print(ser_obj.head(3)) |
0 10
1 11
2 12
dtype: int32
通过索引获取数据
1 | print(ser_obj[0]) # 10 |
10
索引与数据的对应关系仍保持在数组运算的结果中
1 | print(ser_obj > 12) |
0 False
1 False
2 False
3 True
4 True
dtype: bool
3 13
4 14
dtype: int32
整合代码
1 | # 通过list构建Series |
<class 'pandas.core.series.Series'>
0 10
1 11
2 12
3 13
4 14
dtype: int32
<class 'numpy.ndarray'>
[10 11 12 13 14]
<class 'pandas.core.indexes.range.RangeIndex'>
RangeIndex(start=0, stop=5, step=1)
0 10
1 11
2 12
dtype: int32
10
0 False
1 False
2 False
3 True
4 True
dtype: bool
3 13
4 14
dtype: int32
通过dict构建Series(注意:字典的key自动作为索引)
1 | year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5} |
<class 'pandas.core.series.Series'>
2001 17.8
2002 20.1
2003 16.5
dtype: float64
获取数据
1 | print(type(ser_obj2.values)) # <class 'numpy.ndarray'> |
<class 'numpy.ndarray'>
[ 17.8 20.1 16.5]
获取索引
1 | print(type(ser_obj2.index)) # <class 'pandas.core.indexes.numeric.Int64Index'> |
<class 'pandas.core.indexes.numeric.Int64Index'>
Int64Index([2001, 2002, 2003], dtype='int64')
预览数据(head()不加参数则显示全部)
1 | print(ser_obj2.head()) |
2001 17.8
2002 20.1
2003 16.5
dtype: float64
通过索引获取数据
1 | print(ser_obj2[2001]) # 17.8 |
17.8
整合代码
1 | # 通过dict构建Series(注意:字典的key自动作为索引) |
<class 'pandas.core.series.Series'>
2001 17.8
2002 20.1
2003 16.5
dtype: float64
<class 'numpy.ndarray'>
[ 17.8 20.1 16.5]
<class 'pandas.core.indexes.numeric.Int64Index'>
Int64Index([2001, 2002, 2003], dtype='int64')
2001 17.8
2002 20.1
2003 16.5
dtype: float64
17.8
DataFrame
一个Dataframe
就是一张表格,Series
表示的是一维数组,Dataframe
则是一个二维数组,可以类比成一张excel
的spreadsheet
。也可以把 Dataframe
当做一组Series
的集合。
通过ndarray构建DataFrame
1 | import numpy as np |
[[ 0.7346628 -1.13733651 0.72853785 0.38743511]
[ 0.49549724 3.96998008 1.13567695 -0.21425912]
[ 0.22094222 0.7766603 0.46086182 0.33199643]
[-0.46279419 0.85898771 0.41993259 -0.61997791]
[-0.83296535 1.19450707 -1.45531366 -0.13990243]]
0 1 2 3
0 0.734663 -1.137337 0.728538 0.387435
1 0.495497 3.969980 1.135677 -0.214259
2 0.220942 0.776660 0.460862 0.331996
3 -0.462794 0.858988 0.419933 -0.619978
4 -0.832965 1.194507 -1.455314 -0.139902
通过dict构建DataFrame
1 | dict_data = {'A': 1., |
{'A': 1.0, 'B': Timestamp('2018-03-16 00:00:00'), 'C': 0 1.0
1 1.0
2 1.0
3 1.0
dtype: float32, 'D': array([3, 3, 3, 3]), 'E': [Python, Java, C++, C#]
Categories (4, object): [C#, C++, Java, Python]}
A B C D E
0 1.0 2018-03-16 1.0 3 Python
1 1.0 2018-03-16 1.0 3 Java
2 1.0 2018-03-16 1.0 3 C++
3 1.0 2018-03-16 1.0 3 C#
通过列索引获取列数据
1 | print(df_obj2['A']) |
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
<class 'pandas.core.series.Series'>
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
通过行索引(.loc)获取行数据
1 | print(df_obj2.loc[0]) |
A 1
B 2018-03-16 00:00:00
C 1
D 3
E Python
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
增加列
1 | df_obj2['F'] = df_obj2['D'] + 4 |
A B C D E F
0 1.0 2018-03-16 1.0 3 Python 7
1 1.0 2018-03-16 1.0 3 Java 7
2 1.0 2018-03-16 1.0 3 C++ 7
3 1.0 2018-03-16 1.0 3 C# 7
删除列
1 | del(df_obj2['F'] ) |
A B C D E
0 1.0 2018-03-16 1.0 3 Python
1 1.0 2018-03-16 1.0 3 Java
2 1.0 2018-03-16 1.0 3 C++
3 1.0 2018-03-16 1.0 3 C#
整合代码
1 | import numpy as np |
[[ 0.23758715 -1.13751056 -0.0863061 -0.71309414]
[ 0.08129935 1.32099551 -0.27057527 0.49270974]
[ 0.96111551 1.08307556 1.5094844 0.96117055]
[-0.31003598 1.33959047 -0.42150857 -1.20605423]
[ 0.12655879 -1.01810288 -1.34025171 0.98758417]]
0 1 2 3
0 0.237587 -1.137511 -0.086306 -0.713094
1 0.081299 1.320996 -0.270575 0.492710
2 0.961116 1.083076 1.509484 0.961171
3 -0.310036 1.339590 -0.421509 -1.206054
4 0.126559 -1.018103 -1.340252 0.987584
{'A': 1.0, 'B': Timestamp('2018-03-16 00:00:00'), 'C': 0 1.0
1 1.0
2 1.0
3 1.0
dtype: float32, 'D': array([3, 3, 3, 3]), 'E': [Python, Java, C++, C#]
Categories (4, object): [C#, C++, Java, Python]}
A B C D E
0 1.0 2018-03-16 1.0 3 Python
1 1.0 2018-03-16 1.0 3 Java
2 1.0 2018-03-16 1.0 3 C++
3 1.0 2018-03-16 1.0 3 C#
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
<class 'pandas.core.series.Series'>
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
A 1
B 2018-03-16 00:00:00
C 1
D 3
E Python
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
A B C D E G
0 1.0 2018-03-16 1.0 3 Python 7
1 1.0 2018-03-16 1.0 3 Java 7
2 1.0 2018-03-16 1.0 3 C++ 7
3 1.0 2018-03-16 1.0 3 C# 7
A B C D E
0 1.0 2018-03-16 1.0 3 Python
1 1.0 2018-03-16 1.0 3 Java
2 1.0 2018-03-16 1.0 3 C++
3 1.0 2018-03-16 1.0 3 C#
Pandas 数据操作
1 | import pandas as pd |
Series索引
1 | ser_obj = pd.Series(range(5), index = ['a', 'b', 'c', 'd', 'e']) |
a 0
b 1
c 2
d 3
e 4
dtype: int32
行索引
1 | # 行索引 |
0
切片索引可以按照默认索引号,也可以按照实际索引值
1 | # 切片索引(按索引号) |
b 1
c 2
dtype: int32
1 | # 切片索引(按索引值) |
b 1
c 2
d 3
dtype: int32
不连续索引,同样可以按照默认索引号,也可以按照实际索引值
1 | # 不连续索引表达一(按索引号) |
a 0
c 2
e 4
dtype: int32
1 | # 不连续索引表达二(按索引值) |
a 0
e 4
dtype: int32
布尔索引
1 | # 布尔索引 |
a False
b False
c False
d True
e True
dtype: bool
d 3
e 4
dtype: int32
d 3
e 4
dtype: int32
DataFrame索引
1 | import numpy as np |
a | b | c | d | |
---|---|---|---|---|
0 | 0.983790 | 1.063804 | 0.854634 | -1.269025 |
1 | 0.161653 | -0.904602 | -1.840041 | 0.138183 |
2 | -1.256608 | -1.740634 | -1.653686 | -0.412524 |
3 | 0.165782 | 1.116089 | 0.065008 | -1.693706 |
4 | 1.313987 | 0.734437 | -0.625647 | -1.738446 |
列索引
1 | # 列索引 |
<class 'pandas.core.series.Series'>
0 0.983790
1 0.161653
2 -1.256608
3 0.165782
4 1.313987
Name: a, dtype: float64
行索引
1 | # 行索引 |
<class 'pandas.core.series.Series'>
a 0.983790
b 1.063804
c 0.854634
d -1.269025
Name: 0, dtype: float64
不连续索引
1 | #不连续列索引 |
a | c | |
---|---|---|
0 | 0.983790 | 0.854634 |
1 | 0.161653 | -1.840041 |
2 | -1.256608 | -1.653686 |
3 | 0.165782 | 0.065008 |
4 | 1.313987 | -0.625647 |
1 | #不连续行索引 |
a | b | c | d | |
---|---|---|---|---|
1 | 0.161653 | -0.904602 | -1.840041 | 0.138183 |
3 | 0.165782 | 1.116089 | 0.065008 | -1.693706 |
混合索引
1 | # 混合索引 loc |
0 -1.018941
1 0.089275
2 -2.210780
Name: a, dtype: float64
0 -1.018941
2 -2.210780
4 1.435787
Name: a, dtype: float64
运算与对齐
Series
对齐操作
1 | s1 = pd.Series(range(10, 13), index = range(3)) |
s1:
0 10
1 11
2 12
dtype: int32
s2:
0 20
1 21
2 22
3 23
4 24
dtype: int32
1 | # Series 对齐运算 |
0 30.0
1 32.0
2 34.0
3 NaN
4 NaN
dtype: float64
0 30.0
1 32.0
2 34.0
3 22.0
4 23.0
dtype: float64
0 30.0
1 32.0
2 34.0
3 -1.0
4 -1.0
dtype: float64
DataFrame
对齐操作
1 | import numpy as np |
df1:
a b
0 1.0 1.0
1 1.0 1.0
df2:
a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
1 | # DataFrame对齐操作 |
a | b | c | |
---|---|---|---|
0 | 2.0 | 2.0 | NaN |
1 | 2.0 | 2.0 | NaN |
2 | NaN | NaN | NaN |
1 | df1.add(df2, fill_value = 0) # 加法操作,没有对应上的补零 |
a | b | c | |
---|---|---|---|
0 | 2.0 | 2.0 | 1.0 |
1 | 2.0 | 2.0 | 1.0 |
2 | 1.0 | 1.0 | 1.0 |
1 | df1 - df2 # 没有对应上的部分会显示NaN |
a | b | c | |
---|---|---|---|
0 | 0.0 | 0.0 | NaN |
1 | 0.0 | 0.0 | NaN |
2 | NaN | NaN | NaN |
1 | df1.sub(df2, fill_value = 2) # 加法操作,没有对应上的补2(先补充后运算) |
a | b | c | |
---|---|---|---|
0 | 0.0 | 0.0 | 1.0 |
1 | 0.0 | 0.0 | 1.0 |
2 | 1.0 | 1.0 | 1.0 |
1 | df3 = df1 + df2 |
a | b | c | |
---|---|---|---|
0 | 2.0 | 2.0 | 100.0 |
1 | 2.0 | 2.0 | 100.0 |
2 | 100.0 | 100.0 | 100.0 |
函数应用
可以与NumPy
中的ufunc
函数结合操作
1 | # Numpy ufunc 函数 |
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.938212 | -2.487779 | -1.805374 | -1.130723 |
1 | -0.533441 | 0.196536 | -1.094895 | -1.819312 |
2 | -3.233318 | 0.255510 | -1.560183 | -2.404621 |
3 | -1.956924 | -2.947539 | -1.640760 | -0.757321 |
4 | 0.198618 | 0.344484 | -0.893815 | -0.498036 |
1 | np.abs(df) #取绝对值(还有其他诸多NumPy中的函数可以操作) |
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.938212 | 2.487779 | 1.805374 | 1.130723 |
1 | 0.533441 | 0.196536 | 1.094895 | 1.819312 |
2 | 3.233318 | 0.255510 | 1.560183 | 2.404621 |
3 | 1.956924 | 2.947539 | 1.640760 | 0.757321 |
4 | 0.198618 | 0.344484 | 0.893815 | 0.498036 |
使用apply应用行或列数据
1 | # 使用apply应用行或列数据 |
0 0.198618
1 0.344484
2 -0.893815
3 -0.498036
dtype: float64
1 | df.apply(lambda x : x.max(), axis=1) # 按列比较(得到每行的最大值) |
0 -0.938212
1 0.196536
2 0.255510
3 -0.757321
4 0.344484
dtype: float64
1 | df.apply(lambda x : x.max(), axis=0) # # 按行比较(得到每列的最大值) |
0 0.198618
1 0.344484
2 -0.893815
3 -0.498036
dtype: float64
使用applymap应用到每个数据
1 | # 使用applymap应用到每个数据 |
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.94 | -2.49 | -1.81 | -1.13 |
1 | -0.53 | 0.20 | -1.09 | -1.82 |
2 | -3.23 | 0.26 | -1.56 | -2.40 |
3 | -1.96 | -2.95 | -1.64 | -0.76 |
4 | 0.20 | 0.34 | -0.89 | -0.50 |
排序
Series
索引排序 & 值排序
1 | #索引乱序生成 |
2 10
1 13
5 12
3 25
4 14
dtype: int64
1 | # 索引排序 |
5 12
4 14
3 25
2 10
1 13
dtype: int64
1 | # 值排序 |
2 10
5 12
1 13
4 14
3 25
dtype: int64
DataFrame
索引排序 & 值排序
1 | df4 = pd.DataFrame(np.random.randn(3, 4), |
1 | 4 | 2 | 3 | |
---|---|---|---|---|
1 | 0.948112 | 0.076323 | 0.089607 | 0.091737 |
3 | -1.254556 | 1.483504 | 0.468995 | 0.286249 |
2 | -0.806738 | -0.842388 | -1.127489 | -0.020803 |
1 | #按索引排序 |
1 | 4 | 2 | 3 | |
---|---|---|---|---|
3 | -1.254556 | 1.483504 | 0.468995 | 0.286249 |
2 | -0.806738 | -0.842388 | -1.127489 | -0.020803 |
1 | 0.948112 | 0.076323 | 0.089607 | 0.091737 |
1 | #按索引排序 |
1 | 2 | 3 | 4 | |
---|---|---|---|---|
1 | 0.948112 | 0.089607 | 0.091737 | 0.076323 |
3 | -1.254556 | 0.468995 | 0.286249 | 1.483504 |
2 | -0.806738 | -1.127489 | -0.020803 | -0.842388 |
1 | #按列排序 |
1 | 4 | 2 | 3 | |
---|---|---|---|---|
3 | -1.254556 | 1.483504 | 0.468995 | 0.286249 |
2 | -0.806738 | -0.842388 | -1.127489 | -0.020803 |
1 | 0.948112 | 0.076323 | 0.089607 | 0.091737 |
处理缺失数据
生成数据
1 | df_data = pd.DataFrame([np.random.randn(3), [1., np.nan, np.nan], |
0 | 1 | 2 | |
---|---|---|---|
0 | 1.089477 | -0.486706 | -0.322284 |
1 | 1.000000 | NaN | NaN |
2 | 4.000000 | NaN | NaN |
3 | 1.000000 | NaN | 2.000000 |
二值化(NaN为False,非NaN为True)
1 | # isnull |
0 | 1 | 2 | |
---|---|---|---|
0 | False | False | False |
1 | False | True | True |
2 | False | True | True |
3 | False | True | False |
丢掉有NaN的行或列
1 | # dropna |
0 1 2
0 1.089477 -0.486706 -0.322284
0
0 1.089477
1 1.000000
2 4.000000
3 1.000000
填充NaN值
1 | # fillna |
0 | 1 | 2 | |
---|---|---|---|
0 | 1.089477 | -0.486706 | -0.322284 |
1 | 1.000000 | -100.000000 | -100.000000 |
2 | 4.000000 | -100.000000 | -100.000000 |
3 | 1.000000 | -100.000000 | 2.000000 |
数据统计计算和描述
常用的统计计算
1 | df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd']) |
a | b | c | d | |
---|---|---|---|---|
0 | 0.145119 | -2.398595 | 0.640806 | 0.696701 |
1 | -0.877139 | -0.261616 | -2.211734 | 0.140729 |
2 | -0.644545 | 0.523667 | -1.460002 | -0.341459 |
3 | 1.369260 | 1.039981 | 0.164075 | 0.380755 |
4 | 0.089507 | -0.371051 | 1.348191 | -0.828315 |
1 | df_obj.sum() |
a 0.082203
b -1.467614
c -1.518663
d 0.048410
dtype: float64
1 | df_obj.max() |
a 1.369260
b 1.039981
c 1.348191
d 0.696701
dtype: float64
1 | df_obj.min(axis=1) |
0 -2.398595
1 -2.211734
2 -1.460002
3 0.164075
4 -0.828315
dtype: float64
统计描述
1 | df_obj.describe() |
a | b | c | d | |
---|---|---|---|---|
count | 5.000000 | 5.000000 | 5.000000 | 5.000000 |
mean | 0.016441 | -0.293523 | -0.303733 | 0.009682 |
std | 0.878550 | 1.311906 | 1.484695 | 0.602578 |
min | -0.877139 | -2.398595 | -2.211734 | -0.828315 |
25% | -0.644545 | -0.371051 | -1.460002 | -0.341459 |
50% | 0.089507 | -0.261616 | 0.164075 | 0.140729 |
75% | 0.145119 | 0.523667 | 0.640806 | 0.380755 |
max | 1.369260 | 1.039981 | 1.348191 | 0.696701 |