In [11]: df Out[11]: data1 data2 key1 key2 0 -0.2047081.393406 a one 10.4789430.092908 a two 2 -0.5194390.281746 b one 3 -0.5557300.769023 b two 41.9657811.246435 a one
In [21]: df.groupby('key1').mean() Out[21]: data1 data2 key1 a 0.7466720.910916 b -0.5375850.525384
In [22]: df.groupby(['key1', 'key2']).mean() Out[22]: data1 data2 key1 key2 a one 0.8805361.319920 two 0.4789430.092908 b one -0.5194390.281746 two -0.5557300.769023
In [23]: df.groupby(['key1', 'key2']).size() Out[23]: key1 key2 a one 2 two 1 b one 1 two 1 dtype: int64
注意,任何分组关键词中的缺失值,都会被从结果中除去。
10.1.1 对分组进行迭代
GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)。看下面的例子:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [24]: for name, group in df.groupby('key1'): ....: print(name) ....: print(group) ....: a data1 data2 key1 key2 0 -0.2047081.393406 a one 10.4789430.092908 a two 41.9657811.246435 a one b data1 data2 key1 key2 2 -0.5194390.281746 b one 3 -0.5557300.769023 b two
对于多重键的情况,元组的第一个元素将会是由键值组成的元组:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
In [25]: for (k1, k2), group in df.groupby(['key1', 'key2']): ....: print((k1, k2)) ....: print(group) ....: ('a', 'one') data1 data2 key1 key2 0 -0.2047081.393406 a one 41.9657811.246435 a one ('a', 'two') data1 data2 key1 key2 10.4789430.092908 a two ('b', 'one') data1 data2 key1 key2 2 -0.5194390.281746 b one ('b', 'two') data1 data2 key1 key2 3 -0.555730.769023 b two
当然,你可以对这些数据片段做任何操作。有一个你可能会觉得有用的运算:将这些数据片段做成一个字典:
1 2 3 4 5 6 7
In [26]: pieces = dict(list(df.groupby('key1')))
In [27]: pieces['b'] Out[27]: data1 data2 key1 key2 2 -0.5194390.281746 b one 3 -0.5557300.769023 b two
In [30]: for dtype, group in grouped: ....: print(dtype) ....: print(group) ....: float64 data1 data2 0 -0.2047081.393406 10.4789430.092908 2 -0.5194390.281746 3 -0.5557300.769023 41.9657811.246435 object key1 key2 0 a one 1 a two 2 b one 3 b two 4 a one
In [32]: s_grouped = df.groupby(['key1', 'key2'])['data2']
In [33]: s_grouped Out[33]: <pandas.core.groupby.SeriesGroupBy object at 0x7faa30c78da0>
In [34]: s_grouped.mean() Out[34]: key1 key2 a one 1.319920 two 0.092908 b one 0.281746 two 0.769023 Name: data2, dtype: float64
###通过字典或Series进行分组
除数组以外,分组信息还可以其他形式存在。来看另一个示例DataFrame:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In [35]: people = pd.DataFrame(np.random.randn(5, 5), ....: columns=['a', 'b', 'c', 'd', 'e'], ....: index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
In [36]: people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
In [37]: people Out[37]: a b c d e Joe 1.007189 -1.2962210.2749920.2289131.352917 Steve 0.886429 -2.001637 -0.3718431.669025 -0.438570 Wes -0.539741 NaN NaN -1.021228 -0.577087 Jim 0.1241210.3026140.5237720.0009401.343810 Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
In [39]: by_column = people.groupby(mapping, axis=1)
In [40]: by_column.sum() Out[40]: blue red Joe 0.5039051.063885 Steve 1.297183 -1.553778 Wes -1.021228 -1.116829 Jim 0.5247121.770545 Travis -4.230992 -2.405455
In [44]: people.groupby(len).sum() Out[44]: a b c d e 30.591569 -0.9936080.798764 -0.7913742.119639 50.886429 -2.001637 -0.3718431.669025 -0.438570 6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
将函数跟数组、列表、字典、Series混合使用也不是问题,因为任何东西在内部都会被转换为数组:
1 2 3 4 5 6 7 8 9
In [45]: key_list = ['one', 'one', 'one', 'two', 'two']
In [46]: people.groupby([len, key_list]).min() Out[46]: a b c d e 3 one -0.539741 -1.2962210.274992 -1.021228 -0.577087 two 0.1241210.3026140.5237720.0009401.343810 5 one 0.886429 -2.001637 -0.3718431.669025 -0.438570 6 two -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
In [51]: df Out[51]: data1 data2 key1 key2 0 -0.2047081.393406 a one 10.4789430.092908 a two 2 -0.5194390.281746 b one 3 -0.5557300.769023 b two 41.9657811.246435 a one
In [52]: grouped = df.groupby('key1')
In [53]: grouped['data1'].quantile(0.9) Out[53]: key1 a 1.668413 b -0.523068 Name: data1, dtype: float64
如果要使用你自己的聚合函数,只需将其传入aggregate或agg方法即可:
1 2 3 4 5 6 7 8
In [54]: defpeak_to_peak(arr): ....: return arr.max() - arr.min() In [55]: grouped.agg(peak_to_peak) Out[55]: data1 data2 key1 a 2.1704881.300498 b 0.0362920.487276
In [56]: grouped.describe() Out[56]: data1 \ count mean std min25% 50% 75% key1 a 3.00.7466721.109736 -0.2047080.1371180.4789431.222362 b 2.0 -0.5375850.025662 -0.555730 -0.546657 -0.537585 -0.528512 data2 \ max count mean std min25% 50% key1 a 1.9657813.00.9109160.7122170.0929080.6696711.246435 b -0.5194392.00.5253840.3445560.2817460.4035650.525384 75% max key1 a 1.3199201.393406 b 0.6472030.769023
## Add tip percentage of total bill In [58]: tips['tip_pct'] = tips['tip'] / tips['total_bill']
In [59]: tips[:6] Out[59]: total_bill tip smoker day time size tip_pct 016.991.01 No Sun Dinner 20.059447 110.341.66 No Sun Dinner 30.160542 221.013.50 No Sun Dinner 30.166587 323.683.31 No Sun Dinner 20.139780 424.593.61 No Sun Dinner 40.146808 525.294.71 No Sun Dinner 40.186240
In [60]: grouped = tips.groupby(['day', 'smoker'])
注意,对于表10-1中的那些描述统计,可以将函数名以字符串的形式传入:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In [61]: grouped_pct = grouped['tip_pct']
In [62]: grouped_pct.agg('mean') Out[62]: day smoker Fri No 0.151650 Yes 0.174783 Sat No 0.158048 Yes 0.147906 Sun No 0.160113 Yes 0.187250 Thur No 0.160298 Yes 0.163863 Name: tip_pct, dtype: float64
如果传入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名:
1 2 3 4 5 6 7 8 9 10 11 12
In [63]: grouped_pct.agg(['mean', 'std', peak_to_peak]) Out[63]: mean std peak_to_peak day smoker Fri No 0.1516500.0281230.067349 Yes 0.1747830.0512930.159925 Sat No 0.1580480.0397670.235193 Yes 0.1479060.0613750.290095 Sun No 0.1601130.0423470.193226 Yes 0.1872500.1541340.644685 Thur No 0.1602980.0387740.193350 Yes 0.1638630.0393890.151240
In [64]: grouped_pct.agg([('foo', 'mean'), ('bar', np.std)]) Out[64]: foo bar day smoker Fri No 0.1516500.028123 Yes 0.1747830.051293 Sat No 0.1580480.039767 Yes 0.1479060.061375 Sun No 0.1601130.042347 Yes 0.1872500.154134 Thur No 0.1602980.038774 Yes 0.1638630.039389
In [66]: result = grouped['tip_pct', 'total_bill'].agg(functions)
In [67]: result Out[67]: tip_pct total_bill count mean max count mean max day smoker Fri No 40.1516500.187735418.42000022.75 Yes 150.1747830.2634801516.81333340.17 Sat No 450.1580480.2919904519.66177848.33 Yes 420.1479060.3257334221.27666750.81 Sun No 570.1601130.2526725720.50666748.17 Yes 190.1872500.7103451924.12000045.35 Thur No 450.1602980.2663124517.11311141.19 Yes 170.1638630.2412551719.19058843.11
In [68]: result['tip_pct'] Out[68]: count mean max day smoker Fri No 40.1516500.187735 Yes 150.1747830.263480 Sat No 450.1580480.291990 Yes 420.1479060.325733 Sun No 570.1601130.252672 Yes 190.1872500.710345 Thur No 450.1602980.266312 Yes 170.1638630.241255
跟前面一样,这里也可以传入带有自定义名称的一组元组:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
In [69]: ftuples = [('Durchschnitt', 'mean'),('Abweichung', np.var)]
In [70]: grouped['tip_pct', 'total_bill'].agg(ftuples) Out[70]: tip_pct total_bill Durchschnitt Abweichung Durchschnitt Abweichung day smoker Fri No 0.1516500.00079118.42000025.596333 Yes 0.1747830.00263116.81333382.562438 Sat No 0.1580480.00158119.66177879.908965 Yes 0.1479060.00376721.276667101.387535 Sun No 0.1601130.00179320.50666766.099980 Yes 0.1872500.02375724.120000109.046044 Thur No 0.1602980.00150317.11311159.625081 Yes 0.1638630.00155119.19058869.808518
In [71]: grouped.agg({'tip' : np.max, 'size' : 'sum'}) Out[71]: tip size day smoker Fri No 3.509 Yes 4.7331 Sat No 9.00115 Yes 10.00104 Sun No 6.00167 Yes 6.5049 Thur No 6.70112 Yes 5.0040
In [72]: grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'], ....: 'size' : 'sum'}) Out[72]: tip_pct size minmax mean std sum day smoker Fri No 0.1203850.1877350.1516500.0281239 Yes 0.1035550.2634800.1747830.05129331 Sat No 0.0567970.2919900.1580480.039767115 Yes 0.0356380.3257330.1479060.061375104 Sun No 0.0594470.2526720.1601130.042347167 Yes 0.0656600.7103450.1872500.15413449 Thur No 0.0729610.2663120.1602980.038774112 Yes 0.0900140.2412550.1638630.03938940
In [73]: tips.groupby(['day', 'smoker'], as_index=False).mean() Out[73]: day smoker total_bill tip size tip_pct 0 Fri No 18.4200002.8125002.2500000.151650 1 Fri Yes 16.8133332.7140002.0666670.174783 2 Sat No 19.6617783.1028892.5555560.158048 3 Sat Yes 21.2766672.8754762.4761900.147906 4 Sun No 20.5066673.1678952.9298250.160113 5 Sun Yes 24.1200003.5168422.5789470.187250 6 Thur No 17.1131112.6737782.4888890.160298 7 Thur Yes 19.1905883.0300002.3529410.163863
In [74]: deftop(df, n=5, column='tip_pct'): ....: return df.sort_values(by=column)[-n:]
In [75]: top(tips, n=6) Out[75]: total_bill tip smoker day time size tip_pct 10914.314.00 Yes Sat Dinner 20.279525 18323.176.50 Yes Sun Dinner 40.280535 23211.613.39 No Sat Dinner 20.291990 673.071.00 Yes Sat Dinner 10.325733 1789.604.00 Yes Sun Dinner 20.416667 1727.255.15 Yes Sun Dinner 20.710345
现在,如果对smoker分组并用该函数调用apply,就会得到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In [76]: tips.groupby('smoker').apply(top) Out[76]: total_bill tip smoker day time size tip_pct smoker No 8824.715.85 No Thur Lunch 20.236746 18520.695.00 No Sun Dinner 50.241663 5110.292.60 No Sun Dinner 20.252672 1497.512.00 No Thur Lunch 20.266312 23211.613.39 No Sat Dinner 20.291990 Yes 10914.314.00 Yes Sat Dinner 20.279525 18323.176.50 Yes Sun Dinner 40.280535 673.071.00 Yes Sat Dinner 10.325733 1789.604.00 Yes Sun Dinner 20.416667 1727.255.15 Yes Sun Dinner 20.710345
In [77]: tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill') Out[77]: total_bill tip smoker day time size tip_pct smoker day No Fri 9422.753.25 No Fri Dinner 20.142857 Sat 21248.339.00 No Sat Dinner 40.186220 Sun 15648.175.00 No Sun Dinner 60.103799 Thur 14241.195.00 No Thur Lunch 50.121389 Yes Fri 9540.174.73 Yes Fri Dinner 40.117750 Sat 17050.8110.00 Yes Sat Dinner 30.196812 Sun 18245.353.50 Yes Sun Dinner 30.077178 Thur 19743.115.00 Yes Thur Lunch 40.115982
In [78]: result = tips.groupby('smoker')['tip_pct'].describe()
In [79]: result Out[79]: count mean std min25% 50% 75% \ smoker No 151.00.1593280.0399100.0567970.1369060.1556250.185014 Yes 93.00.1631960.0851190.0356380.1067710.1538460.195059 max smoker
No 0.291990 Yes 0.710345
In [80]: result.unstack('smoker') Out[80]: smoker count No 151.000000 Yes 93.000000 mean No 0.159328 Yes 0.163196 std No 0.039910 Yes 0.085119 min No 0.056797 Yes 0.035638 25% No 0.136906 Yes 0.106771 50% No 0.155625 Yes 0.153846 75% No 0.185014 Yes 0.195059 max No 0.291990 Yes 0.710345 dtype: float64
In [81]: tips.groupby('smoker', group_keys=False).apply(top) Out[81]: total_bill tip smoker day time size tip_pct 8824.715.85 No Thur Lunch 20.236746 18520.695.00 No Sun Dinner 50.241663 5110.292.60 No Sun Dinner 20.252672 1497.512.00 No Thur Lunch 20.266312 23211.613.39 No Sat Dinner 20.291990 10914.314.00 Yes Sat Dinner 20.279525 18323.176.50 Yes Sun Dinner 40.280535 673.071.00 Yes Sat Dinner 10.325733 1789.604.00 Yes Sun Dinner 20.416667 1727.255.15 Yes Sun Dinner 20.710345
In [95]: states = ['Ohio', 'New York', 'Vermont', 'Florida', ....: 'Oregon', 'Nevada', 'California', 'Idaho']
In [96]: group_key = ['East'] * 4 + ['West'] * 4
In [97]: data = pd.Series(np.random.randn(8), index=states)
In [98]: data Out[98]: Ohio 0.922264 New York -2.153545 Vermont -0.365757 Florida -0.375842 Oregon 0.329939 Nevada 0.981994 California 1.105913 Idaho -1.613716 dtype: float64
In [99]: data[['Vermont', 'Nevada', 'Idaho']] = np.nan
In [100]: data Out[100]: Ohio 0.922264 New York -2.153545 Vermont NaN Florida -0.375842 Oregon 0.329939 Nevada NaN California 1.105913 Idaho NaN dtype: float64
In [101]: data.groupby(group_key).mean() Out[101]: East -0.535707 West 0.717926 dtype: float64
我们可以用分组平均值去填充NA值:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [102]: fill_mean = lambda g: g.fillna(g.mean())
In [103]: data.groupby(group_key).apply(fill_mean) Out[103]: Ohio 0.922264 New York -2.153545 Vermont -0.535707 Florida -0.375842 Oregon 0.329939 Nevada 0.717926 California 1.105913 Idaho 0.717926 dtype: float64
另外,也可以在代码中预定义各组的填充值。由于分组具有一个name属性,所以我们可以拿来用一下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
In [104]: fill_values = {'East': 0.5, 'West': -1}
In [105]: fill_func = lambda g: g.fillna(fill_values[g.name])
In [106]: data.groupby(group_key).apply(fill_func) Out[106]: Ohio 0.922264 New York -2.153545 Vermont 0.500000 Florida -0.375842 Oregon 0.329939 Nevada -1.000000 California 1.105913 Idaho -1.000000 dtype: float64
10.3.4 示例:随机采样和排列
假设你想要从一个大数据集中随机抽取(进行替换或不替换)样本以进行蒙特卡罗模拟(Monte Carlo simulation)或其他分析工作。“抽取”的方式有很多,这里使用的方法是对Series使用sample方法:
1 2 3 4 5 6 7 8 9
## Hearts, Spades, Clubs, Diamonds suits = ['H', 'S', 'C', 'D'] card_val = (list(range(1, 11)) + [10] * 3) * 4 base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q'] cards = [] for suit in ['H', 'S', 'C', 'D']: cards.extend(str(num) + suit for num in base_names)
In [115]: df Out[115]: category data weights 0 a 1.5615870.957515 1 a 1.2199840.347267 2 a -0.4822390.581362 3 a 0.3156670.217091 4 b -0.0478520.894406 5 b -0.4541450.918564 6 b -0.5567740.277825 7 b 0.2533210.955905
然后可以利用category计算分组加权平均数:
1 2 3 4 5 6 7 8 9 10
In [116]: grouped = df.groupby('category')
In [117]: get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
In [118]: grouped.apply(get_wavg) Out[118]: category a 0.811643 b -0.122262 dtype: float64
顺着上一个例子继续,你可以用groupby执行更为复杂的分组统计分析,只要函数返回的是pandas对象或标量值即可。例如,我可以定义下面这个regress函数(利用statsmodels计量经济学库)对各数据块执行普通最小二乘法(Ordinary Least Squares,OLS)回归:
1 2 3 4 5 6 7
import statsmodels.api as sm defregress(data, yvar, xvars): Y = data[yvar] X = data[xvars] X['intercept'] = 1. result = sm.OLS(Y, X).fit() return result.params
In [130]: tips.pivot_table(index=['day', 'smoker']) Out[130]: size tip tip_pct total_bill day smoker Fri No 2.2500002.8125000.15165018.420000 Yes 2.0666672.7140000.17478316.813333 Sat No 2.5555563.1028890.15804819.661778 Yes 2.4761902.8754760.14790621.276667 Sun No 2.9298253.1678950.16011320.506667 Yes 2.5789473.5168420.18725024.120000 Thur No 2.4888892.6737780.16029817.113111 Yes 2.3529413.0300000.16386319.190588
In [131]: tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], .....: columns='smoker') Out[131]: size tip_pct smoker No Yes No Yes time day Dinner Fri 2.0000002.2222220.1396220.165347 Sat 2.5555562.4761900.1580480.147906 Sun 2.9298252.5789470.1601130.187250 Thur 2.000000 NaN 0.159744 NaN Lunch Fri 3.0000001.8333330.1877350.188937 Thur 2.5000002.3529410.1603110.163863
In [132]: tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], .....: columns='smoker', margins=True) Out[132]: size tip_pct smoker No Yes All No Yes All time day Dinner Fri 2.0000002.2222222.1666670.1396220.1653470.158916 Sat 2.5555562.4761902.5172410.1580480.1479060.153152 Sun 2.9298252.5789472.8421050.1601130.1872500.166897 Thur 2.000000 NaN 2.0000000.159744 NaN 0.159744 Lunch Fri 3.0000001.8333332.0000000.1877350.1889370.188765 Thur 2.5000002.3529412.4590160.1603110.1638630.161301 All 2.6688742.4086022.5696720.1593280.1631960.160803
In [133]: tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day', .....: aggfunc=len, margins=True) Out[133]: day Fri Sat Sun Thur All time smoker Dinner No 3.045.057.01.0106.0 Yes 9.042.019.0 NaN 70.0 Lunch No 1.0 NaN NaN 44.045.0 Yes 6.0 NaN NaN 17.023.0 All 19.087.076.062.0244.0
In [134]: tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'], .....: columns='day', aggfunc='mean', fill_value=0) Out[134]: day Fri Sat Sun Thur time size smoker Dinner 1 No 0.0000000.1379310.0000000.000000 Yes 0.0000000.3257330.0000000.000000 2 No 0.1396220.1627050.1688590.159744 Yes 0.1712970.1486680.2078930.000000 3 No 0.0000000.1546610.1526630.000000 Yes 0.0000000.1449950.1526600.000000 4 No 0.0000000.1500960.1481430.000000 Yes 0.1177500.1245150.1933700.000000 5 No 0.0000000.0000000.2069280.000000 Yes 0.0000000.1065720.0656600.000000 ... ... ... ... ... Lunch 1 No 0.0000000.0000000.0000000.181728 Yes 0.2237760.0000000.0000000.000000 2 No 0.0000000.0000000.0000000.166005 Yes 0.1819690.0000000.0000000.158843 3 No 0.1877350.0000000.0000000.084246 Yes 0.0000000.0000000.0000000.204952 4 No 0.0000000.0000000.0000000.138919 Yes 0.0000000.0000000.0000000.155410 5 No 0.0000000.0000000.0000000.121389 6 No 0.0000000.0000000.0000000.173706 [21 rows x 4 columns]
In [138]: data Out[138]: Sample Nationality Handedness 01 USA Right-handed 12 Japan Left-handed 23 USA Right-handed 34 Japan Right-handed 45 Japan Left-handed 56 Japan Right-handed 67 USA Right-handed 78 USA Left-handed 89 Japan Right-handed 910 USA Right-handed
In [139]: pd.crosstab(data.Nationality, data.Handedness, margins=True) Out[139]: Handedness Left-handed Right-handed All Nationality Japan 235 USA 145 All 3710
crosstab的前两个参数可以是数组或Series,或是数组列表。就像小费数据:
1 2 3 4 5 6 7 8 9 10 11
In [140]: pd.crosstab([tips.time, tips.day], tips.smoker, margins=True) Out[140]: smoker No Yes All time day Dinner Fri 3912 Sat 454287 Sun 571976 Thur 101 Lunch Fri 167 Thur 441761 All 15193244