Group By and Aggregation with Pyspark
• 6 min read
"Group By and Aggregation with Pyspark"
- toc:true- branch: master- badges: true- comments: true
- author: David Kearney
- categories: [pyspark, jupyter]
- description: Group By and Aggregation with Pyspark
- title: Group By and Aggregation with Pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct, avg,stddev
# Load data from a CSV
file_location = "/FileStore/tables/df_panel_fix.csv"
df = spark.read.format("CSV").option("inferSchema", True).option("header", True).load(file_location)
display(df.take(5))
_c0 | province | specific | general | year | gdp | fdi | rnr | rr | i | fr | reg | it |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Anhui | 147002.0 | null | 1996 | 2093.3 | 50661 | 0.0 | 0.0 | 0.0 | 1128873 | East China | 631930 |
1 | Anhui | 151981.0 | null | 1997 | 2347.32 | 43443 | 0.0 | 0.0 | 0.0 | 1356287 | East China | 657860 |
2 | Anhui | 174930.0 | null | 1998 | 2542.96 | 27673 | 0.0 | 0.0 | 0.0 | 1518236 | East China | 889463 |
3 | Anhui | 285324.0 | null | 1999 | 2712.34 | 26131 | null | null | null | 1646891 | East China | 1227364 |
4 | Anhui | 195580.0 | 32100.0 | 2000 | 2902.09 | 31847 | 0.0 | 0.0 | 0.0 | 1601508 | East China | 1499110 |
df.printSchema()
root
-- _c0: integer (nullable = true)
-- province: string (nullable = true)
-- specific: double (nullable = true)
-- general: double (nullable = true)
-- year: integer (nullable = true)
-- gdp: double (nullable = true)
-- fdi: integer (nullable = true)
-- rnr: double (nullable = true)
-- rr: double (nullable = true)
-- i: double (nullable = true)
-- fr: string (nullable = true)
-- reg: string (nullable = true)
-- it: integer (nullable = true)
df.groupBy("province")
Out[8]: <pyspark.sql.group.GroupedData at 0x7f939a0aada0>
df.groupBy("province").mean().show()
+------------+--------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+
province|avg(_c0)| avg(specific)| avg(general)|avg(year)| avg(gdp)| avg(fdi)| avg(rnr)| avg(rr)| avg(i)| avg(it)|
+------------+--------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+
Guangdong| 65.5|1123328.0833333333| 312308.0| 2001.5|15358.781666666668| 1194950.25|0.011261261250000001| 0.0| 0.0| 3099014.25|
Hunan| 161.5| 824676.9166666666| 480788.3333333333| 2001.5| 4765.891666666666| 132110.25| 0.0| 0.07291666666666667| 0.0| 3215128.5|
Shanxi| 281.5| 577540.4166666666| 351680.0| 2001.5| 2817.210833333333|38628.833333333336| 0.0| 0.0| 0.0|1983718.3333333333|
Tibet| 317.5|189219.91666666666|165365.33333333334| 2001.5|170.42666666666665| 839.75| 0.03030303033333333| 0.15583333333333335| 0.20278090583333333|1174175.5833333333|
Hubei| 149.5| 595463.25| 391326.5| 2001.5| 4772.503333333333| 149713.25| 0.045045045| 0.11386386375000002| 0.06230392158333333| 2904659.75|
Tianjin| 305.5| 76884.16666666667| 126636.0| 2001.5|2528.6650000000004|250173.33333333334| 0.0| 0.0| 0.0| 831028.4166666666|
Beijing| 17.5| 581440.8333333334| 412825.0| 2001.5| 4673.453333333333|257369.33333333334| 0.0| 0.3613053613636364| 0.29545454545454547|1175965.4166666667|
Heilongjiang| 125.5|1037878.1666666666| 315925.3333333333| 2001.5| 4041.241666666667| 82719.33333333333| 0.0| 0.0| 0.03931203927272728|3230451.1666666665|
Liaoning| 209.5| 1111002.75|185280.83333333334| 2001.5| 5231.135000000001| 285925.3333333333| 0.11469534044444446| 0.0| null|2628358.4166666665|
Henan| 137.5| 955407.4166666666| 673392.5| 2001.5| 7208.966666666667| 94426.0| 0.0| 0.04| 0.08602150533333335|3671970.6666666665|
Anhui| 5.5| 643984.1666666666|159698.83333333334| 2001.5|3905.8700000000003| 70953.08333333333| 0.0| 0.0| 0.08845208836363637|2649674.4166666665|
Xinjiang| 329.5| 345334.3333333333| 412906.0| 2001.5|1828.8966666666665| 4433.083333333333| 0.0| 0.0| 0.0| 2251012.0|
Fujian| 41.5|246144.16666666666|140619.33333333334| 2001.5|4864.0233333333335| 374466.4166666667| 0.1366666666666667|0.049999999999999996| 0.09999999999999999| 1274116.75|
Jiangxi| 185.5| 592906.3333333334| 458268.6666666667| 2001.5| 2460.7825| 103735.25| 0.0| 0.1491841490909091|0.042727272727272725| 1760613.25|
Jilin| 197.5| 711132.25| 348186.0| 2001.5|2274.8541666666665|41226.583333333336| 0.0| 0.0| 0.0|2136634.9166666665|
Chongqing| 29.5| 561854.1111111111| 151201.4| 2001.5| 2477.7125|41127.833333333336| 0.09677419400000001| 0.0| 0.0|1636146.4166666667|
Shaanxi| 245.5| 387167.1666666667| 386760.5| 2001.5| 2658.034166666667|50892.583333333336|0.002840909090909091| 0.0| 0.07386363636363637|2474031.4166666665|
Sichuan| 293.5| 1194640.5| 707032.8333333334| 2001.5| 5377.79|62197.166666666664| 0.00818181818181818| 0.00818181818181818| 0.2|4016479.5833333335|
Yunnan| 341.5| 802151.1666666666| 200426.0| 2001.5| 2604.054166666667|17048.333333333332| 0.0| 0.0| 0.0|3165418.9166666665|
Gansu| 53.5| 498930.9166666667| 382092.6666666667| 2001.5|1397.8325000000002| 5295.5| 0.11111111120000002| 0.088974359| 0.13038461533333334| 2045347.0|
+------------+--------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+
only showing top 20 rows
df.groupBy("reg").mean().show()
+-------------------+------------------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+
reg| avg(_c0)| avg(specific)| avg(general)|avg(year)| avg(gdp)| avg(fdi)| avg(rnr)| avg(rr)| avg(i)| avg(it)|
+-------------------+------------------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+
Southwest China| 214.3| 648086.8070175438| 327627.0| 2001.5|2410.3988333333336|25405.083333333332| 0.01764440930612245|0.053185448081632655| 0.13679739081632653| 2424971.4|
East China|183.78571428571428|517524.90476190473|230217.37142857144| 2001.5| 7126.732976190476|414659.03571428574| 0.08284508739240506| 0.05701117448101268| 0.09036240282278483|1949130.4761904762|
Northeast China| 177.5| 953337.7222222222|283130.72222222225| 2001.5| 3849.076944444444| 136623.75| 0.03686635942857143| 0.0| 0.02275960168421053|2665148.1666666665|
North China| 179.5|506433.57446808513|334689.14285714284| 2001.5| 4239.038541666667|169600.58333333334| 0.0| 0.15428824051724138| 0.11206896551724138|1733718.7291666667|
Northwest China| 216.7|324849.06666666665|293066.73333333334| 2001.5|1340.0261666666668|15111.133333333333|0.022847222240000003|0.033887245249999996|0.048179240615384616| 1703537.75|
South Central China| 115.5| 690125.8333333334| 382414.8888888889| 2001.5| 5952.826944444445|281785.59722222225|0.014928879322033899| 0.07324349771186443| 0.06797753142372882| 2626299.875|
+-------------------+------------------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+
# Count
df.groupBy("reg").count().show()
+-------------------+-----+
reg|count|
+-------------------+-----+
Southwest China| 60|
East China| 84|
Northeast China| 36|
North China| 48|
Northwest China| 60|
South Central China| 72|
+-------------------+-----+
# Max
df.groupBy("reg").max().show()
+-------------------+--------+-------------+------------+---------+--------+--------+------------------+-----------+-------------------+--------+
reg|max(_c0)|max(specific)|max(general)|max(year)|max(gdp)|max(fdi)| max(rnr)| max(rr)| max(i)| max(it)|
+-------------------+--------+-------------+------------+---------+--------+--------+------------------+-----------+-------------------+--------+
Southwest China| 347| 3937966.0| 1725100.0| 2007|10562.39| 149322| 0.181818182| 0.84| 0.75|10384846|
East China| 359| 2213991.0| 1272600.0| 2007|25776.91| 1743140| 1.214285714| 0.53| 0.6| 7040099|
Northeast China| 215| 3847672.0| 1046700.0| 2007| 9304.52| 598554| 0.516129032| 0.0|0.21621621600000002| 7968319|
North China| 311| 2981235.0| 1023453.0| 2007|13607.32| 527776| 0.0|0.794871795| 0.6| 7537692|
Northwest China| 335| 2669238.0| 1197400.0| 2007| 5757.29| 119516|0.5555555560000001| 0.5| 1.05| 6308151|
South Central China| 167| 3860764.0| 1737800.0| 2007|31777.01| 1712603| 0.27027027| 0.4375| 0.6176470589999999|10533312|
+-------------------+--------+-------------+------------+---------+--------+--------+------------------+-----------+-------------------+--------+
# Min
df.groupBy("reg").min().show()
+-------------------+--------+-------------+------------+---------+--------+--------+--------+-------+------+-------+
reg|min(_c0)|min(specific)|min(general)|min(year)|min(gdp)|min(fdi)|min(rnr)|min(rr)|min(i)|min(it)|
+-------------------+--------+-------------+------------+---------+--------+--------+--------+-------+------+-------+
Southwest China| 24| 18829.0| 18700.0| 1996| 64.98| 2| 0.0| 0.0| 0.0| 176802|
East China| 0| 8964.0| 0.0| 1996| 1169.73| 22724| 0.0| 0.0| 0.0| 489132|
Northeast China| 120| 80595.0| 19360.0| 1996| 1137.23| 19059| 0.0| 0.0| 0.0| 625471|
North China| 12| 35084.0| 32119.0| 1996| 1121.93| 13802| 0.0| 0.0| 0.0| 303992|
Northwest China| 48| 32088.0| 2990.0| 1996| 184.17| 247| 0.0| 0.0| 0.0| 178668|
South Central China| 60| 54462.0| 0.0| 1996| 389.68| 29579| 0.0| 0.0| 0.0| 147897|
+-------------------+--------+-------------+------------+---------+--------+--------+--------+-------+------+-------+
# Sum
df.groupBy("reg").sum().show()
+-------------------+--------+-------------+------------+---------+------------------+--------+------------------+------------------+-------------------+---------+
reg|sum(_c0)|sum(specific)|sum(general)|sum(year)| sum(gdp)|sum(fdi)| sum(rnr)| sum(rr)| sum(i)| sum(it)|
+-------------------+--------+-------------+------------+---------+------------------+--------+------------------+------------------+-------------------+---------+
Southwest China| 12858| 3.6940948E7| 9501183.0| 120090|144623.93000000002| 1524305| 0.864576056| 2.606086956| 6.70307215|145498284|
East China| 15438| 4.3472092E7| 8057608.0| 168126| 598645.57|34831359| 6.544761904| 4.503882784000002| 7.138629823000002|163726960|
Northeast China| 6390| 3.4320158E7| 5096353.0| 72054| 138566.77| 4918455| 1.032258064| 0.0|0.43243243200000003| 95945334|
North China| 8616| 2.3802378E7| 7028472.0| 96072| 203473.85| 8140828| 0.0| 4.474358975| 3.25| 83218499|
Northwest China| 13002| 1.9490944E7| 8792002.0| 120090| 80401.57| 906668|1.1423611120000001|1.7621367529999998| 2.505320512|102212265|
South Central China| 8316| 4.968906E7| 1.3766936E7| 144108|428603.54000000004|20288563| 0.88080388| 4.321366365000001| 4.010674354000001|189093591|
+-------------------+--------+-------------+------------+---------+------------------+--------+------------------+------------------+-------------------+---------+
# Max it across everything
df.agg({'specific':'max'}).show()
+-------------+
max(specific)|
+-------------+
3937966.0|
+-------------+
grouped = df.groupBy("reg")
grouped.agg({"it":'max'}).show()
+-------------------+--------+
reg| max(it)|
+-------------------+--------+
Southwest China|10384846|
East China| 7040099|
Northeast China| 7968319|
North China| 7537692|
Northwest China| 6308151|
South Central China|10533312|
+-------------------+--------+
df.select(countDistinct("reg")).show()
+-------------------+
count(DISTINCT reg)|
+-------------------+
6|
+-------------------+
df.select(countDistinct("reg").alias("Distinct Region")).show()
+---------------+
Distinct Region|
+---------------+
6|
+---------------+
df.select(avg('specific')).show()
+-----------------+
avg(specific)|
+-----------------+
583470.7303370787|
+-----------------+
df.select(stddev("specific")).show()
+---------------------+
stddev_samp(specific)|
+---------------------+
654055.3290782663|
+---------------------+
from pyspark.sql.functions import format_number
specific_std = df.select(stddev("specific").alias('std'))
specific_std.show()
+-----------------+
std|
+-----------------+
654055.3290782663|
+-----------------+
specific_std.select(format_number('std',0)).show()
+---------------------+
format_number(std, 0)|
+---------------------+
654,055|
+---------------------+
df.orderBy("specific").show()
+---+---------+--------+-------+----+--------+------+-----------+----+----+-------+---------------+-------+
_c0| province|specific|general|year| gdp| fdi| rnr| rr| i| fr| reg| it|
+---+---------+--------+-------+----+--------+------+-----------+----+----+-------+---------------+-------+
28|Chongqing| null| null|2000| 1791.0| 24436| null|null|null| null|Southwest China|1022148|
109| Hebei| null| null|1997| 3953.78|110064| null|null|null| null| North China| 826734|
24|Chongqing| null| null|1996| 1315.12| 21878| null|null|null| null|Southwest China| 176802|
25|Chongqing| null| null|1997| 1509.75| 38675| null|null|null| null|Southwest China| 383402|
268| Shanghai| 8964.0| null|2000| 4771.17|316014| 0.0| 0.0|0.44|2224124| East China|1212473|
269| Shanghai| 9834.0| null|2001| 5210.12|429159| 0.0| 0.0|0.44|2947285| East China|1053917|
312| Tibet| 18829.0| null|1996| 64.98| 679|0.181818182| 0.0| 0.0| 27801|Southwest China| 306114|
270| Shanghai| 19985.0| null|2002| 5741.03|427229| 0.0| 0.0|0.44|3380397| East China|1572208|
271| Shanghai| 23547.0| null|2003| 6694.23|546849| 0.0|0.53| 0.0|4461153| East China|2031496|
313| Tibet| 25185.0| null|1997| 77.24| 63|0.181818182| 0.0| 0.0| 33787|Southwest China| 346368|
273| Shanghai| 29943.0| null|2005| 9247.66|685000| 0.0|0.53| 0.0| null| East China|2140461|
272| Shanghai| 29943.0| null|2004| 8072.83|654100| 0.0|0.53| 0.0| null| East China|2703643|
216| Ningxia| 32088.0| null|1996| 202.9| 2826| null|null|null| 90805|Northwest China| 178668|
305| Tianjin| 35084.0| null|2001| 1919.09|213348| 0.0| 0.0| 0.0| 942763| North China| 688810|
228| Qinghai| 37976.0| null|1996| 184.17| 576| null|null|null| 73260|Northwest China| 218361|
302| Tianjin| 39364.0| null|1998| 1374.6|211361| null|null|null| 540178| North China| 361723|
274| Shanghai| 42928.0| null|2006|10572.24|710700| 0.0|0.53| 0.0|8175966| East China|2239987|
217| Ningxia| 44267.0| null|1997| 224.59| 671| null|null|null| 102083|Northwest China| 195295|
303| Tianjin| 45463.0| null|1999| 1500.95|176399| 0.0| 0.0| 0.0| 605662| North China| 422522|
314| Tibet| 48197.0| null|1998| 91.5| 481| 0.0|0.24| 0.0| 3810|Southwest China| 415547|
+---+---------+--------+-------+----+--------+------+-----------+----+----+-------+---------------+-------+
only showing top 20 rows
df.orderBy(df["specific"].desc()).show()
+---+------------+---------+---------+----+--------+-------+-----------+-----------+-------------------+--------+-------------------+--------+
_c0| province| specific| general|year| gdp| fdi| rnr| rr| i| fr| reg| it|
+---+------------+---------+---------+----+--------+-------+-----------+-----------+-------------------+--------+-------------------+--------+
299| Sichuan|3937966.0|1725100.0|2007|10562.39| 149322| null| null| null| 8508606| Southwest China|10384846|
143| Henan|3860764.0|1737800.0|2007|15012.46| 306162| 0.0| 0.0| 0.0| 8620804|South Central China|10533312|
131|Heilongjiang|3847672.0|1046700.0|2007| 7104.0| 208508| 0.0| 0.0|0.21621621600000002| 4404689| Northeast China| 7968319|
215| Liaoning|3396397.0| 599600.0|2007| 9304.52| 598554|0.516129032| 0.0| null|10826948| Northeast China| 5502192|
167| Hunan|3156087.0|1329200.0|2007| 9439.6| 327051| 0.0| 0.4375| 0.0| 6065508|South Central China| 8340692|
119| Hebei|2981235.0| 694400.0|2007|13607.32| 241621| 0.0| 0.5| 0.0| 7891198| North China| 7537692|
155| Hubei|2922784.0|1263500.0|2007| 9333.4| 276622| 0.0|0.111111111| 0.0| 5903552|South Central China| 7666512|
251| Shaanxi|2669238.0|1081000.0|2007| 5757.29| 119516| 0.03125| 0.0| 0.8125| 4752398| Northwest China| 6308151|
203| Jilin|2663667.0|1016400.0|2007| 4275.12| 76064| 0.0| 0.0| 0.0| 3206892| Northeast China| 4607955|
347| Yunnan|2482173.0| 564400.0|2007| 4772.52| 39453| 0.0| 0.0| 0.0| 4867146| Southwest China| 6832541|
298| Sichuan|2225220.0|1187958.0|2006| 8690.24| 120819| 0.0| 0.0| 0.55| 4247403| Southwest China| 7646885|
11| Anhui|2213991.0| 178705.0|2007| 7360.92| 299892| 0.0| 0.0| 0.324324324| 4468640| East China| 7040099|
287| Shanxi|2189020.0| 661200.0|2007| 6024.45| 134283| null| null| null| 5978870| North China| 5070166|
263| Shandong|2121243.0| 581800.0|2007|25776.91|1101159| 0.0| 0.0| 0.0|16753980| East China| 6357869|
191| Jiangxi|2045869.0|1272600.0|2007| 4820.53| 280657| 0.0| 0.41025641| 0.0| 3898510| East China| 4229821|
83| Guangxi|2022957.0|1214100.0|2007| 5823.41| 68396|0.205128205| 0.0|0.23076923100000002| 4188265|South Central China| 6185600|
142| Henan|2018158.0|1131615.0|2006|12362.79| 184526| 0.0| 0.0| 0.0| 6212824|South Central China| 7601825|
59| Gansu|2010553.0|1039400.0|2007| 2703.98| 11802| null| 0.0| 1.05| 1909107| Northwest China| 5111059|
95| Guizhou|1956261.0|1239200.0|2007| 2884.11| 12651| 0.0| 0.0| 0.7105263159999999| 2851375| Southwest China| 5639838|
214| Liaoning|1947031.0| 179893.0|2006| 8047.26| 359000|0.516129032| 0.0| null| 6530236| Northeast China| 4605917|
+---+------------+---------+---------+----+--------+-------+-----------+-----------+-------------------+--------+-------------------+--------+
only showing top 20 rows
This post includes code adapted from Spark and Python for Big Data udemy course and Spark and Python for Big Data notebooks.