"Group By and Aggregation with Pyspark"

  • toc:true- branch: master- badges: true- comments: true
  • author: David Kearney
  • categories: [pyspark, jupyter]
  • description: Group By and Aggregation with Pyspark
  • title: Group By and Aggregation with Pyspark

Read CSV and inferSchema

from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct, avg,stddev


# Load data from a CSV
file_location = "/FileStore/tables/df_panel_fix.csv"
df = spark.read.format("CSV").option("inferSchema", True).option("header", True).load(file_location)
display(df.take(5))
_c0provincespecificgeneralyeargdpfdirnrrrifrregit
0Anhui147002.0null19962093.3506610.00.00.01128873East China631930
1Anhui151981.0null19972347.32434430.00.00.01356287East China657860
2Anhui174930.0null19982542.96276730.00.00.01518236East China889463
3Anhui285324.0null19992712.3426131nullnullnull1646891East China1227364
4Anhui195580.032100.020002902.09318470.00.00.01601508East China1499110
df.printSchema()
root -- _c0: integer (nullable = true) -- province: string (nullable = true) -- specific: double (nullable = true) -- general: double (nullable = true) -- year: integer (nullable = true) -- gdp: double (nullable = true) -- fdi: integer (nullable = true) -- rnr: double (nullable = true) -- rr: double (nullable = true) -- i: double (nullable = true) -- fr: string (nullable = true) -- reg: string (nullable = true) -- it: integer (nullable = true)

Using groupBy for Averages and Counts

df.groupBy("province")
Out[8]: <pyspark.sql.group.GroupedData at 0x7f939a0aada0>
df.groupBy("province").mean().show()
+------------+--------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+ province|avg(_c0)| avg(specific)| avg(general)|avg(year)| avg(gdp)| avg(fdi)| avg(rnr)| avg(rr)| avg(i)| avg(it)| +------------+--------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+ Guangdong| 65.5|1123328.0833333333| 312308.0| 2001.5|15358.781666666668| 1194950.25|0.011261261250000001| 0.0| 0.0| 3099014.25| Hunan| 161.5| 824676.9166666666| 480788.3333333333| 2001.5| 4765.891666666666| 132110.25| 0.0| 0.07291666666666667| 0.0| 3215128.5| Shanxi| 281.5| 577540.4166666666| 351680.0| 2001.5| 2817.210833333333|38628.833333333336| 0.0| 0.0| 0.0|1983718.3333333333| Tibet| 317.5|189219.91666666666|165365.33333333334| 2001.5|170.42666666666665| 839.75| 0.03030303033333333| 0.15583333333333335| 0.20278090583333333|1174175.5833333333| Hubei| 149.5| 595463.25| 391326.5| 2001.5| 4772.503333333333| 149713.25| 0.045045045| 0.11386386375000002| 0.06230392158333333| 2904659.75| Tianjin| 305.5| 76884.16666666667| 126636.0| 2001.5|2528.6650000000004|250173.33333333334| 0.0| 0.0| 0.0| 831028.4166666666| Beijing| 17.5| 581440.8333333334| 412825.0| 2001.5| 4673.453333333333|257369.33333333334| 0.0| 0.3613053613636364| 0.29545454545454547|1175965.4166666667| Heilongjiang| 125.5|1037878.1666666666| 315925.3333333333| 2001.5| 4041.241666666667| 82719.33333333333| 0.0| 0.0| 0.03931203927272728|3230451.1666666665| Liaoning| 209.5| 1111002.75|185280.83333333334| 2001.5| 5231.135000000001| 285925.3333333333| 0.11469534044444446| 0.0| null|2628358.4166666665| Henan| 137.5| 955407.4166666666| 673392.5| 2001.5| 7208.966666666667| 94426.0| 0.0| 0.04| 0.08602150533333335|3671970.6666666665| Anhui| 5.5| 643984.1666666666|159698.83333333334| 2001.5|3905.8700000000003| 70953.08333333333| 0.0| 0.0| 0.08845208836363637|2649674.4166666665| Xinjiang| 329.5| 345334.3333333333| 412906.0| 2001.5|1828.8966666666665| 4433.083333333333| 0.0| 0.0| 0.0| 2251012.0| Fujian| 41.5|246144.16666666666|140619.33333333334| 2001.5|4864.0233333333335| 374466.4166666667| 0.1366666666666667|0.049999999999999996| 0.09999999999999999| 1274116.75| Jiangxi| 185.5| 592906.3333333334| 458268.6666666667| 2001.5| 2460.7825| 103735.25| 0.0| 0.1491841490909091|0.042727272727272725| 1760613.25| Jilin| 197.5| 711132.25| 348186.0| 2001.5|2274.8541666666665|41226.583333333336| 0.0| 0.0| 0.0|2136634.9166666665| Chongqing| 29.5| 561854.1111111111| 151201.4| 2001.5| 2477.7125|41127.833333333336| 0.09677419400000001| 0.0| 0.0|1636146.4166666667| Shaanxi| 245.5| 387167.1666666667| 386760.5| 2001.5| 2658.034166666667|50892.583333333336|0.002840909090909091| 0.0| 0.07386363636363637|2474031.4166666665| Sichuan| 293.5| 1194640.5| 707032.8333333334| 2001.5| 5377.79|62197.166666666664| 0.00818181818181818| 0.00818181818181818| 0.2|4016479.5833333335| Yunnan| 341.5| 802151.1666666666| 200426.0| 2001.5| 2604.054166666667|17048.333333333332| 0.0| 0.0| 0.0|3165418.9166666665| Gansu| 53.5| 498930.9166666667| 382092.6666666667| 2001.5|1397.8325000000002| 5295.5| 0.11111111120000002| 0.088974359| 0.13038461533333334| 2045347.0| +------------+--------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+ only showing top 20 rows
df.groupBy("reg").mean().show()
+-------------------+------------------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+ reg| avg(_c0)| avg(specific)| avg(general)|avg(year)| avg(gdp)| avg(fdi)| avg(rnr)| avg(rr)| avg(i)| avg(it)| +-------------------+------------------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+ Southwest China| 214.3| 648086.8070175438| 327627.0| 2001.5|2410.3988333333336|25405.083333333332| 0.01764440930612245|0.053185448081632655| 0.13679739081632653| 2424971.4| East China|183.78571428571428|517524.90476190473|230217.37142857144| 2001.5| 7126.732976190476|414659.03571428574| 0.08284508739240506| 0.05701117448101268| 0.09036240282278483|1949130.4761904762| Northeast China| 177.5| 953337.7222222222|283130.72222222225| 2001.5| 3849.076944444444| 136623.75| 0.03686635942857143| 0.0| 0.02275960168421053|2665148.1666666665| North China| 179.5|506433.57446808513|334689.14285714284| 2001.5| 4239.038541666667|169600.58333333334| 0.0| 0.15428824051724138| 0.11206896551724138|1733718.7291666667| Northwest China| 216.7|324849.06666666665|293066.73333333334| 2001.5|1340.0261666666668|15111.133333333333|0.022847222240000003|0.033887245249999996|0.048179240615384616| 1703537.75| South Central China| 115.5| 690125.8333333334| 382414.8888888889| 2001.5| 5952.826944444445|281785.59722222225|0.014928879322033899| 0.07324349771186443| 0.06797753142372882| 2626299.875| +-------------------+------------------+------------------+------------------+---------+------------------+------------------+--------------------+--------------------+--------------------+------------------+
# Count
df.groupBy("reg").count().show()
+-------------------+-----+ reg|count| +-------------------+-----+ Southwest China| 60| East China| 84| Northeast China| 36| North China| 48| Northwest China| 60| South Central China| 72| +-------------------+-----+
# Max
df.groupBy("reg").max().show()
+-------------------+--------+-------------+------------+---------+--------+--------+------------------+-----------+-------------------+--------+ reg|max(_c0)|max(specific)|max(general)|max(year)|max(gdp)|max(fdi)| max(rnr)| max(rr)| max(i)| max(it)| +-------------------+--------+-------------+------------+---------+--------+--------+------------------+-----------+-------------------+--------+ Southwest China| 347| 3937966.0| 1725100.0| 2007|10562.39| 149322| 0.181818182| 0.84| 0.75|10384846| East China| 359| 2213991.0| 1272600.0| 2007|25776.91| 1743140| 1.214285714| 0.53| 0.6| 7040099| Northeast China| 215| 3847672.0| 1046700.0| 2007| 9304.52| 598554| 0.516129032| 0.0|0.21621621600000002| 7968319| North China| 311| 2981235.0| 1023453.0| 2007|13607.32| 527776| 0.0|0.794871795| 0.6| 7537692| Northwest China| 335| 2669238.0| 1197400.0| 2007| 5757.29| 119516|0.5555555560000001| 0.5| 1.05| 6308151| South Central China| 167| 3860764.0| 1737800.0| 2007|31777.01| 1712603| 0.27027027| 0.4375| 0.6176470589999999|10533312| +-------------------+--------+-------------+------------+---------+--------+--------+------------------+-----------+-------------------+--------+
# Min
df.groupBy("reg").min().show()
+-------------------+--------+-------------+------------+---------+--------+--------+--------+-------+------+-------+ reg|min(_c0)|min(specific)|min(general)|min(year)|min(gdp)|min(fdi)|min(rnr)|min(rr)|min(i)|min(it)| +-------------------+--------+-------------+------------+---------+--------+--------+--------+-------+------+-------+ Southwest China| 24| 18829.0| 18700.0| 1996| 64.98| 2| 0.0| 0.0| 0.0| 176802| East China| 0| 8964.0| 0.0| 1996| 1169.73| 22724| 0.0| 0.0| 0.0| 489132| Northeast China| 120| 80595.0| 19360.0| 1996| 1137.23| 19059| 0.0| 0.0| 0.0| 625471| North China| 12| 35084.0| 32119.0| 1996| 1121.93| 13802| 0.0| 0.0| 0.0| 303992| Northwest China| 48| 32088.0| 2990.0| 1996| 184.17| 247| 0.0| 0.0| 0.0| 178668| South Central China| 60| 54462.0| 0.0| 1996| 389.68| 29579| 0.0| 0.0| 0.0| 147897| +-------------------+--------+-------------+------------+---------+--------+--------+--------+-------+------+-------+
# Sum
df.groupBy("reg").sum().show()
+-------------------+--------+-------------+------------+---------+------------------+--------+------------------+------------------+-------------------+---------+ reg|sum(_c0)|sum(specific)|sum(general)|sum(year)| sum(gdp)|sum(fdi)| sum(rnr)| sum(rr)| sum(i)| sum(it)| +-------------------+--------+-------------+------------+---------+------------------+--------+------------------+------------------+-------------------+---------+ Southwest China| 12858| 3.6940948E7| 9501183.0| 120090|144623.93000000002| 1524305| 0.864576056| 2.606086956| 6.70307215|145498284| East China| 15438| 4.3472092E7| 8057608.0| 168126| 598645.57|34831359| 6.544761904| 4.503882784000002| 7.138629823000002|163726960| Northeast China| 6390| 3.4320158E7| 5096353.0| 72054| 138566.77| 4918455| 1.032258064| 0.0|0.43243243200000003| 95945334| North China| 8616| 2.3802378E7| 7028472.0| 96072| 203473.85| 8140828| 0.0| 4.474358975| 3.25| 83218499| Northwest China| 13002| 1.9490944E7| 8792002.0| 120090| 80401.57| 906668|1.1423611120000001|1.7621367529999998| 2.505320512|102212265| South Central China| 8316| 4.968906E7| 1.3766936E7| 144108|428603.54000000004|20288563| 0.88080388| 4.321366365000001| 4.010674354000001|189093591| +-------------------+--------+-------------+------------+---------+------------------+--------+------------------+------------------+-------------------+---------+
# Max it across everything
df.agg({'specific':'max'}).show()
+-------------+ max(specific)| +-------------+ 3937966.0| +-------------+
grouped = df.groupBy("reg")
grouped.agg({"it":'max'}).show()
+-------------------+--------+ reg| max(it)| +-------------------+--------+ Southwest China|10384846| East China| 7040099| Northeast China| 7968319| North China| 7537692| Northwest China| 6308151| South Central China|10533312| +-------------------+--------+
df.select(countDistinct("reg")).show()
+-------------------+ count(DISTINCT reg)| +-------------------+ 6| +-------------------+
df.select(countDistinct("reg").alias("Distinct Region")).show()
+---------------+ Distinct Region| +---------------+ 6| +---------------+
df.select(avg('specific')).show()
+-----------------+ avg(specific)| +-----------------+ 583470.7303370787| +-----------------+
df.select(stddev("specific")).show()
+---------------------+ stddev_samp(specific)| +---------------------+ 654055.3290782663| +---------------------+

Choosing Significant Digits with format_number

from pyspark.sql.functions import format_number
specific_std = df.select(stddev("specific").alias('std'))
specific_std.show()
+-----------------+ std| +-----------------+ 654055.3290782663| +-----------------+
specific_std.select(format_number('std',0)).show()
+---------------------+ format_number(std, 0)| +---------------------+ 654,055| +---------------------+

Using orderBy

df.orderBy("specific").show()
+---+---------+--------+-------+----+--------+------+-----------+----+----+-------+---------------+-------+ _c0| province|specific|general|year| gdp| fdi| rnr| rr| i| fr| reg| it| +---+---------+--------+-------+----+--------+------+-----------+----+----+-------+---------------+-------+ 28|Chongqing| null| null|2000| 1791.0| 24436| null|null|null| null|Southwest China|1022148| 109| Hebei| null| null|1997| 3953.78|110064| null|null|null| null| North China| 826734| 24|Chongqing| null| null|1996| 1315.12| 21878| null|null|null| null|Southwest China| 176802| 25|Chongqing| null| null|1997| 1509.75| 38675| null|null|null| null|Southwest China| 383402| 268| Shanghai| 8964.0| null|2000| 4771.17|316014| 0.0| 0.0|0.44|2224124| East China|1212473| 269| Shanghai| 9834.0| null|2001| 5210.12|429159| 0.0| 0.0|0.44|2947285| East China|1053917| 312| Tibet| 18829.0| null|1996| 64.98| 679|0.181818182| 0.0| 0.0| 27801|Southwest China| 306114| 270| Shanghai| 19985.0| null|2002| 5741.03|427229| 0.0| 0.0|0.44|3380397| East China|1572208| 271| Shanghai| 23547.0| null|2003| 6694.23|546849| 0.0|0.53| 0.0|4461153| East China|2031496| 313| Tibet| 25185.0| null|1997| 77.24| 63|0.181818182| 0.0| 0.0| 33787|Southwest China| 346368| 273| Shanghai| 29943.0| null|2005| 9247.66|685000| 0.0|0.53| 0.0| null| East China|2140461| 272| Shanghai| 29943.0| null|2004| 8072.83|654100| 0.0|0.53| 0.0| null| East China|2703643| 216| Ningxia| 32088.0| null|1996| 202.9| 2826| null|null|null| 90805|Northwest China| 178668| 305| Tianjin| 35084.0| null|2001| 1919.09|213348| 0.0| 0.0| 0.0| 942763| North China| 688810| 228| Qinghai| 37976.0| null|1996| 184.17| 576| null|null|null| 73260|Northwest China| 218361| 302| Tianjin| 39364.0| null|1998| 1374.6|211361| null|null|null| 540178| North China| 361723| 274| Shanghai| 42928.0| null|2006|10572.24|710700| 0.0|0.53| 0.0|8175966| East China|2239987| 217| Ningxia| 44267.0| null|1997| 224.59| 671| null|null|null| 102083|Northwest China| 195295| 303| Tianjin| 45463.0| null|1999| 1500.95|176399| 0.0| 0.0| 0.0| 605662| North China| 422522| 314| Tibet| 48197.0| null|1998| 91.5| 481| 0.0|0.24| 0.0| 3810|Southwest China| 415547| +---+---------+--------+-------+----+--------+------+-----------+----+----+-------+---------------+-------+ only showing top 20 rows
df.orderBy(df["specific"].desc()).show()
+---+------------+---------+---------+----+--------+-------+-----------+-----------+-------------------+--------+-------------------+--------+ _c0| province| specific| general|year| gdp| fdi| rnr| rr| i| fr| reg| it| +---+------------+---------+---------+----+--------+-------+-----------+-----------+-------------------+--------+-------------------+--------+ 299| Sichuan|3937966.0|1725100.0|2007|10562.39| 149322| null| null| null| 8508606| Southwest China|10384846| 143| Henan|3860764.0|1737800.0|2007|15012.46| 306162| 0.0| 0.0| 0.0| 8620804|South Central China|10533312| 131|Heilongjiang|3847672.0|1046700.0|2007| 7104.0| 208508| 0.0| 0.0|0.21621621600000002| 4404689| Northeast China| 7968319| 215| Liaoning|3396397.0| 599600.0|2007| 9304.52| 598554|0.516129032| 0.0| null|10826948| Northeast China| 5502192| 167| Hunan|3156087.0|1329200.0|2007| 9439.6| 327051| 0.0| 0.4375| 0.0| 6065508|South Central China| 8340692| 119| Hebei|2981235.0| 694400.0|2007|13607.32| 241621| 0.0| 0.5| 0.0| 7891198| North China| 7537692| 155| Hubei|2922784.0|1263500.0|2007| 9333.4| 276622| 0.0|0.111111111| 0.0| 5903552|South Central China| 7666512| 251| Shaanxi|2669238.0|1081000.0|2007| 5757.29| 119516| 0.03125| 0.0| 0.8125| 4752398| Northwest China| 6308151| 203| Jilin|2663667.0|1016400.0|2007| 4275.12| 76064| 0.0| 0.0| 0.0| 3206892| Northeast China| 4607955| 347| Yunnan|2482173.0| 564400.0|2007| 4772.52| 39453| 0.0| 0.0| 0.0| 4867146| Southwest China| 6832541| 298| Sichuan|2225220.0|1187958.0|2006| 8690.24| 120819| 0.0| 0.0| 0.55| 4247403| Southwest China| 7646885| 11| Anhui|2213991.0| 178705.0|2007| 7360.92| 299892| 0.0| 0.0| 0.324324324| 4468640| East China| 7040099| 287| Shanxi|2189020.0| 661200.0|2007| 6024.45| 134283| null| null| null| 5978870| North China| 5070166| 263| Shandong|2121243.0| 581800.0|2007|25776.91|1101159| 0.0| 0.0| 0.0|16753980| East China| 6357869| 191| Jiangxi|2045869.0|1272600.0|2007| 4820.53| 280657| 0.0| 0.41025641| 0.0| 3898510| East China| 4229821| 83| Guangxi|2022957.0|1214100.0|2007| 5823.41| 68396|0.205128205| 0.0|0.23076923100000002| 4188265|South Central China| 6185600| 142| Henan|2018158.0|1131615.0|2006|12362.79| 184526| 0.0| 0.0| 0.0| 6212824|South Central China| 7601825| 59| Gansu|2010553.0|1039400.0|2007| 2703.98| 11802| null| 0.0| 1.05| 1909107| Northwest China| 5111059| 95| Guizhou|1956261.0|1239200.0|2007| 2884.11| 12651| 0.0| 0.0| 0.7105263159999999| 2851375| Southwest China| 5639838| 214| Liaoning|1947031.0| 179893.0|2006| 8047.26| 359000|0.516129032| 0.0| null| 6530236| Northeast China| 4605917| +---+------------+---------+---------+----+--------+-------+-----------+-----------+-------------------+--------+-------------------+--------+ only showing top 20 rows