"A minimal example of using Pyspark for Linear Regression"

toc:true- branch: master- badges: true- comments: true
author: David Kearney
categories: [pyspark, jupyter]
description: A minimal example of using Pyspark for Linear Regression
title: Pyspark Regression with Fiscal Data

Bring in needed imports

from pyspark.sql.functions import col
from pyspark.sql.types import StringType,BooleanType,DateType,IntegerType
from pyspark.sql.functions import *

Load data from CSV

#collapse-hide

# Load data from a CSV
file_location = "/FileStore/tables/df_panel_fix.csv"
df = spark.read.format("CSV").option("inferSchema", True).option("header", True).load(file_location)
display(df.take(5))

df.createOrReplaceTempView("fiscal_stats")

sums = spark.sql("""
select year, sum(it) as total_yearly_it, sum(fr) as total_yearly_fr
from fiscal_stats
group by 1
order by year asc
""")

sums.show()

Describing the Data

df.describe().toPandas().transpose()

Cast Data Type

df2 = df.withColumn("gdp",col("gdp").cast(IntegerType())) \
.withColumn("specific",col("specific").cast(IntegerType())) \
.withColumn("general",col("general").cast(IntegerType())) \
.withColumn("year",col("year").cast(IntegerType())) \
.withColumn("fdi",col("fdi").cast(IntegerType())) \
.withColumn("rnr",col("rnr").cast(IntegerType())) \
.withColumn("rr",col("rr").cast(IntegerType())) \
.withColumn("i",col("i").cast(IntegerType())) \
.withColumn("fr",col("fr").cast(IntegerType()))

printSchema

df2.printSchema()

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

assembler = VectorAssembler(inputCols=['gdp', 'fdi'], outputCol="features")
train_df = assembler.transform(df2)

train_df.select("specific", "year").show()

Linear Regression in Pyspark

lr = LinearRegression(featuresCol = 'features', labelCol='it')
lr_model = lr.fit(train_df)

trainingSummary = lr_model.summary
print("Coefficients: " + str(lr_model.coefficients))
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("R2: %f" % trainingSummary.r2)

lr_predictions = lr_model.transform(train_df)
lr_predictions.select("prediction","it","features").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="it",metricName="r2")

print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()

predictions = lr_model.transform(test_df)
predictions.select("prediction","it","features").show()

from pyspark.ml.regression import DecisionTreeRegressor
dt = DecisionTreeRegressor(featuresCol ='features', labelCol = 'it')
dt_model = dt.fit(train_df)
dt_predictions = dt_model.transform(train_df)
dt_evaluator = RegressionEvaluator(
    labelCol="it", predictionCol="prediction", metricName="rmse")
rmse = dt_evaluator.evaluate(dt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol = 'features', labelCol = 'it', maxIter=10)
gbt_model = gbt.fit(train_df)
gbt_predictions = gbt_model.transform(train_df)
gbt_predictions.select('prediction', 'it', 'features').show(5)


gbt_evaluator = RegressionEvaluator(
    labelCol="it", predictionCol="prediction", metricName="rmse")
rmse = gbt_evaluator.evaluate(gbt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

This post includes code adapted from Spark and Python for Big Data udemy course and Spark and Python for Big Data notebooks.

_c0	province	specific	general	year	gdp	fdi	rnr	rr	i	fr	reg	it
0	Anhui	147002.0	null	1996	2093.3	50661	0.0	0.0	0.0	1128873	East China	631930
1	Anhui	151981.0	null	1997	2347.32	43443	0.0	0.0	0.0	1356287	East China	657860
2	Anhui	174930.0	null	1998	2542.96	27673	0.0	0.0	0.0	1518236	East China	889463
3	Anhui	285324.0	null	1999	2712.34	26131	null	null	null	1646891	East China	1227364
4	Anhui	195580.0	32100.0	2000	2902.09	31847	0.0	0.0	0.0	1601508	East China	1499110

	0	1	2	3	4
summary	count	mean	stddev	min	max
_c0	360	179.5	104.06728592598157	0	359
province	360	None	None	Anhui	Zhejiang
specific	356	583470.7303370787	654055.3290782663	8964.0	3937966.0
general	169	309127.53846153844	355423.5760674793	0.0	1737800.0
year	360	2001.5	3.4568570586927794	1996	2007
gdp	360	4428.653416666667	4484.668659976412	64.98	31777.01
fdi	360	196139.38333333333	303043.97011891654	2	1743140
rnr	294	0.0355944252244898	0.16061503029299648	0.0	1.214285714
rr	296	0.059688621057432424	0.15673351824073453	0.0	0.84
i	287	0.08376351662369343	0.1838933104683607	0.0	1.05
fr	295	2522449.0034013605	3491329.8613106664	#REF!	9898522
reg	360	None	None	East China	Southwest China
it	360	2165819.2583333333	1769294.2935487411	147897	10533312