Jamie Thomson

Thoughts, about stuff

Archive for April 2017

Running Spark on Ubuntu on Windows subsystem for Linux

leave a comment »

In my day job at dunnhumby I’m using Apache Spark a lot and so when Windows 10 gained the ability to run Ubuntu, a Linux distro, I thought it would be fun to see if I could run Spark on it. My earlier efforts in November 2016 were thwarted (something to do with enumerating network connections) so when Microsoft released the Windows 10 Creators Update I thought I’d give it another bash (pun intended). Happily this time it worked and in this blog post I’ll explain how I got it running in case anyone wants to do the same. If you don’t read all the blurb herein and instead just go run all the steps it should take you, if you’re quick, about 10 minutes.

Enable Windows subsystem for Linux

There are countless sites that will tell you how to do this (here’s one) but basically you need to turn on Windows Subsystem for Linux (Beta) in Control Panel->Windows Features:

 

 

 

 

 

 

 

 

 

 

 

then open up a PowerShell window and run

lxrun /install

which will install all the bits:

 

 

 

 

Once its installed run

bash

to launch a linux bash prompt.

Download and install Spark

Once at the prompt run the following commands:


#install Java runtime environment (JRE)
sudo apt-get install openjdk-8-jre-headless
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre
#download spark, visit https://spark.apache.org/downloads.html if you want a different version
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
#untar and set a symlink
sudo tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz -C /opt
sudo ln -s spark-2.1.0-bin-hadoop2.7 /opt/spark

That’s it. Now to actually use it.

Run some Spark code

Run:

/opt/spark/bin/spark-shell

and you’ll be launched into the Spark REPL which is actually the Scala REPL preloaded with some Spark stuff (Spark is written in Scala hence its considered the canonical language for doing Spark development):

 

 

 

 

 

 

 

 

Let’s try the “Hello World” for Spark:

sc.parallelize(Array(1,2,3,4,5)).collect()

If it works you should see this:

scala> sc.parallelize(Array(1,2,3,4,5)).collect()
res5: Array[Int] = Array(1, 2, 3, 4, 5)

Not in itself particularly exciting, but given I’m running this on Windows and it actually works I think that’s pretty cool.

OK let’s do something more interesting.


val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
df.show()

 

 

 

 

Cool! If you’re already familiar with Spark then you can stop here but if you want to know a bit more about its APIs, keep reading.

Now let’s synthesize a bigger dataset and get its cardinality:


for (a <- 1 until 10){
  df = df.union(df)
}
df.count()

 

 

 

 

A dataset of 1536 rows is a bit more interesting that we can do a few things with:


df.filter(df("age") > 21).count()
df.groupBy("name").count().show()
//combine a few functions together
//this fluent API approach is what I've come to love about Spark
df.withColumn("is over 21", df("age") > 21).limit(5).show()


 

 

 

 

 

 

 

 

 

Notice that when Spark shell started up it provided a link to the Spark UI:


 

 

 

Browse to that URL to view all the stats that Spark keeps about everything that it executes:


 

 

 

 

 

 

 

 

 

 

 

 

 

 

If you want to persist the data as a database table there is a simple saveAsTable() method that allows you to do that, which is then easy to consume using table():


val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
df.write.saveAsTable("savetabledemo")
spark.table("savetabledemo").show()

 

 

 

 

 

And if fluent APIs aren’t your thing there’s still good old SQL:


spark.sql("select count(*) from savetabledemo").show()

 

 

 

Prefer Python to Scala?

If you prefer Python to Scala then you’ll be glad to know you can use that too. One thing you’ll need to do first is, back at the bash prompt, create a symlink to the Python interpreter then launch the Python REPL for Spark, pyspark:


ln -s /usr/bin/python3.5 python
/opt/spark/bin/pyspark

From there you can run the same operations as before, just using python syntax instead.


sc.parallelize([1,2,3,4,5]).collect()
df = sqlContext.read.json("/opt/spark/examples/src/main/resources/people.json")
df.show()
df.filter(df.age > 21).count()
df.groupBy("name").count().show()
df.withColumn("is over 21", df.age > 21).limit(5).show()
df.write.saveAsTable('savetabledemo_python')
sqlContext.table('savetabledemo_python').show()
sqlContext.sql('select count(*) from savetabledemo_python').show()

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

That’s probably enough for now. Hope this helps.

@jamiet

Written by Jamiet

April 23, 2017 at 7:50 pm

Posted in Uncategorized