Jamie Thomson

Thoughts, about stuff

PySpark starter for ten

leave a comment »

I just threw this together and I’m putting it here mainly in case I need it later. It might come in handy for others too…

So you have a new Spark installation against a yarn cluster, you want to run something simple on it (akin to hello World) to see if it does anything. Try copying and pasting this into your bash shell:

echo "from pyspark import SparkContext, HiveContext, SparkConf" > sparking.py
echo "conf = SparkConf().setAppName('sparking')" >> sparking.py
echo 'conf.set("spark.sql.parquet.binaryAsString", "true")' >> sparking.py
echo "sc = SparkContext(conf=conf)" >> sparking.py
echo "sqlContext = HiveContext(sc)" >> sparking.py
echo "l = [('Alice', 1)]" >> sparking.py
echo "rdd = sc.parallelize(l)" >> sparking.py
echo "for x in rdd.take(10):" >> sparking.py
echo " print x" >> sparking.py
spark-submit --master yarn --deploy-mode cluster --supervise --name "sparking" sparking.py

If it runs you should see something like this at the end of the yarn log:

Log Type: stdout
Log Upload Time: Thu Jan 05 14:56:09 +0000 2017
Log Length: 13
('Alice', 1)

Written by Jamiet

January 5, 2017 at 3:07 pm

Posted in Uncategorized

Tagged with , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: