Archive for the ‘Uncategorized’ Category
I just threw this together and I’m putting it here mainly in case I need it later. It might come in handy for others too…
So you have a new Spark installation against a yarn cluster, you want to run something simple on it (akin to hello World) to see if it does anything. Try copying and pasting this into your bash shell:
echo "from pyspark import SparkContext, HiveContext, SparkConf" > sparking.py echo "conf = SparkConf().setAppName('sparking')" >> sparking.py echo 'conf.set("spark.sql.parquet.binaryAsString", "true")' >> sparking.py echo "sc = SparkContext(conf=conf)" >> sparking.py echo "sqlContext = HiveContext(sc)" >> sparking.py echo "l = [('Alice', 1)]" >> sparking.py echo "rdd = sc.parallelize(l)" >> sparking.py echo "for x in rdd.take(10):" >> sparking.py echo " print x" >> sparking.py spark-submit --master yarn --deploy-mode cluster --supervise --name "sparking" sparking.py
If it runs you should see something like this at the end of the yarn log:
Log Type: stdout Log Upload Time: Thu Jan 05 14:56:09 +0000 2017 Log Length: 13 ('Alice', 1)
I’ve been doing lots of Apache Spark development using Python (aka PySpark) recently, specifically Spark SQL (aka the dataframes API), and one thing I’ve found very useful to be able to do for testing purposes is create a dataframe from literal values. The documentation at pyspark.sql.SQLContext.createDataFrame() covers this pretty well however the code there describes how to create a dataframe containing more than one column like so:
l = [('Alice', 1)] sqlContext.createDataFrame(l).collect() # returns [Row(_1=u'Alice', _2=1)] sqlContext.createDataFrame(l, ['name', 'age']).collect() # returns [Row(name=u'Alice', age=1)]
For simple testing purposes I wanted to create a dataframe that has only one column so you might think that the above code could be amended simply like so:
l = [('Alice')] sqlContext.createDataFrame(l).collect() sqlContext.createDataFrame(l, ['name']).collect()
but unfortunately that throws an error:
TypeError: Can not infer schema for type: <type 'str'>
The reason is simple,
returns a tuple whereas
returns a string.
type(('Alice',1)) # returns tuple type(('Alice')) #returns str
The latter causes an error because createDataFrame() only creates a dataframe from a RDD of tuples, not a RDD of strings.
There is a very easy fix which will be obvious to any half-decent Python developer, unfortunately that’s not me so I didn’t stumble on the answer immediately. Its possible to create a one-element tuple by including an extra comma like so:
type(('Alice',)) # returns tuple
hence the earlier failing code can be adapted to this:
l = [('Alice',)] sqlContext.createDataFrame(l).collect() # returns [Row(_1=u'Alice')] sqlContext.createDataFrame(l, ['name']).collect() # returns [Row(name=u'Alice')]
It took me far longer than it should have done to figure that out!
Here is another snippet that creates a dataframe from literal values without letting Spark infer the schema (behaviour which, I believe, is deprecated anyway):
from pyspark.sql.types import * schema = StructType([StructField("foo", StringType(), True)]) l = [('bar1',),('bar2',),('bar3',)] sqlContext.createDataFrame(l, schema).collect() # returns: [Row(foo=u'bar1'), Row(foo=u'bar2'), Row(foo=u'bar3')]
or, if you don’t want to use the one-element tuple workaround that I outlined above and would rather just pass a list of strings:
from pyspark.sql.types import * from pyspark.sql import Row schema = StructType([StructField("foo", StringType(), True)]) l = ['bar1','bar2','bar3'] rdd = sc.parallelize(l).map (lambda x: Row(x)) sqlContext.createDataFrame(rdd, schema).collect() # returns [Row(foo=u'bar1'), Row(foo=u'bar2'), Row(foo=u'bar3')]
HMRC will, if you ask them, send you a form called SA302 which shows your tax calculation for a given year. As I’m self-employed and thus have to submit a Self Assessment tax return every year I find this to be very useful.
To order the SA302 calculation telephone 01619319070 or 08453000627, it should take a couple of minutes at most. You will need your National Insurance Number.
My latest SA302 arrived today and I’ve spent this evening compiling an Excel workbook containing all my SA302 data for the past five years (fun fun fun). I’ve found this to be very very useful so figured I should make the workbook available to anyone else in a similar position. The workbook can be accessed via your web browser (no Excel installation required) here: TaxAnalysis.xlsx. It already has some (fake) numbers filled in for years 2009-10 through to 2013-14:
all you need to do is replace the fake numbers with your own and voila, you’ll get some nice charts like these showing you useful information about how much you earned and how much tax you paid over those years:
Hope this is useful! If so, please do let me have some feedback. Thanks!
In November 2012 my family and I moved into the London Borough of Hounslow and as I am expecting to be here for a very long time I decided to avail myself of some information pertaining to how the council spends its money. All expenditure is published on the council website at Council spending over £500. Its great that that information exists and is published however the format in which it is published isn’t particularly useful to folks like me that want to analyse and drill into the data in greater detail, what we get is a PDF file (rubbish) and a CSV file (better) per month:
Why is PDF rubbish? Because the data is static, we can’t explore it, reshape it, drill into it. The data is presented in whatever format the person who produced the PDF decides. This is all bad bad bad. CSV on the other hand (which stands for comma-separated-values) is better because it contains only raw data and there’s no pre-determined presentation of the data. One can take the monthly CSV files and collate them into a single view that allows exploration and comparison of the data and that is exactly what I have done; I have taken all available data (from April 2012 onwards) and published it online at All London Borough of Hounslow Supplier expenditure over 500GBP since April 2012.
The publishing format is a Microsoft Excel workbook however you do not need Excel installed in order to view it, you only need a web browser. You do have the option to download the workbook to take advantage of the greater power of Excel and do your own analysis.
Putting the data into Excel enables us to provide summaries and visualisations over the data such as expenditure per month:
Top ten expenditures per external Supplier, Expense Type & Service Area:
All the charts are attached to objects called slicers that makes them interactive. Here’s an example of a slicer:
Clicking on a Supplier will cause the charts to display data for only that Supplier (you can select multiple Suppliers by holding down the CTRL button).
Similar to Slicers are Timelines which enable us to show data for a particular month or groups of months:
Importantly, i shall be adding new data to this Excel workbook as and when it becomes available so check back every month to see the data changing.
The first month for which data was available was April 2012 hence when April 2013 rolls around we can start to do year-on-year comparisons and that is when the information might start to reveal some interesting spending trends of the council.
If you’re interested in the council’s absolute total expenditure since April 2012 I show that on the first sheet:
Finally, having access to all this data enables to discover interesting facts such as how much the council has spent with a particular chauffeur supplier:
If you find any other interesting titbits hidden in this corpus do let me know!
I encourage you to take a look and if you have any feedback please leave a comment below.
Helen, Bonnie and I recently moved into our new House in Hanworth Park and with the new house I inherited a substantial vegetable plot (the estate agent called it an orchard but given there’s only one tree in it that’s rather grandiose) that the previous owner has clearly put lots and lots of work into as you can see:
Check out my PhotoSynth of it here: http://bit.ly/jtveggiepatch.
When we bought the house I resolved that I would try and maintain this in the same way and hopefully we could become slightly more self-sufficient in the process; anyone who knows me knows that I am not in the slightest bit green-fingered so this is actually a rather daunting task. Undeterred I ventured down to the veggie patch this morning to make my first harvest of some beetroot that the previous owner had kindly left for us. Here is my first crop:
Not exactly a bumper crop but I am hoping I can get a good few jars of pickled beetroot outta this little lot and perhaps keep some aside to be roasted with our christmas lunch. I have a recipe for pickling beetroot from Miles Collins (see: How to Pickle Beetroot) and I ventured out today to get all the ingredients. Tomorrow is pickling day – check back later to see how I get on!
Recently I had my subscription to Hotmail Plus auto-renewed and I started to consider what I was actually getting for my money so I visited http://billing.microsoft.com to find out. After some clicking around I stumbled across this:
Let’s break this down. For £14.99 per year I get:
- 10GB of storage (as far as I know normal Hotmail is to-all-intents-and-purposes unlimited)
- No ads (Outlook.com hardly ever shows you ads now anyway and if they do they’re not the horrible banner ad type)
- Feature tips and product info (Don’t remember ever getting feature tips and besides, I don’t think I’m in great need of them)
- Larger attachments (I never send attachments anyway if I can help it and if I do Outlook.com’s allowance is both hefty and more than adequate)
- Exemption from account expiration (I guess that might be useful were I ever to suffer from a near-fatal medical condition although if I did I suspect I’d have bigger worries on my plate than an expiring Hotmail account)
As such, thanks but no thanks. Cancelled. Thankfully the redesigned http://billing.microsoft.com makes that rather easy:
Are you still using Hotmail Plus? One word for you…Stop!