Sequence averages in Scala

I’ve been learning Scala and decided to put together a C# to Scala cheat sheet. All is going pretty well but then I got stuck on the equivalent of Average.

Enumerable.Average in .NET calculates a mean average from your sequence by summing up all the values and counting them in a single pass then returning the sum divided by the count in a floating point format (or decimal).

The problem

Given that Scala has nothing built-in there are more than a few suggestions online that boil down to:

val average = seq.sum / seq.length

This has a few problems:

  1. Visiting a sequence twice can be inefficient
  2. Sum can overflow as it is the same type as the sequence
  3. Applied to an integer without casting it returns an integer average

A solution

Scala provides a useful high-order function called foldLeft. Its job is to take an initial state and a function then keep applying the function with each value to the state. So one more efficient solution to the problem is:

val average = seq.foldLeft((0.0, 1)) ((acc, i) => ((acc._1 + (i - acc._1) / acc._2), acc._2 + 1))._1

How does this work?

What we do here is calculate an average as we go, adding the new weighted average each time.

It achieves this by setting up a tuple to contain our initial state with (0.0, 1). This specifies our starting average of 0.0 and our starting position of 1.

The next part specifies the function that takes that state as acc (for accumulator) and the next value in the sequence as i and calculates our rolling average for each value and increases the position as it goes along.

Finally at the end of our call we specify ._1 which tells the compiler we want the first value from the tuple – the average – as we no longer care about the position.

If you wanted to make this function more reusable you could do this:

def average(s: Seq[Int]): Double = s.foldLeft((0.0, 1)) ((acc, i) => ((acc._1 + (i - acc._1) / acc._2), acc._2 + 1))._1

Be aware you might need multiple overloads for each numeric sequence type you want to be able to average given the lack of a common numeric trait that allows for the subtraction and division.

Precision and rounding

There is some slight variance in results between this approach and the total / count due to rounding precision. If you wanted to preserve that you could always add and then divide at the end still in a single pass much like .NET does but with Scala’s foldLeft rather than a foreach.

def average(s: Seq[Int]): Double = { val t = s.foldLeft((0.0, 0)) ((acc, i) => (acc._1 + i, acc._2 + 1)); t._1 / t._2 }

[)amien

Optimizing Sum, Count, Min, Max and Average with LINQ

LINQ is a great tool for C# programmers letting you use familiar syntax with a variety of backend systems without having to learn another language or paradigm for many query operations.

Ensuring that the queries still perform well can be a bit of a chore and one set that fails quite badly are the aggregate operations when you want more than one.

Multiple sequential queries (bad)

var count = db.Invoices.Count();
var total = db.Invoices.Sum(i => i.Paid);
var average = db.Invoices.Average(i => i.Paid);

Will issue three separate requests. There is nothing a LINQ provider can do to optimize that pattern as they are three discrete statements.

Background

If we wanted these values by country we could do this in LINQ:

var a = db.Invoices.GroupBy(i => i.Country)
          .Select(g => new { Country = g.Key,
                             Count = g.Count(),
                             Total = g.Sum(i => i.Paid),
                             Average = g.Average(i => i.Paid) });

Which gets us everything in a single statement broken down by country. In SQL this is:

SELECT Country, Count(*), Sum(Paid), Average(Paid)
    FROM Invoices GROUP BY Country

Many data sources including SQL are happy to provide aggregate values without a group by so how do we generate that from LINQ?

In the absence of a Group method that doesn’t take a property we need to fake it and because of the way many LINQ providers optimize out parts of the tree we can:

Single optimized query (good)

Replacing the property in a GroupBy with a constant value gives us an optimized single query:

var a = db.Invoices.GroupBy(i => 1)
          .Select(g => new { Count = g.Count(),
                             Total = g.Sum(i => i.Paid),
                             Average = g.Average(i => i.Paid) });

Here are the providers I’ve tried:

  • LINQ to Objects (Works although constant is likely evaluated)
  • LINQ to SQL (Works although passes 1 parameter to SQL)
  • Entity Framework 6 (Works although query is a little obscure)
  • Elasticsearch (Works and optimizes out totally)

Count+Where optimizations

If we are performing counts with a predicate or against a where we can also optimize these.

var high = db.Invoices.Count(i => i.Paid >= 1000);
var low = db.Invoices.Where(i => i.Paid < 1000).Count();
var sum = db.Invoices.Sum(i => i.Paid);

Then we can express this as:

var a = db.Invoices.GroupBy(g => 1)
          .Select(g => new { High = g.Count(i => i.Paid >= 1000),
                             Low = g.Count(i => i.Paid < 1000),
                             Sum = g.Sum(i => i.Paid) });

[)amien

What to do before your iTunes Match subscription expires

At $25 a year the iTunes Match service can be a little tough to swallow given all it does is synchronize your music across iTunes especially when other file-sharing services are cheaper and more general purpose (OneDrive, Mega, DropBox etc).

One important thing to know however before you let your subscription lapse or cancel is that once it’s gone all your cloud-backed-up music will be unavailable.

That means if you don’t still have a local copy of the track your ripped from CD/download from anywhere but iTunes you’re going to be digging through backups or have to re-rip or repurchase it.

There is a simple way to download all your missing music before your subscription expires though.

Steps to download all your iTunes Match tracks

  1. Start up iTunes
  2. Create a new Smart Playlist with the criteria (as shown in screenshot)
    1. iCloud Status is matched
    2. Location is not on this computer
      iTunes Smart Playlist for finding Cloud-only tracks
  3. Save this Smart Playlist as say “iTunes Match Download”
  4. Browse to this Smart Playlist and select one song
  5. Select all with Ctrl-A (Windows) or Cmd-A (Mac)
  6. Consider the total size at the bottom of the screen in terms of whether you have this disk space or bandwidth allowance.
  7. Right click on the items and choose Download

This may take a while. You can see the status by opening the Downloads window.

If the downloads stop or fail for any reason just repeat steps 4-6 as your new playlist will keep shrinking as files are now available on your computer.

Enjoy!

[)amien

Setup an Ubuntu server at Digital Ocean

This post is part of a series on Turbocharging WordPress with NGINX and HipHopVM but also serves as a standalone beginners guide to getting an Ubuntu server ready at Digital Ocean.

Virtual machines are called Droplets at Digital Ocean so hit Create then:

1. Give it a name

Give your server a name. This has no bearing on the name your customers see and is only for initially connecting to it/in the Digital Ocean dashboard.

2. Select size

A popular blog should have no problem with the $10 a month 1GB/30GB/2TB option but I run a few sites so went for the next one up with more CPU and RAM.

You can scale up later although you won’t get the extra disk space as it can’t resize the disk. Given static storage like Amazon S3 is cheap and integrates with their CloudFront CDN this isn’t a problem.

3. Select region

Is your audience focused in a specific area?

  • Yes (e.g. a real estate site) then choose the closest server to them
  • No – choose the US East Coast like New York for good global coverage

4. Select Image

Here you select which distribution of Linux, which version and which CPU architecture you want to use.

For this guide I’m using Ubuntu 14.04 x64. In theory you could use alternative distributions or versions but you’re on your own.

Do not select x32 as HipHopVM is only supported on 64-bit architectures.

4. Add optional SSH keys

SSH keys let you automatically sign in without a password – the security being a key file on your computer instead. It’s worth learning how to use this but is outside the scope of this article so just use the normal password for now.

5. Settings

Leave the defaults on unless you want to pay extra for their backup service. Personally I like to use a WordPress plugin that backs up to S3 called UdraftPlus.

6. Hit Create Droplet

Within 60 seconds you should have a fresh virtual machine ready to go.

Connect to your Ubuntu virtual machine/droplet

You will need ssh (preinstalled on a Mac, Windows users should check out Putty) Check the IP address shown on your droplet’s page then:

ssh root@107.170.26.154

You should confirm the fingerprint the first time by typing yes then be rewarded with a Welcome message and a cursor to type new commands into!

Update your Ubuntu virtual machine

Even though the version of Ubuntu you chose is quite up to date there will be a few updates to apply, thankfully this is very easy.

sudo apt-get update

Tells the package manager (known as “apt”) to go find out about all the updates. It doesn’t yet install them though, to do that we need to wait until it’s finished then type:

sudo apt-get upgrade

You’ll need to confirm this with Y and wait a little bit. This upgrades a lot of the packages and applications. Once complete then:

sudo apt-get dist-upgrade

This tells apt to upgrade the core operating system as well. Again confirm with Y and wait a little bit. Once this one is complete you’ll need to reboot your machine with:

sudo reboot now

You’re now ready to reconnect and starting installing packages to make your virtual machine do something useful!

[)amien