Notes on Edward Tufte’s Presenting Data and Information  

Here are my notes from today’s event by renowned statistician Edward Tufte – author of The Visual Display of Quantitative Information and Envisaging Information primarily for my own reference but perhaps of interest to others.

A dramatic start

No announcement, no preamble. The lights went out and a visually striking video showing a representation of music started. Conversations were immediately hushed and devices put away. An effective technique to get attention and signal an absolute start.

Charts and tables

Sorting: Find a sort for your data that makes sense. Treat it as another axis and don’t waste it with the alphabet.

Sparse columns: Remove sparsely populated columns from tables. Special events should be specially annotated.

Linking lines: Always annotate them to describe the interaction. Prefer verbs over nouns as they are a taxonomy.

Information does not fit in a tree. The web is successful because Tim-Berners Lee understood this and made links the interconnectedness between content. “Vague, but exciting”


Content is not clean. Data that shows behavior in a perfect way has likely been manipulated.

Human beings over-detect clusters and conspiracies. They find links between unrelated events especially in sequences (serial correlation). Sports commentators given any series of scores will develop a false narrative to explain it. They’ll find a reason for 7 wins in a row despite random data producing such sequences.

Self-monitoring is a farce because people can’t keep their own score. Once something is measured it becomes a target and will be subsequently gamed and fudged as needed.

You can make many models to fit any data you are given. It may work well for the past and current data but how far it will last is highly variable. This effect is referred to as shrinkage – no model lasts forever.

Big data is not a substitute for traditional data collection and analysis. Google famously thought this when they created Google Flu which tried to spot the spread of flu based on search terms. It has been seriously criticized by Forbes and the New York Times.


Do not jump to conflict or character assassination. Your motives are likely no better (or worse).

How many nice comments wiped out a bad one? Ten… a hundred?

There is evil in the world but it probably does not exist in your day-to-day life.

A deck of slides

A deck is inefficient. It is easy for the presenter but hard for the audience who are waiting for something they can use. “A diamond in the swamp” Slow reveals further reduce the information density and people will check-out when it gets low.

Prefer spatially adjacent data (a document) over temporally stacked (slides). The often-cited limit of 7±2 items was for temporal retention so limiting a page to this number of items is actually the opposite of what that research was telling us. We can cope with much more data if it is all on-page together.

Meetings and presentations

Do not be afraid of paper.

Prepare a document in advance but do not send it and instead spend 30 minutes at the start of the meeting reading it in silence (known as a study hall). People can read faster than you can talk as well as go back and forth as needed, skipping what they already know and latecomers are less disruptive.. Amazon is famously using this with it’s 6-page narrative memo system.

Never go meta in your presentation – stick to the content. Respect your audience and do not presume to know them or  you may find yourself pandering or having low expectations. Instead present the data to the best of your ability. Many complicated things are explained to millions of people all the time. You can’t teach if you have low expectations. Negativity and positivity are self-fulfilling.

Does your audience understand and trust you? Credibility is eroded not just by lying but by cherry picking. Evidence of cherry picking includes data too good to be true and hiding the source of the data behind excuses such as copyright, proprietary or others secrets. Why would a conclusion be open when the data needs to be secret? It’s likely a misrepresentation of the data for their own means.

Note a few words when somebody asks you a question to make sure your answer stays on topic. If you don’t know the answer be honest but suggest where you would start looking for the answer. Never heckle or waste time correcting minutiae.

Doctors trip

A trip to the Doctor’s office is a presentation. Write down your list before you go in. Make them listen because they normally interrupt after 22 seconds and consider each item individually. You’ll give up before you reach the end of your list this way and they may not see the connected pattern of the whole.


Every document needs an abstract. It should spell out as simply as possible:

  1. What the problem is
  2. Who cares
  3. What the solution is

If you can’t write this then you don’t have a document and you’re not saying anything.


Real scientists use Latex. There are thousands of templates including official ones for well known journals. Online tools like Overleaf can reduce the barrier to entry. Latex code appears like this:

\title{My presentation matters}
 Sample of Latex

R is another alternative but it’s considered hard even by people who use Latex.


We are taught to read to extract facts to pass exams at school. We need to practice reading for enjoyment, reading to spot new information, to extract what we want, to form new opinions and ideas, to loot & hack.

Immediately skip words you don’t understand: there won’t be a test – you’re not at school.


Design does not belong to ‘other people’. Support thinking with analytical design and do whatever it takes to explain the data.

Why do bird books use illustrations? Because the authors want to help you spot the birds and using art they exaggerate the differences as well as produce a generic version of the bird.

Nature magazine has some of the best designed visualizations around. Openness, pride and space constraints all help. (DNA only got 1.5 pages) The New York Times also often produces interesting visualizations of data.

User interface

Use the ideas proven by large successful sites on the web. Do not be swayed by arguments that you’re users won’t understand. Millions of users already do.

Touch is the next-generation of user interface. It allows the chrome (interface junk) to be jettisoned. No scrollbars, no buttons, no cursor, no zoom. Pure information experiences and this came not from academia, finance or medical but from consumer space.

The future of interface design… is information design. Edward Tufte – Seattle, August 4 2015

The original UI metaphors at Xerox Parc on the Alto were around a single document. Instead we have application-owned silos of data. The elegance was lost because companies want to control the content you create with their tools. They isolate your content so they can profit.

Hierarchies are still used for web design because it mimics the organization paying the bill. They see themselves this way and do not focus on how and what their customers need. Famous examples include the Treasury Department burying tax forms 7 levels deep despite being a top user request and also the XKCD strip about University web sites. People on the inside have a skewed perspective of what the outside is.

The density of user interfaces is increasing which allows for richer visualizations especially when combined with animation or video. It is hard to get right.

Time window events with Apache Spark Streaming  

If you’re working with Spark Streaming you might run into an interesting problem if you want to output an event based on a number of messages within a specific time period.

For example: I want to send a security alert if I see 10 DDOS attempts to an IP address in a 5 minute window.


groupByKeyAndWindow allows us to chose the IP address for the key and 5 minutes for the window. If we wanted to then collect the sourceIp and the timestamp it would look like this:

var messageLimit = 10
var messageWindow = Minutes(5)
val scc = new StreamingContext(conf, Minutes(1))

// ... setup Kafka consumer via SparkUtils
    .filter(m => m.securityType == 'DDOS')
    .map(m => m.targetIp -> Seq((m.timestamp, m.sourceIp)))
    .reduceByKeyAndWindow({(x, y) => x ++ y}, messageWindow)
    .filter(g => g._2.length >= messageLimit)
    .foreachRDD(m => m.foreach(createAlertEvent))



The problem is your event will fire many times as the stateless RDD is re-run every batch period.

The simplest solution would be to make the batch interval the same as your message window size but that causes more problems, namely:

  • Your job can’t perform any other triggers on the source data at a shorter interval
  • You won’t know about these alerts until some time after they happen (in this case 5 minutes)

External would be terrible and neither Spark counters or globals are much use here.


We need to do two things:

  1. Stop the RDD re-running and instead use streaming state. We can do this by using the reduceByKeyAndWindow overload that allows us to specify the inverse function for removing data as it goes out of window.
  2. Introduce a small amount of in-RDD state that can be used to identify when the event is cleared and when it should fire again.

Let us assume there is a class to handle part 2 named WindowEventTrigger that provides add and remove methods as well as a boolean triggerNow flag that identifies when the event should re-fire. Our RDD body would now look like this:

    .filter(m => m.securityType == 'DDOS')
    .map(m => m.targetIp -> WindowEventTrigger(Seq(m.timestamp, m.sourceIp), messageLimit))
    .reduceByKeyAndWindow(_ add _, _ remove _, messageWindow)
    .foreachRDD(m => m.foreach(createAlertEvent))

How this works is actually quite simple. We have a case class called WindowEventTrigger that we map into the stream for each incoming message, it then:

  1. Tracks incoming messages and if it hits the level sets the flag and makes note of the event
  2. Tracks outgoing messages and resets when the event that caused the trigger goes out of window

By switching to the in-memory groupByKeyAndWindow Spark will need to persist state in case executors go down or it is necessary to shuffle data between them. Ensure your SparkStreamingContext object has a checkpoint folder set to reliable storage like HDFS

WindowEventTrigger class

Here is the WindowEventTrigger class for your enjoyment.

case class WindowEventTrigger[T] private(eventsInWindow: Seq[T], triggerNow: Boolean, private val lastTriggeredEvent: Option[T], private val triggerLevel: Int) {
  def this(item: T, triggerLevel: Int) = this(Seq(item), false, None, triggerLevel)

  def add(incoming: WindowEventTrigger[T]): WindowEventTrigger[T] = {
    val combined = eventsInWindow ++ incoming.eventsInWindow
    val shouldTrigger = lastTriggeredEvent.isEmpty && combined.length >= triggerLevel
    val triggeredEvent = if (shouldTrigger) combined.seq.drop(triggerLevel - 1).headOption else lastTriggeredEvent
    new WindowEventTrigger(combined, shouldTrigger, triggeredEvent, triggerLevel)

  def remove(outgoing: WindowEventTrigger[T]): WindowEventTrigger[T] = {
    val reduced = eventsInWindow.filterNot(y => outgoing.eventsInWindow.contains(y))
    val triggeredEvent = if (lastTriggeredEvent.isDefined && outgoing.eventsInWindow.contains(lastTriggeredEvent.get)) None else lastTriggeredEvent
    new WindowEventTrigger(reduced, false, triggeredEvent, triggerLevel)

Happy streaming,


Table per hierarchy in Azure Table Storage  

If you’re coming from an ORM background to Azure Table Storage you might be wondering how to map class hierarchies to tables.

Documentation on the topic is hard to find unless you know the magic class class name EntityResolver which you can find out by digging into the Azure Client for .NET source code.

Let’s say we have a basic blog style system (minimal fields shown):

public class Content {
  public string Id { get; set; }
  public string Title { get; set }

public class BlogPost : Content {
  public List<string> Topics { get; set; }

public class Page : Content {
  public String Slug { get; set; }

The trick is to create an instance of EntityResolver where T is your base class, e.g. Content. Strangely EntityResolver’s signature requires T implement new() so you can’t make your base class abstract.

Firstly we need to add to our base class some kind of identifier for the type – in ORM terms this is referred to as a discriminator. Then we’d override that in the subtypes to ensure new instances get the correct type set on insertion.

public class Content {
  public string Id { get; set; }
  public string Title { get; set }

public class BlogPost : Content {
  public List<string> Topics { get; set; }

public class Page : Content {
  public String Slug { get; set; }

Let’s say we want to store all of these in a table called ‘content’. We would typically write a small helper class to handle the cloud table and storage, e.g.

public class Content {
  public string Id { get; set; }
  public string Title { get; set }
  public virtual string ContentType { get; set; }

public class BlogPost : Content {
  public List<string> Topics { get; set; }
  public override string ContentType {
    get { return "blog"; }
    set { }

public class Page : Content {
  public String Slug { get; set; }
  public override string ContentType {
    get { return "page"; }
    set { }

With just that change you can actually start inserting rows into Azure Table Storage but querying them back will always result in Content types and saving those back will result in data loss!

We can however help the CloudTable client materialize the correct results by creating an EntityResolver:

EntityResolver<Content> contentResolver = (partitionKey, rowKey, timestamp, properties, etag) => {
    var contentType = properties["ContentType"].StringValue;
    switch (contentType) {
        case "blog": return new BlogPost();
        case "page": return new Page();
        default: throw new NotSupportedException(String.Format("Unknown ContentType '{0}'", contentType));

Which is then passed into operations that materialize results. Note that some signatures don’t accept a resolver so find one that does even if it means suppling a default OperationContent. For example:

var query = table.CreateQuery<Content>().Where(c => c.PartitionKey == yearMonth);
var results = query.ExecuteQuery(query.AsTableQuery(), contentResolver, myRequestOptions, myOperationContext);

Given that these entity resolvers are essential to correctly materializing your results without data loss it’s worth wrapping the CloudTable client with the necessary setup/table-creation/entity resolver.


Quality of SSL protection for US financial institutions  

Troy Hunt put together a list of top Australian banks and their SSL rating using the Qualys SSL Server Test that reveals the somewhat depressing state of SSL security of various banks down under.

This got me wondering how US financial institutions stack up and I thought I’d share:

Update Nov 2015: Lots of great progress by many of the institutions with the exception of KeyBank still showing Poodle vulnerability, Union needing to support newer tech, Mint lacking overall considering they’re a tech company and Citibank being lame for blacklisting SSL Labs.