Skip to content

Random tips for PowerShell, Bash & AWS  

Now freelance again I find myself solving a variety of unusual issues many of which I could find no online solutions for.

Given these no doubt plague other developers let’s share!

Pass quoted args from BAT/CMD files to PowerShell

Grabbing args from a batch/command files is easy – just use %* – but have you ever tried passing them to PowerShell like:

powershell "Something" "%*"

Unfortunately if one of your arguments has quotes around it (a filename with a space perhaps) then it becomes two separate arguments. e.g. "My File.txt" now becomes My and File.txt.

PowerShell will only preserve is if you use the -f option (to run a .PS1 file) but that requires a relaxed policy via Set-ExecutionPolicy and so is a no-go for many people.

Given you can’t make PowerShell do the right thing with the args the trick here is to not pass them as args at all!

SET MYPSARGS=%*
...
powershell -ArgumentList "$env:MYPSARGS"

Get Bash script path as Windows path

While Cygwin ships with cygpath to convert /c/something to c:\\Something etc. MSYS, MSYS2, GitHub Desktop and Git for Windows Bash shells do not have this.

However you can get it another way:

#!/bin/sh
pushd $(dirname "$0") > /dev/null
WINPWD="$(pwd -W)"
popd > /dev/null
echo $WINPWD

This works by switching the working directory to the one the script is in $(dirname "$0") and then capturing the print-working-directory command output using the -W option that grabs it in Windows format. It then pop’s the working directory to make sure it goes back to where it was.

Note that this uses forward slashes as a directory separator still – a lot of stuff is okay with that but older apps and tools are not.

JSON encoding in API Gateway mapping templates

Using Amazon’s AWS Lambda you’ll also find yourself touching API Gateway and while most of it is great the mapping templates are quite deficient in that they do not encode output by default despite specifying the MIME types.

All of Amazon’s example templates are exploitable via JSON injection. Just put a double-quote in a field and start writing your own JSON payload.

Amazon needs to fix this and make it encode by default like other templating systems have done such as ASP.NET Razor.

Until then some recommend the Amazon-provided $util.escapeJavaScript() however while it encodes " as \" it also produces illegal JSON by encoding ' as \' .

The mapping language is Apache Velocity Template Language (VTL) and while not extendable the fine-print reveals that it internally uses Java strings and does not sandbox us. This let’s us utilize Java’s replace functionality:

#set($i = $input.path('$'))
{
"safeString": "$i.unsafeString.replace("\"","\\\"")"
}

Show active known IPs on local network

I’m surprised more people don’t know how useful arp -a is especially if you pipe it into ping…

Bash

arp -a | grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | xargs -L1 ping -c 1 -t 1 | sed -n -e 's/^.*bytes from //p'

PowerShell

(arp -a) -match "dynamic" | Foreach { ping -n 1 -w 1000 ($_ -split "\s+")[1] } | where { $_ -match "Reply from " } | % { $_.replace("Reply from ","") }

Wrapping up

I just want to mention that if you are doing anything on a command-line be it Bash, OS X, PowerShell or Command/Batch then SS64 is a site worth visiting as they have great docs on many of these things!

[)amien

Monitoring URLs for free with Google Cloud Monitor  

As somebody who runs a few sites I like to keep an eye on them and make sure they’re up and responding correctly.

My go-to for years has been Pingdom but this year they gutted their free service so that you can now only monitor every 5 minutes.

The free service with Pingdom wasn’t great to start with – limited alerting options and you can only monitor a single endpoint – so I went searching for something better as $15 a month to monitor a couple of personal low-volume sites is not money well spent.

Google Cloud

I’ve played with the Google Cloud Platform offerings for a while and like many others theirs includes a monitoring component called unsurprisingly Google Cloud Monitoring.

Right now it’s in beta and is free and is based on StackDriver which was acquired by Google in 2014. I can imagine more integration and services will continue to come through as they have a complete product that also monitors AWS.

Uptime checks

Screenshot showing uptime check options

You can create HTTP/HTTPS/TCP/UDP checks and while it was designed to monitor the services you’re running on Google Cloud will happily take arbitrary URLs to services running elsewhere.

Checks can be run every 1/5/10 or 15 minutes, use custom ports, look for specific strings in the response as well as setting custom headers and specifying authentication credentials.

Each URL is monitored and the performance reported from six geographical locations split between east, central and west USA as well as one or Europe, one in Asia and one in Latin America. For example:

damieng.com/

  • Virginia responded with 200 (OK) in 357 ms
  • Oregon responded with 200 (OK) in 377 ms
  • Iowa responded with 200 (OK) in 330 ms
  • Belgium responded with 200 (OK) in 673 ms
  • Singapore responded with 200 (OK) in 899 ms
  • Sao Paulo responded with 200 (OK) in 828 ms

Alerting policies

Here’s where Google’s offering really surprised me with alerting options not just for SMS and Email but also for HipChat, Slack, Campfire and PagerDuty and you can specify a number of them together and mix and match them with different uptime checks etc.

Screenshot of alerting policy options

Incidents

Like Pingdom if the endpoint being monitored goes down an incident is opened that you can write details (comments) into and also like Pingdom it the incident is closed once the endpoint starts responding again.

Graph & dashboard

The cloud monitoring product has a configurable dashboard that like the rest of the product is really geared around monitoring Google Cloud specific services but there is an uptime monitoring component that can still provide some value.

You can download the JSON for a graph, an API as well as iframe-embeddable sharing functionality.

Final thoughts

I’m very impressed with this tool given the lack of limitations in a free product and will be using it for a bunch of my sites for now bearing in mind however that it has no SLA right now!

Any other recommendations for free URL monitoring?

[)amien

Notes on Edward Tufte’s Presenting Data and Information  

Photograph of Envisioning Information Here are my notes from today’s event by renowned statistician Edward Tufte – author of The Visual Display of Quantitative Information and Envisaging Information primarily for my own reference but perhaps of interest to others.

A dramatic start

No announcement, no preamble. The lights went out and a visually striking video showing a representation of music started. Conversations were immediately hushed and devices put away. An effective technique to get attention and signal an absolute start.

Charts and tables

Sorting: Find a sort for your data that makes sense. Treat it as another axis and don’t waste it with the alphabet.

Sparse columns: Remove sparsely populated columns from tables. Special events should be specially annotated.

Linking lines: Always annotate them to describe the interaction. Prefer verbs over nouns as they are a taxonomy.

Information does not fit in a tree. The web is successful because Tim-Berners Lee understood this and made links the interconnectedness between content. “Vague, but exciting”

Data

Content is not clean. Data that shows behavior in a perfect way has likely been manipulated.

Human beings over-detect clusters and conspiracies. They find links between unrelated events especially in sequences (serial correlation). Sports commentators given any series of scores will develop a false narrative to explain it. They’ll find a reason for 7 wins in a row despite random data producing such sequences.

Self-monitoring is a farce because people can’t keep their own score. Once something is measured it becomes a target and will be subsequently gamed and fudged as needed.

You can make many models to fit any data you are given. It may work well for the past and current data but how far it will last is highly variable. This effect is referred to as shrinkage – no model lasts forever.

Big data is not a substitute for traditional data collection and analysis. Google famously thought this when they created Google Flu which tried to spot the spread of flu based on search terms. It has been seriously criticized by Forbes and the New York Times.

Conflict

Do not jump to conflict or character assassination. Your motives are likely no better (or worse).

How many nice comments wiped out a bad one? Ten… a hundred?

There is evil in the world but it probably does not exist in your day-to-day life.

A deck of slides

A deck is inefficient. It is easy for the presenter but hard for the audience who are waiting for something they can use. “A diamond in the swamp” Slow reveals further reduce the information density and people will check-out when it gets low.

Prefer spatially adjacent data (a document) over temporally stacked (slides). The often-cited limit of 7±2 items was for temporal retention so limiting a page to this number of items is actually the opposite of what that research was telling us. We can cope with much more data if it is all on-page together.

Meetings and presentations

Do not be afraid of paper.

Prepare a document in advance but do not send it and instead spend 30 minutes at the start of the meeting reading it in silence (known as a study hall). People can read faster than you can talk as well as go back and forth as needed, skipping what they already know and latecomers are less disruptive.. Amazon is famously using this with its 6-page narrative memo system.

Never go meta in your presentation – stick to the content. Respect your audience and do not presume to know them or  you may find yourself pandering or having low expectations. Instead present the data to the best of your ability. Many complicated things are explained to millions of people all the time. You can’t teach if you have low expectations. Negativity and positivity are self-fulfilling.

Does your audience understand and trust you? Credibility is eroded not just by lying but by cherry picking. Evidence of cherry picking includes data too good to be true and hiding the source of the data behind excuses such as copyright, proprietary or others secrets. Why would a conclusion be open when the data needs to be secret? It’s likely a misrepresentation of the data for their own means.

Note a few words when somebody asks you a question to make sure your answer stays on topic. If you don’t know the answer be honest but suggest where you would start looking for the answer. Never heckle or waste time correcting minutiae.

Doctors trip

A trip to the Doctor’s office is a presentation. Write down your list before you go in. Make them listen because they normally interrupt after 22 seconds and consider each item individually. You’ll give up before you reach the end of your list this way and they may not see the connected pattern of the whole.

Documents

Every document needs an abstract. It should spell out as simply as possible:

  1. What the problem is
  2. Who cares
  3. What the solution is

If you can’t write this then you don’t have a document and you’re not saying anything.

Latex

Real scientists use Latex. There are thousands of templates including official ones for well-known journals. Online tools like Overleaf can reduce the barrier to entry. Latex code appears like this:

\title{My presentation matters}
 \begin{document}
 \section*{Introduction}
 Sample of Latex

R is another alternative but it’s considered hard even by people who use Latex.

Reading

We are taught to read to extract facts to pass exams at school. We need to practice reading for enjoyment, reading to spot new information, to extract what we want, to form new opinions and ideas, to loot & hack.

Immediately skip words you don’t understand: there won’t be a test – you’re not at school.

Design

Design does not belong to ‘other people’. Support thinking with analytical design and do whatever it takes to explain the data.

Why do bird books use illustrations? Because the authors want to help you spot the birds and using art they exaggerate the differences as well as produce a generic version of the bird.

Nature magazine has some of the best designed visualizations around. Openness, pride and space constraints all help. (DNA only got 1.5 pages) The New York Times also often produces interesting visualizations of data.

User interface

Use the ideas proven by large successful sites on the web. Do not be swayed by arguments that your users won’t understand. Millions of users already do.

Touch is the next-generation of user interface. It allows the chrome (interface junk) to be jettisoned. No scroll bars, no buttons, no cursor, no zoom. Pure information experiences and this came not from academia, finance or medical but from consumer space.

The future of interface design… is information design. Edward Tufte – Seattle, August 4 2015

The original UI metaphors at Xerox Parc on the Alto were around a single document. Instead we have application-owned silos of data. The elegance was lost because companies want to control the content you create with their tools. They isolate your content so they can profit.

Hierarchies are still used for web design because it mimics the organization paying the bill. They see themselves this way and do not focus on how and what their customers need. Famous examples include the Treasury Department burying tax forms 7 levels deep despite being a top user request and the XKCD strip about University web sites. People on the inside have a skewed perspective of what the outside is.

The density of user interfaces is increasing which allows for richer visualizations especially when combined with animation or video. It is hard to get right.

Time window events with Apache Spark Streaming  

If you’re working with Spark Streaming you might run into an interesting problem if you want to output an event based on a number of messages within a specific time period.

For example: I want to send a security alert if I see 10 DDOS attempts to an IP address in a 5 minute window.

groupByKeyAndWindow

groupByKeyAndWindow allows us to choose the IP address for the key and 5 minutes for the window. If we wanted to then collect the sourceIp and the timestamp it would look like this:

var messageLimit = 10
var messageWindow = Minutes(5)
val scc = new StreamingContext(conf, Minutes(1))

// ... setup Kafka consumer via SparkUtils
kafkaConsumer
    .flatMap(parseSecurityMessage)
    .filter(m => m.securityType == 'DDOS')
    .map(m => m.targetIp -> Seq((m.timestamp, m.sourceIp)))
    .reduceByKeyAndWindow({(x, y) => x ++ y}, messageWindow)
    .filter(g => g._2.length >= messageLimit)
    .foreachRDD(m => m.foreach(createAlertEvent))

scc.start()
scc.streamUntilTerminated()

Problem

The problem is your event will fire many times as the stateless RDD is re-run every batch period.

The simplest solution would be to make the batch interval the same as your message window size but that causes more problems, namely:

  • Your job can’t perform any other triggers on the source data at a shorter interval
  • You won’t know about these alerts until some time after they happen (in this case 5 minutes)

External would be terrible and neither Spark counters or globals are much use here.

Solution

We need to do two things:

  1. Stop the RDD re-running and instead use streaming state. We can do this by using the reduceByKeyAndWindow overload that allows us to specify the inverse function for removing data as it goes out of window.
  2. Introduce a small amount of in-RDD state that can be used to identify when the event is cleared and when it should fire again.

Let us assume there is a class to handle part 2 named WindowEventTrigger that provides add and remove methods as well as a boolean triggerNow flag that identifies when the event should re-fire. Our RDD body would now look like this:

kafkaConsumer
    .flatMap(parseSecurityMessage)
    .filter(m => m.securityType == 'DDOS')
    .map(m => m.targetIp -> WindowEventTrigger(Seq(m.timestamp, m.sourceIp), messageLimit))
    .reduceByKeyAndWindow(_ add _, _ remove _, messageWindow)
    .filter(_._2.triggerNow)
    .foreachRDD(m => m.foreach(createAlertEvent))

How this works is actually quite simple. We have a case class called WindowEventTrigger that we map into the stream for each incoming message, it then:

  1. Tracks incoming messages and if it hits the level sets the flag and makes note of the event
  2. Tracks outgoing messages and resets when the event that caused the trigger goes out of window

By switching to the in-memory groupByKeyAndWindow Spark will need to persist state in case executors go down or it is necessary to shuffle data between them. Ensure your SparkStreamingContext object has a checkpoint folder set to reliable storage like HDFS

WindowEventTrigger class

Here is the WindowEventTrigger class for your enjoyment.

case class WindowEventTrigger[T] private(eventsInWindow: Seq[T], triggerNow: Boolean, private val lastTriggeredEvent: Option[T], private val triggerLevel: Int) {
  def this(item: T, triggerLevel: Int) = this(Seq(item), false, None, triggerLevel)

  def add(incoming: WindowEventTrigger[T]): WindowEventTrigger[T] = {
    val combined = eventsInWindow ++ incoming.eventsInWindow
    val shouldTrigger = lastTriggeredEvent.isEmpty && combined.length >= triggerLevel
    val triggeredEvent = if (shouldTrigger) combined.seq.drop(triggerLevel - 1).headOption else lastTriggeredEvent
    new WindowEventTrigger(combined, shouldTrigger, triggeredEvent, triggerLevel)
  }

  def remove(outgoing: WindowEventTrigger[T]): WindowEventTrigger[T] = {
    val reduced = eventsInWindow.filterNot(y => outgoing.eventsInWindow.contains(y))
    val triggeredEvent = if (lastTriggeredEvent.isDefined && outgoing.eventsInWindow.contains(lastTriggeredEvent.get)) None else lastTriggeredEvent
    new WindowEventTrigger(reduced, false, triggeredEvent, triggerLevel)
  }
}

Happy streaming,

[)amien