Random tips for PowerShell, Bash & AWS

Now that I am again freelancing, I find myself solving unusual issues, many of which had no online solutions.

Given these no doubt plague other developers, let’s share!

Pass quoted args from BAT/CMD files to PowerShell

Grabbing args from a batch/command files is easy – use %* – but have you ever tried passing them to PowerShell like:

powershell "Something" "%*"

Unfortunately, if one of your arguments has quotes around it (a filename containing a space perhaps), it becomes two separate arguments. e.g. "My File.txt" now becomes My and File.txt.

PowerShell will only preserve it if you use the -f option (to run a .PS1 file) and that requires a relaxed policy via Set-ExecutionPolicy so is a no-go for many people.

Given you can’t make PowerShell do the right thing with the args the trick here is - to not pass them as args at all!

SET MYPSARGS=%*
...
powershell -ArgumentList "$env:MYPSARGS"

Get Bash script path as Windows path

While Cygwin ships with cygpath to convert /c/something to c:\Something etc. MSYS Bash shells do not have this. You can get it another way there however:

#!/bin/sh
pushd "$(dirname "$0")" > /dev/null
if command -v "cygpath" > /dev/null; then
  WINPWD=""$(cygpath . -a -w)""
else
  WINPWD=""$(pwd -W)""
fi
popd > /dev/null
echo $WINPWD

This solution works by switching the working directory to the one the script is in "$(dirname "$0")" and then capturing the print-working-directory command output using the -W option that grabs it in Windows format. It then pops the working directory to make sure it goes back to where it was.

Note that this uses forward slashes as a directory separator still. Many tools and apps are okay with that but some older ones are not.

JSON encoding in API Gateway mapping templates

If you use Amazon’s AWS Lambda you’ll also find yourself touching API Gateway. While most of it is great, the mapping templates are deficient in that they do not encode output by default despite specifying the MIME types.

All of Amazon’s example templates are exploitable via JSON injection. Just put a double-quote in a field and start writing any JSON payload.

Amazon must fix this – encode by default like other templating systems have done, such as ASP.NET Razor. Until then some recommend the Amazon-provided $util.escapeJavaScript() however while it encodes " as \" it also produces illegal JSON by encoding ' as \' .

The mapping language is Apache Velocity Template Language (VTL), and while not extendable, the fine print reveals that it internally uses Java strings and does not sandbox them which let’s us utilize Java’s replace functionality:

#set($i = $input.path('$'))
{
   "safeString": "$i.unsafeString.replaceAll("\""", "\\""")
}

Show active known IPs on the local network

I’m surprised more people don’t know how useful arp -a is, especially if you pipe it into ping…

Bash

arp -a | grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | xargs -L1 ping -c 1 -t 1 | sed -n -e 's/^.*bytes from //p'

PowerShell

(arp -a) -match "dynamic" | Foreach { ping -n 1 -w 1000 ($_ -split "\s+")[1] } | where { $_ -match "Reply from " } | % { $_.replace("Reply from ","") }

Wrapping up

I just want to mention that if you are doing anything on a command-line, be it Bash, OS X, PowerShell or Command/Batch then SS64 is a site worth visiting as they have great docs on many of these things!

[)amien

Monitoring URLs for free with Google Cloud Monitor

As somebody who runs a few sites, I like to keep an eye on them and make sure they’re up and responding correctly.

My go-to for years has been Pingdom, but this year they gutted their free service (update 2021 it’s toally killed and now owned by Solarwinds… yeah, the people who got hacked and unknowingly distributed a back door to all their customers) so maybe not that service.

The free service with Pingdom also had limited alerting options and can only monitor a single endpoint. Instead, I went looking for something better as $15 a month to monitor a couple of personal low-volume sites is not money well spent.

Google Cloud

I’ve played with the Google Cloud Platform offerings for a while, and like many others, theirs includes a monitoring component called unsurprisingly Google Cloud Monitoring.

It’s currently free in beta and is based on StackDriver - acquired by Google in 2014. I can imagine more integration and services to continue to come through as they have a complete product that also monitors AWS.

Uptime checks

Screenshot showing uptime check options

You can create HTTP/HTTPS/TCP/UDP checks, and while designed to monitor the services you’re running on Google Cloud, will happily take arbitrary URLs to services running elsewhere.

Checks can be run every 1/5/10 or 15 minutes, use custom ports, look for specific strings in the response and setting custom headers and authentication credentials.

Each URL is monitored and reported from six geographical locations. They are split between three in the USA (east, central and west), Europe, Asia and Latin America. For example:

damieng.com

  • Virginia responded with 200 (OK) in 357 ms
  • Oregon responded with 200 (OK) in 377 ms
  • Iowa responded with 200 (OK) in 330 ms
  • Belgium responded with 200 (OK) in 673 ms
  • Singapore responded with 200 (OK) in 899 ms
  • Sao Paulo responded with 200 (OK) in 828 ms

Alerting policies

Here’s where Google’s offering surprised me. It has alerting options for SMS and Email, obviously, but also HipChat, Slack, Campfire, and PagerDuty. You can specify combinations together, mixing and matching with different uptime checks etc.

Screenshot of alerting policy options

Incidents

Like Pingdom, if the endpoint monitored goes down, an incident is opened. You can write details (comments) to the incident, and like Pingdom, the incident is closed once the endpoint starts responding again.

Graph & dashboard

The cloud monitoring product has a configurable dashboard geared around monitoring Google Cloud specific services. There is an uptime monitoring component that still provides some value.

You can download the JSON for a graph, an API as well as iframe sharing functionality.

Final thoughts

I’m very impressed with this tool given the lack of limitations in a free product. I am using it for my sites, but it has no SLA right now!

Any other recommendations for free URL monitoring?

[)amien

Notes on Edward Tufte’s Presenting Data and Information

Photograph of Envisioning InformationHere are my notes from today’s event by renowned statistician Edward Tufte – author of The Visual Display of Quantitative Information and Envisaging Information primarily for my own reference but perhaps of interest to others.

A dramatic start

No announcement, no preamble. The lights went out, and a visually striking video showing a representation of music started. Conversations were immediately hushed, and devices put away. An effective technique to get attention and signal an absolute start.

Charts and tables

  • Sorting: Find a sort for your data that makes sense. Treat it as another axis, and don’t waste it with the alphabet.
  • Sparse columns: Remove sparsely populated columns from tables. Special events should be specially annotated.
  • Linking lines: Always annotate them to describe the interaction, prefer verbs over nouns from a taxonomy.

Information does not fit in a tree. The web is successful because Tim-Berners Lee understood this and made links the interconnectedness between content. “Vague, but exciting”

Data

Content is not clean. Data that shows behaviour in a perfect way is likely manipulated.

Human beings over-detect clusters and conspiracies. They find links between unrelated events, especially in sequences (serial correlation). Sports commentators, given any series of scores, will develop a false narrative to explain it. They’ll find a reason for 7 wins in a row despite random data producing such sequences.

Self-monitoring is a farce because people can’t keep their score. Once something is measured, it becomes a target to be gamed and fudged as needed.

You can make many models to fit any given data. It may work well for past and current data, but how far it lasts is highly variable. This is referred to as shrinkage – no model lasts forever.

Big data is not a substitute for traditional data collection and analysis. Google famously thought this when they created Google Flu which tried to spot the spread of flu based on search terms. It has been seriously criticized by Forbes and the New York Times.

Conflict

Do not jump to conflict or character assassination. Your motives are likely no better (or worse).

How many nice comments wiped out a bad one? Ten… a hundred?

Evil exists in the world, but it probably does not exist in your day-to-day life.

A deck of slides

A deck is inefficient. It is easy for the presenter but hard for the audience who are waiting for something they can use. “A diamond in the swamp” Slow reveals further reduce the information density, and people will check-out when it gets low.

Prefer spatially adjacent data (a document) over temporally stacked (slides). The often-cited limit of 7±2 items was for temporal retention, so limiting a page to this number of items is the opposite of what that research was telling us. We can cope with much more data if it is all on-page together.

Meetings and presentations

Do not be afraid of paper.

Prepare a document in advance but do not send it and instead spend 30 minutes at the start of the meeting reading it in silence (known as a study hall). People can read faster than you can talk as well as go back and forth as needed, skipping what they already know, and latecomers are less disruptive. Amazon is famously using this with its 6-page narrative memo system.

Never go meta in your presentation – stick to the content. Respect your audience and do not presume to know them, or you may find yourself pandering or having low expectations. Instead, present the data to the best of your ability. Many complicated things are explained to millions of people all the time. You can’t teach if you have low expectations. Negativity and positivity are self-fulfilling.

Does your audience understand and trust you? Credibility is eroded not just by lying but by cherry-picking. Evidence of cherry-picking includes data too good to be true and hiding the source of the data behind excuses such as copyright, proprietary, or secrets. Why would a conclusion be open when the data needs to be secret? It’s likely a misrepresentation of the data for their own means.

Note a few words when somebody asks you a question to make sure your answer stays on topic. If you don’t know the answer be honest and suggest where you would start looking for the answer. Never heckle or waste time correcting minutiae.

Doctors trip

A trip to the Doctor’s office is a presentation. Write down your list before you go in. Make them listen because they normally interrupt after 22 seconds and consider each item individually. You’ll give up before you reach the end of your list this way, and they may not see the connected pattern of the whole.

Documents

Every document needs an abstract. It should spell out as simply as possible:

  1. What the problem is
  2. Who cares
  3. What the solution is

If you can’t write this, then you’re not saying anything.

Latex

Real scientists use Latex. There are thousands of templates including, official ones for well-known journals. Online tools like Overleaf can reduce the barrier to entry. Latex code appears like this:

\title{My presentation matters}
 \begin{document}
 \section*{Introduction}
 Sample of Latex

R is another alternative considered hard even by people who use Latex.

Reading

We are taught to read to extract facts to pass exams at school. We need to practice reading for enjoyment, reading to spot new information, to extract what we want, to form new opinions and ideas, to loot & hack.

Immediately skip words you don’t understand: there won’t be a test – you’re not at school.

Design

Design does not belong to ‘other people’. Support thinking with analytical design and do whatever it takes to explain the data.

Why do bird books use illustrations? Because the authors want to help you spot the birds and using art they exaggerate the differences as well as produce a generic version of the bird.

Nature magazine has some of the best-designed visualizations around. Openness, pride and space constraints all help. (DNA only got 1.5 pages) The New York Times also often produces interesting visualizations of data.

User interface

Use the ideas proven by large successful sites on the web. Do not be swayed by arguments that your users won’t understand. Millions of users already do.

Touch is the next generation of user-interface. It allows the chrome (interface junk) to be jettisoned. No more scroll-bars, no buttons, no cursor, no zoom. Pure information experiences came not from academia, finance or medical but consumer space.

The future of interface design… is information design. Edward Tufte – Seattle, August 4 2015

The original UI metaphors at Xerox Parc on the Alto were around a single document. Instead, we have application-owned silos of data. The elegance was lost because companies want to control the content you create with their tools. They isolate your content so they can profit.

Hierarchies are still used for web design because it mimics the organization paying the bill. They see themselves this way and do not focus on how and what their customers need. Famous examples include the Treasury Department burying tax forms 7 levels deep despite being a top user request and the XKCD strip about University web sites. People on the inside have a skewed perspective of what the outside is.

The density of user interfaces is increasing. This allows for richer visualizations, especially when combined with animation or video. It is hard to get right.

[)amien

Time window events with Apache Spark Streaming

If you’re working with Spark Streaming, you might run into an interesting problem if you want to output an event based on multiple messages within a specific time period.

For example, I want to send a security alert if I see 10 DDOS attempts to an IP address in a five-minute window.

groupByKeyAndWindow

groupByKeyAndWindow allows us to choose the IP address for the key and 5 minutes for the window. If we wanted to subsequently collect the sourceIp and the timestamp, it looks like this:

var messageLimit = 10
var messageWindow = Minutes(5)
val scc = new StreamingContext(conf, Minutes(1))

// ... setup Kafka consumer via SparkUtils
kafkaConsumer
    .flatMap(parseSecurityMessage)
    .filter(m => m.securityType == 'DDOS')
    .map(m => m.targetIp -> Seq((m.timestamp, m.sourceIp)))
    .reduceByKeyAndWindow({(x, y) => x ++ y}, messageWindow)
    .filter(g => g._2.length >= messageLimit)
    .foreachRDD(m => m.foreach(createAlertEvent))

scc.start()
scc.streamUntilTerminated()

Problem

The problem is your event fires many times as the stateless RDD is re-run every batch period.

The simplest solution would be to make the batch interval the same as your message window size, but that causes more problems, namely:

  • Your job can’t perform any other triggers on the source data at a shorter interval
  • You won’t know about these alerts until some time after they happen (in this case 5 minutes)

External would be terrible, and neither Spark counters nor globals are much use here.

Solution

We need to do two things:

  1. Stop the RDD re-running and instead use the streaming state. We can do this by using the reduceByKeyAndWindow overload that allows us to specify the inverse function for removing data as it goes out of the window.
  2. Introduce a small amount of in-RDD state used to identify when the event is clear and when it should fire again.

Let us assume we have a class to handle part 2 named WindowEventTrigger that provides add and remove methods and a boolean triggerNow flag that identifies when the event should re-fire. Our RDD body would now look like this:

kafkaConsumer
    .flatMap(parseSecurityMessage)
    .filter(m => m.securityType == 'DDOS')
    .map(m => m.targetIp -> WindowEventTrigger(Seq(m.timestamp, m.sourceIp), messageLimit))
    .reduceByKeyAndWindow(_ add _, _ remove _, messageWindow)
    .filter(_._2.triggerNow)
    .foreachRDD(m => m.foreach(createAlertEvent))

How this works is quite simple. We have a case class called WindowEventTrigger that we map into the stream for each incoming message. It then:

  1. Tracks incoming messages - if it hits the level, sets the flag, and notes the event
  2. Tracks outgoing messages - and resets when the event that caused the trigger leaves the window

By switching to the in-memory `groupByKeyAndWindow`, Spark needs to persist state in case executors go down or it is necessary to shuffle data between them. Ensure your SparkStreamingContext object has a checkpoint folder set to reliable storage like HDFS.

WindowEventTrigger class

Here is the WindowEventTrigger class for your utilisation.

case class WindowEventTrigger[T] private(eventsInWindow: Seq[T], triggerNow: Boolean, private val lastTriggeredEvent: Option[T], private val triggerLevel: Int) {
  def this(item: T, triggerLevel: Int) = this(Seq(item), false, None, triggerLevel)

  def add(incoming: WindowEventTrigger[T]): WindowEventTrigger[T] = {
    val combined = eventsInWindow ++ incoming.eventsInWindow
    val shouldTrigger = lastTriggeredEvent.isEmpty && combined.length >= triggerLevel
    val triggeredEvent = if (shouldTrigger) combined.seq.drop(triggerLevel - 1).headOption else lastTriggeredEvent
    new WindowEventTrigger(combined, shouldTrigger, triggeredEvent, triggerLevel)
  }

  def remove(outgoing: WindowEventTrigger[T]): WindowEventTrigger[T] = {
    val reduced = eventsInWindow.filterNot(y => outgoing.eventsInWindow.contains(y))
    val triggeredEvent = if (lastTriggeredEvent.isDefined && outgoing.eventsInWindow.contains(lastTriggeredEvent.get)) None else lastTriggeredEvent
    new WindowEventTrigger(reduced, false, triggeredEvent, triggerLevel)
  }
}

Happy streaming,

[)amien