Data Visualization: It Matters How we Show Graphs

Should we show graphs using horizontal lines like this:

Or using “slanted” lines like this? :

Notice that these two graphs actually are showing the same data, just differently.

I’ll argue that it does matter which way we show graphs, and that the correct graphical representation depends on the data the graph is showing.

First, keep in mind that the graph shows values for points in time. E.g. in both graphs, the value at 5:40 was measured to be approx 750,000.

There are two different types of measurements. Lets call them directly measured values and derived values.

The simplest kind is “directly measured values”. Think of this as measuring the current temperature or the gas gauge in your car. The value is measured directly as a snapshot in time. Here, the slanted graphs make the most sense, because you actually did measure the values at particular points in time (5:35, 5:40, 5:45 etc. in the graph above). The “slanted graphs” represent the best first-order approximation of how the real-world values would have changed over the time period.

For derived values think of sales for a quarter or the average bytes per second over a 5 minutes period (like the graphs above both show). Derived values are not measured directly but represent a value measured over the entire preceding measurement period. Therefore a graph with horizontal lines is the best representation for derived values.

For some reason, I’ve experienced people get emotional and upset about this: “The slanted graphs look better” in defense of slanted graphs. Or “We have only measured these N different values represented by the horizontal lines. The slanted graphs are cheating and pretending we have more data than we actually do.”

But in my opinion they each have their use case, depending on whether the values are directly measured or derived from actually measured values.

The Calculus Consideration

Do you remember calculus? Deriving and integrating? If you don’t you may want to skip this but it is very relevant for this discussion, and the reason why I called the second category of values above “derived values”.

Lets take the example of a graph showing bytes/second as both graphs above. The way this is actually measured is that we ask a server every 5 minutes “how many bytes have you sent in total since boot”? The value for an interval is then the value at the end of the interval minus the value at the beginning of the interval (measured in octets or bytes), divided by the interval length (in seconds), allowing us to arrive at e.g. 750,000 bytes/second.

In calculus terms: “64 bit In” in bytes/second is actually exactly the derivative of received bytes with respect to time.

Now, also using calculus, getting the total number of transmitted bytes for a period must then be integral of bytes/second, or the area under the bytes/second graph for that period. That will only be true if derived values are shown using horizontal graphs.

To illustrate, lets assume the total amount of bytes received by a server at time 10:00:00 is 1,000,000. Ten seconds later, we measure 1,000,200. 200 bytes in 10 seconds s 20 bytes/second. If we draw that as horizontal lines, the line from 10:00:00 to 10:00:10 will have the value 20 bytes/second over the entire period, allowing us to calculate the underlying 200 bytes we received during the period as the area: 20 bytes/second * 10 seconds (see how the units work out too?). If we used slanted graphs, we’ll show 20 at the end of the interval, but something else for the rest of the interval, and the area under the graph will be different from 200 and that is just wrong.

A graph with horizontal lines is the only correct representation of a value that is actually another underlying measured value derived with respect to time.

The argument for slanted graphs being the best first-order approximation for directly measured graphs is actually also from Calculus. If for whatever reason you’re still going to use horizontal graphs, then the measurement time should be shown at the middle of the interval. So if you measure e.g. a temperature every 5 minutes and you measure 21 C at time 5:40, showing a horizontal line with value 21 from 5:37:30 to 5:42:30 is more correct that showing that same horizontal line from 5:35 to 5:40 (or – horror – from 5:40 to 5:45). Because you measured 21 at time 5:40 and assume it had that value “around that time”.

In Summary

So which type is better? It depends.

For directly measured values, slanted graphs from measurement (time, value) to  (time, value) are best. You can sort-of use horizontal lines but then the horizontal line segment should have the measurement time at its center.

For derived values, horizontal lines are the only correct way to go. Ideally the value should be shown for the entire period e.g. in mouse hover and not shown as being for the beginning or end of the period.

I have never seen any program or monitoring tool being able to do all of this. But I do think that they are all wrong. They don’t respect the data. Do you disagree?

Finally there are the softer considerations: “But people like to see slanted graphs. They look nicer and are everywhere”. Yeah, sure, a usability survey might show that people prefer slanted graphs. In my opinion, that doesn’t change that they are simply wrong for derived data.

Metadata for beginners

metadataForBeginnersMany people I’ve spoken to seem to think that they don’t have anything to hide, and as long as the government isn’t listening in on the actual phone conversations, then they’re fine with it. As you might guess, I’m not. This slide from 30th Chaos Communication Congress (30C3) hits the head on the nail.

Director of national Intelligence James Clapper lies to US Congress – without consequences?

James_R._Clapper_official_portrait

So, James Clapper, US Director of national Intelligence, lies to Congress. First he calls the lie the “least untruthful” answer he could publicly provide, and then cites a momentary memory failure. Seven congressmen take issue with James Clapper’s testimony, but Obama administration unlikely to turn against director.

See: Republicans demand consequences for ‘willful lie’ by intelligence chief | World news | theguardian.com

Let me recap: James Clapper, a retired lieutenant general in the United States Air Force (you’d think he knows right from wrong, truth from lie), lies under oath to US Congress and it is not likely to have any consequences for him.

Initially I’m astounded, but after a while, I’m sadly less surprised.

What kind of a message does that send?

If guys like him lie willfully under oath, how does that say about their credibility when not under oath?

 

Big Brother sees all but can’t keep a secret?

1984Two news items from this week have me quite uneasy.

The NSA is basically listening in on every US citizen. For the sake of argument, let me assume that they get everything. So far I don’t think we’re quite there yet here in Denmark.

Data held by the Danish police has been hacked. We’re not sure exactly what the hackers have had access to, but we do know they’ve had at least read+write access to all driver’s license data and read access to the Schengen Information System, a large European database on police and judicial co-operation. They’ve been lurking around in there undetected for 6 months. Do you believe that is all they’ve had access to? In 2011 Pentagon Admitted 24,000 Files Were Hacked too.

So Big Brother is watching us. This is not hearsay, but documented fact at least in the US. Also, now, we know Big Brother cannot keep its own secrets.

Yikes. Either of these two news stories are bad enough individually. But this is a nasty combination.

Getting older?

The other day, I was cooking, and it was time to set the table.

Suddenly I found myself standing in the storage/utility room. And I had no idea why I was there. “Peter, you need to set the table! Get back on track!” – I told myself.

So I went back into the kitchen. Looked at the table: What was missing? Ah, drinks. OK, glasses, plastic cup for my daughter, pitcher of water – check. “Hey, I’d like a Coke”, so I opened the fridge. No Cokes. Should probably put some in the fridge for next time I want a Coke. I went to the utility room, and suddenly it hit me: That is what I was doing in the storage room! Getting Cokes for the fridge!

Man, I think I’m… What was it? … Yes, I’m getting older!

Watches – Oh – Watches

I love watches.

Being a techie, I really can’t accept a watch that isn’t accurate. The accuracy of a quartz-crystal based watch is the minimum. I wish I could get a Rolex, Omega or other really nice looking watch, but I just can’t accept the accuracy I’ll get from a watch like that. So all the beautiful Swiss watches are out for me. But there are alternatives:

Continue reading Watches – Oh – Watches

Great Courses at Coursera

I’m currently taking a Cryptology course at Stanford University via Coursera. It came recommended by Bruce Schneier on Security: Free Cryptography Class, and I find it a great way to expand my knowledge. I really appreciate the level. Just enough for it to be challenging and stimulating, but also not too hard or too much work, so I can still fit it in with family and work.

And in addition, the courses are free!

Thanks, Coursera and participating universities for making this possible.

Check it out! There are courses in:

  • Computer Science
  • Mathematics and Statistics
  • Society, Networks, and InformationEconomics, Finance, and Business
  • Humanities and Social SciencesHealthcare, Medicine, and Biology

All provided by professors from top-notch univerities in the US.

I’ve only tried the Cryptology course, but it rocks!

Peter