Natural Sort

Why doesn’t every programming language have “Natural Sorting” built in, out of the box? Natural sorting is the way humans sort, where number substrings are sorted numerically, everything else alphabetically:

Sorted “Asciibetically” (normal computer sort) Sorted “Naturally” (what humans prefer)
foo1bar
foo10bar
foo2bar
foo1bar
foo2bar
foo10bar

Notice how 10 comes before 2 in asciibetical “normal computer sorting” ? Haven’t we all seen user interfaces that like that? Its just plain wrong. 🙁

Dave Koelle’s Alphanum Algorithm sorts naturally, but instead of analyzing each array element O(log(N)) times, I present a modified Perl version that allows for Schwartzian transforms, yielding huge performance improvements.

Improvements to Alphanum

The problem is that the sorting algorithm in many languages sorts by calling a user-provided comparison function taking two parameters, a and b. Such comparisons are done O(n log(n)) times (for typical sort algorithms mergesort and timsort)

Alphanum works by splitting input strings into string and number sequences and then sorting them individually as strings and numbers respectively. This splitting and partial sorting is performance heavy, and being done O(n log(n)) times. A Schwartzian transform allows that to be done exactly n times, which gives a huge performance improvement.

So instead of converting string23string into

array: [ "string", 23, "string"]

it is converted to

string: "string0000000000000023string"

in a pre-processing step and then a normal asciibetical-comparison sort is performed, which is much faster.

On my machine, sorting 100 strings is 4 times faster, sorting 100,000 strings is 10 times faster and sorting 1,000,000 strings is 12 times faster.

What Natural Sorting isn’t – I18n

While I was investigating to write this post, I discovered UTS #10: Unicode Collation Algorithm. Man, sorting (or collation) is complicated, especially if you take internationalization (i18n) into account. I hadn’t considered that z < ö in Swedish, but ö < z in German, even though I live 22km from Sweden and my wife is German. (Which reminds me that anybody who thinks time calculation is simple doesn’t have a clue, but that is a topic for another post another day ;-))

However, the Unicode Collation Algorithm doesn’t consider Natural Sorting to be important enough to make it on their list. It is explicitly mentioned as a Non-goal: Numeric formatting: numbers composed of a string of digits or other numerics will not necessarily sort in numerical order

Hopefully I didn’t mess up and my modified versions of Alphanum will work hand-in-hand with any i18n/locale-aware sorting the user does.

To-do

I’ve written a perl improvement, but I’m going to need a  javascript version and a MySql version too.  A javascript starting point can be found on Dave Koelle’s page, and I found a MySql based natsort. They just need to be modified similarly. Patches welcome! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *