The sharding parable


Jack, an OR/MS analyst, was recounting to a computer scientist colleague, Brett, the conversation he had recently had with his doctor. Jack explained what a “Donabedianite” is, someone who thinks outcomes are the best criterion for deciding how to do things, and added, “I discovered that I am one of those Donabedianites! And I also had to rethink why, many years ago, in the company I helped start, my piling system worked better than my colleagues’ filing systems.”

“Yeah, you’ve told me that story before,” Brett nodded.

“True,” Jack acknowledged, “but now there’s a sequel. I told the story about my old company to a very tidy friend of mine, and she replied, ‘BS! Your so-called piling system wasn’t a system at all, and it couldn’t have been better than a filing system. Stop kidding yourself!’ So I had to go over whether my claim was right, and if so, why.

“And,” Jack continued, ‘it made me realize something. Of course my friend was right, in principle – a good filing system should always beat piling. It’s more compact, more difficult for some careless person to ruin by knocking over a pile, better organized – or at least it should be. But here’s the problem: You know how, when you go to a library and actually go into the stacks yourself, you often find interesting books you hadn’t known about, because they’re near what you were looking for? You get context from location. A piling system works the same way. “

Brett nodded again.

“But my colleagues’ filing systems were created by the admins, not the managers, and they were designed for efficiency of storage rather than efficiency of retrieval! It’s like what happens when movers pack your books. You had things next to related things on the shelves, but what movers care about is getting items to fit snugly in the boxes, quickly. They pack by size, not subject or author. So when you unpack, and for some time afterward, you often end up having to search through all the books to find one you want!”

“Right,” Brett groaned. He’d had this experience a few times, too.

“So, for example,” Jack continued, “let’s say I wanted to find the minutes of a decision meeting from a year and a half ago, and I’m not sure whether it was the executive committee or the board. In the filing system, “board minutes” and “executive committee minutes” would be far apart. In the piling system, they’d be close together. Now, if the admin had realized what I would want to retrieve, she might have used “Minutes  Board” and “Minutes – Executive Committee” as titles, and then they would be close together in the files. So the point here is that designing a filing system for efficient retrieval by a user is different than designing it for efficient storage by the filer, and you have to know quite a bit about the user’s likely needs to get that system right.”

Brett now looked thoughtful. “I’m not sure you realize how right you are,” he frowned. “Do you know what sharding is?”

“No,” Jack admitted.

“It’s one of the hot new ideas from the big data folks,” Brett explained. “When a chunk of data comes in, they store it the way those movers pack books: break it apart into little pieces and store the pieces wherever they fit snugly into the data warehouse. They call those little pieces shards. They don’t retain any information about which data elements came in together. They just rely on the power and speed of their processors to find related items by brute-force linear search. With current processors, that works really well with, say, a hundred-gigabyte data set, or even a terabyte data set. But we have some 10-petabyte data sets out there, and bigger ones on the way. This sharding could end up causing some serious problems, and I’m not sure a lot of my data science colleagues realize it!”

Now it was Jack’s turn to look thoughtful. “I wonder how this problem might affect applications like IBM’s attempt to use their Watson search and retrieval package in healthcare,” he mused. “It worked for Jeopardy!, but there you have a category stated in advance for every question. In some healthcare applications you’d have context to start with, too, but in others you wouldn’t. Even when you do, if a doctor enters some symptoms and asks for suggested diagnoses and tests to distinguish them, a program like Watson could work well and be helpful. But if you’re speculating about whether there’s a relationship among some conditions nobody thought were related – say, sleep apnea and Type 2 diabetes, which is a possible connection some researchers are exploring now, but seemed like a wild shot five years ago – how well will Watson work for that?”

“Good question,” Brett grinned. “And if they do get it to work, they could have another problem. Remember Deep Blue, the chess-playing program that beat Garry Kasparov about 15 years ago?”

“Yeah,” Jack said. “With some help from human experts to pick opening lines Kasparov didn’t like, as I recall – at least according to Kasparov.”

“Well,” Brett went on, “with or without human experts, it did win, but it required an enormous array of hardware and resources. Within a couple of years, Fritz was doing even better – and that was a program that ran on an ordinary PC and cost less than $100. Efficient pattern recognition, especially if it’s tuned to the way users view the world, generally beats the best brute-force search, even with super-processors!”

Doug Samuelson ( is president and chief scientist of InfoLogix, Inc., in Annandale, Va., and principal decision scientist with Great-Circle Technologies in Chantilly, Va.