My notes and other stuff


Paper: Can We Ever Escape From Data Overload?

Data overload is a generic and tremendously difficult problem that has only grown with each new wave of technological capabilities. As a generic and persistent problem, three observations are in need of explanation: Why is data overload so difficult to address? Why has each wave of technology exacerbated, rather than resolved, data overload? How are people, as adaptive responsible agents in context, able to cope with the challenge of data overload?

Those are huge questions; this paper does a really fantastic job at tackling it, and it's one of my favorites (I have a lot of favorites, I know). It has been written by David Woods, Emily Patterson, and Emily Roth: Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis.

It's the kind of paper I summarize as a reminder; it's worth reading fully at least once, and my overview here can't do justice to the 40 pages of the original—and my summary is derived from a shorter copy published in Cognition, Technology & Work (2002) and for which I could find no open access link.

The paper focuses on the idea that a lot of incident reports and investigations contain something like "although all of the necessary data was physically available, it was not operationally effective. No one could assemble the separate bits of data to see what was going on." It starts so by the idea of data availability paradox:

On one hand, all participants in a field of activity recognise that having greater access to data is a benefit in principle. On the other hand, these same participants recognise how the flood of available data challenges their ability to find what is informative or meaningful for their goals and tasks.

A thing the authors mention is that a lot of systems were intended to help users, but in fact end up requiring even more capacity during times where users are the busiest—I need data when alarms are ringing, but when alarms are ringing, I'm also the least inclined to slowly think, focus, and analyze.

Data overload is classified into 3 categories:

  1. clutter / too much data
    In the 80s, people tried measuring the bandwidth of what we could process, and reduce the amount of data seen, the numbers of pixels shown. This wasn't successful because often designers tried reducing the information on one display by making people navigate across many. The relevance is often context-sensitive and what you remove may be relevant. In the end the approach was also judged meaningless because "people re-represent problems, redistribute cognitive work, and develop new strategies and expertise as they confront clutter and complexity." Dynamic mechanisms requiring user input don't necessarily help because you only know what to filter once you know what to look for.
  2. workload bottleneck
    There are too many sources of data to look at. A lot of work has been done to have automation assist in analysis. Two categories are given: a) those that strongly rely on the analysis being correct (filters, summarisers, automated search term selectors), or weakly rely on it (indexing, clustering, highlighting, organizing). This type of solution considered "necessary but not sufficient" to help, and is at risk of breakdown in collaboration structures between humans and machines.
  3. finding significance in data
    Significance is inherently contextual. People have implicit expectations of where useful data is likely to be located and to know what it should look like. There's a relation between the viewer and the scene that must be taken into account, and it's somewhat of an open problem to cater to this need.

This last point, the focus on context sensitivity, is what is further explored next. An example is one where error codes for an alarm have a corresponding description, but the specific meaning depends on what else is going on, what else could be going on, what has gone on, and what the observer expects or intends to happen. The significance of a piece of data depends on:

The authors say it is myth that information is something in the world that does not depend on the point of view of the observers and that it is (or is often) independent of the context in which it occurs. In fact, data has no significance and is a raw material; informativeness is a property of the relationship between the data and the observer.

So there's another extra set of subcategories of how people focus on what's interesting:

  1. perceptual organization
    Rather than everything being flat in a perception field, things are hierarchical and grouped. You don't count 300 hues of blue, you see the sky. You don't need to show less data, you need to show it with better organization.
  2. control of attention
    Attention is not permanently fixed on one thing; we have to be able to focus on and process new information. Sometimes it's distracting, sometimes it's relevant. Reorientation on new elements implicitly means you lose focus on other things you were previously focusing on. In the real world this is often dealt with implicitly by having a focal point, but maintaining awareness in peripheral vision and auditory fields. The difficulty of doing something with a limited field of vision (using goggles that block peripheral vision) is an experiment suggested to see the extent of this. So automation or information that wishes to better control and direct attention should ideally have some understanding of what it is the human is trying to do, to know how to mediate the stimuli, place it where it makes sense, and to know how to redirect attention on what is worth it.
  3. Anomaly-based processing
    We do not respond to absolute levels but rather to contrasts and change. Meaning lies in contrasts. An event may be an expected part of an abnormal situation, and therefore draw little attention. But in another context, the absence of change may be unexpected and grab attention because reference conditions are changing.

There's a large section on the paper on tricks technical solutions have to work around context sensitivity, which is limited and brittle, but worth taking a look at:

And finally, solutions! Those are, unfortunately, non-trivial otherwise everyone would know about this sort of idea. Anyway, they're cool guidelines to keep in mind:

I'll let the authors conclude:

Ultimately, solving data overload problems requires both new technology and an understanding of how systems of people supported by various artifacts extract meaning from data. Our design problem is less – can we build a visualisation or an autonomous machine, and more – what would be useful to visualise and how to make automated and intelligent systems team players. A little more technology, by itself, is not enough to solve generic and difficult problems like data overload – problems that exist at the intersections of cognition, collaboration and technology.