Paper: Can We Ever Escape From Data Overload?
Data overload is a generic and tremendously difficult problem that has only grown with each new wave of technological capabilities. As a generic and persistent problem, three observations are in need of explanation: Why is data overload so difficult to address? Why has each wave of technology exacerbated, rather than resolved, data overload? How are people, as adaptive responsible agents in context, able to cope with the challenge of data overload?
Those are huge questions; this paper does a really fantastic job at tackling it, and it's one of my favorites (I have a lot of favorites, I know). It has been written by David Woods, Emily Patterson, and Emily Roth: Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis.
It's the kind of paper I summarize as a reminder; it's worth reading fully at least once, and my overview here can't do justice to the 40 pages of the original—and my summary is derived from a shorter copy published in Cognition, Technology & Work (2002) and for which I could find no open access link.
The paper focuses on the idea that a lot of incident reports and investigations contain something like "although all of the necessary data was physically available, it was not operationally effective. No one could assemble the separate bits of data to see what was going on." It starts so by the idea of data availability paradox:
On one hand, all participants in a field of activity recognise that having greater access to data is a benefit in principle. On the other hand, these same participants recognise how the flood of available data challenges their ability to find what is informative or meaningful for their goals and tasks.
A thing the authors mention is that a lot of systems were intended to help users, but in fact end up requiring even more capacity during times where users are the busiest—I need data when alarms are ringing, but when alarms are ringing, I'm also the least inclined to slowly think, focus, and analyze.
Data overload is classified into 3 categories:
- clutter / too much data
In the 80s, people tried measuring the bandwidth of what we could process, and reduce the amount of data seen, the numbers of pixels shown. This wasn't successful because often designers tried reducing the information on one display by making people navigate across many. The relevance is often context-sensitive and what you remove may be relevant. In the end the approach was also judged meaningless because "people re-represent problems, redistribute cognitive work, and develop new strategies and expertise as they confront clutter and complexity." Dynamic mechanisms requiring user input don't necessarily help because you only know what to filter once you know what to look for. - workload bottleneck
There are too many sources of data to look at. A lot of work has been done to have automation assist in analysis. Two categories are given: a) those that strongly rely on the analysis being correct (filters, summarisers, automated search term selectors), or weakly rely on it (indexing, clustering, highlighting, organizing). This type of solution considered "necessary but not sufficient" to help, and is at risk of breakdown in collaboration structures between humans and machines. - finding significance in data
Significance is inherently contextual. People have implicit expectations of where useful data is likely to be located and to know what it should look like. There's a relation between the viewer and the scene that must be taken into account, and it's somewhat of an open problem to cater to this need.
This last point, the focus on context sensitivity, is what is further explored next. An example is one where error codes for an alarm have a corresponding description, but the specific meaning depends on what else is going on, what else could be going on, what has gone on, and what the observer expects or intends to happen. The significance of a piece of data depends on:
- other related data;
- how the set of related data can vary within a larger context;
- the goals and expectations of the observer;
- the state of the problem-solving process and the stance of others.
The authors say it is myth that information is something in the world that does not depend on the point of view of the observers and that it is (or is often) independent of the context in which it occurs. In fact, data has no significance and is a raw material; informativeness is a property of the relationship between the data and the observer.
So there's another extra set of subcategories of how people focus on what's interesting:
- perceptual organization
Rather than everything being flat in a perception field, things are hierarchical and grouped. You don't count 300 hues of blue, you see the sky. You don't need to show less data, you need to show it with better organization. - control of attention
Attention is not permanently fixed on one thing; we have to be able to focus on and process new information. Sometimes it's distracting, sometimes it's relevant. Reorientation on new elements implicitly means you lose focus on other things you were previously focusing on. In the real world this is often dealt with implicitly by having a focal point, but maintaining awareness in peripheral vision and auditory fields. The difficulty of doing something with a limited field of vision (using goggles that block peripheral vision) is an experiment suggested to see the extent of this. So automation or information that wishes to better control and direct attention should ideally have some understanding of what it is the human is trying to do, to know how to mediate the stimuli, place it where it makes sense, and to know how to redirect attention on what is worth it. - Anomaly-based processing
We do not respond to absolute levels but rather to contrasts and change. Meaning lies in contrasts. An event may be an expected part of an abnormal situation, and therefore draw little attention. But in another context, the absence of change may be unexpected and grab attention because reference conditions are changing.
There's a large section on the paper on tricks technical solutions have to work around context sensitivity, which is limited and brittle, but worth taking a look at:
- reduce available data: usually done by hiding all the data behind displays or menus. Breaks down because some of the relevant stuff may get hidden and now the tool is at cross-purposes with what the operator intends and increases cognitive costs rather than saving them.
- only show what's "important": think of log messages with INFO, WARNING, and ERROR, and only showing the most critical data. This, once again, can omit important data (even if people can call up the relevant lower-level information). The problem is to help people recognise or explore what might be relevant to examine without already knowing that it is relevant, which is better done through organization.
- the machine will compute what is important for you: automation that tries to act intelligently has to be conceived as a teammate to be truly useful. Automation can generally be a poor teammate, with limited sensitivity to contextual cues. So this, too, easily breaks down when stressed.
- use syntactic or statistical properties as cues to semantic content: The correlation is weak. "Sort by relevance" is often clumsy. Making this work reliably is non-trivial and often fails. Also the way the correlation works is often opaque and becomes hard to trust by operators.
And finally, solutions! Those are, unfortunately, non-trivial otherwise everyone would know about this sort of idea. Anyway, they're cool guidelines to keep in mind:
- Organisation precedes selectivity: effective systems will have elaborate indexing schemes that map onto models of the structure of the content being explored; they will need to provide multiple perspectives to users and allow them to shift perspectives fluently. I like to think of this like "show me the system the way a support engineer cares about it", or "show me what it looks like from the point of view of someone in a given team", rather than having to dig into selecting elements to build this vision on-the-spot by yourself.
- Positive selectivity enhances a portion of the structured field: positive metaphors ("spotlight", "peaked distribution across a field") help focus on a part of the data, whereas negative ones ("filters", "gatekeepers") tend to be weaker. We tend to default to negative ones (they're more computationally effective) but they hinder the ability to switch focus to otherwise non-selected elements. Better cognitive results are expected from positive metaphors than negative ones.
- You must deal with context sensitivity: solutions to data overload will help practitioners put data into context. Basically it helps to put the context "in the world" rather than having to carry it all in your head. Examples are to show related data, to use model-based displays, automatically extract higher-level events (eg. a device's states) from data, and comparing current anomalies to regular trends.
- Observability is more than mere data availability: "Observability refers to processes involved in extracting useful information. [...] The critical test of observability is when the display suite helps practitioners notice more than what they were specifically looking for or expecting. If a display only shows us what we expect to see or ask for, then it is merely making data available."
- Design of conceptual spaces: You must depict relationships in a field of reference. "With a frame of reference comes the potential for concepts of neighbourhood, near/far, sense of place and a frame for structuring relations between entities." It's a prerequisite to having more than data availability.
I'll let the authors conclude:
Ultimately, solving data overload problems requires both new technology and an understanding of how systems of people supported by various artifacts extract meaning from data. Our design problem is less – can we build a visualisation or an autonomous machine, and more – what would be useful to visualise and how to make automated and intelligent systems team players. A little more technology, by itself, is not enough to solve generic and difficult problems like data overload – problems that exist at the intersections of cognition, collaboration and technology.