Information is not (simply) organized data

Many job titles in technical communication such as ‘information architect’ reference the concept of information at some level. At the same time, our current society is defined by the data economy which is based on accumulating massive amounts of data and refining them to provide value. Informationa and data are such closely related concepts that they are sometimes used interchangeably. However, the two are clearly distinct, and this piece discusses the relationship between the these concepts.

This discussion starts with a default explanation which I find insufficient. I then discuss the details which complicate the relationship between data and information. My expanded model on this relationship is derived from these considerations. This model also involves a more philosophical take on the subject. Finally, I will discuss the practical effects of this model in relation to technical communication.

Please note that I will repeatedly refer to the generic notion of ‘systems’. In this context, it refers to humans and other organisms that can process information as well as computers and other artificial information processors.

The Simple Model

The standard definitions treat data as a factual feed which turns to information when it is processed. According to such a model, information consists of appropriately processed data. Such data has been, for example, differentiated, isolated, or combined in ways related to its context of use.

Technically, ‘data’ is the plural form of ‘datum’. It consists of a set of such data points which are the basicmost facts relative to the observational capacities of the system that collects the data. Data must be specified in relation to collecting systems because there is no objective measurement for a quantum of data. For example, the bits that computers use are based on Shannon‘s mathematical model for information where one bit corresponds to a restricted set of alternatives being halved to approximate the correct one. Each binary value (1 or 0) halves the number of possible commands that are compatible with this data. Computers were designed in a way where their data feeds correspond to this model. What proportion of the input that a human eye, for example, registers would correspond to this?

This model treats information as data which has been processed to be useful. As such, information consists of data which has been selected, organized, and so on to match situational requirements. Since no other conditions apply, such processing suffices to turn data to information. Thus, all appropriately processed data is information. How it is processed must fit situational requirements, though. This involves the portions that hold significance for the system in question being separated from the data feed during processing.

The relationship between data and information is presented approximately like this in the majority of available resources on the subject. It should be noted how such renditions almost certainly simplify the model on which they are based. After enough repetition which replicates the same errors the simplified model takes on a life of its own, though. It then becomes what many ground their conceptions on instead of the original. This makes it worthwhile to occasionally address the model’s blind spots.

Challenges to the Simple Model

The largest issue in the default model presented above is how this simplified presentation implies that such changes are unidirectional. Only unprocessed feeds are treated as data and all processed feeds are treated as information. It would be more sensible to relate such matters to the systems that process such things. Additionally, not all information derived from data is reducible to basic operations such as selecting, and situational significance should be defined in more detail.

Feeds between Systems

If information is just processed data, does typing on a computer feed it data or information and is the corresponding text on the screen data or information?

If the standard model presented above holds, data which has been processed to become information cannot be treated as simple data. Instead, it is conveyed forward as information. There is no issue with the concept of conveyed information. This text, for example, conveys content which has been processed into information to its readers. However, do these words qualify as information to the systems that convey it such as computers and eyes? This seems unlikely. If this is the case, the same content can be both raw data and information depending on the system that processes it.

Feeds between systems that cannot treat a given set of data as information consist of raw data from their perspective even when this involves content which other systems would treat as information. As such, the information related to a set of data should be treated as such in relation to suitable systems. The information aslo cannot consist of said data in the way that the standard model proposes because the data used for communication between systems can change while the information remains constant. An alternative to this constitution-based model is presented below.

Inferences

What kind of processing is involved when inferences are used to derive information from data?

If information is based on the raw data being differentiated and organized, the information derived through inferences must be a part of the data itself.

This is clearly not the case. The most obvious example are conditional inferences: If the conditions apply, the entailed scenario must be the case. How is the information that expresses this relationship to be isolated from the data feed itself? For example, a collection of facts related to directly observable conditions would have to express that ‘if you take your bicycle to be serviced, then the experience of riding it will be better’. No amount of paired data points on serviced bicycles and changes in the experience of riding them can express this claim by themselves. Instead, one must rely on generalizations to construe it on that basis.

The same logical form can certainly also be expressed in ways which do not involve abstract operators such as conditions. The above example is logically equivalent with the claim that ‘your bicycle is serviced or the experience of riding it has not improved’. On the surface, this relationship has now been expressed in a way which data may convey. All these formulations require the use of negations, though. Negations require the opposite state of affairs to be available and for the mutual reverseness of the two to be observed. The involved opposition is semantically determined and thus cannot be a part of the data itself. Even if the data contained two states of affairs where one always applies and where both never apply simultaneously, it is possible for this to change when the sample size increases because them being mutually exclusive within a sample may be a coincidence. Since negations are necessary to express conditions in other ways, reformulations do not suffice to make just data be enough as the constituents for (all) information.

If generalizations are necessary to derive such rules from data, though, there is a risk of errors occurring. This entails that the same processes may derive both information and disinformation from data. As data is factual by definition, the latter option is impossible if these processes only involve using the source data as a constituent for information.

In essence, then, deriving information from data may also involve methods which are not compatible with information consisting of collections of the source data. In this case, the relationship between the two cannot correspond to the simplified standard model.

The Definition of Significance

If information must be processed data that has situational significance to the system, what is such situational significance grounded in?

For this concept of significance to be relevant, it must be specifiable in terms of factors that can be assessed independently from the outside. Otherwise, it is merely an arbitrary consideration, and judgements on whether it applies are primarily based on a sense of whether given pieces of content are informative in a given situation.

The standard model uses several mostly interchangeable expressions such as ‘useful’, ‘relevant’, or ‘applicable’ to describe this relationship to a situation. None of these expressions specify which kinds of factors provide such situational value. At most, they hint at a direction.

The expanded model below attempts to provide a more substantive definition of situational significance.

The Expanded Model

I propose that all states of affairs which are registered by the input channels of systems that are capable of processing data as a result of interactions between these systems and their environments are treated as data. The outputs of other similar systems are a part of such systems’ environments. This distinction between the system and its environment is unique to each system. Importantly, such data remains (just) data instead of being promoted to the status of information regardless of the processing that it undergoes.

Information as a Relative Property of Data Clusters

Information is presented here as a relative property of clusters of data that meet specific conditions. These clusters of data have this property in relation to those systems which can process the information in question. As with the simple model, the primary conditions for information involve data being isolated, organized, and otherwise processed. However, in this case, the definition does not limit suitable methods of processing to arranging the data. Inferences and other generative methods which engender novel content are allowed. When information is not bound to the source data, it can be conveyed with newly produced data. The other conditions are that there are systems which can process the information and that it has the kind of significance specified below to them.

Qualifying for a System’s Use

That some systems may process the data is a precondition for it having significance to them. This is also a more fundamental part of the concept of information, though. Information is always the product of some lifeform — organic, mechanical, or digital — for its use. It is an instrument which lets it process the kinds of states of affairs which data could never express as though they were conveyed by data. Data is limited to immediately available states of affairs. Information allows concepts, abstract states of affairs, and wider wholes than an individual may observe directly to be conveyed. What kind of data can express can express that it is equally futile to chase both literal and figurative unicorns? The kind that can carry this information to you, too, and this is only one example of such data. Feel free to imagine a picture that would do the same.

In this respect, it is important to clarify the relationship between people and all the data processors that we employ. These systems are an extension of humankind in the same way as eyeglasses, clothes, or protheses, for example, can be extensions of an individual’s body. Another point of reference are the ways in which other species artificially control their environments, such as beaver dams. These systems have their own means of processing information where the information in question is not always available to be processed by humans as such. These cases involve information specific to these systems which they then often process further to make it suitable for human use. Complementing one’s capabilities like this through changes to one’s environment is discussed in books such as Being There (Clark, 1996) and The Extended Phenotype (Dawkins, 1982).

Significance in Service of Guiding Action

Significance connects information to different situations and this situationality provides the contexts to which the conveyed content is related when it is processed. As was mentioned above, information is not limited to immediately available states of affairs. This is why there must be some foundation to which available information can be related when it is retrieved from the data used to convey it. For example, if someone is waving their hand back and forth horizontally, the significance of this can be a request to either move away or to move in one of those directions, usually that of the palm. The information conveyed in this manner requires relating the message to situational factors.

In this case, the significance of content that is delivered in a format that the system can use consists of the content letting the system (1) interact with its environment in a way that (2) has a non-incidental foundation (3) for the utility that it provides to the system. Each of these three factors is discussed below.

Interactions

Why does a contribution to the ability to interact act as the foundation for significance? Should it not be possible that content can have significance to a system based on factors that are purely internal to it? For example, people can find significance in extremely theoretical contents where any action-guiding potential appears vanishing at best. If this is the case, the condition that significance involves support to interactions cannot be treated being entailed by its definition.

However, theoretical contents also shape attitudes towards reality and actions reflect these differences in attitude. For example, Hegel’s absolute idealism may constitute some of the most theoretical content available. One’s belief in it can affect teaching methods, for example, though. Also, as the case of Francis Fukuyama shows, such beliefs can also affect one’s attitude towards the state of society and thus to political behaviour and consumption habits. Also note how such claims is subject to such uncertainty that the information in question conveys that they exist rather than that the states of affairs that they describe exist. From this perspective, even a negative reception to such beliefs being conveyed affects behaviour, whether in the form of just a personal reaction or in the form of inventing dialectical materialism as an objection to absolute idealism.

The reason that interactions are essential to significance, on the other hand, is that this is necessary to gain utility. Inaction will have no results. Even though the kinds of actions required in different circumstances and the kinds of benefits thus incurred may vary, even internally self-sustaining systems must be protected from external disturbances. No form of static protection is permanent. Maintaining such protections will eventually require measures.

Non-incidentality

This condition relates to the notion that information must be truthful. For example, Floridi (2005) defines (semantic) information in a way where being true is a necessary condition. Even though information is thus not required to be constitutionally true because it does not consist of just reorganized data, this model still requires it to be true as a part of its definition. Misinformation, on the other hand, is not considered a form of information despite its name. It is content that is conveyed in a similar manner but which does not fulfill all the conditions that the content having the status of information requires.

A non-incidental relationship between behaviour guided by a piece of information and the incurred benefits is essential because it bridges availability and utility. Were this relationship allowed to be based on independent circumstances, the difference between non-information and information and the change from one to the other could depend on factors like an unexpected change in weather. This is simply unacceptable. For example, the claim that a customer will arrive at a meeting half an hour late can be based on the speaker’s prior experiences of said customer’s actions. By coincidence, the reason for the customer being half an hour late in that instance is that their train had been delayed by bad weather. This would make the original claim true and able to direct behaviour in a helpful manner. This would depend on the weather at that time rather than the data on which the generalization was based, though.

The sense at play here for ‘non-incidental relationship’ is that the state of affairs that the information expresses is in a direct causal relationship to the utility acquired through behaviour that is guided by said information. If the example presented above had involved the customer arriving late because of their own indifference or negligence rather than the weather, the information that predicted their late arrival would have been based on the correct reasons. This would make the value in terms of saved time and effort derived from preparing for the later arrival be based on the proper information. Receiving that value would thus be non-incidental.

System-specific Utility

Utility as such is hardly easier to assess from the outside than significance. The concept of utility does provide a path which lets you connect situational significance to details that can be assessed. The alternative proposed here involves a biological model that can be extended to non-organic systems by either treating them as extensions of organic systems or by drawing parallels to functional similarities.

Such utility is still treated as situational. Such situations do not involve strict temporal limitations, though. Several nested situations can be interwoven, and utility being situational only requires that it can be associated with one of such situations that is defined relative to some sensible grounding factor. An example of such a defining principle would be the duration of some project to bundle related circumstances, or even a single lifetime.

Organic Systems

In the case of organic systems, such utility consists of optimal wellbeing. Wellbeing, on the other hand, should be understood as an organic system’s equilibrium state in relation to biometric variables. For example, Damasio (2018) emphasizes how all organic systems are based on homeostasis which consists of a sustainable equilibrium that the organism strives towards. Once none of the available equilibrium points cannot be achieved, it results in termination. It is not a matter of optimizing individual factors but rather, of reaching a sustainable overall state and maintaining it. I will not be commenting on the relative superiority of different points of equilibrium since there is no obvious, univocal measure for such a thing. As such, this is a matter of fulfilling minimal conditions rather than reaching for an upper limit.

The relationship between information and a system being in a state of homeostasis is often indirect. This relationship can be conveyed in particular by changes both in risk factors that apply to a system and in available opportunities. Both risks and opportunities involve probability-base anticipated changes in the factors that contribute to homeostasis. Risks represent threats which move the system away from the cluster of points of homeostatic equilibrium. Because the acceptable range for each involved factor is limited, the points of equilibrium related to them cluster. Opportunities, on the other hand, represent safeguards that either shift the system towards the acceptable range on some axis or anchor it within said range. Such anchoring involves a surplus of available relevant resources to compensate for factors that would move the system away from that range. In practice, then, hunger is easier to keep in check when food is available, for example.

Non-organic Systems

In this model, non-organic systems are either treated as extensions of organix systems or as functionally analogous in relevant respects based on their level of autonomy. Systems which are completely subservient to the interests of their creators are tools despite how they process data or information and how they can often proceed with such tasks independently. For now, all systems produced by humans belong in this group. Genuinely self-directed systems, on the other hand, would belong in the latter group. They would have to be such that no behaviour that serves the benefit or other demands of their creators is impossible to bypass. Such a system can act against such designs.

In the case of systems which act as extensions of organic systems, the utility-related condition is fulfilled by the effect on the wellbeing of the responsible organic systems. Such systems are generally built to allow them to convey information in a format which is intelligible to their creators.

The wellbeing of genuinely autonomic systems does not rely on the same factors as that of organic systems. You can still distinguish between their internal state and external influences, though. This suffices for the concept of homeostasis to apply despite it being primarily a biological notion. Non-organic systems must still ensure sufficient access to energy and maintenance for their means to protect themselves from harmful influences, for example.

Practical Consequences

For the presented expanded model to have value, it must have some notable practical consequences. Two ways in which the relativity of information in the expanded model is relevant are presented below.

Relating

In the expanded model, information is dependent on two factors: the data which conveys it and compatibility with the interpretation methods available to receivers. As such, the same data can either be information-bearing or not relative to a system based on whether the information can be recognized by said system. In the simple model, the difference between data that does not constitute information and data that does only goes one way.

Accordingly, available information can be processed (only) when available interpretation methods are co-ordinated with the information-bearing data.

This observation lets you approach targeting content at audiences from a new perspective.

Targeting and Communication

Usually, targeting is intended to help communicate the recorded information. This conception involves a presupposition that this is a matter of formulating content in such a way that the conveyed information can become knowledge with minimal effort. In this context, knowledge is to be understood as information which is internally available to a system and which the system can relate to a task that it recognizes. In essence, whether the information is conveyed is not questioned. At most, this is a matter of the effort required to acquire the information. This assumes that the information is always available and that this is only a matter of degree.

Targeting and Informing

The expanded model allows the possibility that content does not manifest to systems as information at all if it does not fit their capacities. You can demonstrate this principle by looking for images of complex mathematical formulae outside your field of expertise — or code or grammatological texts. The disconnect occurs one step earlier. The expected practical effects of this include challenges in recall and recognizing its connection to an application later. If the required information has not been conveyed, there is also no corresponding memory trace. At most, any memory traces would concern the data itself, and such memories would be less reliable and harder to process.

Accounting for this risk when content is formulated in a targeted manner requires gradually building the required capacities if some members of the target audience might lack them. In this respect, it can prove worthwhile to account for the difference between the expected target audience and the possibly larger potential target audience. This simply involves details such as clarifying key concepts as part of the content. Field-specific acronym in particular can be expected to be transparent enough to only require a single clarification that only shows the source of the initials. Often just making the expression explicit is not a sufficient clarification and when the number of acronyms in particular increases, you should make sure that they are different enough. Related additional considerations can be found here, for example.

The Effects of Structure

For other people, structured content generally appears similar to non-structured content with the same layout. Content producers are the main beneficiaries from structured documentation. This lets them use inherently divided up content which can be managed en masse. This kind of mass management involves, for example, controlling the layout of specific kinds of elements or whole documents with style sheets and updating reused parts by editing their source element.

From the perspective of non-organic systems, the difference is on a whole another level entirely. Even though artificial intelligences that process natural languages have developed immensely, content that is provided in structured form is by far easier to process, even without the use of state of the art AI. The majority of these use cases do not require any language-processing artificial intelligences. The structured format lets you add targeted metadata and isolate key sections. The associated hierarchical structure also clarifies the relationships between parts. For example, it is vital that a system that processes personal data that the system correctly identifies the difference between an address, a date of birth, and a visitation date. When content is appropriately structured, the system need not check the format of the content itself. Additionally, different dates, for example, have the same content format and thus, correctly recognizing them would require further suitable presuppositions that support such organizing.

In this respect, a structured format thus acts as a means to express information in a manner that suits such systems. The structure acts both as a (for now) necessary condition to convey information and as a channel for further information on the content itself. Applying the expanded model thus emphasizes how only structured content qualifies as information for such systems. The practical consequences of this include that the structured format is required for reliable transfer of information between two artificial systems. Feeds between them are possible in the absence of the structured format but each system must re-interpret the data that it receives to turn it into information that it may process. This alternative is both unnecessarily complex and less reliable.

Summary

The relationship between data and information is generally presented with a simplified model, according to which information is situationally significant organized data. This standard model muddles the relationship between data and information because it states that information is constituted by organized data. However, since the data used to convey the same information can change and since information content can include details that inherently cannot be conveyed by data, this cannot be the case. The details for the concept of significance must also be specified separately before it can be used as part of this definition.

An alternative to this standard model is a slightly expanded model. According to this expanded model, information is not constituted by data but rather, information is a property of organized sets of data. Data will be data but it can also convey information. This property is relative as well. A set of data only has an information value only in relation to those systems which can recognize it when they process such feeds.

This expanded model also defines significance as the interactions that the system has with its environment being provided with non-incidentally beneficial guidance. The expanded model focuses on interactions because the system can achieve no utility otherwise. Non-incidentality requires that there exists a direct link between the information being available and the utility that it provides. This involves the utility gained from acting according to the information relying on the kinds of circumstances that the information is based on.

In this context, utility consists of factors which support the homeostatic state of the information-processing system or its users. Homeostatis consists of an equilibrium which allows the system to sustain itself based on the required factors having values within acceptable ranges. Whether this minimal condition is fulfilled can be assessed from the outside. When systems that are utterly subservient to other systems and only act as tools are involved, this condition is tied to the information that they process benefitting the main systems that make use of such systems.

The differences between these models manifest in how the significance of targeting and structuring are to be understood. In terms of targeting, the condition that the formulation is compatible with the receiver’s ability to identify the conveyed information entails that non-targeted outputs convey no information at all. This is not simply a matter of their inabilty to further process conveyed information. This also affects the format in which memories of the content are retained. Memory format, on the other hand, affects durability, reliability, and applicability. The best way to target content at artificial systems, on the other hand, requires a structured format.