Data Is Not Data, and Neither Is Data
Much of what I have sent out this far has tended to be stuff that I've worked on before, that I felt deserved another viewing, or that I just happen to be extra proud about. I do however work on new stuff as well, and today I wanted to share some short, popular pieces that show what I'm working on right now. I've recently started up a research center on organizational datafication, and below I try to explain why the issue of data is so important, and also often far too uncritically approached. So happy reading, and hope your data stays classy! Kind regards, Alf
On What We Are Given
I am, for my sins, deeply engaged with the major Danish research initiative ADD (Algorithms, Data, and Democracy), a project that could be described as being burdened with a name that might obfuscate more than it reveals. On the surface, things seem quite ordinary, even mundane. Starting from the end, we all feel we know what democracy is, at least to a functional degree. The word comes to us from Ancient Greek, dÄmokratia, derived from the word for âpeopleâ and the word for âpowerâ or âruleâ. Rule by the people then, for better or worse. Simple concept, even if it might create various issues in practice. When it comes to the first word, algorithm, we may feel that while the term in itself isnât all that complicated, we are less sure as to whether we fully understand it. Sure, we might know it comes from Muhammad ibn Musa al-Khwarizmi, or to be more precise, Muhammad, the son of Musa, born in Khwarizm (today known as Khiva, Uzbekistan), whose name was later Latinized as Algoritmi as his work on the Hindu-Arabic numeral system was published in Europe (under the title Algoritmi du numero Indorum). Sure, we know it has something to do with calculations and the methods thereof, and frankly, thatâs where most of us leave things to the nerds and the geeks. We donât fully know how, but we do know that algorithms can, if properly tended, create strange and wonderful things, to the point that weâre even a little scared of them.
Take these two words together â such as in the notion of âalgorithmic democracyâ â and hackles start rising. We accept that algorithms may help in the world, but combining them with democracy sounds⌠wrong. It smacks of automated voting, manipulation, computers taking over, and various other unpleasant things. One is a thing we like, the other a thing we are a little unsure of, and together they raise more questions than they answer. Which, if you think about it, is pretty perfect for a research project â something to like, something to doubt, and lots of questions to go round.
The astute reader will by now have noticed that I in this somewhat facetious deconstruction of the project's name have not touched upon the middle word at all, the little word "data". This is the only one of the three terms that has a basis in Latin, as it comes from the verb dare, which means "to give". A thing given or granted, then, is a datum, and the plural of this is data. Thus, data means something like "things given to us". In medieval times, philosophy started using this term to indicate things that were "a given", i.e. true for the purpose of argument or reasoning (sometimes put in the form "data rerum"). Over time, science adopted this use, and the etymology got muddled. Data became a mass noun, used quite broadly indeed, and as this happened, the term got questioned less and less. Today we use it as part of a little hierarchy, in which data is assumed to be the raw material (as in "data is the new oil"), which can be structured into information, and in context turned into knowledge. It seems all so very tidy!
Now, let's test this in practice. The following is data: 48, 12, 9, 28, 18, 24, 22, 52
Presented in this form it is nigh-on useless, unless you are really pressed for lottery numbers, or looking for an unlikely password. We can add something to it, however, and it becomes a little more interesting. That string of data is in fact the ages in a group of people. We now have some information about this group, such as the fact that the majority of said group are adults, at least in the sense that they can vote. We also know that two are children, and that none are old-age pensioners. Granted, not the most thrilling information, but still. I can now add yet another dimension to this, and tell you that the string of numbers describes the ages in my family, i.e. the age of my and my partner, and the ages of our children. You know have some knowledge about my family â you're welcome. So far, so simple, right?
Some questions remain, however. What was the data before it became infomation? If you answer data, you'd be correct, but only in a general sense. The string of numbers I presented (or gave, as it were) could have been just random ones I made up. They weren't, as I had the information of our ages at hand, but does that mean that this was information for me yet data for you, at the same time? Was it thus only data from some perspectives but information from others? What if I lied? Was it still data to you, even when I knew it was just random numbers? Going a bit further: I wrote that we "can add something to it", in that I revealed that the numbers represented ages. Was that something data? It was a category, unusable in and of itself (Consider the question: "What is the median age of dragons?" â acquiring the data necessary for turning that question into information and knowledge is, sadly, not possible.), so it would seem to be. But where did it come from? It clearly pre-existed the data in the series, and it is likely that the category and the data that can populate it did not emerge as separate entities. Rather, we started to pay attention to ages, and the data and the category that created such information emerged simultaneously.
So it would seem that data isn't just data. More to the point, what we talk about as data is what can populate categories that we have decided are important, interesting, and apposite for specific phenomena. Consider for instance the data we tend to get about our children's classes in school: How many students there are, and their gender split. The former data is to us information about whether the class is "small" or "big", and is used in particular to ensure that the class is not "too big". This also means that schools know that the data is not allowed to go above a certain treshold â if a school says it has a class with 50 students, this will likely be illegal and cause a storm of protests from parents. So whilst the category "class size" might seem like one that could contain data points from 1 to a thousand or more, the data is in fact closely curated to be in a fairly limited span â in Denmark between 24 and 28, with some outliers. So can data and information when it comes to class size really be separated? Then we have the question of genders in class. Without even thinking about it, we work from the assumption that the "correct" data for this category is something approximating an even split, such as 14+14 in a class of 28. Schools are, again, well aware of the curation demands for this, so ensure that classes only rarely skew heavily in their gender split. So the assumed data in the statement "There are 13 boys and 14 girls in my child's class" is in fact affected by the knowledge assumptions of said category. Further again, why this specific data? Size and gender, and for most parents, relatively little beyond this. It is for instance very unusual to get data about class happiness, about the noise-levels in class in dB, or the average reading speed among the students. All these would be data about the class, but for various reasons only a very limited amount of data is considered important enough to collect. In some cases this can be due to the difficulty measuring it â such as in the case of happiness. In other cases this can be due to fears of a backlash â no parents want their children to be in a noisy class. A seminal essay of second-wave feminism, written by Carol Hanisch in 1969, was titled The Personal Is Political, and emphasized the politics underlying much of what was considered personal or private. Today, we may need to open up to the fact that data, far from being just the neutral basis for information, is political as well.
To some, this comes as no surprise. Much of what has been discussed about algorithms, data, and democracy has been very attuned to questions about privacy and biases, often with the assumption that when it comes to data, less is better. If Big Tech has less data about us, they won't be able to manipulate us in the same way â or at least so the story goes. This, however, ignores the lesson we should learn from the etymology of the word "data". It stands for that which is given, the assumptions made regarding what is important, the manner in which we give ourselves to the world. We are (mostly) happy to give our gender and age, as these given categories have been with us as definitional from the moment we learnt to speak â the first things we teach a child to communicate to others is their name and age. Few think about that this is not in fact data about us, but information. The categories here precede us, and as we are born we are assigned a gender and have our age recorded. Less as data, but as part of an information system, ready to categorize us, to treat us as given. The smallest aberration to this, and the system gets a hiccup, pushing away the data that does not fit in. Consider for instance the case of a young man in my extended family. In the Danish data-systems he is well-categorized, with all the right names and codes. There is however something to him, a data point, that does not fit the information structure of the Danish state. This something happens to be a functional uterus, currently occupied with nurturing a new life to fruition. The existing information systems in Denmark lacks the capacity to add in this data, as man-with-uterus is not a category that can be chosen. In effect, this data about him in the system becomes non-data, as there is no way to capture it therein. His surplus of data is ignored, left behind, treated as not given at all.
Why does all this matter? Simply put, data is a far more complex category than we tend to realize. We treat it as something akin to oil or water, a free-flowing resource that is always already ready for use. In reality, data is often a choice â we choose the data we gather based on categories we may or may not understand, and the data that we donât choose, we donât think much about. So we learn some things about classes in school, or people in the healthcare-system, but only what the pre-made structures allow us to know. That another world is possible, with different data-structures, this tends to go forgotten.
On Data Deserts
One of the reasons people are interested in algorithms and data is because people fear that they, or more precisely their data, is being misused. This is a particularly contemporary fear, for although humans have feared being cheated or tricked for as long as weâve had social relations, such shenanigans used to be mostly material and out in the open. Today, we live in fear that someone, somewhere, is doing things they shouldnât with our data, and even more gallingly, making money while doing so. While we earlier might have been cheated by a tradesman or a shopkeeper, today we fear being taken advantage of by a data broker, selling what they know about us to the highest bidder. A lot of things conspire to bolster this fear. We have a discourse in which data is likened to gold or oil, making it sound like e.g. the data regarding just how I like my salads dressed would have innate worth. As even a small amount of gold or oil has value, surely this must mean that all of our data is valuable as well? On this note, my shoe size is 45, and you can have this data-point for free. Youâre welcome. To continue, the existence of data brokers implies a market for said data, and this in turn suggests the potential for competition. Thus is our minds our data is not only innately valuable, it might hold special value for just the right buyer, who is prepared to pay top dollar for my dressing preferences.
The result of this notion of data and the value thereof has created no small amount of debate in society, as people are keen to ensure that their data does not get utilized in an unfair manner. There is thus no end to either ideas about governance principles for data, nor technical solutions for ensuring people have control over their own. Frameworks and blockchains and wallets, oh my! Particularly attention has in these debates been placed on the data of the young, who are assumed not to know any better, and those with limited digital skills. Discourses of protection have become quite prevalent, with the public presented as virtuous maidens at constant risk of being defiled by the data-hungry ogres of Big Tech. As entertaining as this discourse can be at times, it however presents a complex issue in a way that is very black and white, and handily marginalizes other, more problematic issues.
I say this, for while the debate about privacy and data-protection has been raging, another, possibly as important issue has received little to no attention, to the detriment of both the debate and much of society. I say this because I am convinced of one thing:
There is something worse than having your data tracked and traded, and that is for your data not to matter at all.
In our digital society, our data is important, and thus deserves protection. No-one would question this. Yet at the same time, in a digital society the visibility of an individual or a group is innately connected to whether they are seen/noted in the data, and a key step in this is for there to be data so see and note. This is an issue that has interested me for a long time. I can remember, as a child, reading about something known as âfood desertsâ, and being very confused as I read that this could describe e.g. an urban area that had many fastfood outlets, but no grocery stores. To a child this sounds something like a nugget utopia, and thus far from a desert, but the sociologists etc. who introduced the term wanted to point out that if your suburb is structured in a way where feeding your children cheeseburgers is markedly easier and cheaper than getting to a grocery store and buying the ingredients for a nourishing meal, we shouldnât wonder (or condemn) why many took the easier and cheaper option. The notion of desert had been used before, e.g. in âcultural desertâ, but it was this that caught my notice as a child.
Later on, as a researcher, I became interested in the elderly. In particular, I became interested in the strange tendency among entrepreneurship and innovation scholars to disregard them, and through this the fact that quite few startup entrepreneurs seemed to consider them an interesting potential market. In researching this odd phenomenon I interviewed a senior executive at one of the worldâs biggest market research companies, and asked him about the granularity with which they could collect and process data about my son. The answer was remarkable, in that he stated that they collected thousands of data points about people like my son, and created hundreds of categories and personas through which he could be understood. I always felt this to be a tad excessive, seeing as my son at the time was dead broke, with a number of years of being dead broke ahead of him. I then turned the question to my mother, who at the time was still alive. Upon hearing her age, he excused himself and stated that he was aware that this was an issue with their collection methods, and that work was underway to correct this, but as things stood, my mother was too old for them to collect any data at all, let alone generate profiles from this. My mother had aged out of being a data source.
I tell this small reminiscence to illustrate what we all know but which is almost completely ignored in the public discourse on data: There is nothing impartial or equitable about the way we collect data about people, groups, or fields. Rather, the way in which we collect data is a deeply politicized process, where the perceived importance of the person or the field in question affects both the quality and the quantity of the data collected. Again, there is something worse than having your data harvested, and that is to be ignored for your data altogether. This is why I have tried to work with introducing the notion of data deserts into the public discourse; to show that just like a large US city can be beset by food deserts and the problems these bring, we are moving towards a society with data deserts where digitalization leads to marginalization and exclusion. A basic form of this can already be seen in the way urban and rural areas are positioned in what might be called âdata oppositionâ. Here, urban areas, simply due to the fact that it is far easier and cheaper to collect data about most everything within them, will always seem more dynamic, more successful, more⌠well, everything. Rural areas, as problems in them will appear more acute due to smaller samples, simply have no way to compete. As digitalization and datafication continues, these oppositions can only become more pointed.
What kind of data deserts can we thus expect? The elderly, and in particular the rural elderly, already constitute one. Even though I am the last person to claim that the elderly by necessity would have less digital savvy than other groups (my mother, for instance, was nigh-on addicted to her tablet and her preferred social media), there is marked group among the elderly who either by choice or necessity do not have things such as smartphones and computers. This not only makes it more difficult for them to interact with an increasingly digital societal apparatus, but also decreases the amount of data we have about the elderly in general. Some behavioral patterns connected to not wanting to be a burden and having to live frugally can enhance this tendency. Becoming less and less visible due to lack of data, this can lead to insufficient attention paid to e.g. health or mental health issues.
This is however not the only instance. In the center that I lead at SDU, we have started a project on datafication in Greenland, which shows another face of this issue. A country such as Greenland will normally not have the same data structures in place as e.g. mainland Denmark, simply due to different needs and a marked difference in context. However, issues of legislation and institutional pressure are creating increased demands to adopt datafication structures from elsewhere, including structures defined by Big Tech. The risk here, then, is that datafication starts making Greenland less of a data desert, but to do so not from the needs and wants of the local context, but through a kind of data colonialism â begging the question whether it is the real Greenland or a virtual simulacra that becomes datafied?
Another issue can already be seen among small and medium-sized companies. Organizations that have acted quickly to adapt to datafied, algorithmic technologies stand a chance to not only be seen as more innovative, but to be valued on the fullness of data they produce, creating an issue for companies that cannot generate the same level of data intensity, thus being potentially undervalued. A worst-case scenario would see entire parts of an industry becoming data poor in relative measure, and thus at risk of acquisition or other competitive pressures.
Back on an individual level, we might also see that groups that for one reason or another are less datafied and thus present as less data-intensive from the perspective of the state â here we might imagine groups such as neurodiverse youth or adults who adopt a less technological lifestyle â become evermore marginalized (as if they werenât already) as their data becomes diluted in the digital deluge. What might look to some as an attempt not to be part of something that is damaging to you (e.g. social media), may create strange externalities and yet more silences for the already silenced.
What, then, is to be done? Should we build digital oases in Greenland and/or Ărø? Are we to force data about people into the general digital circulation of society? Well, doubtful, and no. These are not issues where there are simple and quick solutions, but rather conundra regarding the datafication of society where we need a robust public debate and a research-based understanding what it means when data inequality creates new kinds of phenomena â such as that of digital deserts.