Tuesday, April 16, 2013

How to use Big Data

(Translated from Comment utiliser le Big Data)

Stéphane Grumbach and Stéphane Frénot have published in Le Monde on January 7, 2013 an article that develops what is often said about Big Data: "Les données, puissance du futur."

It is true that the Internet provides powerful editorial means to the institutions that produce statistics, it is also true that the observations collected by computer processes allow novel usages. One should of course be aware of the new possibilities and new dangers that entails.

The authors of this article, however, handle with too few precautions the semantic bombs that are the words "data" and "information." Phrases such as "to digitize everything", "information society", "mass of data", "a resource little different from commodities such as coal and iron ore" are actually deceptive, because by encouraging to consider data according to their volume they slide down the slope of the "information theory."
Shannon, who likened the information a message brings to the logarithm of its length after compression, said "meaning does not matter." The impressive strength of this statement can not hide its absurdity.

Here's what I was taught by the professional practice of statistics:
  1. "Data" are in fact selective observations: they are not "given" by nature but defined a priori by an observer so that their measure can then be "given" to the computer.
  2. The "information" gives to the brain of the person who receives it an "inner form" which procures him the ability to act. This ability can however come true only if the data are interpreted, which requires applying a causal link between the concepts whose measurement was observed.
  3. The sharpest data analysis doing no more than exploring correlations, one must be proficient in the theory of the observed domain for being able to infer from correlation to causation .
A few words on the last point: a theory is the treasure of previous interpretations, condensed in the form of causal relationships between concepts – and this treasure has to be free of dogmatism, pedantry and narrowness that are all diseases of the theory.

 Anyone who ignores the theory will inevitably fall, as happened to me, in any of the naivety that the experience of theorists had long identified. The observation that data procures was itself based on a (sometimes implicit) theory which provided its concepts, and whose it is important to have at least a hunch.

Operation of intelligence agencies shows that the interpretation (which they call "synthesis") is much more important that the collection of data: it is better to collect a few well chosen data, that ons is able to interpret, rather than to be crushed by a massive collection.

 If we neglect this Big Data will bring only confusion. It is dangerous to place value only in the volume of computer storage and the power of data processing. By cons if one knows how to do data will be indeed a resource - and therefore, as stated by Grumbach and Frénot, an issue.

No comments:

Post a Comment