Ultimately, big data is a practice of inductive reasoning (described in
this article), therefore a basic rule is applied: adequate observation is needed,
to form a true conclusion.
Technically, that could never be achieved, as David Hume argued long
time ago (also see that article). Still, it's the best we
can do when facing the real world, and by far we've
got some pretty satisfying results. Thanks to the ever-shining founding of
logic science, we continuously developed some very sophisticated methods to further
extend the power of data processing, like statistics and its many
branches. And big data is no exception.
Like other previous practice of inductive reasoning, to get a model
good enough for a specific problem, big data requires large amount
of observation. Yet the amount of observation itself is a difficult
enough problem. In big data science, even a tiny neural network
requires a fairly large training set. In statistics, if I want
a valid conclusion, I'll have to find a high enough confidence
level, which comes from large amount of data. See this table.
It beats thousands of words.
No need to check the numbers very closely. It's obvious that
a larger sample size leads to a higher confidence level, and
a smaller margin of error. Also, the conclusion generated could cover
a larger population, which indicates it is more likely to be
universal.
The same idea can be extracted from this image.
(Power of test, labeled in y axis, is a common phrase
for confidence level.)
In fact, outside big data or statistics, everything people deal with
in real world, no matter big or small, requires a similar
strategy. Take Johannes Kepler's founding of laws of planetary motion for instance.
In the book, Deep Learning with PyTorch, in order to explain
what "learning" means, a section is dedicated to his story. The
first step of his learning is to "got lots of good
data". Frankly, this is the beginning of all learning process. (In
natural science of course. In theoretical science, like mathematics or philosophy,
we need deductive reasoning, which is a completely different approach.)
"Data" here, can be seen as roughly equivalent to information. Although
the details of processing from data to information are omitted here,
the key is that data amount is a necessary condition for
information. And "true conclusion" stands for knowledge, for knowledge is the
collection of a large number of true conclusions. In his book,
Tolerance, van Loon argued for the importance of information, with the
story of Socrates, in this paragraph:
"Provided that man remains on good terms with his own conscience,
he can well do without the approbation of his friends, without
money, without a family or even a home. But no one
can possibly reach the right conclusions without a thorough examination of
all the pros and cons of every problem. People must be
given a chance to discuss all questions with complete freedom and
without interference on the part of the authorities."
Although I can't tell whether it's a quote by Socrates himself
(as van Loon stated but I can't find any confirmation), the
truth is in the words, not the name. Data are external
existing objects, specifically the attributes or features of other objects. Information
is the perspective I gain from the data I collected. Before
I can evaluate the quality and effectiveness of each piece of
data, the quantity is the first thing I need to focus
on. Although "more data" doesn't necessarily mean "more truth", as there
is always a theoretical possibility that they contain more false information,
still it's the logical option. If some data were not stick
to the truth, only by collecting more data from other sources
could I adjust my information generated from them and the misleading
effect could be counteracted to a greater extent. Simply put, the
more data I have, the less probably I am trapped.
The key here is to understand that, while "good/evil/beneficial/harmful" is people's
evaluation or judgement of data, data themselves are external existing objects,
independent from any subjective. People form their evaluation or judgement based
on their moral criteria, gained from personal experience, social surrounding, education,
or superior orders, but it doesn't change the external objects at
all. I should focus only on external objects, not other people's
subjective, no matter who these people are or how many people
feel the same way.
That's pretty much the gist of the aforementioned paragraph. As communication
is vital to data exchanging thus information spreading, if it is
blocked or censored, people are doomed to be left in ignorance,
therefore extremely easy to be manipulated. The line in George Orwell's
novel 1984 is essential to this issue, "Who controls the past
controls the future. Who controls the present controls the past."
On a different topic, this could also be the explanation why
people with more broadened vision tend to be wiser. It's because
they can collect more data within their scanning range, thus more
raw material to feed to their cogitation and cognition, metaphorically speaking,
the neural network in their heads. So, people need higher grounds,
where they could benefit from wider eyesight.
My reasonable action here, as far as I see, is to
collect as many data as possible, to get (probably) more thorough
understanding of all things. Whenever I need to see through a
problem, more data are always needed, to find a more advisable
answer. As an old Chinese proverb goes: "Listen to both sides
and you will be enlightened, heed only one side and you
will be benighted."
0 / 960