Truth Concealed and Revealed
Logicians and mathematicians inspect the state of matter, the train of thought on paper, for speech is too fleeting to grasp. The brain is only capable of remembering a measly list of items. We see that paper is an extension of the brain.
Both speech and information are fleeting. The same way writing freezes speech, data can freeze information. The underlying bits may not make sense to any sane human, but they are, at a higher level, a meaningful manifestation of their source. A Google search is an expression of someone’s needs or desire, a social media post carries the user’s pretentious intent to impress, and a workout record is an indicator of someone’s health. Leveraging the strengths of data in a proper way can indeed reveal the model that generates these data in the first place. Big data, as we call them now, yield more general and overarching patterns that could be used to model all kinds of facets of society.
Several decades ago, data existed, but in a very cumbersome and stationary way. To gain insights, one had to flit through the hefty pages of an encyclopedia. No one knew what local majority thought about the recent murder around the corner, nor their deepest little secrets they dared not tell. These details might seem irrelevant and trivial, but what we see is that they lacked a better way to be replicated and propagated.
As computers advanced, information had got a better way to march on. The amount of information is skyrocketing at an unprecedented speed. Until this age, data permeate to the point of being abounding and extravagantly redundant. Now should be the time for us to inspect and learn like mathematicians, with all those data.
Our intuition is itself an incredible feat. Touted by Malcom in his bestseller Blink, our brain bears special gifts for things that are not immediately observable or statistically collectable in real life. Yet sometimes intuition fails hopelessly. This was changed when big data came. Data are gauged and collected at an accelerating rate. It seems that along with Moore’s Law, data are also being cranked up.
The job of data scientists is to find patterns and hopefully their causes therein. In this book, Seth proposed 4 major types of power that we can derive from big data. They shall be elaborated below.
Offering new types of data
Betting on a horse was a mysterious art before big data came into play. This art was mastered by few. A young disruptor, Seder, thought of this problem otherwise. He began using data, instead of his intuition or unreliable feelings, to predict the potentials of each horse. Did he make it? It turns out that the horse he deemed promising indeed succeeded. By measuring the horse’s internal organs Seder eventually figured that left ventricle was the key indicator.
First, and perhaps most important, if you are going to try to use new data to revolutionize a field, it is best to go into a field where old methods are lousy… The second lesson is that, when trying to make predictions, you needn’t worry too much about why your models work.
Languages have been a research topic since last century, and how words are used is still revealing. For example, before the Civil War, according to data back then, the phrase the United States are has higher frequency than its singular counterpart the United States is. How divided a country is can thus be reflected. Words can also work to predict a gender/age pattern. Besides, sentiment analysis on words shows that happy and positive content is more likely to go viral on the Internet.
We can learn that new data come into play especially well in fields where old methods are lousy. This means you do not want to use data on an already saturated subject. On top of that, be open to all possible opportunities of collecting fancy data. That could pay off really well.
Providing honest data
Everyone lies on surveys, and this is called social desirability bias. The official data always reveal more truth than shady questionnaires. People consistently give wrong or tweaked information in the direction to embellish some aspect.
This is the second power of Big Data: certain online sources get people to admit things they would not admit anywhere else. They serve as a digital truth serum.
Google, as the one and only monopoly on search engine, provides huge volume of data for digging and crunching.
- Sex: 2.5% of male users on Facebook identify as gay, which can be reconciled with the surveys. Upon careful analysis though, a certain number of gay men are born in different states and mobility flow between states cannot fully account for those numbers, so something fishy is going on. Via Google search data, we confidently conclude considerable amount of people still in closet, especially in intolerant states.
- The Internet: Different political views can elicit unpleasant controversy or even combat, both online and offline, but which way is more probable? A study derived a probability of about 45% that 2 people visiting the same news site have different political views, which is evidently higher than typical offline scenarios: coworker with 41.6%, neighbor with 40.3%, family member with 37%, and friend with 34.7%. So, the internet aggregates people of different political views, instead of segregating.
- Facebook Friends & Customers: Despite people’s propensity to make themselves look good on social media pages, their post is in actuality far from the truth. People dislike such pretension, but data tell us that users always come back to check out their friends’ lives. A great business is built on secrets, either about nature or about people, said Peter Thiel, author of Zero to One. We could even go so far as to say that this social media application is built on lies. Likewise, Zuckerberg in his college came to know a hard fact that people can decry something as awful, yet they’ll stick. Netflix learnt the same lesson, and they recommend movies based not on what customers added to wishlist (serious foreign films and documentaries), but those (usually lowbrow and shoddy comedies) they consistently watch.
What’s the morale? Truth can be hurtful, but it gives helpful advice. You know you’re not alone in embarrassing situations, because others are doing the same search. It reveals the people who are truly suffering. And it leads from problems to solutions. By the way, never compare your Google search to someone else’s social media. That hurts.
Allowing us to zoom in on small subsets
With big data at our disposal, we are able to zoom in on, say, geographies, while keeping the data statistically significant. This comes from the ubiquity of big data, i.e. they are collected seamlessly from nooks and crannies.
Is America genuinely a land of opportunity? In other words, do you have more chances to get rich when your parents are not? (Let’s superficially assume that success and opportunity can be measured by financial wealth here…) Measured by the possibility of people born low but ending up rich, US scored 7.5%, UK 9.0%, Denmark 11.7% and Canada 13.5%. This looks bad, but still, by zooming in on some states figures become more in line with this country’s title: San Jose 12.9% and Washington 10.5%, whereas others are less lucky: Chicago 6.5% and Charlotte 4.4%. A small survey would just sample few people in any of these cities, but big data have made it possible to zoom in.
And where do successful people come from? Again, rich data from Wikipedia can help. It turns out that “success abounding” counties, with rate of success higher than others sometimes by 20 times, have something in common. Many have a college town, and thus offer early exposure to innovation for the residents. Big cities like NYC and LA are also found to be more likely to spawn successful natives.
Doppelganger is used often to predict the behavior or taste of a target individual, by zooming in on a small subset of people exceedingly similar to the target. Twitter, with a gigantic pool of emotions and attitudes, is a good place to filter people’s likes and dislikes, and doppelgangers are likely to be located thereby. Other services like Amazon also take advantage of this approach to recommend products. When data are immense, we can zoom in on subsets that are more conclusive for our purposes.
With the spirits of randomized and controlled experiments, engineers can easily throw multiple tweaked versions of new feature at users. They do A/B testing, as they call it, to gain even the infinitesimal effect. People are mostly not aware, and they can fall prey.
But there is a dark side to A/B testing.
Companies design, step by step, better versions of their product to keep users active, while the users might sometimes have no idea why they like it. For concrete examples, you can check out this article to learn more about why you should shun it from a neurological point of view. I shan’t spell the detail here.
Nature, from time to time, offers conditions that nurture such causal experiments. These experiments are not readily constructed artificially. If they do happen naturally, however, valuable insights must be drawn. Some notable conclusions are: advertisements are powerful and effective, doctors can be motivated by monetary incentives, prisoners with harsher conditions are prone to commit crimes after release, prestigious schools have little external advantage to the students’ success, rather intrinsic drive is what most students leverage to succeed.
You have become another victim of one of the most diabolical aspects of “the curse of dimensionality.”
In spite of the mighty big data in digital age, numbers and digits are seductive. Ardent data analyzers gather as many data as possible, but the problem is that they are not seeing the big picture. By and large, problems left unresolved today are not ostensibly observable or measurable in terms of data, and they require our critical thinking, intangible knowledge and meticulous scrutinizing to tackle.
Seth Stephens-Davidowitz(2017). Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are. Dey Street Books.