An Impact-Oriented Knowledge Distribution

If an individual’s knowledge could be mapped in such a way that the x-axis represented situations and the y-axis the outcome, what would an effective distribution look like? Paul Graham distills two traits from this hypothetical distribution: Wisdom & Intelligence.

Wise and smart are both ways of saying someone knows what to do. The difference is that “wise” means one has a high average outcome across all situations, and “smart” means one does spectacularly well in a few…the graph of the wise person would be high overall, and the graph of the smart person would have high peaks…[intelligence and wisdom are related because] they’re the two different senses in which the same curve can be high.” (Paul Graham)

There seem to be trade-offs when choosing between investing into intelligence vs wisdom recipes: “Wisdom seems to come largely from curing childish qualities, and intelligence largely from cultivating them.” [1]

However, it is not necessarily a zero sum game because wisdom and intelligence are not mutually exclusive. The ideal case would be to acquire knowledge whose utility does not diminish when applied to other knowledge (on average). Exploiting this seems to be the essence of an effective knowledge distribution.

This post is meant to create dialogue around what an effective knowledge distribution looks like. It is intended to focus mostly on knowledge of skill, but occasionally strays from this in order to create scaffolding for interpretation. This approach has three layers of ideals, and is based on three premises.

The three layers of ideals:
(1) The top layer aspires to Annihilate Myth.
(2) The bottom layer is to practice Bayesianism.
(3) This Knowledge Distribution is an attempt to coherently interpret what lies in between.

Three Premises:
(1) Rare and valuable skills are important because they have a high probability of alleviating disadvantageous bottlenecks.
(2) Knowledge acquisition should have the goal of improving predictions and coherence.
(3) Of all the useful skills, the most general carry the most aggregate power.

Premise #1 relies on the intelligence of a financial or scientific market, because those markets are what tell us what is rare and valuable. The markets tell us this by describing the largest bottlenecks. Therefor, while it may not be accurate to look at bottlenecks to find the most rare and valuable skills, it is useful.

Premise #2 describes two things. The first is prediction. Think of prediction as the study of generalizing from data.

Prediction is an important goal because its quality comes from the underlying models, and vice versa. Prediction quality and model quality are one in the same: “The substance of a model is the control it exerts on anticipation” (Eliezer Yudkowsky). The creation of precise models is often called Science.

The second goal of premise #2 is to holistically improve a logically consistent interpretation of oneself and surrounding reality. This isn’t any different than Science, but in a less precise context.

Therefor premise #2 describes two different kinds of prediction: (1) a detailed visualization of how something specifically works, or how all its details can be interpreted with greatest unified logical consistency,  and (2) drawing insights from a general class of similar entities. Improving these two kinds of prediction, called the inside view and the outside view, is the goal of Bayesianism.

Premise #3 is a philosophical approach to prioritizing the value of skills. It is based on the question: Of all the useful things we can learn, which are the most general? Knowledge is power [2]. By being both pragmatic and broadly applicable this allows for the claim of lots of territory. This generalization can happen on two sides of a bottleneck: (1) Which core skills have the largest need for breadth in societal acquisition. (2) Which technologies aspire to generalize. #1 is the reaction to markets. #2 is a subset, and provides a teleology. Therefor #1 is the source of imprecise business terms such as “Data Science.” #2 is a pruning for purpose. For #2, think singularity education, but centered around skills that exist on the margin.

Prerequisite (1): The Bias-Variance Tradeoff (read me!)
A model can have two types of errors: an error due to “Bias” and an error due to “Variance.” There is a tradeoff between a model’s ability to minimize these two errors. Understanding this Bias-Variance Tradeoff is necessary when evaluating a model’s fit. So a model is under-fitted if it has high bias and low variance. It is over-fitted if it has low bias and high variance.


This picture is a one dimensional explanation of Bias (Inaccuracy) and Variance (Imprecision). These words can be used interchangeably (high bias -> inaccurate, high variance -> imprecise).[3]

Prerequisite (2): Reference Class Forecasting
There are two kinds of prediction: the inside view and the outside view. “An inside view forecast is generated by focusing on the case at hand, by considering the plan and the obstacles to its completion, by constructing scenarios of future progress, and by extrapolating current trends.  An outside view essentially ignores the details of the case at hand, and involves no attempt at detailed forecasting of the future history of the project.  Instead, it focuses on the statistics of a class of cases chosen to be similar in relevant respects to the present one (a reference class).  The case at hand is also compared to other members of the class, in an attempt to assess its position in the distribution of outcomes for the class.” (Kahneman, Lovallo)

This is important here because it adds another dimension to which one can have precision and accuracy: one can have it towards an inside view, and towards an outside view. That is to say, the precision and accuracy of the inside view depends on an intuitive prediction of a detailed visualization of how something works. Whereas the outside view ignores the details of how something works, and instead uses statistics of roughly similar phenomena to guide predictions. Therefor, insofar as our intuition is flawed, “overcoming bias comes down to having an outside view overrule a [bias within the] inside view. Then our questions become: what are valid outside views, and what will motivate us to apply them?” (Robin Hanson)

This approach would become more accurate with many outside views (not just one reference class), weighted by how relevant they are to what is being predicted. Then once you’ve arrived at a qualitative and quantitative judgement, you can make adjustments to the judgements in some cases using the inside view. (lukeprog)


So given the three premises in the Introduction, what is the best approach for constructing an effective knowledge distribution?

One should aim for the simplest description possible, but no simpler. This is known as Occam’s Razor, and is also described as “the simplest explanation that fits the facts.” A simple model is a clear model, and a clear model is one that is interpretable. This means that while it isn’t necessarily very accurate (unbiased), it is precise (low variance). The word ‘precise’ here can refer to either a model with low complexity, or to a distribution of data-points that are tightly clustered.

The goal of simplicity is useful when accruing epistemic rationality. This is because “the more complex a proposition is, the more evidence is required to argue for it.” This is because “from a Bayesian perspective, you need an amount of evidence roughly equivalent to the complexity of the hypothesis just to locate the hypothesis in theory-space.” Simpler hypothesis, smaller burden of proof. This may be necessary for an interpretation of a knowledge distribution to be coherent.

The argument for simplicity is not just the smaller burden of proof. It also provides a useful heuristic for model selection. Given how easy it is to generate new mechanisms, accounts, theories, and abstractions, we need to vet them to find those that are useful; those that have worked thus far. If something is too far away from what we currently understand, then there is nothing that we know a lot about near to it. Staying close enough to what we already know, for sufficient testing, is a necessary (but not sufficient) heuristic when searching for the simplest explanation that fits the facts.

So how should we test near to what we currently know, and then apply it far into the unknown? “There are a bazillion possible abstractions we could apply to the world.  For each abstraction, the question is not whether one can divide up the world that way, but whether it carves nature at its joints, giving useful insight not easily gained via other abstractions.  We should be wary of inventing new abstractions just to make sense of things far away; we should insist they first show their value nearby” (Robin Hanson). A model that carves nature at its joints in essence aims for the simplest description possible. However, this is an epistemic ideal, and doesn’t recognize complexity tradeoffs. Finding a balance between simplicity (higher bias) and necessary complexity (lower bias) makes for an effective filter when accruing instrumental knowledge.

Learning favors Occam’s razor because a more clear model allows for easier integration of new information. To see this, consider two models, one with two features (Model A), and another with three features (Model B). When adding a new feature to Model A, it has less interaction combinations than when that same feature is added to Model B. This makes it easier to see, in Model A, how the new feature will transform its interpretation (regardless of whether the new feature increases coherence, or creates internal contradictions, relative to the original model A). Therefor, when learning, the preference for Model A should be increased due to its simplicity.

Alternatively, imprecise observations make it difficult to distinguish between signals and noise because of larger variance in the underlying data. A precise dataset has a tight cluster, and so there is a more well-defined boundary between what is contradictory or not. Therefor someone with imprecise observations (causing the distribution to be more spread-out) may not even realize when a piece of information is contradictory to the existing observations.

These two examples above describe how one can more readily interpret a new feature or data-point with precise models. However, the importance of accuracy is different in each. In the second example, accuracy is irrelevant: a precise dataset has a tight cluster, irrelevant to where the cluster is located. However, in the first example of incorporating a new feature, accuracy is not irrelevant, but is traded off for simplicity.

This trade-off between simplicity (high bias) and necessary complexity (low bias) has been the central thread of this section. Simplicity is more interpretable because it creates a clear model of how the world works. However, more complexity is sometimes needed for improved accuracy. The instrumental insight comes from whether the value of improving accuracy is greater than the burden of its additional complexity.


The Importance of Precision & Accuracy
“If you write in an unclear way about big ideas, you produce something that seems tantalizingly attractive to inexperienced but intellectually ambitious students. Till one knows better, it’s hard to distinguish something that’s hard to understand because the writer was unclear in his own mind from something like a mathematical proof that’s hard to understand because the ideas it represents are hard to understand. To someone who hasn’t learned the difference, traditional philosophy seems extremely attractive: as hard (and therefore impressive) as math, yet broader in scope.” PG

This is why both accuracy and precision are important when speaking philosophically. Precision is important because without it, words would break down when pushed too hard. For example, it doesn’t seem too far off to define math as the study of terms that have precise meanings; words that cannot be broken. Accuracy is important when speaking philosophically because it defines how much can be projected onto reality.

Accuracy & Precision are also important for generalizabilityAccuracy is important because it comes with the removal of systematic errors. Systematic errors make it more difficult to see insights into central causes, which are connected to the most biases. Precision is important for generalizability because it is related to reproducibility, which is “the degree to which repeated measurements under unchanged conditions will give the same result” (Wikipedia). Therefor precision is important for generalizability insofar as it is the test which defines the boundaries within which something can be generalized. Achieving precision, accuracy, and simplicity gives an abstraction that “carves nature at its joints.”

Accuracy & Precision are also important for well-defined systems. “An ordinary worker builds things a certain way out of habit; a master craftsman can do more because he grasps the underlying principles. The trend is clear: the more general the knowledge, the more admirable it is.” PG

A knowledge distribution achieves maximum value by focusing on things that are generalizable across settings. However, when working within a well-defined system, one should value things that are generalizable within the scope of the system. This provides models that don’t break down at edge cases, and may lend itself to shortcuts.

Precision is also important for well-defined systems, because it removes/prevents internal contradictions. To see this, consider a visualization where the entire possible knowledge of a system is represented as some nebulous abstract space. As we gain knowledge of the system, this space fills with a piece of ‘substance’. Therefor, a well-defined system is one in which a large portion of it’s nebulous abstract ultimate knowledge space is filled with an aggregate of ‘substance’ pieces. There is therefor lots of relative contiguity between the ‘substance’ pieces (the state of bordering or being in direct contact with something). The surface area of contiguity is proportional to the potential for internal contradiction. This is because there are more places where one substance piece could overlap with another, yielding a contradiction. Therefor these boundaries must be precisely define. This makes precision important for well-defined systems because it removes/prevents the overlap of these substance pieces.

How to maximize an aggregate yield from learning
Consider a generalization of there being two kinds of things someone can learn. The first kind values content as an end in itself (for example: names and dates in a history class). The second values content in proportion to its expected future associative power. Therefor, if you think of knowledge acquisition as being similar to investing money, then the first kind of learning is like keeping your money in a safe in your basement; inflation (i.e. neural decay) will cause it to depreciate. However, the second kind of learning is like investing your money into WealthFront (which you can now do with a minimum of only $500). Because this kind of content has higher future associative power, accumulating it will allow for easier accumulation of other knowledge in the future, kinda like compound interest. Therefor maximizing an aggregate yield from learning is an attempt at accumulating knowledge at what is colloquially called an exponential rate. It attempts to do this by focusing on knowledge whose utility persists when applied to other knowledge.

In other words, it relies on the power of association. Lets call an association a connection between content. This connection can come from similarity, contiguity, or contrast. On a high level, there seem to be two kinds of association: conceptual similarity (like a metaphor), and spatial/temporal proximity (like remembering the relative position of chess pieces in a game). Lets call a “hook” a set of associations that can be used to hang general sets of facts.

In addition, note that an association (as I would like to use it here) is different from a pointer. A pointer would be something like: Bias X points to Situation X, Y, and Z. An association would be something like: I have a trigger for recognizing Bias X in Situation X, Y, and Z, and can therefor overcome Bias X. The difference is in familiarity, and that difference is what distinguishes knowledge from skill. This distinction is important because knowledge is only a stepping stone. One should focus on strengthening the association of the skill.

What are qualitative measures for the strength of an association? The strength of conceptual associations seems proportional to the available metaphors. The strength of spatial/temporal proximity associations, like an n-dimensional jigsaw puzzle, seems proportional to how precisely it is guided into place by the proximal associations.

So what is a justification for “what” the most general thing to learn is?

Pragmatically Maximize Aggregate Yield
There are two frames here. The first is theoretical, and says that we should focus on knowledge whose utility persists when applied to other knowledge. The second frame is pragmatic. Ideally this frame is constructed by reifying the theoretical frame, thus creating The Praxis (practice, as distinguished from theory). The minimal test of utility for any attempt at creating a praxis is whether it causes people who read about it to do anything differently afterwards. The scaffolding for this Praxis has at least two layers: foundational traits, and a distribution.

Three basic traits:

  1. Its holistic in scope. Holistic scope means that it contains three categories: (1) personal development (what CFAR and LW are teaching), (2) the acquisition of rare and valuable skills (which is both found and rewarded in some way by Capitalism, and includes both academic and non-academic metrics of success), and (3) skill application (which is what 80,000 hours is trying to align)
  2. Its centered around the benefit of oneself, one’s communities, and the world. You have to start with yourself before you can help others. The mutual support of others gives strength and endurance towards a vision. An effective vision is one that helps the world in scalable ways. Optimize QALYs.
  3. It recognizes returns on investment. From one view, this comes from the expected value of return in consideration of the risk x benefit of all options. The other view is about how these things compound over time. The value of some entity comes from the integration of that entity’s value over its entire life. It is therefor useful to recognize and then focus on the pieces that have larger multiplicative effects. For example, going deep into knowledge that can also be used to improve your life.

The T-distribution:
In the beginning of this essay there is an interpretation of wisdom being measured as an average outcome of knowing what to do across all situations, and intelligence being measured as the ability to do spectacularly well in only a few situations.

This view of intelligence & wisdom seems to be a more general view of the classic (and inevitable) T-shaped distribution of effective learning. Cultivating a T-shaped knowledge distribution allows for the benefits of both deep knowledge and generalizable mental models. Consider two seemingly-incompatible perspectives on career choice

“Advancing to the cutting edge in a field is an act of “small” thinking, requiring you to focus on a narrow collection of subjects for a potentially long time. Once you get to the cutting edge, however, and discover a mission in the adjacent possible, you must go after it with zeal: a “big” action. In general, deep over wide learning provides more total skill development in the long run.” -So Good They Can’t Ignore You by Cal Newport

“I think many people would benefit from considering a broader array of possible career paths than is normally considered; from keeping options open and learning as they go; from going outside their comfort zone and challenging themselves to gain new skills; and from generally focusing, early on, on personal development and learning.” -Holden Karnofsky on Altruistic Career Choices

The first approach tries to maximize the impact of technical projects by first excelling in a specific domain, and then applying that skill to the adjacent possible for impact. This can be summarized as “Think Small, Act Big.” The second approach is more about pushing your comfort zone to gain new skills, thereby becoming a generalist. Both emphasize gaining valuable skills, but just within different scopes. Finding domain areas where you can work towards both simultaneously is what is meant by an effective T-shaped knowledge distribution. This approach is therefor based around the question: of all the useful things we can learn, which are the most general? The rest of this essay attempts to use the T-distribution as a framework for either answering this question directly, or articulating a precise question for further exploration.

Of all the useful things to learn, which are the most general?
When speaking generally, it sounds cliche. But the answer is: Science for knowledge, Technology for reification, and Art for our sanity. To speak more specifically than this overlooks the importance that idiosyncratic interests play in the role of learning:

“There can’t be too much compulsion here. No amount of discipline can replace genuine curiosity. So cultivating intelligence seems to be a matter of identifying some bias in one’s character—some tendency to be interested in certain types of things—and nurturing it. Instead of obliterating your idiosyncrasies in an effort to make yourself a neutral vessel for the truth, you select one and try to grow it from a seedling into a tree.” (Paul Graham)

My attempt at an answer:As a technologically optimistic utilitarian, the most rare and valuable skills to generally understand are computer science, statistics, social psychology, and study design. These fields are broadly applicable and gain more leverage as tech expands.

Engineering is useful because it is a break from ideals towards managing trade-offs. Specifically, programming (for its feedback mechanisms, computational abilities, networks, and existence as pure text) allows you to reify solutions across domains.

Statistics is useful because it allows for a scalable interpretation of reality. The most successful future software/APIs won’t attempt to replace human functions, but complement them. This kind of software is inherently data-driven, given the complimentary strengths of silicon technology. It is therefor an effective way for an expert intuition to create a very large lever.

Social Psychology is useful because it provides a framework for recognizing and understanding systematic errors that people make. This is helpful to economics because it improves on the model of the rational agent. It is helpful to individuals because we have to understand a bias before we can overcome it.

Study Design is useful because it can help us discover unknown needs/interventions, and calibrate the process of scaling in irregular contexts. The hidden existence and future influence of these secret models is an effective truth.

How much knowledge should you acquire from each domain?
One should acquire knowledge in a domain with the goal of discovering utility in its applications to the adjacent possible. The adjacent possible are all the things we are currently ignorant of, but that we are proximally close to by some measure (such as metaphorical proximity: how similar a new thing is to something we already know).

Consider all known knowledge as a collection of sand, in a sandpit. Now imagine the distribution of the sand as being represented by the wisdom & intelligence distribution. Research in this sense is the process of drizzling new sand onto the sandpit in a precise location (thus creating an upper bound of an ‘intelligence spike’). Over time, as new grains of sand fall and roll down the hill, it will land and claim the adjacent possible. This metaphor creates the security that low hanging fruit will inevitably be discovered, and holds an ideal of wanting to “level out” the reified knowledge distribution.

Alternatively, knowledge can be carried from an auxiliary domain back to a primary domain. Here, sand is moving in the other direction: up the hill. This can be effective because staying within a single field brings diminishing returns on associative gains of new information. This is true if a particular topic of study becomes easier as you gain deeper knowledge of it. It seems like it would become easier because the new information is more precisely guided by an abundance of proximal associations. For example, I’m sure epidemiology could learn a lot from how sustainability organizations conduct experimental studies (and vice versa).

However, there is a trade-off here. Diminishing returns may only exist when taking the frame of a Bigger Picture. “As knowledge has grown more specialized, there are more and more types of work in which people have to make up new things” (PG). As if the knowledge distribution rises with a fractal nature, thus creating more surface area for ‘intelligence spikes’ as fields become specialized. This feels intuitively correct because “as our knowledge grows, so does the shore of our ignorance” (Marcelo Gleiser) [3]. Therefor, the surface area for bordering the adjacent possible becomes larger with height. Is performance unbounded in proportion to the task’s exposure to the adjacent possible? And given that there is increasingly larger room for spikes as the knowledge distribution rises, is it worth being wise?

I think so, because “connections between your depth area and other areas, and linking those novel far connections are a key source of creativity. Think of a physicist looking at a gecko foot, thinking about the deep principles at work, and inventing sticky gloves. Or a fashion designer who knows math and makes patterns based on generative mathematical rules. In other words, I hypothesize that T-shaped knowledge is an efficient way to have creative insights, especially when you’re deep in a field where other people might lack breadth (e.g. fashion).” (Dan Greene)

Why is this Impact-oriented?
The claim for this approach being impact-oriented is overly-ambitious. However, I feel justified in making the claim because the argument is about being both general and precise, and therefor may make for a starting point for asking the question: what is an impact-oriented knowledge distribution?

In fact, the above approach doesn’t even attempt to be centered around impact. Consider a useful distinction in ways of thinking: Can I solve this social problem using these frameworks? Vs. Can I use these frameworks to solve some social problem. The above approach uses the latter way of thinking. Again, its an attempt to answer the question: of all the useful things we can learn, which are the most general? Therefor, this entire approach may be biased insofar as it is not centered around impact, but on generalizability.

This feels justified because (1) generalizability of tools seems powerful insofar as it can claim lots of territory, and (2) finding the most general skills within a pragmatic space allows for a more simplistic argument (which I argued in favor of, in the section called “simplicity”), and simplicity may be necessary for this argument to even be coherent.

I also like the title because I hope for it to be provocative. And it feels, at least for me, to be useful insofar as the coherence adds philosophical weight to evolve this framework further. Similar to the weight of Nietzsche’s eternal recurrence. This motivation is useful for flow. IAmA Flow Junkie. Accrue the happy things.

Impact-oriented considerations from silicon valley industry conformity
In addition, the above perspective is strongly biased by the data-driven silicon valley bubble. And “It’s not at all obvious that the way people solve problems in silicon valley is going to help solve social problems. That’s not to say that there is a clearly defined notion of ‘how people solve problems in silicon valley’, but you will converge towards solutions that look like solutions to other silicon valley problems if you’re solving them with other people in that environment.[5] ” (Nick Borg)


[1] “Recipes for wisdom, particularly ancient ones, tend to have a remedial character. To achieve wisdom one must cut away all the debris that fills one’s head on emergence from childhood, leaving only the important stuff. Both self-control and experience have this effect: to eliminate the random biases that come from your own nature and from the circumstances of your upbringing respectively. That’s not all wisdom is, but it’s a large part of it. Much of what’s in the sage’s head is also in the head of every twelve year old. The difference is that in the head of the twelve year old it’s mixed together with a lot of random junk.

“The path to intelligence seems to be through working on hard problems. You develop intelligence as you might develop muscles, through exercise. But there can’t be too much compulsion here. No amount of discipline can replace genuine curiosity. So cultivating intelligence seems to be a matter of identifying some bias in one’s character—some tendency to be interested in certain types of things—and nurturing it. Instead of obliterating your idiosyncrasies in an effort to make yourself a neutral vessel for the truth, you select one and try to grow it from a seedling into a tree.

In Short, “the path to wisdom is through discipline, and the path to intelligence through carefully selected self-indulgence. Wisdom is universal, and intelligence idiosyncratic. And while wisdom yields calmness, intelligence much of the time leads to discontentment.” (PG)

[2]  Knowledge is not power. It wants power. Its used most by the groups that are (or feel) weak. In its extreme, ethics is no more than an attempt by the weak to gain power over the strong. This is why the well-being of the many has always been the alibi of tyrants.

Reify it with technology. Use knowledge as fuel. Then you get power.

[3] “Accuracy and precision are defined in terms of systematic and random errors.” Accuracy refers to how close a measurement is to the true value. “Precision, related to reproducibility, is the degree to which repeated measurements under unchanged conditions show the same result. For example, if an experiment contains a systematic error, then increasing the sample size generally increases precision but does not improve accuracy. Whereas Eliminating the systematic error improves accuracy but does not change precision.” (Wikipedia)

[4]  “As our knowledge grows, so does the shore of our ignorance” (Marcelo Gleiser). This is interesting because ignorance is not consumed by knowledge, but increased. I think more information leads to greater perceived ignorance because the truth is that reality greatly exceeds our awareness of it. 

So as someone becomes more knowledgable, there is more novel ignorance. And if curiosity seeks to annihilate itself, does curiosity increase with knowledge? If so, then seek more novelty by allowing curiosity to annihilate itself.

[5] An industry is a group of problems roughly in the same class; learning about your industry is different than learning to solve your problems.

Thanks to Nick Borg, Dan Greene, and Alessandro Gagliardi for helping evolve this

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s