What is the shape of all knowledge? How has knowledge grown over time? Can we anticipate ways in which knowledge will grow? Knowledge is certainly hierarchically structured but it’s also richly scaffolded and networked with concepts and ideas linking within and across domains.
We can attempt to answer some of these questions by computationally exploring the Network of Knowledge that is the Wikipedia: The largest indexed collection of crowdsourced knowledge, currently made up of a staggering 40 million articles written in close to 300 languages.
Now, completely characterizing large-scale networks is not a simple business. As a start, we focused on a small, specific piece of the pie inspired by an intriguing claim about the Wikipedia’s structure: As xkcd and others have suggested, if you repeatedly follow the first link presented in an article, you’ll always end up at Philosophy. We wondered if this was true, and how the first-link flow through articles might look in aggregate.
We took a snapshot of the English Wikipedia in November of 2014, and extracted the network generated by taking only the first link present in each article. We then examined this “First Link Network” in detail and we provide some of our main results below. We note that the present version of the Wikipedia differs in microstructure, and that some of the specific observations we make will only apply to November 2014. That said, there is a general robustness to the flow of concepts. Philosophy really is a popular destination.
If you’d like to delve into our findings, please visit our project’s site. We supply visualizations, python notebooks, data, and more. Our primary paper appeared in the Journal of Computation Science and can be downloaded here; the (free and equivalent) arXiv version is here.
Like the Wikipedia, the First Link Network connects inventions, places, figures, objects, and events across space and time. Take the page about bananas. It has a first link to fruit, which itself has a first link to botany. The corresponding path of links connects bananas to fruit and eventually to biology, science, and philosophy. Or train, which leads to rail transport, then conveyance of passengers and goods, then goods, economics, and on up, again, to philosophy.
To understand this structure, we used a simple method, taken from studies of other kinds of flow networks such as rivers and search processes on the web. Imagine a coffee roaster walking along every path with an open bag of coffee beans. As the roaster walks along all possible paths, a bean spills at each article the roaster visits. The roaster is careful not to revisit an article and thereby a loop. Once the roaster has finished exploring, articles along many paths end up with more beans—one bean for each path. The number of beans at each article signifies influence.
Why the bean walk? It turns out the roaster’s walk is useful for isolating influence in a network with cycles. Unlike other measures, beans never spill into a dead-end. We unleashed the roaster on Wikipedia, to uncover a structure resembling a social movement.
The 99 Percent
We found that articles consistently flow from specific to broader topics—similar to the banana’s path from fruit to biology, with most paths converging on a few topics. Whether we measured accumulation (total bean count), direct links, or influence, each was concentrated in a handful of articles: 99% had very few, while 1% held the overwhelming majority.
What are these anchor topics? By direct links (in-degree), topics made of many parts, such as sports and nations, dominated. The United States, Canada, U.K., and Germany are among the top nations; football and soccer are among the top sports. The many parts—players, leagues, and strategies—link to football, the anchor.
When we measured accumulation and influence, the dominant articles fit in one of three categories: academic disciplines (left hand), abstract notions (middle hand), and modern topics (right hand).
You’ll find a handful of articles floating in a word cloud. Philosophy, as suspected, is especially dominant, ranking first in influence by two orders of magnitude. After Philosophy come broad areas of knowledge: biology, health care, and “web page”.
And this tidy structure emerges from the independent decisions of millions of Wikipedia authors. From porcupines to Abe Lincoln, the many articles naturally converge at a few anchors.
Closed paths of links are also common. For example, many articles live on closed loops in the network related to the calendar. The longest chain of articles spans 365 links: a chain of Orthodox Liturgics for each day of the year. Today’s list of Saints links to tomorrow’s and so on. This pattern popped up again for the longest articles: long chains are organized by day, year, or decade.
On the other extreme, articles linking back to one another in groups of two or three mark close associations (e.g., photography and photograph). Some are synonyms; others are more nuanced associations, not in your thesaurus.
The above is a hopefully a useful taste of what we’ve done. Again, we welcome you to dig into our paper and explore our project’s site. Many groups have been carrying out interesting work on the Wikipedia’s structure, the community of editors, and the science of science in general, and we are sure there will be many exciting discoveries to come.