The semantic future of the web

The web is built on data—my data, your data, data from small companies, data from big companies, and so forth. We might hand over data like an email address and in return, we might get access to other data, perhaps exclusive content for a new video game or a weekly newsletter. This constant exchange of data allows for collaboration and communication on a scale that never existed prior to the web.

Much of the data currently exchanging hands can be viewed as human-centric. We have news articles, blogs, e-commerce, forums, video platforms, social media, and Q&A sites providing us data to read, watch, and otherwise consume. We are not the only consumers of the web though, with search engines, voice assistants, pricing bots, and even link preview bots performing a staggering number of requests every day—computer systems like these are playing an ever-growing role in data consumption.

Tim Berners-Lee coined the concept of a “Semantic Web”, where the web can be considered more a global database that computer systems could understand rather than a series of separate web pages. In turn, this could effectively allow deeper integrations between different computer systems and allow for greater decentralization of data. The data here is not just from large corporations—it can be your data or my data, data that we control and manage ourselves through our own websites.

Unfortunately, we are not at this stage of a full data utopia. Large amounts of data are not publicly available, and for data that is available, it can often be locked behind APIs with their own proprietary systems where you need to pay for access.

To move from where we are now to a full Semantic Web is not something that can happen overnight. We have been building web pages for years on HTML, CSS, and JavaScript, optimally designed for a human viewing experience. To extract reliable data from HTML currently, computer systems need to be able to process unstructured data, then establish context and meaning. The thing is, we humans can determine the context and meaning from viewing the page, but machines have to perform additional processing to get that same context. Directly encoding structured data eliminates additional complexity for machines to process themselves. There are many different solutions to encode structured data including Open Graph, Microdata, RDFa, and JSON-LD.

Open Graph, created by Facebook, is a popular format for holding specific types of structured data. Facebook uses this to generate link previews from page metadata. Website developers want additional control over what is displayed based on how it is described in the metadata. Since its creation, other social media sites have also adopted Open Graph for generating link previews.

Microdata, RDFa, and JSON-LD, however, are a bit different as, by themselves, they only represent different formats of storing data in a web page. Computers can parse these standardized structures. However, unless it knows the type of data being represented, it will not actually understand the data. What is missing here is a shared vocabulary so that two different computer systems can understand each other.

A joint effort made by Google, Microsoft, Yahoo, and Yandex proposed a solution called Schema.org to promote structured data in web pages with a common vocabulary. For search engines, this structured data can help provide richer information in the search results. While Schema.org does not describe every type of object, nor does it intend to, it does create a solid foundation to describe many common objects: books, events, locations, medical conditions, movies, organizations, and people. For areas that it does not cover, alternative vocabularies can be used to describe that specialized data. Through its popularity for enhancing SEO, Schema.org has an ever-growing user base which in turn helps grow the Semantic Web.

A Semantic Web may not only change how we think about searching for information online but who controls the information. Imagine every website not just being a wall of content but a graph of inter-related topics and ideas. There would not need to be a central spot where data is stored and controlled by a single entity, helping avoid some concerns about censorship and bias while simultaneously improving privacy and control over one’s data that they share.

For example, take a site like Facebook. It maintains mountains of information about people and businesses, with various relationships between different entities from comments, reactions, and shares. This data is part of the Facebook ecosystem; it effectively “belongs” to them. In a future where data is in our own control, sites like Facebook could just be the visual representation of the existing network, built on a Semantic Web. The data we declare public on our website is what can be viewed, giving us full control over what is shared. This also means that we are not locked-in to a service like Facebook. You are free to move to other “front ends,” as the data is yours and you maintain it.

It might seem strange that an organization like Facebook would ever want to give up their data, however, with stricter laws being passed, for example GDPR in the EU and CCPA in California, it may be just a matter of time until Facebook is forced to.

As new technologies are built to take advantage of this data, it will also provide new tools and experiences for users. While algorithms behind search engines are complex, they currently provide results for queries that have already been specifically answered. If you asked, “all songs before 1995 that failed domestically but were well-received worldwide,” you would be unlikely to get results because no one has yet answered that question. The data for such a query exists on the web; however, it is not readily available due to how search works. With a web built on data, obscure queries like this could turn up results by combining different datasets across several sites.

The ability to query more complex data can especially help researchers and data scientists, being able to potentially combine vast amounts of public data with their own private research data to discover new and interesting things. Additionally, it may help those training machine learning models as specific data sets could be crafted that they may have been impossible to acquire otherwise.

Changes to support a Semantic Web are not something that can happen overnight—we are talking years of small steps and incremental improvements. Even if most websites had rich structured data in their markup, many new tools and technologies would need to be built to leverage it. For example, Berners-Lee has been working on Solid as a method to allow users greater control over their own data, building upon key concepts of a Semantic Web.

Like many other concepts, the Semantic Web does have its critics. One, Cory Doctorow, goes as far to call it “a pipe-dream, founded on self-delusion, nerd hubris, and hysterically inflated market opportunities.” That comment is not without merit as there are several potential problems that need to be considered.

With the number of websites on the web and the vast number of types that may need to be represented, there is a huge amount of data that would need to be understood for any sufficiently complex query. Schema.org has 841 types by itself but only scratches at the surface of all data that could be represented. When looking at specific industries and the data that they might publicly share, there could be hundreds of vocabularies with thousands of types in each.

Beyond the sheer amount of data is deciding how to even classify some of it. Debates could rage on about the most mundane things like whether “a washing machine was a kitchen appliance or a household cleaning device.”

Then the Semantic Web needs to handle duplicate data which, unfortunately, might not be any easier than trying to de-duplicate unstructured data. A single item might be able to be represented in two or more different vocabularies and may have different properties defined. A global identifier for data may help in specific circumstances, however, it will not fully solve the problem.

The credibility of data is another key concern with a Semantic Web. When we research information currently, there are many different factors that we may consider when determining if the information we read can be trusted. Additionally, we might verify what we find across multiple different sites. Systems would not only need to deal with factually incorrect data but also inconsistency in the data that it does find.

Maybe the biggest problem though is not a technical one but a human one. Web developers or other people interested in these types of technologies might go out of their way to add data to their pages and websites, but, would your parents want to manage their own data like that? Your neighbours? Your friends? Even if tools are built for the average person, what is to say they would even want to use them? For them, the Semantic Web might be dead on arrival.

We are still a long way off from some form of a Semantic Web. While in many ways we are definitely stepping towards it, the full data utopia will rely on many aspects falling perfectly into place. It is unlikely to be a data revolution but rather an evolution of how the web operates now. As we step forward though, we will undoubtedly discover new uses for the data and start developing the technologies that can utilize it.

The semantic future of the web

Building a Semantic Web

Data could change how we use the web

Still barriers to be overcome

Add to the discussion