The semantic future of the web
The web is built on data—my data, your data, data from small companies, data from big companies, and so forth. We might hand over data like an email address and in return, we might get access to other data, perhaps exclusive content for a new video game or a weekly newsletter. This constant exchange of data allows for collaboration and communication on a scale that never existed prior to the web.
Much of the data currently exchanging hands can be viewed as human-centric. We have news articles, blogs, e-commerce, forums, video platforms, social media, and Q&A sites providing us data to read, watch, and otherwise consume. We are not the only consumers of the web though, with search engines, voice assistants, pricing bots, and even link preview bots performing a staggering number of requests every day—computer systems like these are playing an ever-growing role in data consumption.
Tim Berners-Lee coined the concept of a “Semantic Web”, where the web can be considered more a global database that computer systems could understand rather than a series of separate web pages. In turn, this could effectively allow deeper integrations between different computer systems and allow for greater decentralization of data. The data here is not just from large corporations—it can be your data or my data, data that we control and manage ourselves through our own websites.
Unfortunately, we are not at this stage of a full data utopia. Large amounts of data are not publicly available, and for data that is available, it can often be locked behind APIs with their own proprietary systems where you need to pay for access.
Building a Semantic Web
To move from where we are now to a full Semantic Web is not something that can happen overnight. We have been building web pages for years on HTML, CSS, and JavaScript, optimally designed for a human viewing experience. To extract reliable data from HTML currently, computer systems need to be able to process unstructured data, then establish context and meaning. The thing is, we humans can determine the context and meaning from viewing the page, but machines have to perform additional processing to get that same context. Directly encoding structured data eliminates additional complexity for machines to process themselves. There are many different solutions to encode structured data including Open Graph, Microdata, RDFa, and JSON-LD.
Open Graph, created by Facebook, is a popular format for holding specific types of structured data. Facebook uses this to generate link previews from page metadata. Website developers want additional control over what is displayed based on how it is described in the metadata. Since its creation, other social media sites have also adopted Open Graph for generating link previews.
Microdata, RDFa, and JSON-LD, however, are a bit different as, by themselves, they only represent different formats of storing data in a web page. Computers can parse these standardized structures. However, unless it knows the type of data being represented, it will not actually understand the data. What is missing here is a shared vocabulary so that two different computer systems can understand each other.
A joint effort made by Google, Microsoft, Yahoo, and Yandex proposed a solution called Schema.org to promote structured data in web pages with a common vocabulary. For search engines, this structured data can help provide richer information in the search results. While Schema.org does not describe every type of object, nor does it intend to, it does create a solid foundation to describe many common objects: books, events, locations, medical conditions, movies, organizations, and people. For areas that it does not cover, alternative vocabularies can be used to describe that specialized data. Through its popularity for enhancing SEO, Schema.org has an ever-growing user base which in turn helps grow the Semantic Web.
Data could change how we use the web
A Semantic Web may not only change how we think about searching for information online but who controls the information. Imagine every website not just being a wall of content but a graph of inter-related topics and ideas. There would not need to be a central spot where data is stored and controlled by a single entity, helping avoid some concerns about censorship and bias while simultaneously improving privacy and control over one’s data that they share.
For example, take a site like Facebook. It maintains mountains of information about people and businesses, with various relationships between different entities from comments, reactions, and shares. This data is part of the Facebook ecosystem; it effectively “belongs” to them. In a future where data is in our own control, sites like Facebook could just be the visual representation of the existing network, built on a Semantic Web. The data we declare public on our website is what can be viewed, giving us full control over what is shared. This also means that we are not locked-in to a service like Facebook. You are free to move to other “front ends,” as the data is yours and you maintain it.
It might seem strange that an organization like Facebook would ever want to give up their data, however, with stricter laws being passed, for example GDPR in the EU and CCPA in California, it may be just a matter of time until Facebook is forced to.
As new technologies are built to take advantage of this data, it will also provide new tools and experiences for users. While algorithms behind search engines are complex, they currently provide results for queries that have already been specifically answered. If you asked, “all songs before 1995 that failed domestically but were well-received worldwide,” you would be unlikely to get results because no one has yet answered that question. The data for such a query exists on the web; however, it is not readily available due to how search works. With a web built on data, obscure queries like this could turn up results by combining different datasets across several sites.
The ability to query more complex data can especially help researchers and data scientists, being able to potentially combine vast amounts of public data with their own private research data to discover new and interesting things. Additionally, it may help those training machine learning models as specific data sets could be crafted that they may have been impossible to acquire otherwise.
Still barriers to be overcome
Changes to support a Semantic Web are not something that can happen overnight—we are talking years of small steps and incremental improvements. Even if most websites had rich structured data in their markup, many new tools and technologies would need to be built to leverage it. For example, Berners-Lee has been working on Solid as a method to allow users greater control over their own data, building upon key concepts of a Semantic Web.
Like many other concepts, the Semantic Web does have its critics. One, Cory Doctorow, goes as far to call it “a pipe-dream, founded on self-delusion, nerd hubris, and hysterically inflated market opportunities.” That comment is not without merit as there are several potential problems that need to be considered.
With the number of websites on the web and the vast number of types that may need to be represented, there is a huge amount of data that would need to be understood for any sufficiently complex query. Schema.org has 841 types by itself but only scratches at the surface of all data that could be represented. When looking at specific industries and the data that they might publicly share, there could be hundreds of vocabularies with thousands of types in each.
Beyond the sheer amount of data is deciding how to even classify some of it. Debates could rage on about the most mundane things like whether “a washing machine was a kitchen appliance or a household cleaning device.”
Then the Semantic Web needs to handle duplicate data which, unfortunately, might not be any easier than trying to de-duplicate unstructured data. A single item might be able to be represented in two or more different vocabularies and may have different properties defined. A global identifier for data may help in specific circumstances, however, it will not fully solve the problem.
The credibility of data is another key concern with a Semantic Web. When we research information currently, there are many different factors that we may consider when determining if the information we read can be trusted. Additionally, we might verify what we find across multiple different sites. Systems would not only need to deal with factually incorrect data but also inconsistency in the data that it does find.
Maybe the biggest problem though is not a technical one but a human one. Web developers or other people interested in these types of technologies might go out of their way to add data to their pages and websites, but, would your parents want to manage their own data like that? Your neighbours? Your friends? Even if tools are built for the average person, what is to say they would even want to use them? For them, the Semantic Web might be dead on arrival.
We are still a long way off from some form of a Semantic Web. While in many ways we are definitely stepping towards it, the full data utopia will rely on many aspects falling perfectly into place. It is unlikely to be a data revolution but rather an evolution of how the web operates now. As we step forward though, we will undoubtedly discover new uses for the data and start developing the technologies that can utilize it.
Tags: semantic web
10 Comments
Quoting Doctorow’s article from 2001… Come on, that’s neither still relevant nor correct considering modern semantic web projects. There have been solutions proposed and implemented that tackle all of the issues mentioned. Could be better researched e.g. by interviewing someone from the semantic web community (github.com/linkeddata, forum.solidproject.org)
So human make effort to structure the web data, so it is easier for bots to take advantage of?
What about human devs to make effort so the bots better structure the web data by themselves? (ex: https://textoptimizer.com/e )
Interesting blog.
I was just thinking to myself how cumbersome sites built on frameworks have become, and what a breeze it is at times to come across a simple site running simple code. My CPU and data plans are thankful for it. And yet reading this article, I thought to myself that take for example those pricing bots, they’re as effective as they are because commercial businesses build their sites using well established ecommerce frameworks. That essentially means that the data is already structured to an extent for these bots. They either go directly to known sites or they prompt search engines, visit the resulting list of sites, and they know exactly what to look for because everything is in these specified elements with specified ID’s. Google has been doing the same thing for some time now for the sake of their spiders, getting coders to follow a certain structural standard.
I think frameworks are the future of this concept of virtual readers and data consumers.
Interesting blog.
I was just thinking to myself how cumbersome sites built on frameworks have become, and what a breeze it is at times to come across a simple site running simple code. My CPU and data plans are thankful for it. And yet reading this article, I thought to myself that take for example those pricing bots, they’re as effective as they are because commercial businesses build their sites using well established ecommerce frameworks. That essentially means that the data is already structured to an extent for these bots. They either go directly to known sites or they prompt search engines, visit the resulting list of sites, and they know exactly what to look for because everything is in these specified elements with specified ID’s. Google has been doing the same thing for some time now for the sake of their spiders, getting coders to follow a certain structural standard.
I think frameworks are the future of this concept of virtual readers and data consumers.
These virtual consumers are going to form a web within the web. And if we can access this web and track our public data, then I imagine this brings a whole new reality to the digitisation of human personalities. Of course this also means advertisers and so forth have access to a much more intimate you.
The problem that semantic web technology right now has is that it technically works, but it’s not user-friendly. The JSON-LD processor in javascript is heavy and slow, so unlikely to be shipped in a Webapp. I’ve looked at some of the solid apps and they look and feel like technical prototypes, not MVPs. GraphQL and OpenAPI are getting great adoption, through documentation and tooling, something I’d have liked to see for the Hydra community group.
It seems that the Apps around the semantic web are designed for a network effect that isn’t there (yet?)
Take a look at Microsoft Collections (runs as an extension in Edge) for an interesting application that uses the semantic web. Also see how Amazon is using json-ld to engineer the audio presentations of recipes on its device. It is interesting to that many ‘recipes’ web sites now provide json-ld data. Additional interesting efforts are embedded content components from alltrails.com, google maps, youtube.com, twitter.com and map stories.
Wikidata should be mentioned here as a growing repository (1 billion entry) of open and structured data. It has semantic-web friendly interfaces (RDF, SPARQL). It acts as :
1. a structured data repository
2. a heavily multilingual database, with labels and synonyms in every language
3. a hub for the identifiers of entries in many different databases.
Also, initiatives like the linked open data cloud (https://lod-cloud.net/) can provide good entry points for newcomers in this (fantastic !) field.
Thanks for the article
Good summary of the situation. Have you ever heard of IEML by Pierre Levy? In my eyes, it’s the best answer for the problem so far. See more on his blog: pierrelevyblog.com/2020/07/28/ieml-grammar-short-version/
Too bad the article cites Cory Doctorow’s writing which is completely misleading and founded on false premises. Most of the problems the writing blames metadata for (misinformation, misinterpretation, bad / unreliable information) are endemic to the Web and the nature of human communication in general. It is also talking about an utopia, which I personally am not interested in. I’m more into improving things.
One thing Linked Open Data does very well is tracking provenance, which can help in assessing the quality of sources.
One thing it does very badly is being efficient enough to operate at scale. Verbose serialization languages such as JSON-LD don’t help in this direction.
Some people get drunk on LOD and try to apply it to all their information repositories and internal system communications, and that most likely turns into an unreadable and unprocessable mess. I have been there too. Rather than joining the LOD AA, though, I still think that LOD has a place in making the information world better—which is, when is used to connect foreign systems. It is best used as a lingua franca to bridge institutions that don’t speak the same language.
As for the semantic web being user-riendly, I don’t understand the point. I guess many people use Facebook every day without having a clue of what OpenGraph is. The Facebook interface takes the nerdy stuff off their sight and just gives them the functionality they need. This is what LOD lacks: an interface for specific groups of users with different competencies.
First: It is important to look at requirements for business models for data ecosystems, as the economy largely runs on money, like it or not! That means examining the requirements for search, trust, payments, terms and conditions, privacy and confidentiality, resilience and so forth.
Second: AI is improving and so are the assumptions as to what is needed for information to be machine interpretable. We are slowly learning how to make computers think like we do, to understand things like we do, and to learn like we do. This points to the likely emergence of the Sentient Web, and federated services as the future of Web search. The Sentient Web subsumes the Semantic Web.