Security, authenticity and reliability of data for the Semantic Web: a Critical Review

Rebecca Flaherty & Stephen Gray

Abstract

As the web approaches maturity as a research tool, trust in the provenance of data, and in authenticity of authorship, are of crucial importance. The web has a large and ever-increasing audience, both human and machine based. Data can not only be read from a huge array of sources but also be widely and repeatedly redistributed. Users and agents (machines and software such as browsers which 'navigate' online data) need to be able to trust in the authenticity, objectivity and origins of data. In order to do this a system for evaluating the value of data will need to be established.

The intention of this paper is first to define how the value of a web resource is currently established, and how this affects the usability of the web as a reliable tool for academic research. Possible solutions to this problem using data models, information distribution methods, mark up, authentication and Security methods for the Semantic Web will be reviewed and critically assessed.

A case study scenario will be established for academic writing, to illustrate the issues outlined above and how they might be addressed using semantic web technologies. Current thinking around these technologies will be explored, as will the standards that need to be established in order to facilitate World Wide Web Consortium director Tim Berners-Lee's vision of a Semantic Web.

Once established, the Semantic Web has the potential to provide solutions through the development of a 'web of trust', digital signatures and semantic mark-up languages. While these would allow information-gathering agents to become more 'intelligent' in their selection of data, the implementation of the semantic web will be hampered if the issues of security and privacy and of making information more accessible and understandable by machine are not addressed.

Introduction

Robert is a PhD student who supplements his income by lecturing at the university. He has been asked to present a lecture on a subject with which he is unfamiliar, as the subject deals with emerging technologies. He begins web-based research, as published material is as yet unavailable. However, as he is in a trusted position, the information he gives out needs to be reliable, accurate and trustworthy. How does Robert ensure that the results of his research are of sufficient quality?

Assessing the Authenticity of a Web Resource

The Internet provides a vast amount of information on almost any subject. If the Web is to be used as a tool for academic research, the problem

lies not in finding information, but in sifting through it and winnowing out the most valid and useful material…. Unfortunately, a great deal of the content of cyberspace… is inaccurate, poorly substantiated, unbalanced, or even tainted by deliberate political or commercial slant.

(Jelovsek, 2004)

The reliability of information on the web can be assessed in a variety of ways. Has the author or sponsor of the information provided any credentials, and can they be verified? If the contents of a page are under copyright this can also be checked. Reputability might also play a role: information appearing either on the website of a university or trusted institution will hold more weight that the same information published in a blog or forum. Reputation is at present the most convenient form of determining the value of web resources; there are many university-sponsored search sites, such as Athens.ac.uk, providing access to academic journals, articles and other resources. Information, however, can appear to be from a reputable source through presentation, but actually have nothing to do with the institution, and therefore the domain name should be checked and verified.

On a magazine-based website, Robert finds an article which was written by a lecturer he knows at another university. He emails the lecturer to confirm his authorship before using the information. Robert is able to establish a limited amount of credibility in the article through its use of references, its appearance of objectivity and the magazine's reputation. However, how could he have ensured its authenticity had he not known the author personally?

The accuracy and objectivity of the source are key considerations. Verification of the information that a website contains can give an indication of its reliability through a web search for supporting articles. References and links to other sources provided on the page can be an indication of the source's credibility. The political, commercial or other agenda of the site should be investigated, as this can affect the reliability of information types ranging from news coverage to product advertisement, and in the extreme take the form of propaganda.

Bias on the web can be insidious and far reaching…. We might be able to hold bias in check if we could all judge the content of web sites by some objective definitions. But the process of asserting quality is subjective, and is a fundamental right upon which many more things hang.

(Berners-Lee, 1999 p.135)

The age of the data on a web page can be a factor. If the page has been updated recently it is more likely that information is up to date; older pages may contain information that was accurate when the site was produced, but is no longer relevant. Whether the information is from a primary or secondary source affects its validity: is the information first-hand, or is it a forum discussion?

Robert found that manually searching for verification data was a time-consuming task, and involved a variety of searches, not only for the information required, but also on authors, institutions and their credentials.

The Semantic Web may offer a means for information to be searched for, analysed and authenticated by machine, potentially helping Robert in his research.

How the Semantic Web might provide Authentication

What is the Semantic Web?

If HTML and the Web made all the online documents look like one huge book, RDF, schema, and inference languages will make all the data in the world look like one huge database.

(Berners-Lee, 1999 p.201)

The Semantic Web is a concept that seeks to unite all data across the Web by defining an inference language. This language will act as a bridge across mark-up and natural languages. The Semantic Web is, essentially, a syntactic framework based on relational database ER modelling and it is quickly becoming the springboard for many emerging Internet technologies.

Applications of the Semantic Web

Recently, the technology has been used on many popular social networking sites, such as 'del.ici.ous', 'Flickr' and 'Livejournal', to add categorical data 'tags' to a file or document such as a blog entry ('Livejournal), URI 'bookmarks' ('del.ici.ous') or image ('Flickr').

As this method of marking up data is adopted more widely, it becomes important to ensure that the data is marked up a consistent and accurate way. If a machine is to understand the meaning of data and automatically follow links through the vast amount of information available, it must also be able to evaluate the authenticity of that information to assess its value to the current task.

To do this, it must be able to establish the origin of the data; that the true origin is the same as the claimed origin; and, if the data has been changed since it originated, why and by whom. Only once these have been established can the data's worth be assessed: "Being able to ask 'Why?' is important. It allows the user to trace back to the assumptions that were made, and the rules and data used." (Berners-Lee, 1999, p.207)

Trust and the Semantic Web

Central to the way the Semantic Web will address these issues is the notion of trust:

The degree to which an agent considers an assertion to be true for a given context. While the term 'trust' is often used to denote a very high degree of confidence, there is an associated risk of the assertions being wrong.

(Reagle, 2002).

Very little is written about trust and proof in the architecture of the Semantic Web, and yet it is arguably one of the most important aspects. Trust, however, is difficult to establish, particularly in the disembodied online realm.

This mirrors real-world situations, such as banking and voting, which have been replicated or trialled online (The Electoral Commission, 2006). There are countless cases of rigged elections in which voters could not trust that their vote would be properly counted after being placed in the ballot box and recent UK elections have suffered from postal voting corruption (The Guardian, 2005). For high street banks, older technologies such as cheques also suffer from the same risks of misplaced trust. (The Daily Telegraph, 2006). However, we can usually build up enough trust for each of these transactions, through signals such as official or corporate marks and buildings, uniformed staff and a perception that other people in similar situations are engaging in the same transaction.

The ability to build trust in the corresponding online situations is more difficult: the signals available are one dimensional and less transparent - for example, a bank might have maintained a branch in the same building for decades, but its URL could point to a different server every day without its customers being aware of it - meaning that the provenance of data is hidden to most human users.

Robert receives an RDF Site Summary (RSS) feed of news relevant to his research from a colleague. He knows that he can trust that information, because in context the information comes from a trusted source. He can then make an informed judgement on the reliability of the content.

A crucial underpinning for the Semantic Web, as envisaged by Tim Berners-Lee is a 'web of trust', a "mesh of statements about who will trust statements of what form when they are signed with what keys. This is where the meat is, the real mirroring of society in technology."

(Berners-Lee, 1999, p.209)

In order to achieve this, systems must be developed to enable machines to match or better the human ability to judge reputability alongside existing means of authentication:

Much research has focused on authentication of resources, including work on digital signatures and public keys. Confidence in the source or author of a document is important, but trust, in this sense, ignores many important points. Just because a person can confirm the source of documents does not have any explicit implication about trusting the content of those documents.

(Golbeck, J., Parsia, B. & Hendler, J., 2003, p.1)

Social Networking and the 'Web of Trust'

Some possibilities are demonstrated by the most widespread of the Semantic Web applications to have emerged so far: social networking, FOAF (Friend-Of-A-Friend) and RSS. Applications such as 'Flickr', 'Del.ici.ous' and 'Livejournal' apply the principles of social networks and structures to create virtual communities bound by recommendation, reputation and shared interests. These principles allow webs of trust to develop in a way that mimics real world social networks with no central authority to validate trustworthiness.

FOAF extends this concept by providing "a way to describe people and relationships to computers." (Morten Frederickson, 2006). It is "an ontological vocabulary for describing people and their relationships. There are millions of FOAF files online - some created by individuals, and others output as a standardized way of sharing data from some centralized social network websites." (Golbeck, J., Parsia, B. & Hendler, J., 2003, p.3). The more links and 'friends' that are attached to a certain piece of data, the higher its confidence value and trustworthiness becomes.

However, the applications of social networking are limited and FOAF runs the risk of exposing data to those who might misuse it. Many FOAF pages contain personal data such as telephone numbers and email addresses, which could expose their owners to 'spam' and identity theft.

Robert used his real-world social network to check his first source, but can an online equivalent provide similar checks? As his topic is very new, it is mentioned far more in blogs and news reports than in conference papers or journals. He finds many blogs referring to a relevant recent news story on the site of a national broadsheet newspaper. Keen to engage his students with up-to-the-minute material, Robert traces the original source on the newspaper's website and downloads a copy. However, he misses the subsequent revelation that the article was itself based on a blog entry that was a fabrication. The newspaper later retracted the story and sacked the journalist responsible, but, to Robert, the information he found on the web seemed current, reliable and supported by enough links from a network of blogs to be trusted.

While this assumption would normally be as safe as Robert needed it to be in the context of his research, this scenario shows that an entire mutually dependent system of trust can exist in a 'fossilised' state.

Public Keys for Authentication of Sources

One way in which the author origin of a document could be ascertained on the Semantic Web is through the use of digital signatures. Open PGP (Pretty Good Privacy) is one method of signing documents. Rather than have a centralised certification authority each entity distributes their own public key, and if they trust a source they add or sign that sources public key with their own. This then builds into a 'web of trust':

Each PGP user maintains a 'keyring' of public keys with associated trust values. Two important variations of trust are: (a) trust to introduce a new public key and corresponding associated identity, and (b) trust that a given key for the purposes of signing messages [is] from an identified party.

(Klyne, 2002)

Therefore, the machines that are used to gather the data can use Digital Signatures to assess the value of the source's data as well as its contents.

This PGP system of signing documents would allow Robert or his user agent to keep a list of the public keys of known and trusted sources. The public key of a document can then be examined for known keys, and its worth calculated accordingly. This would save Robert the task of searching though untrustworthy sources, as the user agent can be told to ignore data from unknown or untrustworthy sources.

Tim Berners-Lee claims that this assertion of trust using PGP can be refined to a finer 'granularity' (Dumbill, 2000). A user can define not only which keys are trusted, but also the domain within which to trust them.

The Semantic Web is not intended to emulate an inference engine, but the granularity of trust can be improved from the current 'trusted' or 'not trusted' to a more nuanced spectrum.

Data and the Semantic Web

It is difficult to predict the evolution of a mark-up language:

Where for example a library of congress schema talks of an 'author', and a British Library talks of a 'creator', a small bit of RDF would be able to say that for any person x and any resource y, if x is the (LoC) author of y, then x is the (BL) creator of y. This is the sort of rule which solves the evolvability problems.

(Berners-Lee, 1998)

Rather than being stored in human readable form, data for the Semantic Web must be stored in a machine-readable format and served as human-readable when required. This would be a fundamental shift in the way content is currently perceived by traditional web developers.

Different sites and databases will be able to name data using different conventions; this could be though cultural convention, language differences or the personal preference of the programmer. Superficially this would appear a great obstacle to interoperability, but the Semantic Web will overcome this through the use of inference languages.

This framework of 'dictionaries', which will allow machines to query other machines to ascertain the meaning of data, will also allow them to build maps and networks of trust. By creating an ability to 'understand' data, Semantic Web technologies will also permit user agents to 'understand' its origins and explore its provenance.

Critical Reflection

The Semantic Web promises solutions to the problems Robert faced when searching, authenticating and evaluating the ever-expanding amount of data available on the online.

The web of trust and PGP can be seen as mutually reinforcing methods of ascertaining the provenance of data. PGP can be used authenticate the identity of nodes within a web of trust, reducing the potential for a web to be misused. In turn PGP propagates through a web of trust with its certificates gaining value as they accumulate endorsements. An advantage of both a web of trust and PGP is that they are both decentralised systems. However at present there is not a decentralised system for the implementation of PGP.

However, both are systems of accumulation, so a new user faces the difficulty not only of establishing their own trustworthiness, but also of assessing the value of different webs without a trusted starting point. The accumulation of trust through the signing of PGP certificates enables various levels of 'trustworthiness' to be established, and might facilitate a ranking system, with an entity gaining credibility as the accumulate endorsing signatures.

PGP will allow a gradation of trust to be formed, rather than a black or white, trusted or not trusted, system. User agents could interpret certificates in shades of grey. It might be trusted within certain domains or for certain transactions but not others. This enables a user to specify to an agent the dominion within which a certificate is to be trusted. A disadvantage of this is that unsigned data, or data with un-known signatures, which could be relevant and trustworthy, is ignored by the user agent. An alternative would be that such data would be provided to the user but would be flagged as questionable, or an agent could investigate its provenance.

However, progress towards the 'critical mass' of acceptance and implementation that will lead to the creation of the Semantic Web is not inevitable. The mere fact that it is still largely a vision rather than a reality nearly a decade after it was first conceived emphasises that the difficulties are not trivial. These difficulties are both technological and societal.

Standardisation recommendations for mark-up languages were initiated in 1995 by the World Wide Web Consortium (Berners-Lee, T., Connelly, D., 1995). Despite this, they have never been fully adopted by the web at large. Arguably there are two main concerns for the development of the Semantic Web. It will require further non-patented, de-centralised standards to be widely implemented and there must be significant changes in the way information is perceived and stored.

None of the authentication techniques described in this paper will be able to provide comprehensive and trustworthy search results until the foundations of the Semantic Web are in place.

References