In Search of Truth for Knowledge-based Systems

Oct 5, 2014

I find myself coming back to Stefano Mazzocchi’s blog post about epistemology in knowledge-based systems. The upshot for me is that since an absolute truth doesn’t appear to exist, a truly useful search engine for the semantic web should calculate probabilities of truthfulness for each statement published, akin to Bayesian probabilities.

I was thinking of an approach using a PageRank-style algorithm to compute a trustworthiness-score for each domain (or rather subdomain). The score would lie between zero and one, where a trustworthiness of 1 can never be reached (except for domains only publishing tautologies) but would mean that when we pick a random statement from that domain, the probability of the statement being true is 100%—i.e. all statements provided by that domain are 100% true. The algorithm can be thought of as an iterative method that updates the trustworthiness of each domain on each iteration, depending on how many of the statements on that domain agree or disagree with all comparable statements on all other domains. For example, if most statements on domain A agree with statements on other domains that already have positive trustworthiness, the trustworthiness of domain A will increase.

Computing to what degree two statements agree is also a non-trivial task. It should be a similarity measure which measures zero when two statements are incomparable (i.e. completely independent like “The capitol of the USA is Washington” and “It was raining in France on 1st October 2014”), while returning one for the similarity of any statement to itself and minus one for two contradictory statements. However, this requires a certain knowledge about the predicates in the statements. For example, the two statements “Bob is-married-to Alice” and “Bob is-married-to Eve” are contradictory in most cultures—usually, only one of them can be true. But “Bob has-friend Alice” and “Bob has-friend Charlie” may both be true statements. Also, the statement “Earth has-population 7 billion” should be more similar to “Earth has-population 7.1 billion” than to “Earth has-population 1 billion”, but how much more similar?

Should statements concerning numeric literals, in addition to a unit, also carry an optional 95% confidence interval? Probably yes. Also, domains should be able to express their own probability for each statement they publish, from which, in combination with the domain’s trustworthiness, the search engine’s overall probability for that statement would be calculated. Interestingly, YAGO seems to go into that direction.

Depending on the details of the algorithm, it would probably also be sensitive to initialization. This seems like a good thing, since it would avoid favouring the view that is voiced by the most domains (e.g. someone publishing the same wrong statements on hundreds of domains) and allow us instead to assign higher initial trustworthiness to domains like wikipedia.org that already represent a consensus among its many contributors. However, the people setting these initialization scores would have a disproportionate influence on what will be considered trustworthy information and what not, in effect determining what the system will eventually output as probably true. So should .gov domains have a higher initial trustworthiness? This seems reasonable in the area of statistics, but less so when considering stuff published during an election campaign.

While the details are immensely more complicated for statements than for pages, all in all, the fundamental problem seems to be similar to the one that Google and other search engines are already facing today: how much to trust each domain on the web.