Google Similarity Distance


The Google Similarity Distance is a concept given by Rudi Cilibrasi and Paul Vitanyi which calculates the similarity between any number of given queries.

Google | Most Powerful Search Engine
This method relies on the World Wide Web and a powerful search engine such as Google to find out the similarity metric between terms. The similarity is based upon the principle of frequency of occurrence of a term in the search results of a search engine on the Internet.

By using the derived formula for Normalized Google Distance(NGD), given below, one can find the similarity between terms (0 for identical and 1 for unrelated).

              max(log(f[x]),log(f[y])) - log(f[xy])
NGD =  ------------------------------------
               log(N) - min(log(f[x]),log(f[y]))


x - Query 1
y - Query 2
f[x] - Search Results Count of [Query 1]
f[y] - Search Results Count of [Query 2]
f[xy] - Search Results Count of [Query 1,Query2]

N - Total no. of pages searched by the search engine


This NGD when calculated can be used to draw a similarity graph of queries.

Google Similarity Distance is useful in Automated Machine Learning, Pattern Recognition, Clustering of Unknown Objects, etc.



Source : The Google Similarity Distance - IEEE
For full content : The Google Similarity Distance - PASCAL EPrints

Search Keywords: Google, Google Similarity Distance, Normalised Google Distance, Automated Machine Learning

No comments:

Post a Comment