Clusty, a search engine clustering the results
I intended to write about this four days ago, when I first saw this post at Google Blogoscoped but I didn’t have the time before. In it Philipp Lenssen, the author of the blog, argues that “diverse Google results are good” because they satisfy different cases of a search, given that the semantics of the same search from two different users might be different. He gives the example of the query [google blog] which might yield at least three different classes of results: “1) the user wants to see the official blog by Google Inc., 2) the user heard about Google’s blogging platform and wants to find Blogger.com, or 3) the user is looking for an independent blog covering Google.”
He goes on explaining the case for other examples and concludes that, even though some results might be diverse, there is a problem of optimization that benefits those pages that are doing a better job at it (particularly the ones that are aware that search engine optimization matters). Further, in the comments, he also adds a comparison of the overall results of different search engines, specifically Yahoo, and Microsoft’s Live.com and argues that Google’s results are better for the keywords [google blog].
I agree that, in a general sense, diverse results do benefit a broader population of users and it is in this context that I want to introduce the Clusty Search, arguing that its results are even better than Google's. This engine goes beyond the notion of crawling and indexing pages, by also querying other search engines, combining the results and grouping them into clusters. The notion of cluster here is to try to maximize the similarity of web pages within the cluster and the dissimilarity across different clusters so that the resulting clusters are as relevant as possible as a group of similar pages.
By doing so, Clusty increases the importance of different classes of the result and gives the user the opportunity to narrow it down to the specific class he or she is looking for. Lets go back to the [google blog] example. In Clusty the query returns different classes as expected (the official Google blog, the Google Blog Search tool, and several independent blogs about Google) and, more interestingly, lists the different clusters it found on the left, allowing the user to specify which ‘kind’ of result he or she wants. Note that the engine uses a soft algorithm for clustering so each result can be in more than one cluster for the same query.

A more generic example highlights this feature even further. Take the query [star], for instance, which could mean at least the user is looking for either 1) someone famous, 2) a constellation of some sort, 3) some band, or even 4) something related to Star Wars or Star Trek. These are all real examples of clusters presented as a result.
I haven’t been using this search engine on an everyday basis yet but, overall, it is quite powerful, seems to be very consistent, and has a lot of other features. To name a few related to the topic of this post, search within a cluster, highlight the page’s clusters, and narrow the results by top or second level domain. I actually got myself playing with it for a while just to see how it organizes the different ‘kinds’ of pages into clusters, without any human intervention*.
*I’m assuming the engine is using the notion of cluster as an unsupervised learning process that groups documents based on their similarity (according to some metric). See the Wikipedia article on Data Clustering for more details.
He goes on explaining the case for other examples and concludes that, even though some results might be diverse, there is a problem of optimization that benefits those pages that are doing a better job at it (particularly the ones that are aware that search engine optimization matters). Further, in the comments, he also adds a comparison of the overall results of different search engines, specifically Yahoo, and Microsoft’s Live.com and argues that Google’s results are better for the keywords [google blog].
I agree that, in a general sense, diverse results do benefit a broader population of users and it is in this context that I want to introduce the Clusty Search, arguing that its results are even better than Google's. This engine goes beyond the notion of crawling and indexing pages, by also querying other search engines, combining the results and grouping them into clusters. The notion of cluster here is to try to maximize the similarity of web pages within the cluster and the dissimilarity across different clusters so that the resulting clusters are as relevant as possible as a group of similar pages.
By doing so, Clusty increases the importance of different classes of the result and gives the user the opportunity to narrow it down to the specific class he or she is looking for. Lets go back to the [google blog] example. In Clusty the query returns different classes as expected (the official Google blog, the Google Blog Search tool, and several independent blogs about Google) and, more interestingly, lists the different clusters it found on the left, allowing the user to specify which ‘kind’ of result he or she wants. Note that the engine uses a soft algorithm for clustering so each result can be in more than one cluster for the same query.

A more generic example highlights this feature even further. Take the query [star], for instance, which could mean at least the user is looking for either 1) someone famous, 2) a constellation of some sort, 3) some band, or even 4) something related to Star Wars or Star Trek. These are all real examples of clusters presented as a result.
I haven’t been using this search engine on an everyday basis yet but, overall, it is quite powerful, seems to be very consistent, and has a lot of other features. To name a few related to the topic of this post, search within a cluster, highlight the page’s clusters, and narrow the results by top or second level domain. I actually got myself playing with it for a while just to see how it organizes the different ‘kinds’ of pages into clusters, without any human intervention*.
*I’m assuming the engine is using the notion of cluster as an unsupervised learning process that groups documents based on their similarity (according to some metric). See the Wikipedia article on Data Clustering for more details.
Labels: Geek Talk, Information Retrieval
