Technology

Scalable Clustering of News Search Results

1. INTRODUCTION

Three types of clustering.

1. Offline Clustering

2. Incremental Clustering

3. Realtime Clustering

Standard Clustering Algorithms

1. Hierarchical agglomerative clustering (HAC) algorithm
2. Partitional algorithms such as K-means

Various Matrix factorization techniques

1. Singular value decomposition

2. Non-negative matrix factorization

3. Local non-negative matrix factorization

4. Concept decomposition

2. ARCHITECTURE OF THE SYSTEM

1. The offline batch clustering algorithm will be run on a regular basis, multiple times a day on the news corpus, which we limit to a full month of news articles.
2. Clustering algorithm assigns a cluster ID to every document that is present to the algorithm

3. OFFLINE CLUSTERING

1. The similarity between all pairs of articles is first computed via Locality Sensitive Hashing (LSH).
2. Use the LSH method to determine the pairwise similarities, we then construct a similarity graph on the corpus,wherein a pair of articles has an edge if the similarity between the two meets a threshold.
3. A correlation clustering algorithm [1] is then run on top of the similarity graph to produce the final clustering

Feature Vector Generation

1. The different types of features that were extracted are:
1. TF-IDF
2. Wikipedia Topics
3. Part of Speech Tagging

2. An important component of the LSH method is the generation of Minhash signatures for each article.

Minhash Signature Generation

1. Minhash signatures are a succinct representation of each article computed such that the probability that two articles has the same Minhash signature is equal to the Jaccard similarity between the two.
2. Minhash signatures can also be quickly used to detect exact duplicate

Duplicate Detection

1. If an article is a duplicate of another it is unnecessary to include both in the clustering process
2. One article per group of duplicates is sufficient to determine the cluster memberships of the rest.

Locality Sensitive Hashing

1. Locality Sensitive Hashing can be used to quickly eliminate pairs of articles that share very few features from the pairwise similarity computation.

2. A similarity graph is then constructed using the output of the LSH process.
3. Each article is represented as a node in the graph and an edge exists between two nodes if its similarity exceeds the threshold.

4. The edges are also weighted by the cosine similarity measure.

Correlation Clustering

1. The similarity graph is then fed into a correlation clustering algorithm to partition the graph into clusters.

Evaluation

4 . INCREMENTAL CLUSTERING

1. The offline clustering phase produces a set of clusters of similar/relevant documents.
2. These clusters are then taken as groundtruth for constructing a classifier for incremental clustering.
3. Offline clustering is usually scheduled to run at the interval of a couple of hours.
4. Three simple classifiers for the purpose of incremental clustering.

Static A standard classifier that must be re-trained on the latest set of documents at a shorter time interval.

Semi-adaptive: A classifier that is capable of creating new class for news articles which are not close
“enough” to any of the existing clusters. The closeness here is a value which requires careful tuning.

Fully-adaptive: A classifier that is not only capable of creating new classes but also updating itself to cope with the evolution of the news in the existing clusters. That said, the classifier is also able to remove classes corresponding to “submerging” stories. Compared to semiadaptive classifier, this classifier is more flexible but more likely to suffer from sub-optimality as it is sensitive to the order of the news articles that arrived at the system.

5. As long as offline clustering can be carried out more frequently, static classifier is perhaps the best as it is simple to implement, easy to deploy, and requires no maintenance effort. Otherwise, semi or fully-adaptive classifiers, which are harder to maintain, can be used instead.

5. REALTIME CLUSTERING

1. Realtime clustering is vital to adjust the granularity of clustering at the query-time.
2. In our experiments we used hierarchical agglomerative clustering (HAC) to incorporate these similarity measures and compared the proposed similarity measures with the standard cosine similarity.

Meta-Clustering and Textual Matching

1. BM25 [14] as a proxy for textual matching between the query and the document.
2. Given a query q and a document d, BM25 score can be computed as

6. OFFLINE CLUSTERING PERFORMANCE

Technology

Sunday, July 8, 2012

No comments:

Post a Comment

About Me