Given a set of documents, a common task is to group them accordingly to topics or subjects. A human agent can create a hierarchy of subjects and assign each document to its related issue.
However, a clustering algorithm can create this structure automatically and more precisely. We can apply hierarchical clustering algorithms to group documents as we build the hierarchy between them.
We don’t need to set the subjects nor the topic hierarchy previously. Instead, the clustering process discovers them.
However, human agents need to provide some parameters to the algorithms:
- a criterion to measure the similarity between documents (a distance function)
- a level to cut the hierarchy or minimum size of a cluster
- labels for the discovered clusters