Week 5 & 6 Finding K in Kmeans, Dataset 🔬

Clustering Word Embeddings

This week, my primary focus was on clustering word embeddings to identify similar words within a dataset. The challenge was to determine the optimal number of clusters, ( k ), as this was not known beforehand. To address this, I employed four common methods: the Elbow Method, Silhouette Score, and Gap Statistic. Here’s a detailed breakdown of each approach:

1. Elbow Method: The Elbow Method involves running k-means clustering on the dataset for a range of ( k ) values and then plotting the within-cluster sum of squares (WCSS) against the number of clusters. The optimal ( k ) is identified at the ‘elbow point’ of the graph, where the WCSS starts to diminish at a slower rate. I implemented this method using the following steps:

Converted the text data into TF-IDF features.
Computed WCSS for different values of ( k ) and plotted the results to find the elbow point.

2. Silhouette Score: The Silhouette Score measures how similar an object is to its own cluster compared to other clusters. A high value indicates well-defined clusters. I calculated the silhouette scores for different ( k ) values and plotted them to determine the optimal number of clusters.

3. Gap Statistic: The Gap Statistic compares the total within intra-cluster variation for different numbers of clusters with their expected values under null reference distribution of the data. I used this method to statistically validate the optimal number of clusters.

After determining the optimal number of clusters using these methods, I performed k-means clustering on the word embeddings. Here’s the summarized code for the clustering process with the best output (Silhouette Score):

def cluster_embeddings(embeddings):
    silhouette_scores = []
    for k in range(2, len(embeddings)):
        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=42)
        kmeans.fit(embeddings)
        score = silhouette_score(embeddings, kmeans.labels_)
        silhouette_scores.append(score)

    optimal_k = np.argmax(silhouette_scores) + 2

    kmeans = KMeans(n_clusters=optimal_k, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(embeddings)

    cluster_centers = kmeans.cluster_centers_
    cluster_labels = kmeans.labels_

    return cluster_centers, cluster_labels

The optimal number of clusters was determined, and the cluster labels were printed for further analysis.

Generating 100 Triples with Textual Predicates

In addition to clustering embeddings, I generated 100 triples with textual predicates that are not available in the DBpedia ontology. This task involved creating subject-predicate-object triples where the predicate is a textual description not found in the existing DBpedia ontology. The purpose was to enrich the ontology with new, meaningful relationships extracted from various textual sources. I was able to achieve this by passing some strings into the dbpedia ontology using sparql query, and checking if they are available in the dbpedia ontology https://dbpedia.org/ontology/.

The generation of these triples involved:

Extracting meaningful relationships from textual data.
Ensuring the predicates were unique and not present in the existing DBpedia ontology.
Structuring the triples in a standardized format for easy integration into the ontology.

Scaling the pipeline

While attempting to generate a large number of triples from sentences for the dataset, I decided to scale the pipeline using SparkNLP. This approach would enhance the efficiency and speed of the pipeline, enabling it to process a substantial volume of triples from various articles. However, I encountered a dependency conflict with Java in the notebook environment, preventing the successful implementation of SparkNLP. I need to further understand how to integrate SparkNLP within the notebook, as it sometimes necessitates a Java environment.

Gracias.