This material was a teaching aid for a crash course I gave at work about cosine similarity. Cosine similarity is a blunt instrument used to compare two sets of text. If two the two texts have high numbers of common words, then the texts are assumed to be similar. The ultimate goal is to plug two texts into a function and get an easy to understand number out that describes how similar the texts are, and cosine similarity is one way to skin that cat.
Please note, there are plenty of other very fast implementations for cosine similarity, but this one was written for educational purposes.
Stop Words
One way to improve your results from cosine similarity is to clean the text first. This involves removing punctuation, making all of the letters the same case, and removing stop words, which are words that we know, or assume, are not of interest. Different applications require different sets of stop words, and different sets of stop words can produce wildly different results in terms of text similarity. Here is one set that I found reading this blog.
determiners = ["a", "an", "another", "any",\ "certain", "each", "every",\ "her", "his", "its", "my",\ "no", "not", "our", "some", "that",\ "the", "their", "this", "your" ] coordinating_conjunctions = ["and","but","or","yet","for","nor","so"] prepositions = ["as","aboard","about","above",\ "across","after","against",\ "along","around","at","before",\ "behind","below","beneath",\ "beside","between","beyond","but",\ "by","down","during","except",\ "following","for","from","in",\ "inside","into","like","minus",\ "near","next","of","off","on",\ "onto","opposite","out","outside",\ "over","past","plus","round","since",\ "than","through","to","toward",\ "under","underneath","unlike",\ "until","up","upon","with","without"] stop_words = determiners + coordinating_conjunctions + prepositions
A Verbose Implementation
This implementation of cosine similarity was adopted from this SO post. I added stop words, and lots of print statements.
from collections import Counter import math import re WORD = re.compile(r'\w+') def clean_text( text, stop_words=None ): print("\nINPUT\n") print(text) words = WORD.findall( text ) words = [ word.lower() for word in words ] if stop_words is not None: words = [ word for word in words if word not in stop_words ] words.sort() return words def text_to_vector( text, stop_words=None ): vector = Counter( clean_text( text, stop_words ) ) print("\nWORD VECTOR\n") vkeys = vector.keys() vkeys.sort() for key in vkeys: print("{} <- {}".format(vector[key],key)) print("\n"+"-"*5) return vector def verbose_cosine( text1, text2, stop_words=None ): vector1 = text_to_vector( text1, stop_words ) vector2 = text_to_vector( text2, stop_words ) intersection = set(vector1.keys()) & set(vector2.keys()) intersection = list( intersection ) intersection.sort() print("\nINTERSECTION\n") for item in intersection: print("{} | {} <- {}".format(vector1[item], vector2[item], item)) print("\n"+"-"*5) numerator = sum([vector1[i] * vector2[i] for i in intersection]) print("\nNUMERATOR: sum of the products of the counts in the intersection\n") print(numerator) print("\n"+"-"*5) sum1 = sum([vector1[i]**2 for i in vector1.keys()]) sum2 = sum([vector2[i]**2 for i in vector2.keys()]) denominator = math.sqrt(sum1) * math.sqrt(sum2) print("\nDENOMINATOR: product of the square roots of the sum of the squares of the counts\n") print(denominator) print("\n"+"-"*5) similarity = None if not denominator: similarity = 0.0 else: similarity = float(numerator) / denominator print("\n{}".format( similarity )) return similarity
Note that this implementation outputs , which returns values between -1 and 1, but some users want to report just , which would range from 0 to .
A Quick Example
Here’s a tiny data set from JFK and FDR.
fdr = "The only thing we have to fear is fear itself. The test of our progress is not whether we add more to the abundance of those who have much; it is whether we provide enough for those who have too little. It is common sense to take a method and try it. If it fails, admit it frankly and try another. But above all, try something." jfk = "My fellow Americans, ask not what your country can do for you, ask what you can do for your country. Forgive your enemies, but never forget their names. Change is the law of life. And those who look only to the past or present are certain to miss the future." djt = "All of the women on The Apprentice flirted with me - consciously or unconsciously. That's to be expected. What separates the winners from the losers is how a person reacts to each new twist of fate. Sometimes your best investments are the ones you don't make."
Using this data, and the stop words above, the verbose_cosine()
function should produce the following:
verbose_cosine( fdr, djt, stop_words )
INPUT The only thing we have to fear is fear itself. The test of our progress is not whether we add more to the abundance of those who have much; it is whether we provide enough for those who have too little. It is common sense to take a method and try it. If it fails, admit it frankly and try another. But above all, try something. WORD VECTOR 1 <- abundance 1 <- add 1 <- admit 1 <- all 1 <- common 1 <- enough 1 <- fails 2 <- fear 1 <- frankly 3 <- have 1 <- if 4 <- is 5 <- it 1 <- itself 1 <- little 1 <- method 1 <- more 1 <- much 1 <- only 1 <- progress 1 <- provide 1 <- sense 1 <- something 1 <- take 1 <- test 1 <- thing 2 <- those 1 <- too 3 <- try 3 <- we 2 <- whether 2 <- who ----- INPUT My fellow Americans, ask not what your country can do for you, ask what you can do for your country. Forgive your enemies, but never forget their names. Change is the law of life. And those who look only to the past or present are certain to miss the future. WORD VECTOR 1 <- americans 1 <- are 2 <- ask 2 <- can 1 <- change 2 <- country 2 <- do 1 <- enemies 1 <- fellow 1 <- forget 1 <- forgive 1 <- future 1 <- is 1 <- law 1 <- life 1 <- look 1 <- miss 1 <- names 1 <- never 1 <- only 1 <- present 1 <- those 2 <- what 1 <- who 2 <- you ----- INTERSECTION 4 | 1 <- is 1 | 1 <- only 2 | 1 <- those 2 | 1 <- who ----- NUMERATOR: sum of the products of the counts in the intersection 9 ----- DENOMINATOR: product of the square roots of the sum of the squares of the counts 67.8306715284 ----- 0.13268333922104286