Doc2vec with Gizmodo articles

- 8 mins

As part of a Cornell Tech project with Chartbeat last semester, I used Python’s Scrapy to scrape articles with all their comments off of gizmodo.com. For this blog post, I scraped gizmodo.com for 8,000 articles, trained a doc2vec model, and reduced the resulting vectors to make neat 2D and 3D visualizations. I wanted to see whether these visualizations of the reduced vectors from doc2vec would appear in clusters that “made sense”.

Doc2vec, an extension of word2vec, is an unsupervised learning method that attempts to learn longer chunks of text (docs). Doc2vec uses the same one hidden layer neural network architecture from word2vec, but also takes into account whatever “doc” you are using. It uses the same context window from word2vec, but concatenates the doc vector containing those words to the words in your context window. The diagram from Mikolov et. al below shows this architecture.

image

Just like word2vec, this method will learn the semantics of the text instead of just clustering documents that have the same words. I used gensim’s implementation, whose documentation can be seen here. I used each Gizmodo articles’ text as a document, and trained the doc2vec model on 8,000 documents.

After Doc2vec, I did a couple manual checks to see if the articles that were deemed “similar” by cosine similarity (so the closer the score to 1, the closer the 2 article vectors are to each other) were somewhat related, but I noticed that most articles did not have high similarity scores. I plotted the distribution of cosine similarity scores between all articles below. We can see that scores were not high, most articles had a score of .4 or .5. .

image

As an example, Drunk Guy Arrested for Kicking a Pepper Robot matched with You Could Creepily Dress Your Pepper Robot Up Like a Doll with a score of .49. Rarely articles would have higher scores like Nikon D810 Review The Ultimate Adventure Camera matching with Sony A7S Adventure Tested Ultimate Low-Light Mirrorless with a score of .77. The latter pair don’t seem to be similar from the titles, but the articles are structured/written identically. The two robot articles were both about Pepper Robot.

I tried different methods to reduce the 300 dimensional vectors. tSNE worked best, which doesn’t preserve distances like other methods such as MDS, but calculates similarities between vectors as probabilities, and then preserves those probabilities. Generally tSNE did not preserve the correct similarities, but did preserve similarities between articles that had high similarity scores. I think tSNE didn’t do well on all articles because the articles don’t clearly separate into categories. This is unlike say MNIST data, which does well with tSNE because there are similarities within the data and it noticeably has 10 clusters.

For the purposes of this visualization, I used only 1,000 articles. First I chose these articles randomly, but after tSNE, the visualizations looked like blobs and articles weren’t next to their real 300 dimensions neighbors. So, I found each article’s most similar pair, then only used the articles with the top 1,000 similarity scores. In this case, tSNE preserved the highest similarites, and actually revealed some clusters in two and three dimensions! The visualizations are shown below. To make the visualizations, I adapted code from the talented Christopher Olah’s blog post on MNIST.

Gizmodo articles come with meta-tags that include “keywords”. I named the clusters by using the keywords that appear the most. All keywords are listed in order of their frequency (high to low). 2D keywords are listed below in their respective colors. Clustering in 3D revealed slightly different categories which are listed above the 3D graphic.

(Click a point to view similar articles. You can zoom in and out of the graphic.)

Click a point above to view similar articles!

3D keywords are listed below in their respective colors:

(Drag your mouse above the points to see the closest articles. You can also zoom in and spin the hairball around.)

Drag your mouse above a point above to view similar articles!

All my code is here. The code for the scraper is here. In the future, I’d like to see how using more data would help the performance of doc2vec and tSNE. Because I had originally built the scraper to extract comments from articles, I’d like to see whether a particular user tends to comment on articles that doc2vec deems similar.

Inna Shteinbuk

Inna Shteinbuk

comments powered by Disqus
rss facebook twitter github youtube mail spotify instagram linkedin google pinterest medium vimeo <