Sentiment analysis of stock tweetsHaving previously wired up a simple spring app with Twitter to consume their tweet stream relating to last year's Rugby World Cup - mostly just to experiment with the event-driven programming model in Spring and Reactor - I thought on a whim, why not see if I can find some nice sentiment analysis tools to analyse the tweets, so rather than just consuming the number of tweets about a given topic, I could also analyse if they were positive or not.
Now, that probably sounded like a fairly glib comment. And to be honest, it was: sentiment analysis is very hard, and the last time I looked most efforts were not up to much. Added to that, to make it actually effective, you need some pretty specific training data - for example, if you had a model trained using this blog and then tried to apply that to another sort of text - say tweets - then it's most likely not going to perform well. Tweets are particularly different as people use different language, grammar and colloquialisms on twitter (in part due to the 140 chars limit) compared to normal writing.
But still, I had my laptop on my commute home on the train, so I figured why not see if there are any simple sentiment analysis libraries that I could just drop in and run the tweets through. Sure the resulting scores would likely be way off, but it would be an interesting experiment to see how easy it was (and if done, could we then find a decent training set to re-train our model so it was more accurate at analysing tweets).
A quick google later and I came across Stanford's Core NLP (Natural Language Processing) library, via the snappily titled "Twitter Sentiment Analysis in less than 100 lines of code!" (which seemed just as flippant as my original suggestion, so seemed like a good fit!). Surprisingly, it was actually just as easy as I had hoped that it might of been! The libraries are nicely available in the maven repo, coming with a pre-trained model (albeit trained on film reviews) and are written in Java. A lot of the code is taken from the approach outline in the above article and the Stanford Core NLP sample class, but its pretty simple and I managed to process a few thousand tweets last night having set it all up on my commute and analyse the sentiment (producing wildly in-accurate sentiment scores - but who's to know, right?!)
(I switched to streaming stock related tweets - mostly just so I could include references to Eddie Murphy in Trading Places)
Updating our dependenciesI will skip the normal app setup and Twitter connection stuff, as I was just building this on top of the app I had previously done for the RWC (which already connected to the Twitter streaming API and persisted info to Redis.
All we need to do here is add the two Stanford dependencies - you can see I also added a dependency for Twitter's open-source library - this provides tweet cleanup/processing stuff, and really just used to extract "cashtags" (like a hashtag, but starting with a $ used on Twitter to indicate stock symbols, e.g. $GOOGL etc).
Spring configurationNext up, as we are using Spring its super easy to just add the configuration so we can let Spring manage our Stanford NLP objects and inject them into our service class that will have the code to analyse the sentiment
Now we have told Spring to manage the main Stanford class we need and the simple Twitter Extractor class. For the StanfordCoreNLP class we are passing in some properties for what text analysis we want to use (this can usually be done with a properties file, but I was feeling lazy so did it programatically - you can see details of which Annotators are available here: http://stanfordnlp.github.io/CoreNLP/annotators.html )
Next up, based on the code examples we have seen, we need a little bit of code to analyse a piece of text and return a score - So I created a simple Spring service called SentimentService that I later wire into my event listener.
That's mostly it really, in my event listener instead of just persisting the tweet along with its labels I also run the analysis and also save the score.
(Analysis of a couple thousand tweets - an average score plus number of tweets for each symbol)
As always, all the code is available on GitHub, so feel free to fork it and play yourself (and if you manage to find a training set to accurately analyse tweets then let me know!)