The twitter streaming APIs are a very efficient way of having the tweets you’re interested in pushed to you. For example you can use the filter endpoint to have tweets matching your filter (author, hashtag, keywords etc) but for this I was more interested in the sample endpoint which sends out about 1% of all public tweets. This endpoint does however have some limitations:
- A set of credentials (app/user combination) can only have a single connection open (any further connection attempts will terminate the previous ones). So in order to use it I would either need to have each visitor authenticate with the app in order to create their own streaming connection, or build some sort of servers-side proxy.
- The API response is actually quite large and when combined with the hundreds of tweets per second received results in a large amount of data being retrieved (during testing on a Friday morning I was getting a fairly consistent 2 Mbps of data from the API).
Here’s a quick example (capturing the stream for about 5 seconds resulting in 1.3 MB of data, I’ve shown just the first ~1000 lines here, a sample of the sample you could say) of the streaming API data:
Here’s a few things to note:
- There is a lot of metadata about tweets included which I don’t need.
- There are quite a few native retweets which include the retweeted text prefixed with RT in the new tweet. Should they be excluded or should the retweet count towards the word count?
- There are many different languages, in order to have something meaningful for myself (I only speak english fluently, plus a couple of other european languages poorly at best) I decided to only process english tweets.
All of this meant that it made sense to build a simple back-end service/proxy that created a single streaming connection, processed this data and fed a far more condensed amount of data out to the browser(s). I chose to build something with node.js.
First we need to get the data out of the streaming API. I found a npm module called node-tweet-stream that worked with the filter endpoint, and with a little butchery was able to hook it up to the sample API instead.
I often use Heroku for hosting small things like this and Heroku encourages you to store as much of the application configuration as possible in the environment rather your application code respository. To manage this in my Ruby projects I use dotenv to allow me to keep such configuration in a
.env file locally (excluding this from the source control). I was very pleased to find such functionality also exists for developing in node. A quick install of the dotenv npm module and a simple
require and it was working here.
Logging things out to the console is great for debugging things but no real use. To get the data out to a browser I started to build a simple express app as I’d had some experience with this before but something reminded me of web sockets and socket.io so I thought I’d try playing with them. Again, all that was required was another install/require and a couple of extra lines and now we have tweets being proxied through to the browser(s). The code was now looking like this:
The main reason for proxying the data was to reduce the amount sent out to the browsers, so now was time to take those massive responses and reduce them to some word lists. Again I found a couple of great npm modules to help with this; keyword-extractor for extracting the important words (or more accurately, excluding the non-important words), and franc for determining the language of the tweet (keyword-extractor only works with english, much like my brain).
While writing this I noticed that the twitter response actually contains a
langfield, negating the need to use franc. I hadn’t noticed this at the time, oh well!
Plugging these in, along with some exclusions myself (links, retweets, replies) gives us the final code (find it on GitHub) that was deployed to Heroku:
So with less than 50 lines of code we have live tweets being parsed for words and those word lists being sent out to the browser. Now let’s get the browser to render them.
Firstly we’ll use socket.io to connect to the web socket and start grabbing the words as they come in.
I’m using the underscore.js library here to get access to some simple helper functions
And there we go, the words are being spat out to the browser’s console, but of course this is of no practical use. Lets count the occurences and displaying that visually. We’ll do this by throwing the words and their counts in to an object and then displaying the most popular ones periodically.
There’s a few things to explain here:
scaletransform is being used instead of
font-sizeto change the size of the words as this results in a GPU accelerated transform, which we can then enhance with transitions with very little impact on performance.
- The created DOM nodes are being cached in the
text_nodesobject so we don’t have to recreate them each time or try to find them.
- A frame number is used to note when the elements were last updated so that it’s easy to remove any words that are no longer popular.
- The colour of the words are randomised using
hsla()as this only requires a single number to be generated (the hue) instead of the multiple numbers required to use
This works great, but it will count occurences since you first loaded the page, I wanted it to only consider the most recent words (lets say only the last 5 minutes) so I need to store the word lists in such a way that I can easily and quickly remove the older ones. I could have stored the time of each occurence of each word but that would get complicated. I decided instead to store the word occurences in several different objects (I called them buckets), with the one that was incremented being rotated every few seconds. The
render method would then only use the buckets covering the last 5 minutes worth of occurences.
And there we have the (more or less) finished code, and here it is running:
There’s still a few things I’d like to improve when I can:
- The positioning of the words is random, which often results in excessive overlapping, the translucency helps with that but it sometimes is quite bad.
- It would be nice to have it be a little more customisable, maybe the source being a hashtag, a user or your timeline instead of the sample stream.
It was fun to spend a couple of hours playing around with some new things, everyone needs to be able to do that occasionally.
What new technologies are you most excited about playing with?
Originally published on theparticlelab.com on 25th January 2015