Alexander Gromnitsky's Blog

Twitter stats using gnuplot, json & make

Latest update:

Twitter allows users to download a subset of their activities as a zip archive. Unfortunately, there are no useful visualizations available for the provided data, except for a simple list of tweets with date filtering.

For example, what I expected to find but there were no signs of it:

  1. a graph of activities over time;
  2. a list of: i. the most popular tweets; ii. users, to whow I reply the most.

Inside the archive there is data/tweet.js file that contains an array (assigned to a global variable) of "tweet" objects:

window.YTD.tweet.part0 = [ {
  "tweet" : {
    "retweeted" : false,
    "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
    "favorite_count" : "2",
    "id" : "12345",
    "created_at" : "Sat Jun 23 16:52:42 +0000 2012",
    "full_text" : "hello",
    "lang" : "en",
    ...
  }
}, ...]

The array is already json-formatted, hence it's trivial to convert it to a proper json for filtering with json(1) tool.

Say we want a list of top 5 languages in thich tweets were written. A small makefile:

$ cat lang.mk
lang: tweets.json
    json -a tweet.lang < $< | $(aggregate) | $(sort)
tweets.json: $(i)
    unzip -qc $< data/tweet.js | sed 1d | cat <(echo [{) - > $@

aggregate = awk '{r[$$0] += 1} END {for (k in r) print k, r[k]}'
sort = sort -k2 -n | column -t
SHELL := bash -o pipefail

yields to:

$ make -f lang.mk i=1.zip | tail -5
cs   16
und  286
ru   333
en   460
uk   1075

(1.zip is the archive that Twitter permits us to download.)

To draw activity bars, the same technique is applied: we extract a date from each tweet object & aggregate results by a day:

2020-12-31 5
2021-01-03 10
2021-01-04 5

This can be fed to gnuplot:

$ make -f plot.mk i=1.zip activity.svg

This makefile has an embedded gnuplot script:

$ cat plot.mk
include lang.mk

%.svg: dates.txt
    cat <(echo "$$plotscript") $< | gnuplot - > $@

dates.txt: tweets.json
    json -e 'd = new Date(this.tweet.created_at); p = s => ("0"+s).slice(-2); this.tweet.date = [d.getFullYear(), p(d.getMonth()+1), p(d.getDate())].join`-`' -a tweet.date < $< | $(aggregate) > $@

export define plotscript =
set term svg background "white"
set grid

set xdata time
set timefmt "%Y-%m-%d"
set format x "%Y-%m"

set xtics rotate by 60 right

set style fill solid
set boxwidth 1

plot "-" using 1:2 with boxes title ""
endef

To list users, to whom one replies the most, is quite simple:

$ cat users.mk
users: tweets.json
    json -e 'this.users = this.tweet.entities.user_mentions.map( v => v.screen_name).join`\n`' -a users < $< | $(aggregate) | $(sort)

include lang.mk

I'm not much of a tweeter:

$ make -f users.mk i=1.zip | tail -5
<redacted>       41
<redacted>       49
<redacted>       60
<redacted>       210
<redacted>       656

Printing the most popular tweets is more cumbersome. We need to:

  1. calculate the rating of each tweet (by a such a complex foumula as favorite_count + retweet_count);
  2. sort all the tweet objects;
  3. slice N tweet objects.

A Make recipe for it is a little too long to show here, but you can grab a makefile that contains the recipe + all the recipes shown above.


Tags: ойті
Authors: ag