A review Of The Contents Of My Twitter Archive

2022-11-15
privacy

Nowadays, while watching Twitter die because of recent events, it's recommended to save your data ("archive") just in case. But what is in that archive? What can one infer from it about the Twitter internals? I took a look at what's in there, and tried to guess why, what the platform knows about me, what it stored in the archive.

Introduction

There are reasons to ask for your archive on Twitter. Perhaps the site, with all of your data, will go down, so it's useful to archive your data. It may be years or perhaps decades of your data! Or you're interested in what they store about you -- maybe a few distant posts you no longer remember.

In either case, one can ask for an archive of your data. This takes at least 24 hours or longer (supposedly, "to protect your account", more likely an artificial throttle), but you'll get a ZIP file with some stuff in it. There are some reports about this not being possible any more, depending on your circumstances and precise moon phase.

When you get your archive and unzip it, you get bunch of folders, with JSON data (actually it's JS, as the filename suffixes suggest; more on this later) and media in several directories. I was curious what these really contain. While there are some tools to help you getting to useful data, I was more interested in what data there is really.

I was also wondering how much this reveals about how they store data about you. It turns out the layout gives you quite some insights of how their databases are structured; at least to the extent they are willing to expose. In previous instances, with a different service, I got back an insanely wide CSV file, where each column had a title (like a database field) and stored exaclty one value for each column. Twitter, on the other hand, gives you basically JSON files. A lot of them.

The index

What's even better is there’s a README.txt! It's almost 50 kilobytes long, and it's pretty extensive (although in some cases it just points to the developer documentation for further details). It contains sections about (probably) all files delivered, with some description. Someone actually thought this through, at least to a large extent.

I'll go through some of the .js files in the archive. They contain JSON data (with a JavaScript variable defined, just for good measure, so technically it's more like a JavaScript snippet, which also throws a curve ball to tools like jq, but we'll ignore that detail for now).

tweets.js

Nothing spectacular here. One interesting bit is about editability of tweets:

"editTweetIds" : [
     "1589186620443734017"
   ],
   "editableUntil" : "2022-11-06T09:53:10.000Z",
   "editsRemaining" : "5",
   "isEditEligible" : false
}

I really love this bit, per tweet:

"id" : "1589710308087586816",
"id_str" : "1589710308087586816",

So... each ID is stored as a number (represented as string) as well as a string (representing a number). You know, just in case! I can imagine at some point they had to change the IDs to string to avoid overflows on some platforms, but removing the field was no longer an option. So they added the same value but explicitly as a string. Correction: it looks like all numbers are represented as strings (i.e. they are in quotes) in all files. Makes you wonder why id_str is here? Guess: it was added for reasons I wrote earlier, then they switched everything to strings, but by then it was too late to remove this duplication?

The media items referenced have a HTTP (not HTTPS) link in a filed media_url, but also as HTTPS link in media_url_https. Both actually work. It makes you wonder why this is done, but at this point I think it's for legacy support.

The media, i.e. images, in the downloadable archive are of limited size. The full size is available online, so the do have it on their servers... When you ask for “your data”, shouldn’t they, you know, return that? What’s in the archive is provably not all what they have. Is limiting the archive size a fair tradeoff? Was that the intention? Note: the tool mentioned above can get around this now, by explicitly downloading the originals. Which may be slow if you have many images.

The media structure can hold one media item. If there are more images, those are in extended_entities. This, again, suggests that originally one could only upload one image and when they started supporting more they did not / could not change the data layout, so they added a new, more flexible field instead.

The field possibly_sensitive is always null or false for me: perhaps it's used when a tweet was flagged?

Finally, there's this construct:

"user_mentions" : [
      {
        "name" : "Privacy Badger",
        "screen_name" : "PrivacyBadger",
        "indices" : [
          "49",
          "63"
        ],
        "id_str" : "2232194276",
        "id" : "2232194276"
}

This particular one appeared in one of my tweets and it gives a fascinating insight into how they represent and reproduce the tweets. In the above case I mentioned @PrivacyBadger in a tweet, in the substring 49-63 of the text. They looked up that user at the time, recorded the ID and the place of mention separately. When showing the tweet, they can now style that slice properly and link them to that account -- as opposed to some random text that is not a mention, just happens to begin with @.

As a side note: since the screen name is stored here for good, it's highly unlikely that they update every tweet that mentions this user if and when that user changes screen name. Does this mean that an old tweet has a permanent record of the then-used name of mentioned users? Or is this updated for display every time upon display? If so, why would you store screen name to begin with?

tweet-headers.js

This is a derived dataset (a sort of euphemism in this case). It contains the tweet ID, the user ID (mine, duh) and the a timestamp. I doubt they produce this file for the purpuses of archival only, so it's likely they have this stored separately internally. It's basically a database index: tuples of [user, tweet, timestamp]. However, in order to screw with consistency, the timestamp is not in ISO format in this case, rather RFC 2822. I know, spotting this is a hint of OCD, but still... it's true!

like.js

It is an index to tweets you liked. It tries to be simple: the only data fields are tweetID, fullText (take that, database normalization!) and expanded URL which, of course one could deduct from the tweet ID, but it’s spelled out in full nonetheless. Although it’s using a different URL scheme, god only knows why, like:

"expandedUrl" : "https://twitter.com/i/web/status/1588250309918593024"

account.js

It contains the basics: email, username (not display name), account ID, create timestamp, display name. For some strange reason the IP used for sign-up is in a different file (account-creation-ip.js) which suggests they added this field later and/or it’s stored separately.

account-timezone.js

This is as simple as it gets:

"timeZone" : "Amsterdam"

Actually, it’s even simper than it gets. There’s no country information, or actual timezone, despite the what the filename suggests. I guess there’s only “one real Amsterdam”... But if I was living in Rome or Budapest or London, the bets would be even more off about where that really is.

verified.js

"verified" : false

That’s it for me. I guess there are more fields if you’re actually verified. I’d be really curious if those fields are different if you are verified the “old style” versus by paying that infamous $8. I don't think I’ll ever know.

profile.js

Your profile is not your account: fields here are different (bio, website, images). Of course, your website link is a t.co shortcut, most likely to keep tabs on accesses to it…

twitter-shop.js

I confess I do not know what this is, beyond what I found in an announcement: "Today, we’re launching a beta experiment". Has this ever taken off?

twitter-circle-*

I have no idea what this really is and why it is useful - the only use I saw was when I was invited to “make money fast” schemes, along with 49 others.

personalization.js

Okay, now this looks interesting. I surely personalized things? It turns out: no, I haven’t, but the system inferred a bunch of things about me anyway.

The field p13nData (because, as the trend goes, words are too long for machines to read, so personalizationData was too hard to write?) field includes languages where it inferred what languages I speak. I can only guess this is based on who I follow, and what language they write in, because otherwise it’s quite off the mark… That, or the “AI” behind this is quite dumb.

genderinfo says male. Ok, not a bad guess! 

More interestingly, interests lists a large number of topics I may have interest in. I wonder how this list is compiled? From the engineering point of view, the archive contains no hints at when I may have been interested in theses topics, or how much by any definition. It’s just a list of “names” and whether those are disabled (uhm, what does that even mean?)

This list of “interests” contains stock symbols as well as magical string such as “Cheez-It”, “Eric Idle”, “Events”, “National debt of the United States”, “Pramila Jayapal” (?), “Salad”, “Shoes”, “Weather” and “Weather” (yes, twice) amongst many other things. I can only deduct that some procedure (can only be automatic, right? like “AI”) tags tweets internally, and checks if I have interacted with them at least… once? Twice? Who knows.

Next up, there’s a list audienceAndAdvertisers and lookalikeAdvertisers. I have no idea what a lookalike advertiser is, but this list is long. As opposed to advertisers, catalogAudienceAdvertisers or even doNotReachAdvertisers, which are empty. It’s a mystery, or one of those "it seemed like a good idea at the time" events.

Oh, but shows has a lot of entries! Again, no idea how this is populated, but the system thinks I’m interested in “America's Got Talent”, “Extant” (?), and of course “<82><82><83><9C><83><82><83><93><83><90><83><83><83><97>” which is likely probably my favourite — but we’ll never know since no other metrics are exposed, only the existence of such an interest.

Finally, inferredAgeInfo claims ”age" : ”13-54” which is not too bad, but also not very useful unless you’re the 19th century marketer who has this age bracket in their index of their yellow pages.

The field birthDate is empty for me. Go figure! Have I been successful in not ever telling them?

protected-history.js

I’d love to know what this is for. Maybe I’m not important enough to have anything useful here?

The README says:

- protectedAt: Date and time the "Protect your Tweets" setting was used in the last six months.
- action: Whether the account is protected or unprotected.

I'm sure this means something to someone out there.

phone-number.js

Yep, it is what the name suggests. I'm not sure any more why I gave my phone number away, perhaps it was mandatory at some point (I vaguely remember something along these lines, just not sure if it was Twitter or something else). I'm pretty sure there's no way to ask them to forget this number -- though I wonder if one can change it to a random one...? Would they keep history? If they did, would they tell you...?

saved-search.js

It seems I (accidentally? knowingly?) searched for something and then pushed the button “save search”. No idea why I would ever do that. But, because this is important, it’s saved in a record in their database.

I looked around and the only way to “unsave” this search on the web UI seems to be to search again for this very thing, in which case the “…” menu shows a “Delete saved search” option. Which you only see if you ever press this button on the exact search that was saved. I did not find a list of saved searches in the UI.

email-address-change.js

Again, this is roughly what one would expect, containing fields like changedAt, changedFrom and changedTo.

ip-audit.js

This is supposed to have the IP addresses you used to log in to your account ("IP address associated with the login" according to the README). It is just a timestamp + IP list. A couple of notes:

Unsurprisingly, this only has IPv4 addresses.
More interestingly, it has entries only for the last two months, but for those days more than 4 logins per day on average. I'm pretty sure I have not "logged in" these many times, so the label is a bit of a misnomer.
Even more interestingly, about a quarter of the IPs listed there are from the 10/8 block, i.e. RFC1918 space, which by definition cannot be a public address. So I wonder what this is... the only plausible explanation is that it's the local IP address my device (phone) had, while I was on 4G or something. But then it makes zero sense to log this.

ni-devices.js

Some info about devices you're using, or have used before. It may be of some value, although I don't exactly believe in this being totally correct: at a minimum "deviceVersion: Operating system version associated with the device" is just simply not true. Also, token per device is in the data, but I found no description in the README; it seems like a permanent ID of some kind.

Note: in this file the date format is yet again different than usual (yyyy.dd.mm). Argh!

lists-member.js

Is this a list of lists that you were put on, vs. the ones you asked to be put on?

Conclusion - This Is Not The End

There are far more files than I mentioned in this post. They cover more aspects of the functionality, and presumably they are filled in with some data if you've exercised those features. I'm sure they would give more insight into internals and data they have about you, but I'd only suggest digging into them if you're really fascinated by the kind of stuff I mentioned here.