Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
NBC News has publicly released a database of deleted Tweets from their investigation into how Russian Twitter Trolls may have influenced the 2016 US election. You can read about the results of NBCβs analysis in their stories here and here, but the focus of this post will be on how you can explore the data on your own, using open source data analysis tools. Weβll show how to get started with the data and hopefully inspire you to dig into the data yourself.
1. Neo4j Sandbox And Neo4jΒ Browser
NBC News has released the data as a Neo4j Database and CSV files that can be used with your favorite data analysis tools. But the easiest way to get started with the data is by using Neo4j Sandbox. Neo4j Sandbox allows you to spin up a private hosted instance of Neo4j pre-populated with interesting datasets.
Use Neo4j Sandbox to spin up private hosted Neo4j instances pre-populated with interesting datasets.
Once youβve launched your Russian Twitter Trolls sandbox instance youβll have access to Neo4j Browser, the query workbench for Neo4j that will allow you to interact with the database.
Use Neo4j Browser to visually explore the database.
2. Query WithΒ Cypher
Cypher, the query language for graphs is a great way to explore the database and can be used from within Neo4j Browser or by building an application using one of Neo4jβs clientΒ drivers.
Cypher uses graph pattern matching to allow users to express complex graph patterns to match against the graph. This allows for answering questions like:
What are the most commonly used hashtags by theΒ Trolls?
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_TAG]->(ht:Hashtag)RETURN ht.tag, COUNT(tw) AS numORDER BY num DESC
What Troll accounts have the most followers?
MATCH (u:Troll) WHERE EXISTS(u.followers_count)RETURN u.screen_name AS screen_name, u.followers_count AS followersORDER BY followers DESC LIMIT 50
What tweets contain the wordΒ βfraudβ?
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)WHERE tw.text CONTAINS "fraud"OPTIONAL MATCH p=(tw)-[:HAS_TAG|HAS_LINK|MENTIONS|IN_REPLY_TO]-(a)RETURN * LIMIT 50
Find inferred relationshipsβββwhat Trolls are retweeting otherΒ Trolls?
MATCH p=(:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(:Troll)RETURN p LIMIT 10
Further ideas for querying:
- What are the most commonly used applications by the Trolls to postΒ tweets?
- What locations do the Troll accounts list in their profiles?
- What tweets had the most number of retweets that were not from other Russian Troll accounts?
3. Fill In MissingΒ Data
Due to the way the data was collected there are some missing pieces. For example, some of the users are missing profile information and some tweets are missing metadata like number of likes and retweets.
Missing Profile Information
For example, the user β@TEN_GOPβ is missing profile information in the database as this wasnβt captured:
MATCH (u:Troll) WHERE u.screen_name = "TEN_GOP"RETURN u.id, u.screen_name, u.description, u.location, u.name-------------------------------------------------------------ββββββββββββββ€ββββββββββββββββ€ββββββββββββββββ€βββββββββββββ€ββββββββββ"u.id" β"u.screen_name"β"u.description"β"u.location"β"u.name"βββββββββββββββͺββββββββββββββββͺββββββββββββββββͺβββββββββββββͺβββββββββ‘β"4224729994"β"TEN_GOP" β"" β"" β"" βββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββ΄βββββββββ
We can reconstruct the Twitter profile URL for β@TEN_GOPβ:
https://twitter.com/TEN_GOP
but because these accounts have been suspended by Twitter, all we seeΒ is:
The Russian Troll accounts were suspended by Twitter, removing their data from Twitter.com and Twitterβs API
We can check web caches, such as Internet Archive to find cached versions of these pages, which we may then be able to scrape. Internet Archive has an API for checking for cached versions of pages, forΒ example:
http://archive.org/wayback/available?url=http://twitter.com/TEN_GOP--------------------------------------------------------------------
{ url: "http://twitter.com/TEN_GOP",archived_snapshots: {closest: {status: "200",available: true,url: "http://web.archive.org/web/20170818065026/https://twitter.com/TEN_GOP",timestamp: "20170818065026" } }}
shows that the profile page for β@TEN_GOPβ has been captured by Internet Archive and is available here.
Missing Tweet Information
We can also reconstruct the tweet URLs for tweets in the database that have missing information to check againstΒ caches:
MATCH (u:Troll)-[:POSTED]->(t:Tweet) WHERE t.text = ""RETURN "https://twitter.com/" + u.screen_name + "/status/" + t.id AS tweet_url LIMIT 10------------------------------------------------------------------ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ"tweet_url" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘β"https://twitter.com/SCOTTGOHARD/status/781651098398494720"ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/SCOTTGOHARD/status/780602260401299456"ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/WarfareWW/status/783649582064467968" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/WarfareWW/status/783642593137754114" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/WarfareWW/status/756033388423897088" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/WarfareWW/status/794918302585909250" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/WarfareWW/status/787416487346708481" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/WarfareWW/status/794189517653680132" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/WarfareWW/status/797080157135761409" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β"https://twitter.com/WarfareWW/status/781515670379003904" ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Unfurling URLs
Many of the tweets contain URLs that use link shortener services so itβs not clear what pages they are actuallyΒ sharing:
MATCH (t:Troll)-[:POSTED]->(tw:Tweet)-[:HAS_LINK]->(u:URL)WHERE u.expanded_url CONTAINS "bit.ly"RETURN u.expanded_url LIMIT 10------------------------------------------------------------ββββββββββββββββββββββββββ"u.expanded_url" ββββββββββββββββββββββββββ‘β"http://bit.ly/2eeMnZR"ββββββββββββββββββββββββββ€β"http://bit.ly/2dCn9qP"ββββββββββββββββββββββββββ€β"http://bit.ly/2ctTjGN"ββββββββββββββββββββββββββ€β"http://bit.ly/2eAOBnf"ββββββββββββββββββββββββββ€β"http://bit.ly/2awlrUs"ββββββββββββββββββββββββββ€β"http://bit.ly/2aAtdyN"ββββββββββββββββββββββββββ€β"http://bit.ly/29UHsyx"ββββββββββββββββββββββββββ€β"http://bit.ly/2cOskmM"ββββββββββββββββββββββββββ€β"http://bit.ly/2cOskmM"ββββββββββββββββββββββββββ€β"http://bit.ly/2cOskmM"ββββββββββββββββββββββββββ
We can use tools like cURL to unfurl the links to find the final destination URLs:
β curl -Ls -w %{url_effective} -o /dev/null http://bit.ly/2eeMnZR
http://ksnt.com/2016/10/27/early-voting-more-good-signs-for-clinton-in-key-states/?utm_source=twitterfeed&utm_medium=twitter%
Online tools like unfurlr allow us to accomplish the same thing, but can also inspect page content and spoof userΒ agents.
Further ideas for enriching theΒ data:
- Supplementing the data with Google Knowledge Graph
- Checking web caches like archive.is and Internet Archive for cached versions of the deletedΒ tweets
- Searching other social media platforms for usernames that have beenΒ reused
4. Graph Algorithms
Graph algorithms are a way to apply analytics to the entire graph to further enhance our understanding of the data. These algorithms fall into three categories:
- CentralityβββWhat are the most important nodes in the network. Centrality algorithms include PageRank, Betweenness Centrality, and Closeness Centrality.
- Community detectionβββHow can the graph be partitioned? Community detection and clustering algorithms include Union Find, Louvain, Label Propagation, and Connected Components.
- PathfindingβββWhat are the shortest paths or best routes available given cost? Pathfinding algorithms include Minimum Weight Spanning Tree, All Pairs- and Single Source- Shortest Path, and Dijkstra.
PageRank is a recursive graph algorithm that defines the importance of a node proportional to the importance and number of connected nodes in the graph. Image source Wikipedia
We can run these algorithms in Neo4j with Cypher using the Neo4j Graph Algorithms procedures. For example, hereβs how to run PageRank on the Troll retweetΒ graph:
CALL algo.pageRank("MATCH (t:Troll) RETURN id(t) AS id", "MATCH (r1:Troll)-[:POSTED]->(:Tweet)<-[:RETWEETED]-(:Tweet)<-[:POSTED]-(r2:Troll) RETURN id(r2) as source, id(r1) as target", {graph:'cypher'})
For more examples of running graph algorithms on the Russian Troll dataset, see the Neo4j Browser guide for the Russian Twitter Trolls Neo4j Sandbox instance.
Graph algorithms ideas:
- What are the most influential Troll accounts?
- Can you find communities in the graph based on interactions using community detection algorithms?
5. Graph Visualization
Data visualization is often the best way to make sense of the results of graph algorithms. There are a number of open source tools for visualizing graph data, each with their own pros and cons. Tools such as Gephi, vis.js, and Semiotic are commonly used for building interactive graph visualizations.
For those familiar with data visualization, graph data brings a unique set of challenges. Often the most important features of graph visualization are:
- binding node size to the importance, or centrality, of the node in theΒ graph
- grouping the nodes together in clusters. Many graph visualization tools use a force directed layout to surface clusters, however we can also use community detection algorithms and bind the communities to node color to showΒ clusters
- Showing relationship thickness proportional to a property, or weight, of the relationship
Visualizing the Russian Troll retweet/reply network. Node size is proportional to PageRank, color shows the result of a community detection algorithm, and relationship thickness is determined by the number of retweets between theΒ Trolls.
The image above shows the results of running PageRank and community detection algorithms on the Russian Troll retweet graph and visualized using a vis.js wrapper library called neovis.js.
Further Ideas:
Can you imagine other types of inferred networks that can be extracted from the graph? How would you express those graphs using Cypher? Can you build an interactive graph visualization using one of the tools mentioned above to visualize thatΒ graph?
6. Natural Language Processing
Natural language processing (NLP) is the process of making sense of text data. Common NLP tasks include part of speech tagging, entity extraction, word similarity, and sentiment analysis. There are a number of open source tools for performing NLP tasks such as Stanfordβs CoreNLP tools, NLTK in Python and even some tolls designed specifically for working with Twitter data such as CMUβs Twitter Part-of-Speech Tagger and a crowd-sourced tool for finding hashtag definitions.
Entity extraction on the Tweet data involves extending the graph model by annotating tweets that contain entities.
You can even run NLP tasks using Cypher directly in Neo4j using an extension. See this post for how to get started using the GraphAware neo4j-nlp procedures. And for those comfortable with using Python tooling for NLP this blog post shows how to run entity extraction on Twitter data using Neo4j andΒ Python.
NLP ideas:
- Entity extractionβββwhat are the most common people, organizations, and places mentioned in the tweets? Are certain groups of trolls talking about certain entities?
- Sentiment analysisβββare the Trolls talking positively about anything? Or do they focus on spreading negativity only?
We hope youβre excited to explore the data, share anything interesting you find with us on Twitter β@neo4jβ.
Six Ways To Explore The Russian Twitter Trolls Database In Neo4j was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.