Weekly Reports: Sept. 5 – Sept. 11, 2016

  • Monday
    • Holiday!
  • Tuesday
    • Created graphs for daily frequencies of tweets and relevant tweets for Nigerian and South African election data
      • Nigeria: Nigeria
      • South Africa: South Africa
  • Wednesday
    • Implemented Stanford’s CoreNLP library into Spark again for use with Jupyter and the Toree Spark kernel. It accepts an RDD of (Date, String) pairs for evaluation.
      • Fought a lot with the EC2 instances on Amazon (as launched by spark-ec2 scripts) to get them to use Java 1.8, which is necessary for CoreNLP.
    • Reviewed a bunch of papers for the HCIL CHI Clinic
  • Thursday
    • START meeting for progress and tasking
    • CHI paper clinic
  • Friday
    • Finished running raw tweet -> relevance filtering -> community identification -> per-community sentiment analysis pipeline for both the 2015 Nigerian and 2014 South African elections.
      • The communities identified by Label Propagation in Spark have a large, central community with many smaller communities. Most of these smaller communities seem unrelated to the primary event of interest.
        • For example, South Africa has one large group of >20k users that is focused on the election and several small groups focused on elections in other countries (or based in other countries). Data includes Botswana, the US, and India.
        • Nigeria also has one major community and one smaller community relevant to Biafra, a Nigerian secessionist group/state, but the other 8 top groups are not related to Nigeria.
      • These results suggest the relevance filtering step we have in place is not adequate for disambiguation of election information. The mechanism we have for identifying relevant communities seems to work well, however, so maybe the current pipeline should include an additional verification step where a human selects a few communities of interest.
      • It’s also worth noting here that identifying a single major community is not the most useful thing here. Since our goal is identifying differences across communities’ sentiment, having a single large community inhibits this goal. Our results also show these large communities have relatively static average daily sentiment across the election cycle (according to CoreNLP). A different method for identifying communities may be necessary.
        • Alternatives include topic similarity, hashtag usage, or actually extracting full friend/follower lists.

Weekly Reports: August 29 – September 4, 2016

  • Monday
    • Finished query expansion for Nigeria 2011 election using Gnip sample
    • Submitted the job to pull Nigeria 2011 election with SME + expanded keywords from Gnip
  • Tuesday
    • Finished evaluating relevance judgments using hand-labeled data for Nigeria 2015 and South Africa 2014 elections
      • Agreement using Cohen’s Kappa ~= 0.75
      • Classification accuracy is high
        • 0.90 for Nigeria and 0.89 for South Africa using linear SVM from Scikit with CountVectorizer and TF-IDF transformation
      • Pooled models for both elections performs surprisingly well with score 0.90
        • Transferring South Africa to Nigeria and vice versa do not perform very well though, suggesting relatively disjoint important keywords
  • Wednesday
    • Developed code for reading Excel data with Python and shunting it to Scala with Spark, so I can build a relevance classifier for the judged tweets for Nigeria and South Africa
    • Tested Spark’s ML library for learning to identify relevant tweets from human labels.
      • Currently, Spark’s ML library (NOT MLLib) only contains naive bayes, decision trees, and a few others, which don’t perform quite as well as the linear SVM I built in Python. Should compare these results with MLLib’s SVM implementation.
  • Thursday
    • Implemented a notebook for applying a classifier trained on human-judged tweets to the larger data sets from Nigeria and South Africa during their election periods. Note that I only applied this to the overlap experiment data, which covers a few days on either side of the election event rather than months around the event.
  • Friday
    • Applied the relevance classifier from human tweets to the overlap-experiment tweets for Nigeria and South Africa
      • Results are unclear. In South Africa, query expansion results in a larger number of relevant tweets but lower density, but in Nigeria, we see the opposite. Further, the relevant difference SMEs and news is inconsistent between South Africa and Nigeria. In South Africa, news performed better, but in Nigeria, SMEs performed better.
      • Central accounts do seem to be consistently better at finding relevant tweets
      • South Africa:
        • Relevant: 194,628, Irrelevant: 1,910,315, Relevant Percent: 0.092437
        • umd-central-accounts, Counted: 83550.0
          • Relevant Count: 36651
          • Relevant Percent: 0.438671
        • umd-sme-keywords, Counted: 258482.0
          • Relevant Count: 67964
          • Relevant Percent: 0.262935
        • umd-sme-keywords-expanded, Counted: 918366.0
          • Relevant Count: 147113
          • Relevant Percent: 0.160190
        • umd-news-keywords, Counted: 380696.0
          • Relevant Count: 145884
          • Relevant Percent: 0.383203
        • umd-news-keywords-expanded, Counted: 1589471.0
          • Relevant Count: 155154
          • Relevant Percent: 0.097614
      • Nigeria
        • Relevant: 5,298,568, Irrelevant: 1,790,816, Relevant Percent: 0.747288
        • umd-central-accounts, Counted: 1560307.0
          • Relevant Count: 1384520
          • Relevant Percent: 0.887338
        • umd-sme-keywords, Counted: 4904290.0
          • Relevant Count: 3574459
          • Relevant Percent: 0.728843
        • umd-sme-keywords-expanded, Counted: 6529544.0
          • Relevant Count: 4921017
          • Relevant Percent: 0.753654
        • umd-news-keywords, Counted: 4424309.0
          • Relevant Count: 3092892
          • Relevant Percent: 0.699068
        • umd-news-keywords-expanded, Counted: 6244873.0
          • Relevant Count: 4622833
          • Relevant Percent: 0.740261
    • Read “The death of bin Laden: How Russian and US media frame counterterrorsm”
      • Trying to determine what the vocabulary of user response to terrorism tells us
      • This paper is more about how Russian and US newspapers framed Osama bin Laden’s death (Russian papers highlighted negative aspects like extrajudicial killings, US global interests, political opportunities for Obama, whether the death was true, etc. US papers were more about success, US exceptionalism, etc.)

Weekly Reports: June 6 – June 12, 2016

  • Monday
    • Started running RTTBurst on all of 2015.
      • It’s very slow. About through January 15 after 4 days.
    • Generated a set of misogyny-related tweets
      • Keywords were taken from Hatebase, K. Preston and K. Stanley, “‘What’s the worst thing…?’ gender-directed insults,” Sex Roles, vol. 17, no. 3–4, pp. 209–219, 1987., and S. Hewitt, T. Tiropanis, and C. Bokhove, “The Problem of Identifying Misogynist Language on Twitter (and Other Online Social Spaces),” in Proceedings of the 8th ACM Conference on Web Science, 2016, pp. 333–335.
      • Seems heavy on pornographic material
      • Filtered to remove retweets and non-English tweets
      • Built a CSV sample of tweets for hand-labeling
  • Tuesday
    • Posted terrorism notebooks to github
    • Meeting about SES + Twitter
  • Wednesday
    • Meeting on START relevance project
    • Worked on dissertation
    • Developed a KL-divergence Scala implementation for Spark
      • Available under spark-twitter-nb here
  • Thursday
    • Worked on dissertation
  • Friday
    • Finished a draft of my dissertation intro
  • Saturday
    • Ran KL divergence against Boston, both Paris attacks, Brussels attack, and Nigerian Kano bombing against 1% Twitter at the same time.
      • Top 100 divergent keywords across terror attacks are listed here
    • Built sampling mechanism for Gnip activity data. Will sample a given number of tweets per rule
  • Sunday
    • Hand-labeled 1000 misogynistic tweets to determine distribution of insults, quotes, music lyrics, self-identification, and other types.

Weekly Reports: May 30 – June 5, 2016, Expanding Twitter Terrorism Research

Apologies for the hiatus. Last week was my first full week back, since I was in Germany for the ICWSM 2016 conference and presenting at the HCIL Symposium the previous two weeks. It was my first time at ICWSM, in Germany, and presenting at the HCIL Symposium, and all of it was amazing!

I don’t have the usual day-by-day breakdown of my research, but I will instead post a general overview of my work from last week.

My ICWSM paper was on Twitter’s response to terrorist attacks in Western countries, and I focused specifically on the Boston Marathon bombing, Sydney Hostage Crisis, and Charlie Hebdo attacks (my poster is available here: ICWSM16_Poster_Portrait). Since writing the paper though, two additional tragic events occurred: the Paris November attacks, and the Brussels airport attacks. It made sense to apply the same analyses from my ICWSM paper to these new cases and see if the same behaviors were observed.

I also wanted to experiment with some of the new technology that supports interactive analyses on “big data,” so I began working with Anaconda, Apache Toree, and Bokeh-Scala to see if I could duplicate my original analyses directly on the big NSF-funded cluster we have on campus at the University of Maryland.

To these ends, I built a pair of Jupyter notebooks (using the Apache Toree Spark kernel) that runs on our cluster, reads data directly from HDFS, analyzes it with Spark, and produces graphics using Bokeh.

I’ve made these notebooks and the original ICWSM analysis available on Github. Feel free to modify and play with the data and analysis!

ICWSM 2016 Analytics

Paris November Attacks

Brussels Transit Attacks

Social Media Analytics During Crises

I had the great opportunity to run a tutorial on social media analytics during crises at the 2016 HCIL Symposium at UMD this year.
As with my previous talk at MITH on Twitter + Ferguson, I wanted to give a talk that was informative about tools but also be hands-on enough, so attendees could see some easy analytics they could modify to answer their own questions.

The notebooks are available on Github, include data acquisition from Reddit, Facebook, and Twitter, and you can view them directly on github here: https://github.com/cbuntain/TutorialSocialMediaCrisis

This material includes:

Material Overview

Tutorial Introduction

  • Terror Data sets
    • Boston Marathon
      • 15 April 2013, 14:49 EDT -> 18:49 UTC
    • Charlie Hebdo
      • 7 January 2015, 11:30 CET -> 10:30 UTC
    • Paris Nov. attacks
      • 13 November 2015, 21:20 CET -> 20:20 UTC (until 23:58 UTC)
    • Brussels
      • 22 March 2016, 7:58 CET -> 6:58 UTC (and 08:11 UTC)

Data Acquisition

  • Topic 1: Introducing the Jupyter Notebook
    • Jupyter notebook gallery
  • Topic 2: Data sources and collection
    • Notebook: T02 – DataSources.ipynb
    • Data sources:
      • Twitter
      • Reddit
      • Facebook
  • Topic 3: Parsing Twitter data
    • Notebook: T03 – Parsing Twitter Data.ipynb
    • JSON format
    • Python json.load

Data Analytics

  • Notebook: T04-08 – Twitter Analytics.ipynb
  • Topic 4: Simple frequency analysis
    • Top hash tags
    • Most common keywords
    • Top URLs
    • Top images
    • Top users
    • Top languages
    • Most retweeted tweet
  • Topic 5: Geographic information systems
    • General plotting
    • Country plotting
    • Images from target location
  • Topic 6: Sentiment analysis
    • Subjectivity/Objectivity w/ TextBlob
  • Topic 7: Other content analysis
    • Topics in relevant data
  • Topic 8: Network analysis
    • Building interaction networks
    • Central accounts
    • Visualization

Weekly Reports: May 2 – May 8, 2016

  • Monday
    • START status presentation slides
  • Tuesday
    • Finished START status slides
    • Interesting preliminary result about Twitter:
      • Previous research suggests terror attacks and crises do not increase the population of social media users. These events alter the distribution of topics discussed but do not drive more posts on the platform.
      • In looking at Nigerian and South African elections, major developments in these events do seem to increase the volume of messages posted to the platform. More research is necessary here since we only have an N of 2 right now.
  • Wednesday
    • START status slides
    • Finished draft of the ICWSM poster
  • Thursday
    • SIGIR camera ready draft
  • Friday
    • Worked on building a portable Anaconda installation for my HCIL symposium tutorial on social media during crises.

Weekly Reports: Apr. 25 – May 1, 2016

  • Monday
    • Industry day
    • Working on job talk
  • Tuesday
    • START meeting
    • Pulled 28 million tweets from France around the November Paris attacks
  • Wednesday
    • Built a sample of tweets from Nigeria and South Africa surrounding elections. These samples will be used for relevance judgments and to determine the distribution of relevant tweets during these times (maybe) and for building a validation set for automated relevance selection algorithms.
  • Thursday
    • Explored surveyed reports of negative security experiences and their correlation with crime data.
      • Used Uniform Crime Reporting (UCR) data provided by the DoJ and FBI for arrest and offense data (United States Department of Justice. Federal Bureau of Investigation. Uniform Crime Reporting Program Data: County-Level Detailed Arrest and Offense Data, 2012. ICPSR35019-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2014-06-12. http://doi.org/10.3886/ICPSR35019.v1)
      • Counted up the number of users who have and have not had negative experiences wrt security (e.g., had personally identifying information stolen, been the victom of an online scam, etc.), assigned any respondent who had such an experience a +1, and any respondent who had not a -1.
      • Summed these +/-1 values over the FIPS state/county data in the survey to determine how many such instances occurred in a given geographic area.
      • Then compared these geolocated sums with the UCR data for correlation.
      • At the county level, weak correlation exists between robbery and these experiences (Pearson’s r=0.455) with the sum over all crime slightly weaker (r=0.424). At the state level, robbery has a stronger correlation (r=0.7) and general crime sum is still weak (r=0.589).
  • Friday
    • Compared language distributions in Twitter for Nigeria, South Africa, the Ivory Coast, and Tunisia using Twitter’s 1% public sample stream and Gnip
      • Each country had five sets to compare: three sets extracted using geolocation (tweets posted from the country) from Twitter’s 1% and Gnip’s bounding box functionality, and two sets extracted from Gnip using a set of seed keywords relevant to two events I generated manually
      • The geolocation data seems relatively consistent across languages, but the seeded data is skewed towards English
  • Saturday
    • Built word clouds across four African countries using both geolocated tweets and tweets extracted using manually seeded keywords.
    • Can see results here

Weekly Reports: Apr. 18 – Apr. 24, 2016

  • Monday
    • RecSys paper work
    • Re-ran CREDBANK analysis against PHEME data set
      • Few overlapping features. Differentiating between true and false in PHEME is more difficult than in CREDBANK
      • CREDBANK random forest got ROCAUC=0.91, but PHEME could only get around 0.64
      • Verified accounts and status counts were the only significantly different features beyond those found in CoCo
      • PHEME ROCAUC
  • Tuesday
    • Meeting at START
    • Met with Jen re: RecSys paper
    • Continued RecSys paper work
  • Wednesday
    • Submitted paper to RecSys on analyzing credibility in the CREDBANK data set
  • Thursday
    • Brenna’s practice talk
    • BBL by Dr. O’Shea and Dr. Scanlon from Edinburgh and the Open University respectively
    • Made summary slides for Gary that discuss our Boston Marathon work
  • Friday
    • Worked on job talk
    • NGS2 tech discussion
    • Gave an interview about inferring personal characteristics from social media

Weekly Reports: Apr. 4 – Apr. 17, 2016

I spent these two weeks preparing for and attending the WWW conference in Montreal (very awesome!) and writing a paper for RecSys.

At WWW, I presented a paper at #Microposts on overlaps between social media and survey work around the Boston Marathon Bombing in 2013 (“Comparing Social Media and Traditional Surveys Around the Boston Marathon Bombing”). This paper received an honorable mention for the best paper award. I also had the pleasure of chairing a session of the Workshop on Modeling Social Media.

Overall, WWW16 was a great experience once again, and I look forward to next year in Australia.

Weekly Report: Mar. 28 – Apr. 3, 2016

  • Monday
    • Booked Germany flight
    • Reached out to Jimmy about Deb Roy @ MIT
    • Emailed IC postdoc program about transferring advisors
  • Tuesday
    • START meeting on status
      • Fixed a bug in access to the Gnip data in S3
    • Ran new Gnip jobs
      • Tunisia soldier attacks
      • South African elections
      • Nigerian elections
      • Cote D’Ivoire + Simone Gbagbo sentencing
  • Wednesday
    • Attended Jen’s CLIP talk
    • Went to the Social Media and Demographic Methods workshop at PAA2016
  • Thursday
    • Worked on my IC postdoc application
    • Starts the Recsys short paper on reinterpreting credibility in the Credbank data set
  • Friday
    • Added code to my credibility analysis notebook for extracting tweet polarity and subjectivity/objectivity (using the TextBlob package)
      • Density over time
      • Polarity over time
      • Subjectivity over time
    • Ran t-tests on the difference between network density, polarity, and objectivity
      • Median density for credible events is significantly lower than non-credible events (T-Test Statistic: -2.03626214044, p-Value: 0.0457402402121)
      • Median polarity is higher for credible events (T-Test Statistic: 2.69919034379, p-Value: 0.00881886216241)
      • Credible events contain more subjective content than non-credible events, which are more subjective (T-Test Statistic: 2.43566952861, p-Value: 0.017572535262)
        • This was surprising and counterintuitive to me.