Weekly Reports: Sept. 5 – Sept. 11, 2016

  • Monday
    • Holiday!
  • Tuesday
    • Created graphs for daily frequencies of tweets and relevant tweets for Nigerian and South African election data
      • Nigeria: Nigeria
      • South Africa: South Africa
  • Wednesday
    • Implemented Stanford’s CoreNLP library into Spark again for use with Jupyter and the Toree Spark kernel. It accepts an RDD of (Date, String) pairs for evaluation.
      • Fought a lot with the EC2 instances on Amazon (as launched by spark-ec2 scripts) to get them to use Java 1.8, which is necessary for CoreNLP.
    • Reviewed a bunch of papers for the HCIL CHI Clinic
  • Thursday
    • START meeting for progress and tasking
    • CHI paper clinic
  • Friday
    • Finished running raw tweet -> relevance filtering -> community identification -> per-community sentiment analysis pipeline for both the 2015 Nigerian and 2014 South African elections.
      • The communities identified by Label Propagation in Spark have a large, central community with many smaller communities. Most of these smaller communities seem unrelated to the primary event of interest.
        • For example, South Africa has one large group of >20k users that is focused on the election and several small groups focused on elections in other countries (or based in other countries). Data includes Botswana, the US, and India.
        • Nigeria also has one major community and one smaller community relevant to Biafra, a Nigerian secessionist group/state, but the other 8 top groups are not related to Nigeria.
      • These results suggest the relevance filtering step we have in place is not adequate for disambiguation of election information. The mechanism we have for identifying relevant communities seems to work well, however, so maybe the current pipeline should include an additional verification step where a human selects a few communities of interest.
      • It’s also worth noting here that identifying a single major community is not the most useful thing here. Since our goal is identifying differences across communities’ sentiment, having a single large community inhibits this goal. Our results also show these large communities have relatively static average daily sentiment across the election cycle (according to CoreNLP). A different method for identifying communities may be necessary.
        • Alternatives include topic similarity, hashtag usage, or actually extracting full friend/follower lists.

Weekly Reports: August 29 – September 4, 2016

  • Monday
    • Finished query expansion for Nigeria 2011 election using Gnip sample
    • Submitted the job to pull Nigeria 2011 election with SME + expanded keywords from Gnip
  • Tuesday
    • Finished evaluating relevance judgments using hand-labeled data for Nigeria 2015 and South Africa 2014 elections
      • Agreement using Cohen’s Kappa ~= 0.75
      • Classification accuracy is high
        • 0.90 for Nigeria and 0.89 for South Africa using linear SVM from Scikit with CountVectorizer and TF-IDF transformation
      • Pooled models for both elections performs surprisingly well with score 0.90
        • Transferring South Africa to Nigeria and vice versa do not perform very well though, suggesting relatively disjoint important keywords
  • Wednesday
    • Developed code for reading Excel data with Python and shunting it to Scala with Spark, so I can build a relevance classifier for the judged tweets for Nigeria and South Africa
    • Tested Spark’s ML library for learning to identify relevant tweets from human labels.
      • Currently, Spark’s ML library (NOT MLLib) only contains naive bayes, decision trees, and a few others, which don’t perform quite as well as the linear SVM I built in Python. Should compare these results with MLLib’s SVM implementation.
  • Thursday
    • Implemented a notebook for applying a classifier trained on human-judged tweets to the larger data sets from Nigeria and South Africa during their election periods. Note that I only applied this to the overlap experiment data, which covers a few days on either side of the election event rather than months around the event.
  • Friday
    • Applied the relevance classifier from human tweets to the overlap-experiment tweets for Nigeria and South Africa
      • Results are unclear. In South Africa, query expansion results in a larger number of relevant tweets but lower density, but in Nigeria, we see the opposite. Further, the relevant difference SMEs and news is inconsistent between South Africa and Nigeria. In South Africa, news performed better, but in Nigeria, SMEs performed better.
      • Central accounts do seem to be consistently better at finding relevant tweets
      • South Africa:
        • Relevant: 194,628, Irrelevant: 1,910,315, Relevant Percent: 0.092437
        • umd-central-accounts, Counted: 83550.0
          • Relevant Count: 36651
          • Relevant Percent: 0.438671
        • umd-sme-keywords, Counted: 258482.0
          • Relevant Count: 67964
          • Relevant Percent: 0.262935
        • umd-sme-keywords-expanded, Counted: 918366.0
          • Relevant Count: 147113
          • Relevant Percent: 0.160190
        • umd-news-keywords, Counted: 380696.0
          • Relevant Count: 145884
          • Relevant Percent: 0.383203
        • umd-news-keywords-expanded, Counted: 1589471.0
          • Relevant Count: 155154
          • Relevant Percent: 0.097614
      • Nigeria
        • Relevant: 5,298,568, Irrelevant: 1,790,816, Relevant Percent: 0.747288
        • umd-central-accounts, Counted: 1560307.0
          • Relevant Count: 1384520
          • Relevant Percent: 0.887338
        • umd-sme-keywords, Counted: 4904290.0
          • Relevant Count: 3574459
          • Relevant Percent: 0.728843
        • umd-sme-keywords-expanded, Counted: 6529544.0
          • Relevant Count: 4921017
          • Relevant Percent: 0.753654
        • umd-news-keywords, Counted: 4424309.0
          • Relevant Count: 3092892
          • Relevant Percent: 0.699068
        • umd-news-keywords-expanded, Counted: 6244873.0
          • Relevant Count: 4622833
          • Relevant Percent: 0.740261
    • Read “The death of bin Laden: How Russian and US media frame counterterrorsm”
      • Trying to determine what the vocabulary of user response to terrorism tells us
      • This paper is more about how Russian and US newspapers framed Osama bin Laden’s death (Russian papers highlighted negative aspects like extrajudicial killings, US global interests, political opportunities for Obama, whether the death was true, etc. US papers were more about success, US exceptionalism, etc.)

Weekly Reports: May 30 – June 5, 2016, Expanding Twitter Terrorism Research

Apologies for the hiatus. Last week was my first full week back, since I was in Germany for the ICWSM 2016 conference and presenting at the HCIL Symposium the previous two weeks. It was my first time at ICWSM, in Germany, and presenting at the HCIL Symposium, and all of it was amazing!

I don’t have the usual day-by-day breakdown of my research, but I will instead post a general overview of my work from last week.

My ICWSM paper was on Twitter’s response to terrorist attacks in Western countries, and I focused specifically on the Boston Marathon bombing, Sydney Hostage Crisis, and Charlie Hebdo attacks (my poster is available here: ICWSM16_Poster_Portrait). Since writing the paper though, two additional tragic events occurred: the Paris November attacks, and the Brussels airport attacks. It made sense to apply the same analyses from my ICWSM paper to these new cases and see if the same behaviors were observed.

I also wanted to experiment with some of the new technology that supports interactive analyses on “big data,” so I began working with Anaconda, Apache Toree, and Bokeh-Scala to see if I could duplicate my original analyses directly on the big NSF-funded cluster we have on campus at the University of Maryland.

To these ends, I built a pair of Jupyter notebooks (using the Apache Toree Spark kernel) that runs on our cluster, reads data directly from HDFS, analyzes it with Spark, and produces graphics using Bokeh.

I’ve made these notebooks and the original ICWSM analysis available on Github. Feel free to modify and play with the data and analysis!

ICWSM 2016 Analytics

Paris November Attacks

Brussels Transit Attacks

Weekly Reports: May 2 – May 8, 2016

  • Monday
    • START status presentation slides
  • Tuesday
    • Finished START status slides
    • Interesting preliminary result about Twitter:
      • Previous research suggests terror attacks and crises do not increase the population of social media users. These events alter the distribution of topics discussed but do not drive more posts on the platform.
      • In looking at Nigerian and South African elections, major developments in these events do seem to increase the volume of messages posted to the platform. More research is necessary here since we only have an N of 2 right now.
  • Wednesday
    • START status slides
    • Finished draft of the ICWSM poster
  • Thursday
    • SIGIR camera ready draft
  • Friday
    • Worked on building a portable Anaconda installation for my HCIL symposium tutorial on social media during crises.

Weekly Reports: Apr. 25 – May 1, 2016

  • Monday
    • Industry day
    • Working on job talk
  • Tuesday
    • START meeting
    • Pulled 28 million tweets from France around the November Paris attacks
  • Wednesday
    • Built a sample of tweets from Nigeria and South Africa surrounding elections. These samples will be used for relevance judgments and to determine the distribution of relevant tweets during these times (maybe) and for building a validation set for automated relevance selection algorithms.
  • Thursday
    • Explored surveyed reports of negative security experiences and their correlation with crime data.
      • Used Uniform Crime Reporting (UCR) data provided by the DoJ and FBI for arrest and offense data (United States Department of Justice. Federal Bureau of Investigation. Uniform Crime Reporting Program Data: County-Level Detailed Arrest and Offense Data, 2012. ICPSR35019-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2014-06-12. http://doi.org/10.3886/ICPSR35019.v1)
      • Counted up the number of users who have and have not had negative experiences wrt security (e.g., had personally identifying information stolen, been the victom of an online scam, etc.), assigned any respondent who had such an experience a +1, and any respondent who had not a -1.
      • Summed these +/-1 values over the FIPS state/county data in the survey to determine how many such instances occurred in a given geographic area.
      • Then compared these geolocated sums with the UCR data for correlation.
      • At the county level, weak correlation exists between robbery and these experiences (Pearson’s r=0.455) with the sum over all crime slightly weaker (r=0.424). At the state level, robbery has a stronger correlation (r=0.7) and general crime sum is still weak (r=0.589).
  • Friday
    • Compared language distributions in Twitter for Nigeria, South Africa, the Ivory Coast, and Tunisia using Twitter’s 1% public sample stream and Gnip
      • Each country had five sets to compare: three sets extracted using geolocation (tweets posted from the country) from Twitter’s 1% and Gnip’s bounding box functionality, and two sets extracted from Gnip using a set of seed keywords relevant to two events I generated manually
      • The geolocation data seems relatively consistent across languages, but the seeded data is skewed towards English
  • Saturday
    • Built word clouds across four African countries using both geolocated tweets and tweets extracted using manually seeded keywords.
    • Can see results here

Weekly Reports: Apr. 18 – Apr. 24, 2016

  • Monday
    • RecSys paper work
    • Re-ran CREDBANK analysis against PHEME data set
      • Few overlapping features. Differentiating between true and false in PHEME is more difficult than in CREDBANK
      • CREDBANK random forest got ROCAUC=0.91, but PHEME could only get around 0.64
      • Verified accounts and status counts were the only significantly different features beyond those found in CoCo
      • PHEME ROCAUC
  • Tuesday
    • Meeting at START
    • Met with Jen re: RecSys paper
    • Continued RecSys paper work
  • Wednesday
    • Submitted paper to RecSys on analyzing credibility in the CREDBANK data set
  • Thursday
    • Brenna’s practice talk
    • BBL by Dr. O’Shea and Dr. Scanlon from Edinburgh and the Open University respectively
    • Made summary slides for Gary that discuss our Boston Marathon work
  • Friday
    • Worked on job talk
    • NGS2 tech discussion
    • Gave an interview about inferring personal characteristics from social media

Weekly Report: Mar. 28 – Apr. 3, 2016

  • Monday
    • Booked Germany flight
    • Reached out to Jimmy about Deb Roy @ MIT
    • Emailed IC postdoc program about transferring advisors
  • Tuesday
    • START meeting on status
      • Fixed a bug in access to the Gnip data in S3
    • Ran new Gnip jobs
      • Tunisia soldier attacks
      • South African elections
      • Nigerian elections
      • Cote D’Ivoire + Simone Gbagbo sentencing
  • Wednesday
    • Attended Jen’s CLIP talk
    • Went to the Social Media and Demographic Methods workshop at PAA2016
  • Thursday
    • Worked on my IC postdoc application
    • Starts the Recsys short paper on reinterpreting credibility in the Credbank data set
  • Friday
    • Added code to my credibility analysis notebook for extracting tweet polarity and subjectivity/objectivity (using the TextBlob package)
      • Density over time
      • Polarity over time
      • Subjectivity over time
    • Ran t-tests on the difference between network density, polarity, and objectivity
      • Median density for credible events is significantly lower than non-credible events (T-Test Statistic: -2.03626214044, p-Value: 0.0457402402121)
      • Median polarity is higher for credible events (T-Test Statistic: 2.69919034379, p-Value: 0.00881886216241)
      • Credible events contain more subjective content than non-credible events, which are more subjective (T-Test Statistic: 2.43566952861, p-Value: 0.017572535262)
        • This was surprising and counterintuitive to me.

Weekly Report: Mar. 21 – Mar. 27, 2016

  • Monday
    • Met with Erin, Sarvesh, and Ben regarding START + Gnip
    • Got access to Amazon AWS account for START work
      • This included developing instructions for assuming a role assigned by another AWS account to your account. So complicated.
    • Ran analysis code on Gnip data to determine language distributions, GPS coordinates of postings, and topics across relevant tweets in Cote D’Ivoire, Nigeria, South Africa, and Tunisia
      • This data was divided among three sets: tweets that come from the target country, tweets that mention a central account (pulled from analysis of relevant tweets in the 1% stream), and tweets that mention specific keywords
      • Ivory Coast soldier protest keywords:
        • (contains:divoire OR “ivory coast” OR abidjan) contains:soldier protest
      • Nigeria Kano Bombing
        • contains:bokoharam OR (kano nigeria) OR (mosque kano) OR (mosque blasts)
      • South African miner protest
        • (strike platinum) OR (amcu workers)
      • Tunisian crackdown
        • contains:تونس OR (contains:tunisia crackdown mosques) OR (contains:tunisia militants kill soldiers)
      • Very few GPS-coded tweets are present in the sets of relevant tweets
        • Cote D’Ivoire: 0, Nigera: 572, South Africa: 31, Tunisia: 434
          • Note about Tunisia: all of those tweets occur within Tunisia but are not more specific
      • Based on the topics extracted, the process of identifying common tokens and bigrams in tweets from the 1% stream that match a Lucene-like query for the event and using these unigrams and bigrams as input for Gnip rules does work.
  • Tuesday
    • Attended DARPA’s NGS2 proposer day
  • Wednesday
    • Prepared for meetings
    • Reviewed survey data about socioeconomic status (SES) and Internet usage/privacy concerns
    • Reviewed the intel community post-doc fellowship listings and down-selected to a small list of interesting ones
  • Thursday
    • Met with Elissa about the SES survey data
      • Discussed areas of interest in descriptive statistics, prediction, and cross-data set correlation
    • Met with Jen
      • Lots of good feedback on post-docs, exploring human ability to predict SES from social media data
      • Going to explore a Recsys short paper on credibility

Weekly Report: Mar. 14 – Mar. 20, 2016

  • Monday
    • Streamlined the emigration pattern notebooks to ensure consistent results across Syria, Turkey, Ukraine, and Greece
      • Interesting result: Turkey has a lot of Twitter users compared to Ukraine. With a little less than twice the population, Turkey has an order of magnitude more Twitter users than the Ukraine
  • Tuesday
    • START + Twitter meeting
    • Worked on the SMDR paper some more
      • Generated new emigration path figures for Syria, Greece, Ukraine, and Turkey
      • Turkey has a lot more movement than other countries, but this may be an effect of have many more Twitter users.
    • Generated new emigration maps for SMDR paper
      • Syria: Syria
      • Greece: Greece
      • Ukraine: Ukraine
      • Turkey: Turkey
    • Extended the Syria emigration map using Jen’s more complete data set she pulled from Twitter
      • Syria: Syria
  • Wednesday
    • Finished a draft of the SMDR paper
    • Ran the first PowerTrack job
      • It is surprising how much data is generated with a fairly limited keyword set. Searching for tweets that contained either “boston” or “marathon” over a 9-day period produced about 16 million tweets.
  • Thursday
    • Used data from GDELT to find articles related to a set of conflicts in Africa
      • Downloaded content of these articles and counted the most common words (less stop words) to generate potential search terms for use in the Twitter trial
      • Keywords for “south africa mining protest police”
        • “africa”, “south”, “african”, “platinum”, “strike”, “union”, “industry”, “wage”, “workers”, “companies”, “police”, “mining”
      • Keywords for “tunisia protest terrorism constitution”
        • “tunisia”, “tunisian”, “algeria”, “political”, “terrorist”, “attack”, “terrorism”, “parties”, “algerian”, “security”, “people”, “islamic”
      • Keywords for “ivory coast soldiers protest”
        • “ivory”, “coast”, “soldiers”, “government”, “protests”, “i.coast”, “cote”, “demands”, “gbagbo”
      • Keywords for “suicide bomber terrorists mosque nigeria”
        • “nigeria”, “nigerian”, “haram”, “boko”, “mosque”, “bomb”, “kano”, “killed”, “attack”, “terrorist”, “terror”
    • Running the interest profiles generated from this data in my query expansion code I built for TREC
  • Friday
    • Extended my TREC query expansion code to generate bigrams as well as unigrams
    • START meeting to discuss alternate methods for keyword extraction
    • Finalized and submitted the SMDR paper
    • Generated graphs of Twitter usage and language distributions on Twitter’s 1% stream for Tunisia and Cote D’Ivoire. All data is generated from 1% stream between 1 April 2013 and 31 December 2015.
      • Tunisia Frequency: Tunisian Tweet Frequency
        • Big spike in Tunisia seems related to One Direction. -_-
      • Tunisia Languages: Tunisian Language Distribution
        • I was surprised by the amount of French in Tunisia since the official language is Arabic. Wikipedia says French is the language of “commerce and education” though, which is consistent with these results.
      • Cote D’Ivoire Frequency: Cote D'Ivoire Tweet Frequency
        • Unclear what the spikes are here.
      • Cote D’Ivoire Languages: Cote D'Ivoire Language Distribution
        • French is by far the most popular language in which to tweet in Cote D’Ivoire.
  • Saturday
    • Expanded the graphs of Twitter usage and language distributions on Twitter’s 1% stream to Nigeria and South Africa
      • Nigeria Frequency: Nigeria Tweet Frequency
      • Nigeria Languages: Nigeria Language Distribution
      • South Africa Frequency: South Africa Tweet Frequency
      • South Africa Languages: South Africa Language Distribution
    • Nigeria and South Africa primarily use English on Twitter.
    • In all countries, there appears to be a significant drop in tweets in early May of 2015. I think this is an artifact of something in Twitter’s sampling but am unsure.
    • Also plotted the tweet locations for each of these four countries. Looks like most tweets come from coastal regions.
      • Cote D’Ivoire: Cote D'Ivoire Tweet Locations
      • Nigeria: Nigeria Tweet Locations
      • South Africa: South Africa Tweet Locations
      • Tunisia: Tunisia Tweet Locations
  • Sunday
    • Started Gnip PowerTrack jobs to extract conflict-relevant tweets from Nigeria, Cote D’Ivoire, South Africa, and Tunisia
      • Ivory Coast and Tunisia have relatively few tweets compared to Nigeria and South Africa

Weekly Report: Mar. 7 – Mar. 13, 2016

  • Monday
    • Re-ran Coco on Credbank data with the following order of tweet classification: retweet, (url, hashtag, media, user_mention), tweet
      • That is, a tweet is tagged as only a retweet, a tag for a type of entity, or a regular tweet if it does not match any of the previous types.
      • With this data set, credible events have >4x as many URLs, 2x as many retweets and media posts, and a higher percentage of media posts.
        • Seems non-credible data has more user mentions and hashtags though
  • Tuesday
    • Met with Erin, Ben, and Sarvesh about INSPIRE status and tasking
    • Talked with Erin about the Social Media Demographic Research workshop and Polnet
      • SMDR -> migration in social media and what it can tell us
      • Polnet -> communities discussing contentious topics and the demographic differences between them. If we know one community is relatively deprived compared to the other from social science research, can we support or corroborate this with social media?
    • Finalized the ICWSM paper, “Evaluating Public Response to the Boston Marathon Bombing and Other Acts of Terrorism through Twitter”
      • Submitted the proofed copy to AAAI
    • Began outlining the extended abstract for the Social Media Demographic Research workshop
  • Wednesday
    • Started extracting tweets in Greece
    • Started the TREC gating code with the expanded token set
    • Updated website
  • Thursday
    • Met with START about spinning up AWS account access for INSPIRE
    • Attended HCIL’s BBL
    • Continued work on SMDR paper
  • Friday
    • Verified that query expansion has no major effect on gating TREC systems.
      • With query expansion, scores are very close to gating without query expansion (average increase is slightly worse for ELG and slightly better for nCG)
    • More work on SMDR paper
    • Started extracting Greece users for emigration research
    • Paperwork for ICWSM/WWW travel
    • Put together a page on the HCIL site for my workshop on social media analytics here