Weekly Reports: August 29 – September 4, 2016

  • Monday
    • Finished query expansion for Nigeria 2011 election using Gnip sample
    • Submitted the job to pull Nigeria 2011 election with SME + expanded keywords from Gnip
  • Tuesday
    • Finished evaluating relevance judgments using hand-labeled data for Nigeria 2015 and South Africa 2014 elections
      • Agreement using Cohen’s Kappa ~= 0.75
      • Classification accuracy is high
        • 0.90 for Nigeria and 0.89 for South Africa using linear SVM from Scikit with CountVectorizer and TF-IDF transformation
      • Pooled models for both elections performs surprisingly well with score 0.90
        • Transferring South Africa to Nigeria and vice versa do not perform very well though, suggesting relatively disjoint important keywords
  • Wednesday
    • Developed code for reading Excel data with Python and shunting it to Scala with Spark, so I can build a relevance classifier for the judged tweets for Nigeria and South Africa
    • Tested Spark’s ML library for learning to identify relevant tweets from human labels.
      • Currently, Spark’s ML library (NOT MLLib) only contains naive bayes, decision trees, and a few others, which don’t perform quite as well as the linear SVM I built in Python. Should compare these results with MLLib’s SVM implementation.
  • Thursday
    • Implemented a notebook for applying a classifier trained on human-judged tweets to the larger data sets from Nigeria and South Africa during their election periods. Note that I only applied this to the overlap experiment data, which covers a few days on either side of the election event rather than months around the event.
  • Friday
    • Applied the relevance classifier from human tweets to the overlap-experiment tweets for Nigeria and South Africa
      • Results are unclear. In South Africa, query expansion results in a larger number of relevant tweets but lower density, but in Nigeria, we see the opposite. Further, the relevant difference SMEs and news is inconsistent between South Africa and Nigeria. In South Africa, news performed better, but in Nigeria, SMEs performed better.
      • Central accounts do seem to be consistently better at finding relevant tweets
      • South Africa:
        • Relevant: 194,628, Irrelevant: 1,910,315, Relevant Percent: 0.092437
        • umd-central-accounts, Counted: 83550.0
          • Relevant Count: 36651
          • Relevant Percent: 0.438671
        • umd-sme-keywords, Counted: 258482.0
          • Relevant Count: 67964
          • Relevant Percent: 0.262935
        • umd-sme-keywords-expanded, Counted: 918366.0
          • Relevant Count: 147113
          • Relevant Percent: 0.160190
        • umd-news-keywords, Counted: 380696.0
          • Relevant Count: 145884
          • Relevant Percent: 0.383203
        • umd-news-keywords-expanded, Counted: 1589471.0
          • Relevant Count: 155154
          • Relevant Percent: 0.097614
      • Nigeria
        • Relevant: 5,298,568, Irrelevant: 1,790,816, Relevant Percent: 0.747288
        • umd-central-accounts, Counted: 1560307.0
          • Relevant Count: 1384520
          • Relevant Percent: 0.887338
        • umd-sme-keywords, Counted: 4904290.0
          • Relevant Count: 3574459
          • Relevant Percent: 0.728843
        • umd-sme-keywords-expanded, Counted: 6529544.0
          • Relevant Count: 4921017
          • Relevant Percent: 0.753654
        • umd-news-keywords, Counted: 4424309.0
          • Relevant Count: 3092892
          • Relevant Percent: 0.699068
        • umd-news-keywords-expanded, Counted: 6244873.0
          • Relevant Count: 4622833
          • Relevant Percent: 0.740261
    • Read “The death of bin Laden: How Russian and US media frame counterterrorsm”
      • Trying to determine what the vocabulary of user response to terrorism tells us
      • This paper is more about how Russian and US newspapers framed Osama bin Laden’s death (Russian papers highlighted negative aspects like extrajudicial killings, US global interests, political opportunities for Obama, whether the death was true, etc. US papers were more about success, US exceptionalism, etc.)