Weekly Reports: Sept. 5 – Sept. 11, 2016

  • Monday
    • Holiday!
  • Tuesday
    • Created graphs for daily frequencies of tweets and relevant tweets for Nigerian and South African election data
      • Nigeria: Nigeria
      • South Africa: South Africa
  • Wednesday
    • Implemented Stanford’s CoreNLP library into Spark again for use with Jupyter and the Toree Spark kernel. It accepts an RDD of (Date, String) pairs for evaluation.
      • Fought a lot with the EC2 instances on Amazon (as launched by spark-ec2 scripts) to get them to use Java 1.8, which is necessary for CoreNLP.
    • Reviewed a bunch of papers for the HCIL CHI Clinic
  • Thursday
    • START meeting for progress and tasking
    • CHI paper clinic
  • Friday
    • Finished running raw tweet -> relevance filtering -> community identification -> per-community sentiment analysis pipeline for both the 2015 Nigerian and 2014 South African elections.
      • The communities identified by Label Propagation in Spark have a large, central community with many smaller communities. Most of these smaller communities seem unrelated to the primary event of interest.
        • For example, South Africa has one large group of >20k users that is focused on the election and several small groups focused on elections in other countries (or based in other countries). Data includes Botswana, the US, and India.
        • Nigeria also has one major community and one smaller community relevant to Biafra, a Nigerian secessionist group/state, but the other 8 top groups are not related to Nigeria.
      • These results suggest the relevance filtering step we have in place is not adequate for disambiguation of election information. The mechanism we have for identifying relevant communities seems to work well, however, so maybe the current pipeline should include an additional verification step where a human selects a few communities of interest.
      • It’s also worth noting here that identifying a single major community is not the most useful thing here. Since our goal is identifying differences across communities’ sentiment, having a single large community inhibits this goal. Our results also show these large communities have relatively static average daily sentiment across the election cycle (according to CoreNLP). A different method for identifying communities may be necessary.
        • Alternatives include topic similarity, hashtag usage, or actually extracting full friend/follower lists.