Princeton Election Consortium

A first draft of electoral history. Since 2004

FRS178: Outline and Reading

Freshman Seminar 178, Spring 2017
Statistics, Journalism, and the Public Interest
Instructor: Sam Wang, Neuroscience/Molecular Biology/Law and Public Affairs

Location and time: Thursdays 1:30pm-4:20pm, Neuroscience A59.
Bring your laptop computer.
Extra class: Monday, May 15th, during reading period.
Course Assistant: Ben Deverett,

Seminar title and description: Modern life is complex, and so is the news. News organizations contend with many pressures, including the need to present information in a storylike manner even when reality is quantitatively nuanced; the presence of competing chatter from other information sources which are often inaccurate; and a culture in which individual citizens believe themselves to be in possession of a “better” truth. Math and statistics provide tools to solve these problems. This seminar will introduce students to the practice of journalism using math. Students will study the basic tools of data analysis, and apply those tools to telling stories in an interesting, but also quantitatively honest manner. Examples of problems to be considered include opinion poll analysis; assessment of autism risk; how people form false beliefs; the replication crisis in social science; and the use of Big Data in reporting.

We hope to assemble a mix of students, each of whom has some experience in one of the following three domains: (a) statistics, (b) computer programming, or (c) journalism or storytelling. Participants will explore how statistical analysis – and not just the reporting of a statistic – can guide good journalism. The class will divide into teams of two or three students each. Each team will pursue a project for the semester. The project can be the writing of an original article, the design of an interactive app for consumption by news readers, or some other project that illustrates the intersection of data with the public interest.

Each week, we will do at least two of the following (a) cover a major concept in data and storytelling, and use the concept to introduce relevant code and statistical ideas; (b) have a discussion, led by students or by an external visitor; and (c) computer lab to work with data. Readings will be taken from statistical textbooks and from news outlets such as the New York Times, the Washington Post, Science, and Nature. In some sessions, student teams will be expected to run and modify programs written in Python and/or MATLAB. Sometimes we will use Microsoft Excel.

For those who will do more computing, optional help sessions are available from PICSiE.

Course credit policy:

20% – Attendance & class participation.
20% – Project #1, due March 30th. Data journalism or public-interest piece on a topic that combines statistics with the public interest.
20% – Homework.
40% – Project #2, due May 15th. An original computer program or app, or a data journalism feature.

Students are expected to make one presentation during the term. This presentation can be on one of their projects, or another topic.

FRS178 Outline
Version 1.2 – February 9th, 2017


Session 1, February 9: Overview of statistics, data science, and the public interest
SW: Course policies. What does it mean to do data journalism? What should we do with single examples? How can we combine multiple examples? How is variability quantified? What is an outlier? Common yardsticks of variation: confidence intervals, risk ratios and effect sizes as ways to compare risks and treatment efficacy. Examples: Autism, brain training, elections.

Lab: Install and use Anaconda, a Python programming environment. Introduction to Python.
(optional) February 15: Making plots in Python

Ben Deverett will lead a campuswide tutorial on how to make plots in Python. Lewis Library room 347, Visualization Lab.

Session 2, February 16: Matching numbers to the story
Visitor: Brian Kernighan, Computer Science, on How to Lie With Numbers.
SW: (continue from Session 1)
Lab: Data scraping exercise. Example: Twitter API.

Session 3, February 23: Data sources and scraping
SW: A few project ideas.
Lab: Data scraping, continued. Example: Huffington Post polling API.
Students: start forming small teams.

Session 4, March 2: Data scraping clinic
Visitor: Alexey Svyatkovskiy, PICSiE, on data scraping.
SW is traveling.


Session 5, March 9: Project ideas / student project workshopping
Students: Discuss project ideas and try in lab.

Demo: Publishing climate/census/economic data via StatX, from Excel or API. with an explainer here:

Session 6, March 16: Probability and outliers
SW: Probability distributions. Systematic errors. Turning margins into probabilities. Is this a good idea? Is the margin of error a useful concept? when and when not?
Communicating about political races. Costs and benefits. Example: U.S. political prediction and its failures.

March 20-24 – Spring Break


Session 7, March 30: Regression and machine learning
SW: Linear regression. Fitting to models. Overfitting. Machine learning.
Resource: The page on machine learning and Scientific Python includes tutorials and other resources.
Project #1 is due.
Homework #1: Google Correlate.

Session 8, April 6: Significance testing and variability
SW: Introduction to the idea of a probability that an event occurred by chance. What does “occurred by chance” mean? Frequentism. Common significance tests. Priors and Bayesian reasoning.
Homework #1 is due.

Session 9, April 13: Bayesian thinking and the replication crisis
SW: Should a single study ever be believed? The replication crisis in psychology and other social sciences. The difficulty with p-values. Priors. Bayesian reasoning.
Student presentations. Show one of the following: your data set, your calculation, or your front end.


Session 10, April 20: Cognitive biases
SW: Cognitive biases. Examples: false beliefs and journalists who follow a “narrative.”
Visitor: Joe Stephens, Princeton Journalism Program.
Homework #2 is due.

Session 11, April 27: Reporting on Science
Visitor: Joe Palca, NPR.
Student progress reports.

Session 12, May 4: Multiple causation
SW: Gene-environment effects. The genetics of intelligence. Examples: autism, neuropsychiatric disorders, and normal personality.
Student progress report or lab workshopping.
Homework #3 is due.

Field trip, Monday, May 8th: New York Times
We will go to Manhattan to visit the New York Times newsroom. Stops: Sunday Review, The Upshot.

Session 13, May 12: Student project demonstrations


FRS 178 Data Project Ideas

Project: Citizen engagement with news articles
The challenge: How to connect citizens in an efficient manner with people.
The solution: a web app to provide optimal contact information.

Given a news URL (here is an example), extract the text of the article, identify the journalists who wrote the article and the political figures who are mentioned, and provide optimal ways to contact those people. This exercise requires data scraping, matching to people, and a well-thought-out strategy for reaching people (i.e. who is best reached by email? who is reached in person? by phone?).

In the example case, key figures would include Emmarie Huetteman, Yamiche Alcindor, Senator Susan Collins, Senator Lisa Murkowski, Betsy DeVos, and others. Return the most effective way of reaching those people, which is email for reporters (Huetteman and Alcindor), Washington and home-state phone numbers and physical addresses for Senators, and some other unknown route for others. Note that optimizing the contact path is nontrivial – it requires locating email and Twitter addresses for reporters, and recognizing what category each person falls into. Finally, display the information so that the user can do something with it immediately.

Bonus 1: connect with relevant factual resources to provide background for article topic.
Bonus 2: make it robust and easy to update.
Needed components: data scraping, name identification, building a database of political actors.

Project: Find-A-Town-Hall
A plug-in for the competitive district finder at the Princeton Election Consortium. That application finds competitive Congressional districts near the user. An event finder would locate activist events related to the district, and return it to the user. Note that there are people who are developing Town Hall databases, so there is some information available online. For robustness it would be best to rely on multiple online databases. Also note that the PEC project has been developed into, so that is another possible partner.

Project: Factville (advanced)
The challenge: A proliferation of false information in sites masquerading as news. These sites spring up and go viral quickly, in such a short amount of time that factchecking organizations cannot keep up.
The solution: Create a social media game that allows people to score points by critiquing false statements, and triggering replies.
Needed components: a short list of accepted information sources for checking (Wikipedia, Washington Post, New York Times).