- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
- Status update No. 2 for your project Due by next class
- Read about visualizations, find a good visualization and a bad visualization Due by next class
Jump to the full details on homework assignments
The things we do
Most journalistic endeavors usually try to find something dramatic, such as a politician who is corrupt, an executive who is racist, a company that is evil. Or the flip side: The most honest politician, the most equitable executive, the most angelic company. But the available data is generally not that to the point, there is no easy measure of evil or corruption, of honesty and equity…obviously, if such data were public and obvious, it would be trumpeted.
So much of the most insightful data analysis and journalism relies on things that are implied from data.
Every morning and every evening, Google's Android system helpfully tells me how long it's going to take to get home:
How? I've never set a configuration option about where I live or where I work.
The power of metadata
In the wake of the Snowden-NSA revelatins, Kieran Healy wrote this illustrative article: Using Metadata to find Paul Revere
I have been asked by my superiors to give a brief demonstration of the surprising effectiveness of even the simplest techniques of the new-fangled Social Networke Analysis in the pursuit of those who would seek to undermine the liberty enjoyed by His Majesty’s subjects. This is in connection with the discussion of the role of “metadata” in certain recent events and the assurances of various respectable parties that the government was merely “sifting through this so-called metadata” and that the “information acquired does not include the content of any communications”. I will show how we can use this “metadata” to find key persons involved in terrorist groups operating within the Colonies at the present time. I shall also endeavour to show how these methods work in what might be called a relational manner.
…Rest assured that we only collected metadata on these people, and no actual conversations were recorded or meetings transcribed. All I know is whether someone was a member of an organization or not. Surely this is but a small encroachment on the freedom of the Crown’s subjects. I have been asked, on the basis of this poor information, to present some names for our field agents in the Colonies to work with. It seems an unlikely task.
StAndrewsLodge LoyalNine NorthCaucus LongRoomClub TeaParty Bostoncommittee LondonEnemies Adams.John 0 0 1 1 0 0 0 Adams.Samuel 0 0 1 1 0 1 1 Allen.Dr 0 0 1 0 0 0 0 Appleton.Nathaniel 0 0 1 0 0 1 0 Ash.Gilbert 1 0 0 0 0 0 0 Austin.Benjamin 0 0 0 0 0 0 1 Austin.Samuel 0 0 0 0 0 0 1 Avery.John 0 1 0 0 0 0 1 Baldwin.Cyrus 0 0 0 0 0 0 1 Ballard.John 0 0 1 0 0 0 0
Via Kieran Healy:
Racial preference and dating
What people say: I'm not opposed to dating interracially
What people appear to do: Engage less with users of different races
Via OKCupid's dating data blog: Race and Attraction, 2009 – 2014
One interesting thing is to compare what you see above with what those same users have told us about their racial attitudes. Answers to match questions have been getting significantly less biased over time:
Another article from OKCupid: How Your Race Affects The Messages You Get
As you can see, the races all match each other roughly evenly: good news. It means all other things being equal, two people, of whatever race, should have the same chance to have a successful relationship. But now let’s look at the table of how individuals actually reply to each other’s messages. First we’ll examine messages sent by men to women:
Patterns of interaction
The pair used a hefty data set from Facebook as their lab: 1.3 million Facebook users, selected randomly from among all users who are at least 20 years old, with from 50 to 2,000 friends, who list a spouse or relationship partner in their profile. That makes for a lot of social connections to analyze, roughly 379 million nodes and 8.6 billion links. The data was used anonymously.
Why use fancy network analysis when you can just look at metadata and timing? Even if you don't tell Facebook your relationship status, or write wall posts like, "IM GOING 2 BREAKUP W/ U!!", who you happen to interact with, and when, and for how long, is a good enough indicator.
From AllFacebook.com: Facebook Knows That Your Relationship Will End In A Week
As the service’s engineers built more and more tools that could uncover such insights, Zuckerberg sometimes amused himself by conducting experiments. For instance, he concluded that by examining friend relationships and communications patterns he could determine with about 33 percent accuracy who a user was going to be in a relationship with a week from now. To deduce this he studied who was looking which profiles, who your friends were friends with, and who was newly single, among other indicators.
From Quartz: We know when Dzhokhar Tsarnaev sleeps
Before he was captured, Boston Marathon bomber Tsarnaev's Twitter accountwas discovered. The content of the tweets didn't say much, but the metadata of the tweets…OK, the metadata didn't say much either, except that the tweeting behavior seemed like it would be that of a college student. Via Quartz:
That’s our visualization of tweets by @J_tsar, a Twitter account that has been linked to Dzhokhar, one of the alleged Boston bombers. The darker the pink, the more tweets. What it tells us, quite mundanely, is that Dzhokhar stays up late, often smoking weed, and sleeps past noon. Like so many other college students.
Fraudsters, by definition, do not tell their marks, "Hey, I'm trying to defraud you." So Sift Science looks at how fraudsters operate, such as when they choose to visit a site and what address they email from:
At Sift Science, we analyze a lot of data. We distill fraud signals in real-time from terabytes of data and more than a billion global events per month. Previously, we discovered that the U.S. has more fraud than Nigeria and solved the mystery of Doral, FL. At our “Cats N’ Hacks” Hackathon last week, I decided to put some of our fraud signals to the test. Working with our Machine Learning Engineer, Keren Gu, we discovered some interesting fraud patterns
Measuring Google users' satisfaction with search results
How does Google's search engine even know whether the billions of links they suggest on a daily basis are any good? The bulk of what they deliver makes it infeasible to survey users for satisfaction, i.e. there's no easy way for users to tell Google what is good. So Google simply measures how users behave upon entering a page:
Stephen Levy’s excellent book In the Plex describes how Google engineers figured out how to improve search results by mining their user behavior data (bold added):
"… Google could see how satisfied users were. … The best sign of their happiness was the "long click" – this occurred when someone went to a search result, ideally the top one, and did not return. That meant Google has successfully fulfilled the query. But unhappy users were unhappy in their own ways, most telling were the “short clicks” where a user followed a link and immediately returned to try again. "If people type something and then go and change their query, you could tell they aren’t happy," says Patel. "If they go to the next page of results, it’s a sign they’re not happy." Often called pogosticking, this refers to the behavior of users that click on a result, then "pogostick" back and forth between the search results and different websites, searching for satisfaction.
(btw, I highly recommend "In The Plex", one of the best books about how modern information technology is developed.)
How webpages measure us
Chartbeat is an analytics service that not only tells you who visits your page, but how they interact with it:
See exactly where your readers are actively engaging with your stories. And where you’re losing their attention. Scroll Depth measures how far down the page your audience is reading so you can adjust your homepage content accordingly. We count pixels, so you get data.
Using the Network panel from the Chrome browser's web inspector, we see that a "ping" script is activated every 30 seconds or so, and the variables sent include x and y
On Facebook, scrolling down a wall or your newsfeed will trigger a fetching script:
The data included in that fetch request includes everything from the numerical ID of the current page, 5281959998 for the NYTimes, to
9), to the browser that I'm using:
Try "Liking" a page and see what data you send. This is what happened when I "Liked" the New York Times:
So Facebook can track not just what we explicitly do…but implicit behavior, such as how far down we scroll on someone's page, when each request fires (i.e. how long we've been on the page), and when we leave the page. These are all factors that might indicate how much we "Like" a page (or someone), regardless of whether we actually hit the "Like" button.
Measuring by proxy
In statistics, a proxy or proxy variable is a variable that is not in itself directly relevant, but that serves in place of an unobservable or immeasurable variable. In order for a variable to be a good proxy, it must have a close correlation, not necessarily linear or positive, with the variable of interest.
Do judges let a bad mood cloud their judgment? Who knows? How do you judge a judge's bad mood to begin with? So let's look at when they've last eaten, versus whether or not they grant parole. Via NYT Economix's blog: Up for Parole? Better Hope You’re First on the Docket
A new paper finds that experienced parole judges in Israel granted freedom about 65 percent of the time to the first prisoner who appeared before them on a given day. By the end of a morning session, the chance of release had dropped almost to zero.
After the same judge returned from a lunch break, the first prisoner once again had about a 65 percent chance at freedom. And once again the odds declined steadily.
Note: This is just one paper, and the correlation is not undisputed. I cite it only as an example of how to use one variable that is easily measurable –lunchtime, time of judgments – as a potential proxy for a variable that cannot be easily measured, e.g. the mood of a judge.
We saw in the first lesson how the speed of cop cars can be indirectly measured by looking at what time they passed by toll booths:
Via the Sun-Sentinel's Pulitzer series:
Global warming or what?
As you probably have heard, people disagree about why the Earth is warming or if it's something to even be concerned about. Potential impact of global climate change can be considered a variable that's hard to measure.
So Reuters looked at a more benign dataset – how often U.S. coastal sensor detected flood-level waters:
A Reuters analysis of more than 25 million hourly readings from nearly 70 tide gauges around the United States shows that at most locations, the mean sea level has risen steadily in recent decades. Flooding has increased, too, as measured by the number of days a year that readings exceeded flood thresholds set by the National Weather Service at the 25 gauges with data spanning five decades or more.
Their interactive graphic:
via Center for Investigative Reporting and Tampa Bay Times, "America's Worst Charities"
From their first story, America's 50 worst charities rake in nearly $1 billion for corporate fundraisers
The worst charity in America operates from a metal warehouse behind a gas station in Holiday.
Every year, Kids Wish Network raises millions of dollars in donations in the name of dying children and their families.
But there's no direct metric for "being a bad charity". So this is what CIR/TBT uses as a proxy:
The United States is home to roughly 1.6 million tax-exempt organizations.
That's far too many to examine closely. So the Tampa Bay Times and The Center for Investigative Reporting used data collected by the nonprofit charity tracker GuideStar USA to narrow the pool to the 5,800 charities nationwide that report paying professional solicitation companies to raise donations.
We focused on these charities because relying heavily on for-profit fundraisers is one of the most inefficient ways to collect donations. Regulators and industry experts widely consider the practice a red flag for bad charities.
To tell the stories of America's worst charities, reporters started in California, Florida and New York, the largest states that require charities to disclose the results of their professional fundraising campaigns.
These states capture the fundraising activities of thousands of charities across the country, and in many cases record the donations raised and the cash paid to fundraisers in every state where a charity solicits donations.
Reporters zeroed in on charities that consistently kept less than 33 cents of every dollar donated. Watchdogs generally flag charities as wasteful if they keep less than 65 cents of every dollar raised.
Dollars for Docs
Drug companies are sometimes alleged to reward doctors financially for prescribing their drugs. Drug prescription data is not available, so we can't directly measure whether speaker/consulting fees directly impact prescribing habits.
What companies claim:
Pharma companies often say their physician salesmen are chosen for their expertise. Glaxo, for example, said it selects “highly qualified experts in their field, well-respected by their peers and, in the case of speakers, good presenters.”
What companies sometimes let slip through:
Kentucky’s medical board placed Dr. Van Breeding on probation from 2005 to 2008. In a stipulation filed with the board, Breeding admits unethical and unprofessional conduct. Reviewing 23 patient records, a consultant found Breeding often that gave addictive pain killers without clear justification. He also voluntarily relinquished his Florida license.
New York’s medical board put Dr. Tulio Ortega on two years’ probation in 2008 after he pleaded no contest to falsifying records to show he had treated four patients when he had not. Louisiana’s medical board, acting on the New York discipline, also put him on probation this year.
Yet during 2009 and 2010, Hastik made $168,658 from Lilly, Glaxo and AstraZeneca. Ortega was paid $110,928 from Lilly and AstraZeneca. Breeding took in $37,497 from four of the firms. Hastik declined to comment, and Breeding and Ortega did not respond to messages.
Be careful when choosing a proxy variable. The ones mentioned in this lesson are more effective because they examine how people/organizations act.
But when it comes to looking at results, the relationship to an organization's quality or "goodness" is not so direct. Take the case of teachers, who might receive bonuses if their students do well on standardized tests, or surgeons who are admonished if they have a low rate of patient survival.
In other words, to measure performance, the proxies are things that are the results of what teachers and surgeons do, not necessarily what they actually do:
- How well the students do on standardized test
- Survival rate of patients after surgery
- Post-operation costs
- New York State issues "report cards" for cardiac surgeons.
- Is More Information Better? The Effects of “Report Cards” on Health Care Providers
…and that indirection should make some sense, because how do you measure the act of teaching otherwise? Sometimes, results are the only metric we might have.
But these proxy metrics do not directly measure how a teacher or surgeon acts, but rather, the results that they are purportedly responsible for.
But for a surgeon with a relatively high death rate, or post-operation costs, it may be that the surgeon deals with high-risk patients. In fact, if a surgeon is at the top of his or her field, we might expect that they receive the sickest of patients, ones who don't have a high chance of survival to begin with.
And with teachers, their impact can depend on a variety of factors beyond their control, such as the student's success in previous grades, or at-home support.
Via the Washington Post, Meet Ashley, a great teacher with a bad ‘value-added’ score
How can one explain Ashley’s shockingly low score, however? As a principal who has always availed himself of data when evaluating teachers, I would sit down and have a conversation about the test results so that I could put them in context. Here is what we know about the context of Ashley’s score:
- This year, Ashley’s score was based on her two eighth grade classes, not the results of her Regents-level classes
- The two eighth grade classes were different curricula: one was an Algebra course and the other was a Math 8 course.
- The Algebra 8 course is geared towards the Regents exam, which is a high-school level assessment that is beyond the mathematical level of the NYS Math 8 examination. Ninety one percent of Ashley’s students in this class passed the Regents Algebra 1 examination. There is different content on the Math 8 exam, which can make it a challenge for some of our weaker Algebra students. In fact, of the students who took the Algebra course, one-quarter of them passed the Regents examination but scored below proficiency on the Math 8 exam.
- In the two weeks prior to the three-day administration of the Math 8 exam in April 2012, students in Ashley’s class had one week of vacation followed by three days of English testing. In the two weeks leading to the beginning of the Math 8 exam, Ashley saw her class only three times.
Rather than place the student results in context, the State issued a blind judgment based on data that was developed through unproven and invalid calculations. These scores are then distributed with an authority and “scientific objectivity” that is simply unwarranted. Along the way, teacher reputations and careers will be destroyed.
Note: There is more going on here than just questionable proxy variables. Maybe the variables aren't bad ones, but opponents of the State metrics would argue that proper context is not given…and of course, insufficient context can plague a variety of analyses.
It was happening again, and Dr. Richard Dal Col could hardly believe it. An emergency cardiac patient, yet another "salvage case," was dead, this time before surgery could even begin. Enraged and frightened, Dr. Dal Col stormed from the operating room into the administrator's office of St. Peter's Hospital in Albany.
"We've got to do something!" he recalls shouting in his anger at the system. "They're going to pull my license if this continues."
A month later, when the New York State Health Department released its annual report card of cardiac surgery, one of the first of its kind in the nation, St. Peter's had the highest mortality rate of any hospital in the state. A year later, Dr. Dal Col had the worst record of any heart surgeon in the state – the beginning of a period of anguish and self-doubt that has only eased as he has moved off the bottom of the list.
The year 1993, when the ranking was announced, was far better for Dr. Jeffrey Gold at the New York Hospital-Cornell Medical Center, where he enjoys life at the top of the list. Dr. Gold was No. 3 the year Dr. Dal Col was last of the 87 heart surgeons listed by name. In the 1995 report he is No. 1.
"How does it feel to be the Willie Mays of heart surgery?" he was asked by CBS. Last week, a man needing heart surgery called from upstate. "Doctor," he said, "I'm 47 years old, I have two young children, I want to live – and I want the best."
The story of Dr. Gold and Dr. Dal Col illustrates how the elite world of New York cardiac surgeons has struggled to adjust to a new set of rules aimed at improving care. The changes have led to turmoil within the insular cardiac community, forcing at least 21 low-performing doctors out of heart surgery in New York. At the same time, doctors suspect that some high-risk surgical patients are pressured to go out of state.
"There's a lot of talk that the sickest patients are going out of state," Barbara A. DeBuono, the New York State Health Commissioner, said. "The fact is that a person in cardiac arrest is not going to be flown to Ohio."
Dr. Omoigui points out that severely ill heart patients in western New York may simply more often be referred to Cleveland by doctors concerned about the high cardiac mortality rates reported at upstate hospitals in recent years.
Other criticism of the report centers on what is termed risk-factor inflation. Under the state system, doctors report patients' risk factors – like age, the pumping capacity of the heart, previous heart attacks – which are weighed so that doctors who take on severely ill patients are not unfairly penalized. In some cases, the death of a patient with many risk factors will count on a doctor's record as only half a mortality.
This update should include: - The first 200 words of your story, i.e. the lede and nutgraf - What you will have finished by Friday for your story - What you will be doing on Monday for your story
In your Github repo, with the filename
cool-viz.md, find and link (and screengrab):
- A visualization you found compelling
- A visualization you think is ineffective.
Write a graf on each. Try to use the readings listed below as points of comparison.
And no maps.
Some clips for inspiration and for thought - Good Ol’ Excel Is The Ultimate Data Visualization Tool (In Most Cases)
- A Big Article About Wee Things - A compilation by ProPublica’s Lena Groger; small images can make big impact
- How The Rainbow Color Map Misleads - More color is not always better
- The NYT’s Amanda Cox on Winning the Internet
- PowerPoint Does Rocket Science–and Better Techniques for Technical Reports - How ugly formatting can obscure the important information in technical presentations.
- Megan Jaegerman’s brilliant news graphics
An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
Do your on-the-ground reporting
- No class because of Election Day Coverage
While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Discussion of final projects before the Thanksgiving break.
Holiday - no class
Holiday - no class
Last-minute help on final projects.
In-class presentations of our final data projects.