Hacking Breaking News | Matthew Conlen (mathisonian)

Matthew Conlen is a human-computer interaction researcher working at Midjourney. He is the creator of Realtime, an automated data-journalism platform, and Idyll, a markup language for writing interactive multimedia documents.

Matthew has previously worked with The New York Times, NASA Jet Propulsion Laboratory, Our World in Data, FiveThirtyEight, and others. He received his Ph.D. from the University of Washington where he was advised by Jeffrey Heer in the Interactive Data Lab.

← Back

Hacking Breaking News

December, 2013

Even though I'm no longer working for a news organization, I can't help but continue to spend time thinking about how technology can be used both to build better tools for journalists and communities, and to disseminate information more efficiently.

While there seems to be some debate about whether or not technology is ruining journalism, something I've found is that there tends to be a sweet spot where tech is combined with human curation and oversight. This has the potential to result in something superior to either approach individually. (More modern uses of twitter, and the app circa are both going in the right direction here)

So, I set out to work on a project taking advantage of some of the already existing, curated content sources that are available for free to anybody with an internet connection.

project

For the project definition, I came up with two questions that I would try to answer:

Given a near real-time stream of breaking news headlines, can we algorithmically determine the physical location of the events as they happen, along with some key terms or players associated with the stories?

If so, what how can that information be used?

automating breaking news

To answer the first question I used the following strategy:

Watch the BreakingNews twitter account for any updates
Use the AlchemyAPI to process the tweets and extract entities.
If any of those entities are locations, use the google geoencoding API to turn location entity names into latitude-longitude bounding boxes.

This is a great example of the power of open API's. I am able to create all of that functionality in about 100 lines of javascript code by taking advantage of the hard work that others have made available for free.

applying the results

The second question is open-ended, but this was my first attempt:

We have a computer program that knows what the most 'important' story is right now, and where that thing is happening.

Given that, we can create something to get updates on the story that are:

happening at the location of the event
happening in real-time
reported by a reputable source

To do this I turn once again to twitter. It is straightforward to use the results of the geoencoding api to cover points (1) & (2).

Figuring out (3), which tweets are reliable or not, is the more difficult (and ill defined), especially since the decisions have to be made in realtime as the twitter stream continually provides more and more tweets.

I don't claim to have solved (3), but my attempt was that a twitter user is considered to be reputable if they meet any of the following criteria:

Are associated with a news organization in their bio
Have certain keywords in their bio
Have over N followers

Hopefully it provides some rough estimate of reliability.

To check for an association with a news organization, I keep a big list of news orgs' twitter accounts to check against (and anybody can submit a pull request to add a new one!).

The keywords that I use were just determined by trial-and-error, and are in no way scientific, but they do seem to improve results.

Old interface streaming reporter on-site of kenya mall shootings

results

The results of all this is a working working website that can with no oversight on my part keep up to date on important news happening around the world, and also provide realtime context and updates from people tweeting at the site of the event.

I call the project Onsite

Current interface showing people celebrating sporting event

I've open-sourced all of the code, and there is a rough demo available online.

Not surprisingly, the site is most effective when there is a large new story breaking, and is relatively unassuming otherwise. The two images of the program in action (a sporting event, and the Kenya mall attack) display this to some extent

There is plenty of possible future work with stuff like this, including possibly making an API to provide others with realtime and meta-tagged breaking news stories. What else would you do with it?

I'm on twitter. @mathisonian