- Conference 2015
- Join CSWA
- The conference is here!! We are now in Saskatoon from June 18 – 21.
- Find the program with session and tour locations here
With the Stanley Cup playoffs well underway, this issue of the Data-Driven Journalism series is dedicated to hockey. Specifically, we explore how to use command line tools to unearth interesting patterns in the teams who won a Stanley Cup in the past.
Although the analysis presented here is hockey-specific, keep in mind that the tools you will learn are broadly applicable to any field of data journalism, and can enhance your reporting. For instance, these tools allow you to investigate climate extremes when reporting global warming, investigate trends in authorship within scientific journals, and study the impact of social class on survival on the Titanic, just to name a few.
For those of you “Canadians” who don’t follow hockey, let’s start with some context. Each year, after the regular season, the top ranking teams battle each other in a tournament. The winning team gets to take home the coveted Stanley Cup, a trophy sometimes referred to as the Holy Grail of hockey.
As you will see below, some surprising conclusions emerge out of this analysis, including the fact that the 1918—1919 season did not have a Stanley Cup winner; read on to find out why. Let’s get started with the analysis.
Setup your workspace
Let’s launch our Terminal and setup our workspace. If this article is the first one you read from the Data Journalism series, or if you need a refresher, read this article first. With the Terminal open, use the command mkdir (make directory) to create a folder on your Desktop, where we will perform today’s analysis:
Then navigate to that folder using the command cd (change directory):
Obtain the data
Since we are interested in teams who won the Stanley Cup in the past, we need to retrieve data to that effect. As it turns out, the Hockey Databank Project keeps track of these numbers. Every year, they release an updated dataset that outlines statistics about players, teams, awards and more (for details, please visit their website).
From the Hockey Databank Project, we can download the data file teams.csv, which contains a line for every NHL team and every year they played, as well as various statistics (including the number of wins and loses, amount of penalty time incurred, and whether they won the Stanley Cup). To download this file, we can use the curl command as we discussed last time:
curl -o data.csv http://robertaboukhalil.com/data/hockey/teams.csv
This simply tells your computer to download the file teams.csv from my server and save it as data.csv in the current folder on your computer. Once the file download complete, we can preview what the file contains using the head command:
You should see the first few lines of the file:
From this preview, we can deduce that the data file is arranged in comma-separated columns, where column 7 is the division rank of the team and column 8 will indicate the playoff status of the team. From the documentation of the Hockey Databank Project, column 8 will contain SC if the team won the Stanley Cup, SCSF is the team lost during semi-finals, and so on. Also note that column 19 indicates the team name.
Now that we have our data, let’s find out which team has won the most Stanley Cups so far.
Which teams have won the most Stanley Cups so far?
Here’s the game plan for answering this question. We need to find all the lines in our data file that relate to teams who won the Stanley Cup. Then we want to count how many times each one of these teams won the cup. First, let’s list all the lines in the file that contain winning Stanley Cup events. Since we expect to see “SC,” in the lines where that happens, let’s first use the grep command to list only the lines that match that pattern:
grep "SC," data.csv
Note that in the output, there should only be lines with “SC,” on them:
Next, we would like to only list the team name from each line so that we can later identify how many of each we saw. We can use the cut command to isolate column 19 of the file, which contains the team name. To do so, we’ll take the output from grep and use the pipe (the vertical line character) to pass it to the cut command, as we discussed in a previous post:
grep "SC," data.csv | cut -d "," -f 19
Briefly, the cut command requires you to specify which delimiter (-d) defines a column (in our case a comma), and which column to cut out of the file (in our case column 19). Note that, despite the name, cut does not modify your original file. When you press enter, you should see the names of all teams who won the Stanley Cup.
As expected, there are teams who won the Stanley Cup several times, and we would like to count how many times that has happened for each team. To do so, we first sort the list of names so that the same team names are neighboring lines, which will make it easier to count for the computer. To sort our list, we use the aptly-named sort command:
grep "SC," data.csv | cut -d "," -f 19 | sort
Notice how we now see islands of team names. Next, we use a special command called uniq -c, which identifies our islands of teams, compresses them into one line, and counts how many of them there were:
grep "SC," data.csv | cut -d "," -f 19 | sort | uniq -c
We’re almost there; the last step is to sort this list by the number of times a team has won the Stanley Cup:
grep "SC," data.csv | cut -d "," -f 19 | sort | uniq -c | sort
As you can see, the Montreal Canadians won the Stanley Cup a total of 24 times, followed by the Toronto Maple Leafs and the Detroit Red Wings, who won the Cup for a total of 11 times.
Which division rank are teams who win the Stanley Cup?
To illustrate how powerful the command line can be, let’s ask a slightly different question: for the teams who won the Stanley Cup, what was their rank in their division at the end of the season? As it turns out, we can use almost exactly the same code as before, and only change column 19 to column 7, which indicates the rank the team was within their division:
grep "SC" data.csv | cut -d "," -f 7 | sort | uniq -c | sort
From the output of this command, we conclude that approximately 60% of Stanley Cup winners were teams that ranked 1st in their division, and nearly 90% were teams ranked either 1st or 2nd in their division.
Was the Stanley Cup ever not awarded?
As you may remember, the 2004-2005 hockey season did not take place due to the NHL Lockout, and therefore the Stanley Cup was not awarded that year. The next question we will tackle is: Was there any other year during which the Stanley Cup was not awarded?
According to the Hockey Databank Project, the data spans the seasons between 1913 and 2013. Therefore, we should expect to find lines with “SC,” for all years in between. Here’s the game plan for this challenge: we can make two files containing date ranges, one that contains the years where the Stanley Cups was awarded, and in the other file, all numbers between 1913 and 2013. To answer our question, we simply need to compare the two files and the differences will identify the years during which the Stanley Cup was not awarded.
First, we extract the list of years we have on file during which teams have won the Stanley Cup (the year is saved as column 1):
grep "SC," data.csv | cut -d "," -f 1
We then save this result to a file called SC_years using the greater-than symbol (>), which directs the output of the command to a file instead of on screen:
grep "SC," data.csv | cut -d "," -f 1 > SC_years
Next, we generate the second file that contains all numbers between 1913 and 2013. To do so, there is the command seq (sequence) that will come in handy:
seq 1913 2013 > all_years
Now all that’s left is to compare these two files using the command diff (difference):
diff SC_years all_years
The output on your screen should show the years 1918 and 2004. This indicates that the Stanley Cup was not given out in 2004-2005 season—as expected—but also in the 1918-1919 season. A bit of research reveals that this was the year of the Spanish Flu, during which the Stanley Cup series were canceled after several players and managers caught the flu.
The last word
To conclude, we have seen how the command line is a very powerful tool for analyzing datasets, and I hope this encourages you to give it a try and maybe even use this in your own reporting.
If you enjoyed reading this article and following along the exercises, I would encourage you to check out Adventures in Data Science with Bash, a book I’m writing where each chapter is a self-contained adventure—similar to the one we did today—that spans a variety of topics, including finding the hottest day in Vegas, and how much tip a NYC taxi driver can expect on average. Signup on the website to receive a free chapter.
By day, Robert Aboukhalil is a computational biologist; by night, he is an entrepreneur and science communicator. He is currently pursuing a Ph.D. in computational biology at Cold Spring Harbor Laboratory and is the Editor-in-Chief of Technophilic Magazine.
By Kristina Campbell
The past few years have seen the rise of an intriguing therapy: fecal microbiota transplantation (FMT). Infusing a patient’s guts with another person’s feces has gone from being a questionable treatment of last resort to one that has significant scientific backing. Indeed, for cases of recurrent C. difficile infection, FMT is up to 90% effective.
The mechanisms underlying the treatment’s efficacy are not well understood. However, FMT is thought to produce positive results by seeding the gut with a new, healthier community of microorganisms.
The success of FMT in treating C. difficile has been so dramatic that people have begun to use it independently, without any medical guidance, for other health issues that have been linked to the microbiome. These include Crohn’s disease, ulcerative colitis, obesity, even multiple sclerosis and autism. Some of these uses may eventually prove to be appropriate, but as far as the scientific community is concerned, the jury is still out. The risks of FMT are not fully known.
Journalist David Wild, in a recent editorial, reported that an estimated 10,000 people are doing this procedure at home, dwarfing the number of regulated FMT procedures done under doctor supervision. Wild argues that this is a public health problem.
The emergence of this trend is not surprising; for many patients with diseases such Crohn’s or ulcerative colitis the standard treatments are ineffective. For these individuals, self-medicating with FMT may offer some small hope for relief from symptoms that have a profound negative impact on quality of life. Yet they may be ill-equipped to perform the procedure properly and safely.
Regulation remains a thorny issue because FMT has not been approved for any applications other than C. difficile. Wild, in his editorial, calls for scientists to initiate a harm minimization approach that would involve establishing “a network of supervised FMT clinics that bypasses regulatory requirements.”
The official advice – as in a position paper from the Canadian Association of Gastroenterology – is that more controlled studies on FMT are urgently needed to assess the approach’s efficacy in treating conditions other than C. difficile.
Ideally, those considering FMT would help address this urgent need for scientific data by participating in clinical trials. A less controlled, but potentially useful, way to accelerate research on FMT is to involve these FMT DIYers in citizen science research from their own homes.
In the meantime, doctors can do their part by taking patients seriously when they talk about FMT in the clinic. In anecdotal accounts of ‘best case’ situations, some patients report that they received very valuable support from their general practitioners (GPs) after being open about doing FMT independently. Thus, while a GP cannot endorse the therapy, he/she can be well educated on the treatment and provide practical and emotional support to patients.
There’s light at the end of the tunnel. FMT studies continue to be published: for example, one Canadian and one European study have explored FMT in the treatment of ulcerative colitis, and several reports (one from the US and one from China) have looked at FMT for Crohn’s disease. Treatment efficacy was variable in these studies. This may be due to the heterogeneity of these diseases, suggesting that FMT may only work in a subset of patients. These studies have also provided hints that some stool donors seem to be ‘pooperstars’ with the special ability to induce remission in these patients. There is even some indication that children between the ages of 8 and 15 might be ideal donors. In time, scientists will have a clearer picture of FMT’s risks and benefits for different individuals.
Kristina Campbell, the “Intestinal Gardener”, writes for the Gut Microbiota for Health Experts Exchange.
by Kimberly Moynahan
When I tell people I am a freelance writer, their minds usually turn to journalism and they respond with “Who do your write for? Maybe I’ve read some of your work.”
And while every fibre of my body would love to be able to respond with a casual, “Oh, you know, Scientific American, the BBC, National Geographic … the usual places,” that hasn’t happened for me.
Not everyone is cut out to pitch ideas, cover a beat, write on spec, conduct interviews, travel to faraway places and write long feature stories.
However, it turns out there is a world of work out there just screaming for people who have strong writing skills and the ability to convey STEM subjects in a clear and engaging way.
It’s not all glamorous work; most of the time your name won’t even be associated with what your write. At times it may not even be “science writing” per say. But it does one thing that freelance journalism sometimes fails to do – it pays a professional hourly rate.
So I thought I’d share some of unexpected jobs that have crossed my desk in the last few years. My thought is that, if you are struggling freelancer, there may be things here you didn’t know you could do with your science communications background.
And even if you are in journalism, some of these might provide income between commissions. They all require specific skillsets and writing styles, but you might be surprised at what you’re already good at.
I know I was.
Professional web writing doesn’t mean turning out mass-produced blog posts to build up someone else’s content. While there is a market for that, it pays pennies. No, I’m talking about working with web developers and designers to write high quality pages for commercial sites.
Not much different from translating science, it takes a certain skill to wade through all of the corporate-speak the company typically provides and make it crisp, enthusiastic, and web-friendly. And if the topic is scientific, that’s icing on the cake.
Researching and Writing White Papers
One thing people with scientific backgrounds are often able to do well is parse large volumes of information and tease out central tenets and truths. That combined with good research skills can make science writer an ideal candidate for turning out white papers.
I have to say, this is my favourite freelance work by far. Interpretive writing involves writing informational panels and labels for museums, science and nature centres, zoos, and the like.
It can also include writing audio scripts and screen text for digital interactives. I’ve even been involved in developing a storyline for an educational video game.
Interpretive writers work with experts who provide the core material and graphic designers who make it all look good. In some cases, I’ve served as both science researcher and writer, making the work all the more interesting and fun.
Marketing and Advertising Copy
I thought I had left this field 20 years ago, but recently was contacted by an marketing firm that needed someone with a science or medical background to write copy for pharmaceutical products. I took the work and I like it because the projects are quick (2-4 hours over a couple days) and I can fit them in between bigger jobs.
I’ll start by saying I don’t get paid money for this work. I exchange writing projects with a group of fiction authors for the purpose of professional critique. Part of my contribution is keeping a sharp eye out for scientific errors or misunderstandings in their work.
They are never surprised to see things like, “Salamanders are amphibians, not reptiles” in the margins of their manuscripts or a complete description of how the nervous system works in their written critiques. In fact, this work was the impetus for my “Friday Fiction Facts” blog series.
In return, these fine writers provide invaluable feedback on my creative writing projects. I can safely say, I have learned more about how to clearly communicate scientific ideas from these sometimes “science-averse” people than I have at any workshop. As a result, I consider my work for them as paid work.
A whole lot of other opportunities have crossed my desk over the years, most of which I’ve passed over due to the insultingly low pay being offered. I write for a living so everything I’ve listed above has paid at my professional rate.
Finally, I want to mention that several of the contracts I’ve enjoyed were a result of my CSWA membership. More interestingly (and tellingly) one came through PWAC and the client told me it was “impossible” to find good science writers.
So yes people, there is a market.
Now I’d be interested to hear from you about unusual or unexpected freelance work you’ve picked up.
Kimberly Moynahan writes on the natural sciences and reflects on that uneasy space in the Venn diagram where humans and wildlife overlap, both physically and emotionally. Her work can be found on her blog, Endless Forms Most Beautiful.
By Barry Shell
“You don’t start out writing good stuff. You start out writing crap and thinking it’s good stuff, and then gradually you get better at it. That’s why I say one of the most valuable traits is persistence.” — Octavia Butler
Every now and then people ask me: “What ever happened to that novel you were writing?” My last blog post for the CSWA was about my NaNoWriMo project. National Novel Writing Month is an annual project that brings people together from all over the world with a common purpose – to write a novel in a month. The blog post was written on day four of the 30 day project and I was very optimistic about the process. It ended up being a thoroughly enjoyable experience. Every writer should try it, and yes, I did finish the novel.
But it’s a first draft. As we all know, editing is probably the most important part of the any writing project, but another VERY important part is just getting a first draft down on paper—especially when that draft has to be at least 50,000 words. Mine ended up totalling 50,940, but the ending still needs a bit of work. Actually the first page needs work, but that will happen during editing.
I’ve barely looked at the novel since it was finished. It sits under the coffee table, a half-ream of double sided print in a red 3-ring binder. That’s something the NaNoWriMo people recommended we do at the end of the month: make sure to print out your novel. It’s wonderful to hold the physical artifact in your hands and know that, yes, you did it. Parts are great. Other parts — Ouch. But some bits are quite good. Inspiring, even.
The month of steady writing was intense. Being retired, I had loads of time to devote to the project. But many participants I met were students taking a full load of courses, or young mothers with families. All types of people of all different ages were writing novels last November. It seemed that most of them were writing fantasy, horror or some other classic genre and most of those who came to the organized events and write-ins were women in their 30s.
The local VancoWriMo group gathered numerous times throughout that month. Collectively, the Vancouver group produced almost 24,000,000 words during those 30 days. That works out to about 480 novels. But at most only 30 – 50 people ever showed up to any of these events. Usually we met at Vancouver coffee shops or bars and just sat there writing. One of the veterans might suddenly announce, “Word war in 5 minutes.” You would try to get yourself ready, maybe get another cup of coffee, or a beer. Then there would be a count down like for a rocket launch. You would note your word count at that moment and start typing. Word wars typically lasted 20 or 30 minutes. At the end, everyone would yell out how many words they just wrote. When the goal is quantity not quality, this is a great way to write. And it’s perfect for doing that initial word dump for a first draft. It certainly gets the creative juices flowing. At times we would take a break and order food and chat about our plot holes and other things.
There were meet-ups in the suburbs but I only went to the ones in Downtown Vancouver. These were coordinated via Facebook and twitter. One time, we met at 10:00am on a Sunday morning at the Waterfront Skytrain station. There were about 30 of us. Everyone got a ticket and we headed down to the platform. We passed up the first skytrain as it was one of the old ones, and we all wanted to be in one of the newer cars. The idea was to write as we rode all the way to the end of the line—and all the way back again. From Waterfront you loop around through Burnaby and New Westminster, then back around to VCC/Clark station. That takes an hour. Oddly a skytrain car half filled with people typing away madly on laptops did not get as many curious stares as we’d expected. It’s simply uncanny how fast time goes by when you are immersed in writing your own novel on Skytrain. You put your head down and when you look up you are already at Metrotown. At VCC/Clark everyone got up for a stretch. The car sits there for about three minutes. And then we were off for another hour of writing. Indeed, I did get some inspirations from a few of the real life characters and random conversations overheard on the train that day. It worked. At the end we all went for lunch in the food court just outside Waterfront station.
Most days I spent about 3 hours physically writing, sometimes a bit more, usually at home, once in the woods in Stanley Park, but most often in various coffee shops around town. Yes indeed, I was one of those idiots nursing a coffee for two hours writing a novel in a cafe. I managed to keep ahead of the daily word count targets to the point that I built up enough of a surplus to take a day off once in a while. But most days I wrote. It was very different from the kind of writing I used to do as a staff writer at SFU—each day was an adventure. I’d go to bed at night wondering what would happen next in the story. More than once I surprised myself with a plot twist or a new subplot. Other times I could see the whole thing unfold and just slogged through the process of getting it all down.
I did not look at the novel for the entire month of December. I didn’t even print it out until Jan 15. Then in February I began to rework it into a stage play for a sort of musical/opera. In fact, I knew this was a possibility when I began the project in November, so I purposely created Phil, a character in the novel who was a struggling artist/musician. During the course of the novel Phil wrote five songs—not the music, just the words, but it was a great way to force myself to write some songs. I’m now in the process of learning Sibelius, a musical notation program so I can write out the melody and chords for these songs.
In discussing the stage play with my musical collaborators we have simplified the plot and improved the story, making it more realistic by including deeper, more subtle interactions and relations among the characters. These ideas will eventually drive the novel editing process, which, as one NaNoWriMo veteran recommended, is best done when the Spring rains end and you can spend hours in the sun, perhaps at the beach, editing. With any luck, some day I will reach a point where I feel comfortable letting others read it.
Congratulations to this year’s CSWA Book Award Winners!
The Canadian Science Writers’ Association/ Association candienne des rédacteurs scientifiques is pleased to announce the winners in the 2014 Science in Society Book Awards competition. The winners will each be presented with a certificate and $1000 cash prize during an awards dinner held in conjunction with the CSWA ‘s 44th annual conference in Saskatoon, SK, hosted by the University of Saskatchewan 18-21 June 2015.
Winner for the 2014 Science in Society Children/ Middle Grades Book Award competition:
The first in a series of humorous books about “disgusting creatures”, The Fly is a look at the common housefly. It covers such topics as the hair on the fly’s body (requires a lot of shaving), its ability to walk on the ceiling (it’s pretty cool, but it’s hard to play soccer up there), and its really disgusting food tastes (garbage juice soup followed by dirty diaper with rotten tomato sauce, for example).
Elise Gravel is an award-winning author and illustrator from Québec. She is winner of the Governor General’s Award for Children’s Illustration in French, and is well known in Québec for her original, wacky picture books. Having completed her studies in graphic design, Elise found herself quickly swept up into the glamorous world of illustration. Her old design habits drive her to work a little text here and there into her drawings and she loves to handle the design of her assignments from start to finish. She is inspired by social causes and likes projects that can handle a good dose of eccentricity.
Winner for the 2014 Science in Society General Book Award competition:
Being among bees is a full-body experience, Mark Winston writes—from the low hum of tens of thousands of insects and the pungent smell of honey and beeswax, to the sight of workers flying back and forth between flowers and the hive. The experience of an apiary slows our sense of time, heightens our awareness, and inspires awe. Bee Time presents Winston’s reflections on three decades spent studying these creatures, and on the lessons they can teach about how humans might better interact with one another and the natural world.
Like us, honeybees represent a pinnacle of animal sociality. How they submerge individual needs into the colony collective provides a lens through which to ponder human societies. Winston explains how bees process information, structure work, and communicate, and examines how corporate boardrooms are using bee societies as a model to improve collaboration. He investigates how bees have altered our understanding of agricultural ecosystems and how urban planners are looking to bees in designing more nature-friendly cities.
The relationship between bees and people has not always been benign. Bee populations are diminishing due to human impact, and we cannot afford to ignore what the demise of bees tells us about our own tenuous affiliation with nature. Toxic interactions between pesticides and bee diseases have been particularly harmful, foreshadowing similar effects of pesticides on human health. There is much to learn from bees in how they respond to these challenges. In sustaining their societies, bees teach us ways to sustain our own.
Mark L. Winston has had a distinguished career researching, teaching, writing and commenting on bees and agriculture, environmental issues, and science policy. He was a founding faculty member of the Banff Centre’s Science Communication programme, and consults widely on utilizing dialogue to develop leadership and communication skills, focus on strategic planning, inspire organisational change, and thoughtfully engage public audiences with controversial issues. Winston’s work has appeared in numerous books, commentary columns for The Vancouver Sun, The New York Times, The Sciences, Orion magazine and frequently on CBC Radio and Television and National Public Radio. He currently is a Fellow in Simon Fraser University’s Centre for Dialogue, and a Professor of Biological Sciences.
Short List for the 2014 Science in Society Children/ Middle Grades Book Award competition:
Zoobots by Helaine Becker, Kids Can Press.
Starting from Scratch by Sarah Elton, Owl Kids Books.
It’s Catching by Jennifer Gardy, Owl Kids Books.
The Fly by Elise Gravel, Penguin Random House.
If by David J. Smith, Kids Can Press.
Short List for the 2014 Science in Society General Book Award competition:
The End of Memory by Jay Ingram, Harper Collins Publishers Ltd.
Canadian Spacewalkers: Hadfield, MacLean and Williams Remember the Ultimate High Adventure by Bob McDonald, Douglas & McIntyre.
Pain and Prejudice: What Science can Learn about Work from the People Who Do It by Karen Messing, Between the Lines (BTL).
Is that a Fact? by Dr Joe Schwarcz, ECW Press.
Bee Time by Mark L. Winston, Harvard University Press.