by Robert Aboukhalil

It is often said that data science is 80 percent data preparation and 20 percent science. This was demonstrated in the previous installment of my Data-Driven Journalism series, where I used basic command line tools to find the coldest and warmest days in Montreal over the last decade. In this post, I introduce a few more command line tools and demonstrate their use for analyzing a dataset about the passengers aboard the Titanic.

Screen Shot 2014-08-22 at 8.53.51 AM

The Titanic []


Using this data, we will answer the following questions:

  1. What is the percentage of passengers who survived?
  2. Were first-class passengers more likely to survive?
  3. Was the “women first” code applied?

Setup your workspace

Let’s get started! [To refresh your memory, click on here.] Open up a Terminal and create a folder on your desktop to store today’s analysis using the ‘make directory’, or ‘mkdir’, command:

mkdir ~/Desktop/datajournalism102/

Remember that the tilde character (~) is just a shortcut to the current user’s home directory.

Next, navigate to the folder you just created using the ‘change directory’, or ‘cd’, command:

cd ~/Desktop/datajournalism102/

To download the Titanic dataset, use the curlcommand:

     curl -o titanic.txt

To understand how curl works, let’s unpack this command (read the graph below from right to left):


To catch a glimpse of what the dataset holds, use the head command:

head titanic.txt

You should see the first 10 lines of the file:

Screen Shot 2014-08-22 at 8.57.31 AM

Note that there are 11 columns in this dataset, each separated by a comma. Luckily, the first line tells us what each column refers to:


Column Contents
1 Entry number
2 Passenger class (1st, 2nd, 3rd)
3 Survival status (1 if survived; 0 otherwise)
4 Passenger name
5 Passenger age
6 Port where they embarked
7 Destination
8 Room number
9 Ticket number
10 Lifeboat
11 Gender


Although there were over 2,200 passengers on the Titanic, keep in mind that this dataset only has information about 1,314 of them.

Analysis 1: Percentage of survivors

To calculate how many passengers died and how many survived, we need to extract the third column and count the number of passengers with either a 1 or a 0. There’s a very important command called cut that will allow you to extract columns from a file:

cut -f3 -d, titanic.txt

This will extract the third column and will define columns as being delimited by commas as shown below:

Screen Shot 2014-08-22 at 8.59.22 AM


When you run the command above, you should see many lines of 1s and 0s. To only show the lines with 1s in them, we can use the grep command, another very important command, which we use here to only keep lines with ‘1’ in them:

cut -f3 -d, titanic.txt | grep 1

Note that the pipe symbol | allows you to chain commands by executing a command on the result of another one. In the example above, we execute grep on the output of cut. This is convenient because it allows us to perform complex operations without saving results from intermediary steps.

If you run this command, only the lines with 1s in them will show. Now let’s use another command, wc(word count), to count the number of lines with 1s. Although you can use wc to count the number of words, wc -l will tell you how many lines are in a file:

   cut -f3 -d, titanic.txt | grep 1 | wc -l

This should return 449 passengers.

Likewise, we can count the number of passengers in this dataset that did not survive (864):

cut -f3 -d, titanic.txt | grep 0 | wc -l

From this analysis, we would conclude that ~66 percent of passengers did not survive. This is very similar to what other sources have reported.


Analysis 2: Survival by passenger class

Next, we’d like to know the percentage of passengers who died from each passenger class (1st class, 2nd class, 3rd class). First, we use grep to extract 1st class passengers, and then wc to count how many of them there were:

  grep 1st titanic.txt | wc -l

Note the difference between this grep command and the previous. Since the grep command does not come after the pipe symbol |, we must specify the source that we’re “grepping” on, in this case the titanic.txt file.

From the output of this command, we conclude there were 323 first class passengers. To calculate how many of them did not survive, we extract the third column and count the number of 0s:

grep 1st titanic.txt | cut -f3 -d, | grep 0 | wc -l

There were 130 first class passengers who died (40 percent).

Repeating the same analysis for the two other classes yields the interesting result that a much greater proportion of 3rd class passengers died, compared to first and second class passengers:

1st class          40% of passengers survived

2nd class        58% of passengers survived

3rd class         81% of passengers survived


Analysis 3: the “women first” code

With such a dataset, we can even attempt to guess whether the ‘women first’ code was applied on the Titanic.

Using everything we learned so far, we can make the following command to count the number of female passengers who survived:

grep female titanic.txt | cut -f3 -d, | grep 1 | wc -l

Here, grep will only select lines from titanic.txt that contain “female” in them, cut will select the 3rd column from the remaining lines, and the second grep will only show lines with 1s in them. Finally wc -l is used to count the number of 1s.

This should output 307. Next, we obtain the total number of female passengers:

grep female titanic.txt | cut -f3 -d, | wc -l

That should give 463 (66 percent survived). Doing the same for men yields 142 survivors out of 850 (17 percent).

Clearly, a much greater proportion of women survived the Titanic accident, but the staggering discrepancy between survival rates suggests some other effect could be at play. In fact, it seems there was confusion on the ship: when Captain Smith ordered his officers that women and children should go first, some of them understood that only women and children could go and therefore prevented men from boarding the life boats.

Looking further, scientists have recently combined survival data from many shipwrecks and concluded that “in contrast to the Titanic, […] the survival rate for men is basically double that for women.”

The last word

In my last post about data science for journalists, we learned the basics of using the command line with commands such as cd, mkdir, ls, head and sort.

In this post, I covered a few more of the key commands such as curl, cut and grep. With all this knowledge in hand, you should now be well equipped to do some data analysis of your own! To get you started, here is a link where you will find a long list of publicly available datasets to play with:


Robert AboukhalilBy day, Robert Aboukhalil is a computational biologist; by night, he is an entrepreneur and science communicator. He is currently pursuing a Ph.D. in computational biology at Cold Spring Harbor Laboratory and is the Editor-in-Chief of Technophilic Magazine.


by Kimberly Moynahan

Well here we are, finally in the dog days of summer. We’ve earned this after a long record-breaking winter and a much delayed spring here in southern Ontario.

The dog days are named for the binary Dog Star, Sirius, also known as Alpha Canis Majoris, the brightest star in the night sky and the largest star in the constellation Canis Major. Among ancient Romans, the hottest days of the year were associated with the first heliacal rising of Sirius –the day when Sirius first becomes visible on the eastern horizon just before sunrise. This year that occurred on August 7th.

Most people, unaware of the astral connection, associate the dog days of summer with their own earthbound canines, sprawled belly down on cool kitchen tiles or retreating to damp excavations under the porch. For humans, the dog days are a time to shelter from the heat in air conditioned interiors or to take to the water – be it the local pool, the lake or the ocean.

But, while our urge is to siesta away the steamy days, one small creature is working diligently, recovering from last winter and preparing for the next.

This is the season of summer honey.

•Honeycomb (photo: Flickr User bionicgrrrl CC BY-NC 2.0.)

• Honeycomb (photo: Flickr User bionicgrrrl CC BY-NC 2.0.)

By the time the dog days hit us, honeybees have long-since used up last year’s store of honey and have been working ceaselessly since the first blooms of early spring – those of maples, pussy willows and bugloss, to name a few — to feed their young and grow their ranks.

Recently, they made the decision of whether or not to divide the colony and swarm. Some hives did and some did not, depending on how crowded the colony had become. Now settled into their permanent digs until next summer, they are making honey from the nectar sources in their range.

Honeybees in hive  (photo Flickr User: dni777 CC BY-SA 2.0)

Honeybees in hive (photo Flickr User: dni777 CC BY-SA 2.0)

If you are only familiar with the mass-produced product in squeezable plastic bears, you may think that there is just one honey and that it is of uniform golden colour and singular mild flavour. That’s what I used to believe until I started frequenting farmers’ markets and talking to beekeepers – and most importantly, tasting the honey. As it turns out, in nature, no two honeys are alike.

The uniform colour and flavour of commercial honey is a result of raising bees near single crops such as clover or canola, feeding bees cane sugar or corn syrup, mixing many sources of honey to an “average” flavour and sweetness, and, in some cases, ultra-filtering it to remove all traces of pollen – a practice that removes some of the distinct flavours and also helps prevent the honey from crystalizing, leading to the misguided belief that “good” honey should remain liquid forever.

•Honey bear bottles (photo: Flickr User: karen CC BY-NC-ND 2.0)

• Honey bear bottles (photo: Flickr User: karen CC BY-NC-ND 2.0)

But like wine, coffee and tea, real honey is variable, taking on the characteristics of its environment or “terroir” – that is, the collection of elements that affect plants and the make-up of their nectar, such as season, climate, soil, weather, geology and human activity.

The taste of honey has several components. The most obvious is its perceived sweetness. This largely depends on the ratio of fructose to glucose; the more fructose, the sweeter the honey. Most mono-flower honeys – or varietals, as they are called – are similar in sweetness, but some, such as orange-blossom honey, have a much stronger sweet-factor.

In addition to sweetness, different varietals have unique identifiable flavours and aromas, the subtleties of each dependent on the chemical makeup of the nectar and the final product as made by the bees.

For instance, two major components responsible for the odor of Linden [also known as Basswood] honey are, terpenes linden ether (3,9-epoxy-1,4(8)-p-menthadiene) that has a flowery, mint-like odor, and cis-rose oxide that has a powerful, green, geranium type odor. (The Honey Traveler)

Eighty percent of Canada’s honey crop comes from the vast prairies of Alberta, Saskatchewan and Manitoba. Much of this is a mono-floral by-product of the canola industry, where more than 300,000 colonies of honeybees pollinate the world’s largest crop of canola seed. Smaller commercial operations in the east focus their efforts on apple orchards and blueberry crops. These are the sources of plastic bear honey.

However, twenty percent of Canada’s honeybee colonies are owned by hobbyists and small producers and this is where you begin to see the delectable array of honey that’s possible to produce in this country.

Most people are probably unaware that every temperate Canadian region offers a dazzling range of luminous, astonishing honeys. (The National Post )

•Honeybees on sunflower (photo: Flickr User: Sue Reynolds CC BY-SA 2.0)

• Honeybees on sunflower (photo: Flickr User: Sue Reynolds CC BY-SA 2.0)

When bees are surrounded by an abundance of summer wildflowers, berries, and a variety of agricultural and cover crops, the flavour and aroma of summer honey can be delightful and complex.

“Harvested from mid-July to August, the honey is made from the nectar of white clover, raspberry bush, alfalfa, mint and wild flowers. There is nothing cloying or rich about it. Though delicate, it is full-flavoured. It trickles down the throat softly without a blast of sweetness hitting my molars, or any trace of bitterness catching in my throat. Divine.” (Montreal Gazette)

So, if you have never tried real summer honey, take a trip to your nearest farmers’ market and buy yourself a bit of that succulent sunlight. I promise, if you do, you’ll never go back to the bear.


Kimberly MoynahanKimberly Moynahan writes on the natural sciences and reflects on that uneasy space in the Venn diagram where humans and wildlife overlap, both physically and emotionally. Her work can be found on her blog, Endless Forms Most Beautiful.


By Allison MacLachlan

I’ve just finished reading The Bees by Laline Paull. It’s a new novel that manages to present a well-researched and fascinating portrait of the honeybee through an anthropomorphized story.

I hadn’t thought much about bees before in this level of detail — I just suggested The Bees as my book club’s August pick because it sounded interesting and had been well reviewed. But taking such a deep dive into the world of bees has made me newly attuned to their beauty, organization, and amazing efficiency.

The Bees, a new novel that has garnered considerable attention, is being described as the Watership Down of bees. Image source:

The Bees, a new novel that has garnered considerable attention, is being described as the Watership Down of bees. Image source:

Somewhat serendipitously, recent weeks have seen considerable buzz about bees in the news.

The Globe and Mail reported on what appeared to be a precipitous decline in Canada’s honeybee population, possibly due to the use of neonicotinoids — a group of pesticides that farmers often apply to corn, canola, vegetables, and flowers.

This systemic pesticide travels to all parts of a plant, meaning insects pick it up when they gather nectar or pollen. Neonicotinoids can weaken a bee’s immune system, disorient them in flight, and cause reproductive problems, slowing the growth of a colony. This, in turns, means fewer insects to pollinate plants and help crops grow.

More recent analyses suggest the bee decline is less pronounced than initially signaled, and the issue may only be real in Ontario.

Despite the media attention of late, concerns about a declining supply of honeybees aren’t new. The issue was flagged even a decade ago, and scientists estimate that domesticated honeybee numbers may have been slipping for 50 years. In the past, threats like parasites and viruses were considered more significant.

We’re hearing about a bee decline this summer particularly because Ontario is the first province to propose regulating neonicotinoids. As the producer of most of Canada’s corn — attractive to bees because wind deposits pollen on corn – Ontario may soon require commercial growers to obtain licenses to use neonicotinoids.

Insects that transfer pollen and nectar from one plant to another, thereby fertilizing the plant, help 30 percent of the word’s crops grow. [] Image source: Wikimedia Commons.

Insects that transfer pollen and nectar from one plant to another, thereby fertilizing the plant, help 30 percent of the word’s crops grow.  Image source: Wikimedia Commons.

In the United States, worrisome bee declines have recently led the federal government to set up a new Pollinator Health Task Force. Declines have also been observed in the United Kingdom and China, as well as in Europe, which put limits on the use of three common pesticides from the neonicotinoid family last year in a short-term trial.

Limiting use of neonicotinoids would certainly have economic implications: sold under a long list of different brand names, they make up 40 percent of the insecticide market, and corn is a key contributor to Ontario’s economy.

The Bees includes the threat of pesticides as a major plot point: foragers become coated in a strange, grey film from a neighbouring field. It throws off their finely tuned navigational sense and is often fatal. The issue is woven deftly into the story and, thanks in part to the anthropomorphism, the effect is to instill real sympathy for bees’ predicament.

It’s unique to find a work of fiction that also functions in many ways as a great piece of science writing. Yes, some fictional liberties are taken (for instance, like Watership Down’s rabbits, the bees talk). But I think credit is due for simply making readers more keenly aware of an issue in current science, and piquing interest in further exploring the facts.

Taking a close look at one uncommonly appreciated insect has been interesting for other reasons. In The Bees, I encountered several highlights of bee biology and behaviour I didn’t know about before. Each fascinated me. Humans have gathered a few insights already from bees. Each of these struck me as an idea that we can appreciate and gain more from:

Economize resources

Honeycomb is made of perfect repeating hexagons, mathematically proven to be the most compact structure and, therefore, the most efficient use of wax and energy to construct.

Use of hexagon is maximized in honeycomb for efficient use of work and resources.Image source: Wikimedia Commons.

Use of hexagon is maximized in honeycomb for efficient use of work and resources.Image source: Wikimedia Commons.

Focus energy

Forget multi-tasking. Within their colonies, each bee has one distinct purpose such as communication, construction, or defense, and channels immense energy into their one crucial niche.

Listen and give

Bees communicate using pheromones, vibrations, and dance. They must focus equally on taking in information that will help them succeed, such as the location of good pollen sources, and openly sharing what they know.

Cast a wide net

Bees forage for pollen in a large network of areas within about 3 kilometres of their hive, and are always open to exploring new terrain if it will deliver good results.

Adapt and change course

Bees observe closely to find out which plants have good pollen supplies, but they don’t cling onto stale possibilities. In case of bad weather, hazards, or poor results, bees swiftly change course — a mindset possibly worth adopting as we think about the challenges with neonicotinoids.


AllisonMacLachlanAllison MacLachlan (@a_maclachlan) earned her M.Sc. in science writing from MIT in 2011. She lives in Toronto, works at Owlkids, and enjoys writing about health, biology, and psychology.



By: Pamela Lincez

Since its discovery by Canadians Frederich Banting and Charles Best in 1921, insulin has been the primary and most sustainable therapy for the treatment of type 1 diabetes.

The autoimmune destruction of beta cells in the pancreas eliminates the patient’s ability to produce insulin to regulate blood sugar (glucose) levels. Most people must take about 1,450 insulin injections a year. Without Banting and Best’s discovery of insulin, patients would be completely helpless in treating their disease.

Over the course of almost a century, science and technology have forged incredible advances in the manufacturing and production of insulin and in personal glucose management, with the delivery of insulin through wearable insulin pumps. The design and wearability of insulin pumps has come a long way from the first ‘backpack’ insulin pump designed by Dr. Arnold Kadesh in 1963, where pumps are now fashionable and trending even in the swimsuit competitions of Miss USA pageants!

Insulin pumps offer immense relief for patients as they present an alternative to incessant insulin injections; but until the past year, insulin pumps have required human intervention. When the pump signals a change in blood glucose levels, the patient must push a button for the production of insulin. In a sense, patients are still not free from monitoring their disease. The artificial pancreas, an insulin monitoring closed-loop system in which patients would be free from involvement of their pump, has been a dream for many researchers over the past decade. Over the last year, this has materialized as a viable technology.

In November 2013, Medtronic launched the MiniMed Veo Paradigmdiabetes imagesystem, in the US. This artificial pancrea-like system was the first insulin pump device closest to an artificial pancreas that could provide both continuous glucose monitoring and direct the delivery of insulin when glucose levels wavered near a pre-set threshold. Medtronic’s system is a transformative step towards a true artificial pancreas, with the design of the automatic response at a pre-set threshold, but is still not a completely artificial system that does not rely on the human brain for intervention.

This past year, the American Diabetes Association’s journal, Diabetes Care published results from clinical trials testing the feasibility and safety of a true wearable artificial pancreas system known as the Diabetes Assistant (DiAs). The DiAs is a smart phone device that communicates wirelessly with insulin pumps and epitomizes the features of a true artificial pancreas. It is basically a portable pancreas iPhone hybrid.

Stepping up the game in ‘hands-free’ glucose monitoring, Dr. Steven Russell from Massachusetts General Hospital has published this past June in the New England Journal of Medicine, on the success of a bionic pancreas device. This device uses a removable sensor under the skin to automatically monitor glucose levels and two automatic pumps that output either insulin or glucagon as needed.

The DiAs and the new bionic pancreas are truly exceptional and revolutionary advancements in insulin management, however these technologies still do not solve the underlying autoimmune disease that will plague a patient’s life.

We are now arriving at a time where scientists and doctors are on a mission to prevent disease and circumvent the need for continuous insulin injections. Researchers across the world have toyed with many therapeutic prospects as insultin replacements: surgical strategies like islet and immune cell transplants, personalized and regenerative beta cell therapy and immunotherapy injections of inflammatory cytokines.

The difficulty in successfully treating type 1 diabetes lies in the complexity of the disease, as little is known about the events leading to the onset of disease. Specific genes and environmental factors, enterovirus infections, or combinations of thereof have been implicated as the culprits driving disease susceptibility and suggesting that ‘diabetes has gone viral’, yet a distinct target for a vaccine or cure still do not exist.

As the race to find a cure for type 1 diabetes continues, exciting technological and scientific advancements are emerging. Even filmmakers are jumping on board with the excitement. This past year, a 4-minute teaser video was released for the upcoming documentary film The Human Trial. From what I’ve seen in the teaser, there may not be a Matthew McConaughey Oscar winning performance, but I do anticipate a lot of drama. The film follows the rivalry between the top research labs, the role of the FDA, and Big Pharma. Maybe the road to a cure in a clinical trial is not so glamorous or dramatic, but as a scientist, I do appreciate the interest in documenting groundbreaking research as it happens in real-time.

From the murky canine insulin concoction Banting and Best discovered, to Dr. Kadesh’s backpack pump and the revolutionary DiAs pump currently being tested in clinical trials to an upcoming documentary film – type 1 diabetes research is entering an era that will bring us closer to a cure.

PLincezPam is a PhD candidate in Microbiology and Immunology at UBC in Vancouver. She is wrapping up her research this year on a new target for type 1 diabetes therapy. Her undergraduate studies in Biochemistry and Biotechnology, her work in various research labs from academia to industry and participation at a variety of Science conferences have exposed her to a diversity in scientific thought. Her participation in the Banff Science Communications Program and many Science Outreach programs have inspired her to communicate science from all fields and share her love for perfectly awkward science on her Perfectly Awkward Science website She is as her Twitter handle @PamLincez describes – a futurist, realist, optimist and traveler.


by Sarah Boon

‘Rewilding’ is a popular buzzword these days. Ecology books from 2013 that discuss rewilding including Canadian JB MacKinnon’s Once and Future World, American Emma Marris’ Rambunctious Garden, and the UK’s George Monbiot with Feral. Rewilding has even gone mainstream, with the city of Vancouver developing a plan for rewilding parks spaces, and Parks Canada talking about rewilding Banff National Park by reintroducing bison.

Muybridge Buffalo galloping [Credit: Wikipedia,]

But what exactly is rewilding, and what does it have to do with you?

Rewilding is a form of restoration that aims to restore wilderness from its current managed state into a wilder version of itself. Unlike most restoration efforts, which focus on the restoration of an ecosystem that dates to a historic baseline, most often as the pre-Columbian time period (just prior to 1492) in North America, rewilding often considers a prehistoric timeframe.

Rewilding was first introduced to the public in 2005, when a group of ecologists published a scientific paper in Nature outlining their ideas for prehistoric rewilding at the continent scale. This controversial – and out of this world – strategy is called Pleistocene rewilding, and involves restoring megafauna that lived 10,000-13,000 years ago. While we can’t bring back the now-extinct mammoths, sabre-toothed cats, giant tortoises, and short-faced bears of that era, we can substitute other species such as the Asian elephant, various species of large cat (bobcat, lynx, cougar, jaguar), the Bolson tortoise, and the grizzly bear.

Tusker debarking a tree in Kabini [Credit: Wikipedia,]

Why would we want to do this? Pleistocene rewilding advocates suggest that wilderness has been in a continual state of decline since that fateful moment 13,000 years ago when humans began to hunt megafauna. These animals were instrumental for food, and in physically shaping landscapes, and maintaining complex food webs and ecological interactions across continents. The mammoth, for example, is credited with controlling the spread of forests through a combination of eating trees (herbivory) and transporting seeds in their dung (seed dispersal). A modern-day example of the effects of elephant species on the landscape can be seen in Africa, where forest species are declining because of the concurrent decline in elephant populations and the loss of a key method of seed dispersal.

Rewilders argue that the 1492 baselines used in traditional restoration represents an environment that humans already substantially impacted and note that current environments are not synchronous with historic ones due to climate change and changes in ecological community structure. Rewilding – which focuses on key species rather than communities of species – is seen as a way to restore resilience and function to our planetary ecosystems so that they are better able to respond to challenges like climate change.

Supporters of rewilding also suggest that our relationship with the natural world will benefit both from the careful thought and consideration required to put rewilding plans into place, and from the more immediate relationship with nature we would develop. For example, if bison were reintroduced into Banff National Park, visitors would have to be alert for bison as well as other megafauna such as bears, making them more aware of – and more active participants in – the environment.

So what megafauna Vancouver is introducing into its parks to promote rewilding – the grizzly, perhaps? No, the rewilding plan that the city has in mind is on a much smaller scale, and designed more with the human element in mind. The plan is to create less tame (more wild?) natural spaces that the public can explore either on their own or with a guide to reconnect with nature away from manicured lawns, cell phone chatter, concrete and asphalt. These include developing a youth program called ‘Reflect Effect,’ designed to use art media to explore environmental themes and projects, or changing park mowing practices to support pollinators.

This is the kind of rewilding that you can practice in your own backyard or closest vacant lot: planting native species and non-invasive species adapted to your microclimate, providing habitat for birds, insects and butterflies, and being part of a city-wide network of rewilded urban spaces that help strengthen ecosystem resilience and response to human impacts such as air pollution, climate change, and development. All while giving people some nature to connect with – no mammoths required.

For more on rewilding, read this interview with George Monbiot, or watch this BBC documentary.

Sarah BoonSarah Boon has straddled the worlds of freelance writing/editing and academic science for the past 15 years. She blogs at Watershed Moments about the environment, science communication & policy, women in science and academic culture.

Set your Twitter account name in your settings to use the TwitterBar Section.