A Data-to-Viz Walkthrough with Wine Production Numbers

A Data-to-Viz Walkthrough with Wine Production Numbers

A Data-to-Viz Walkthrough with Wine Production Numbers

We like wine here at Needle Stacker, and saw a few trends in that market that we’re highlighting below to explain what we do with data (in any sector, not just wine!), how we do it, and what the value is for our customers and the audiences they want to reach.

Executive summary

Finding, massaging, and formatting source data are all important steps before you can build an effective visualization. Publicly-available data most often needs to be scrubbed and corroborated to ensure we will reach sound conclusions that can convince seasoned industry pros. But it’s well worth the effort. At their best, pointed visualizations will put into sharp relief trends that even experienced executives can learn from. This, in turn, will contribute to targeted readership engagement that supports audience and revenue growth.

1. Finding the Data: Who’s Talking?

To get started with a new vertical, we will typically work closely with our customer to identify data sources relevant to their industry. That said, we’re also very well versed in quickly discovering source material by ourselves. Either way, the end result is a list of authoritative sources and data series that we can monitor for updates and additions.

In this example, after conducting advanced Google queries, searching through the trade press, and talking with some winemaker friends, we established our own “master list” of data sources in the wine and spirits sector. They span the gamut of customs and trade-tracking governmental organizations (usually at the country level) national and international trade associations, market research firms, and even financial institutions such as Rabobank or SBV.

Today we will work with documents prepared by Trade Data & Analysis for Wine Institute. They are a reputable source and the data is of broad interest to the industry. Wine production and vineyard acreage for 50+ countries gives us more than enough datapoints to help show how countries have competed over the 2009-2012 period.

2. Massaging the Raw Data


 Getting the data ready in just 17 easy steps…


Several steps are necessary before it’s possible to even chart data. While they vary depending on the source’s file and data format, here’s the process typically works:

  • Import/copy data from its source, which is rarely perfectly formatted for our purposes. Here we are working from scanned PDFs, i.e. numbers are lumped in one big image rather than actual text that you can copy/paste. Short of having to scan printed documents, this is the worst-case scenario, but we still often run into such “analog digital” documents, especially for archival data. Thankfully we’ve found optical recognition (OCR) software that’s sophisticated enough to properly recognize tabular data and export it to Excel. (We don’t mean to pick on Wine Institute or TDA as we appreciate the hard work they put in getting data for so many countries!) It is more common to find CSV or basic Excel sources, but even these require further processing.
  • Restructure the spreadsheet’s rows and columns so that each datapoint is on its own row, by year. This is necessary for anything but the most basic charts.
  • Reformat details such as capitalization, number formats, and units, in order to enhance readability. For instance we’ll be switching surface data from acres to hectares to consistently use the metric system since production data is expressed in liters. This will make the spreadsheet both easier to use by itself, and ready for export into a visualization tool such as Tableau.
  • Add columns for classification purposes, in this case grouping countries by continent and zone (Old World vs. New World). Use Excel’s Table functionality and Vlookup formula to match each country to these categories. Muscle memory of Excel keyboard shortcuts is highly recommended!
  • Double-check that numbers add up consistently with the source. If one isn’t careful, it’s easy to make a mistake somewhere along the process and mess up the data.

Phew! Now we have a usable spreadsheet that lends itself to analysis and charting. The takeaway is that there is a vast ocean of data out there in many sectors, but getting it ready for consumption can be a time-consuming and detail-oriented endeavor that should not be underestimated. Incidentally, in order to scale our visualization services we establish relationships with sources so that we don’t have to reinvent the wheel and start from scratch for every chart.

3. Formatting & Visualization

Visualization is not just meant to convey data but also to help draw insights and tell a story. Ideally we’d like to give the big picture as well as a sufficient level of detail. This would make the same graphic useful for a wider audience, as some readers will just need the highlights while others will want to dig into very specific aspects of the data.

The 10,000 foot perspective can be achieved by grouping countries by continent, or possibly using the New World vs. Old World distinction familiar to wine buffs. Using overlapping columns with two axes, we can show both production and surface trends over time. This proves to be useful as an initial approach to see Europe’s (deliberate, it turns out) shrinking but still dominant position, but you’ll see below that this remains quite a coarse-grained perspective. That said, many similar relatively simple charts can be derived from one dataset, and often that’s exactly what we do.


A simpler chart: good for big trends, not much else


To drill down by country will require us to make some choices, as charting 50 countries over 4 years and two data series would just be visually overwhelming. We can for instance focus on the top 10 producers, as per the following chart:


So this column is Spanish wine production in 2011, and that one is South Africa’s vineyard surface in 2012 and… where was I?


This works to an extent, but by now this chart type has become really busy and it can be hard to make comparisons or see trends without squinting at specific parts of the chart. Moreover, the key ratio between production and surface is yield – how many liters of wine do countries produce by hectare – and showing that would overload the chart even more. The pivot chart above remains valuable for people willing to play with its filters and get the exact data they’re interested in. If you have consultants or financial analysts in your audience, they’ll love it if you were to offer Excel downloads that combine clean source data with ready-made pivot tables. They can then further tinker with the data themselves or blend it with their own metrics.

However, as far as building a graphic ready to be consumed online by people who don’t have a degree in advanced spreadsheets, it’s time for a new approach.

After a few attempts, our proposal for this data set is the line chart below, which we authored in Tableau. The first version is a static screenshot; the second is its interactive embeddable version.

At a granular level, we’re able to display with each line at what rank each country was at the beginning and end of the 2009-12 period (e.g. France started and ended as the number 2 producer in volume behind Italy), what average values were over the period (e.g. the US produced an average 2.7 billion liters of wine over the period), and whether each value/country increased or decreased over the period thanks to color coding (e.g. China’s planted surface is a dark green because they added a lot of vineyard area).

Meanwhile, the big picture is much more readily apparent at a country vs. country and even continental level than with the previous chart type. Again, we go for minimalist production values in order to keep it legible and let the data express itself.

wine data
Big picture, meet details


Using an Interactive Viz Comes with Pros and Cons

Pro: macro meets micro. The interactive below adds the benefit of overlaying granular data boxes as the user hovers over each data point. There is obviously way too much data to display by default, so this is a nice solution. You now get the benefit of seeing charted trends combined with the ability to zoom into detailed numbers that you would typically only obtain by consulting a spreadsheet or table.

Hover over me!

Con: not mobile friendly. Interactive embeds come with several drawbacks though. Complex embeds are usually not responsive, and they can be a bit sluggish to load on a slow connection. Hover is a mouse event that is hard to emulate on touch devices, and would only be practical on larger tablets anyway. As a result, such technology is best used in full width page layouts and in articles or website features that readers will be inclined to read on a desktop computer. Note that many websites use narrow columns for their body copy, which helps the readability of text-centric articles, but can become a straitjacket for more ambitious graphics. Designs should be flexible enough to adequately accommodate rich content types of various widths.

We don’t want to talk in absolutes, and sure, you can to some extent bend over backwards to offer some limited level of interactivity on mobile devices (leaving aside native mobile apps vs. web browsing, which is a whole conversation in its own right). But is the cost involved worth it to your audience, and will they actually engage? In most cases the costly complexities of getting browser-based interactivity to work on tiny touch screens that people consult on the run are not worth it. So for mobile audiences on smaller devices with iffy bandwidth, it’s probably better to stick to PNG screenshots optimized for smaller sizes.

A final consideration worth pondering for longer articles: interactive graphics tend to print poorly, if at all. The takeaway is to know your audience, the use cases for your content, and make the right trade-off analysis. You may for instance break down the data into highlights that can fit on static PNG charts and in an email newsletter, while keeping the more complex stuff to a website section that readers will want to use once at their desk.

How About a Map? Wouldn’t that look cool?

Geographical data seems a natural match for mapping, and it often is. In this case, a world map allows us to show data for all the countries in the original dataset – which we have ruled out in our charts above – and may look more pleasant than a dry spreadsheet. However, this raises challenges of its own: there is little available space in the middle of Europe; and production issues come into play, given that we’re trying to show a “movie” (i.e. how did the data evolve over time), rather than a still picture (i.e. a snapshot for a specific year). And that doesn’t even address the known evils of Mercator projections!

A map animated through time with a data overlay can tackle some of these issues, with the caveats that 1) interactivity then becomes essential for the visualization to convey its message, and 2) we’re not actually sure how to produce such types of charts at the time of this writing! Google Public Data Explorer provides a compelling example here, but their Fusion Tables authoring tool doesn’t seem to support the creation of such maps despite a feature request dating back to 2010. If you know of a practical solution that doesn’t require extensive custom coding (e.g. build it up from the ground up with D3.js), let us know in the comments!

This brings an important point: not all visualizations can be done in a cost effective way, and amazing productions by the likes of Bloomberg or the New York Times typically have taken dozens of hours from a whole team to produce. That type of expense rarely makes sense for mid-sized organizations, and arguably they’re loss leaders even for such news behemoths.

Below is a quick attempt to turn our chart into a map. We think you’ll agree that is conveys less data and takes more time to interpret, though at least it lets readers instantly locate countries. Adding a filter by continent is a quick way to add value to the interactive version of this map.

For further consideration, more specific maps showing where specific varietals are grown, not just by country but within them, could be a more fruitful exercise. It is a bit misleading to color the whole of Russia as a winemaking country when their production is concentrated in a small area in the southwest near Turkey. That sort of map would also really make the point that growing vines is by and large a matter for temperate climates with access to water.


from the Dept of Discarded Drafts

The bottom line is that just because you can produce a fancy graphic doesn’t mean you should. You always have to ask yourself: is this serving or impairing understanding? And is the cost/benefit ratio aligned with your goals and compatible with your means?

4. Wait, Is This Story Even True?

The charts above show very wide yield discrepancies among countries, with the most productive country showing more than 4 times the yield of the least productive one. Fascinating, but now is a good time to ask ourselves again whether this is even accurate. The first rule of outliers is to doublecheck your data’s accuracy. This is where you cannot be just a chart machine, but need to keep your analyst’s thinking cap firmly in place. A few guidelines help keep us in check:

  • Read the fine print. The Wine Institute’s documents include several footnotes, including the salient fact that acreage data includes vineyards used to produce table grape or raisins, not just wine. There’s a good reason for that: the FAO and Eurostat source data aggregated by TDA is not split by use. For big table grape or raisin producers such as China, the US and Chile, this can significantly overstate vineyard surface for wine, as well as affect our yield calculation. Which brings us to our second point…
  • Check other sources. Other organizations such as the International Organisation of Vine and Wine (OIV) publish similar data sets (though less exhaustive in terms of countries), and it’s worth comparing some of the topline numbers to see whether our chart passes the smell test. We do find some significant discrepancies in Asia, which can again be explained by a footnote: Wine Institute excludes from its numbers Chinese yellow wine, Japanese sake, and Korean rice/fruit wines from its numbers. Drilling down to the level of say, Chile, we can tap their SAG agricultural agency whose annual survey gives us about 125,000 ha of vineyard dedicated to winemaking. So if we only wanted to focus on the largest wine producers, it might be preferable to start from narrower but more pointed data sources for increased accuracy. At the expense of spending more time from sources in a variety of languages, a sharper dataset could be built from the bottom up sourcing data from the likes of Argentina’s Instituto Nacional de Vitivinicultura (INV), Germany’s Deutsche Weine, or the USDA.
  • Understand the subject matter. Again, we want to make sure our chart passes the smell test, looking at it from different angles. We find after cursory research that grape yields do vary widely because of many environmental and technical factors, and a dry country such as Spain indeed has significantly lower yields than rainy Germany. Another significant factor at play here is that vines typically take about 4-5 years to reach full yield, so a country like China that’s quickly ramping up vineyard surface will show lower yields for a while, as its newer vines don’t produce yet. A quick scan through the ranking by yield leads us to doublecheck Russia’s suspiciously high productivity, and other sources indeed size up that country’s vineyard surface at about 70,000 hectares rather than 45,000, which, assuming accurate numbers for Russian wine production, would lead to significantly lower yield more in line with third-party assessments.

Overall, data collection processes are rarely perfect, especially for large industry-wide aggregates at the country level. One should always assess to what extent the data at hand should be footnoted with caveats or corrected by cross-referencing weaker datapoints. There’s always a trade-off between the time spent vouching for data vs. the value of incremental precision – provided we’ve ruled out that a specific dataset is simply too polluted to be safely used.

In this case, the biases and limitations in the original dataset are likely to be constant through the observed period. Even if we know that the data could be better, it is reasonably safe to think it’s at least directionally correct. The map is not the territory and data should always be approached with common sense and healthy skepticism. What matters most is screening numbers and making informed decisions.

5. Taking Visualization Further to “Get the Data to Talk” and “Tell Me Something I Don’t Know”

Annotations or animations can direct readers to specific parts and help them grasp what’s going on without frowning at the chart or having their eyes wander aimlessly. The more complex the data, and the less obvious the insights, the more we recommend such light but effective hand-holding.

This type of step-by-step walk-through can be made using a number of techniques using anything from low-tech animated gifs to Tableau’s story points and Youtube videos with voice-overs. Choosing the best vehicle to let the data fully tell its story depends on the wealth of data at hand, its underlying narrative opportunities, and the time/money budget for production. This provides the opportunity to attract the attention to specific parts of the chart, backed up by some extra data and explanations that are not necessarily apparent on the original chart. Just for the sake of illustration, here’s an animated mock-up cycling through a couple highlights:


Pointing where to look and explaining underlying drivers

6. Diving Even Deeper: What Data Could Enlighten Your Audience?

Working with data often reveals the need… for more data! For example, Wine Institute also compiled wine consumption data by country, whose complement is of course exports. A full-fledged dashboard for the wine industry may include consumption flows in volume and value by country/region/varietal/wine type and by year, over at least a decade, given the long-term pace of agricultural byproducts. A visualization is often the starting point in an ongoing quest to figure out a market or trend. The end goal of that process is to deliver sound insights from which your audience will feel they really learned something useful.

This is where our incremental approach comes into play, by tapping a broad set of data sources for quick highlights, as well as providing ongoing updates of established longitudinal series for deeper inquiries. In the end this allows zooming and panning through any given data universe and always serves to put articles in perspective.

Please get in touch if you would like to discuss our approach to data and how we can help your organization roll out similar visualizations to grow reader engagement and open new revenue opportunities.

Leave a Reply

Your email address will not be published. Required fields are marked *