The City of Edmonton recently released the 2009 Edmonton municipal census data as part of the Open Data initiative. The current catalogue doesn’t include results at the neighbourhood level (except in PDF). Edmonton blogger Mack Male recently shared a neighbourhood census file. He also gave an excellent map-based example of what could be done with such interesting data.
Attributes in the municipal census data
The municipal census is narrower in scope than the federal census, which was last completed in 2006. While the federal census gathers data on demographic attributes such as income, marital status, and ethnicity, the municipal census is restricted to the following: gender, age, dwelling unit ownership (owned or rented), structure type (single-detached house, duplex, row house, etc.), and dwelling unit and property type status (occupied, unoccupied, etc.).
The city provides some summary reports at the city, ward, and neighbourhood levels, but these raw tables have limited value for marketers (and residents). They are, however, a great starting point for analytics (which is what we do). This is the first in a series of posts that, I hope, will provide demographic insights into the municipal census data to help fuel marketing strategies for local Edmonton businesses and non-profits. Along the way, you’ll also get an introduction to what we mean by data-driven marketing.
Correlation in a nutshell
In statistics, correlation is a measure of the relation between two variables (e.g., height and weight, demand for a product and its price, etc.). Correlation coefficients can range from -1.00 to +1.00. A coefficient of -1.00 represents a perfect negative correlation: higher values of one variable correspond exactly to lower values of the other. A coefficient of +1.00 represents a perfect positive correlation: higher and lower values of one variable correspond exactly to higher and lower values of the other. A coefficient of 0.00 represents a lack of correlation: knowing the value of one variable does not help you predict the value of the other. (See the Wikipedia article or this detailed correlation primer for more information.)
The reason I’m going over all this is because the first analysis we did on the census data explored the correlations of the number of people in various age groups in Edmonton neighbourhoods.
Age group correlation table
The neighbourhood census data provides counts of male and female residents in successive five-year age groups that start at 0-4 and end at 85+. Not surprisingly, the correlation between the total number of males and the total number of females in a neighbourhood is around 0.98. The correlations between the number of males and females in individual age groups are also high, typically greater than 0.9. From this, we decided to ignore gender and just look at the correlation structure of the total population in the various age groups in each neighbourhood. With a little help from the conditional formatting feature of Microsoft Excel, we produced the following correlation table.
Each cell in the table shows the correlation coefficient for the age groups associated with the cell’s row and column. You can see that the table is symmetric with coefficients of 1.00 along the diagonal. The shading ranges from red (0.00 or no correlation) to green (1.00 or perfect correlation). This table really brings out the patterns in the distribution of age groups across neighbourhoods in Edmonton.
- No negative correlations. In general, if you find a relatively higher (lower) population of one age group in a neighbourhood, you are likely to find a relatively higher (lower) population of all other age groups in that neighbourhood. A negative correlation between age groups would indicate that these groups tended to cluster in distinct neighbourhoods, and perhaps suggest that they avoid each other.
- Birds of a feather flock together. The green band along the diagonal shows that there is a high correlation in the population of adjacent age groups. As an example, neighbourhoods that have relatively more (fewer) 5-9 year olds also have relatively more (fewer) 10-14 year olds. The same applies when comparing 55-59 year olds and 60-64 year-olds and so on. Adjacent age groups tend to cluster in Edmonton neighbourhoods.
- Family ties. The off-diagonal green cells show the high positive correlation between the population of children (under 19) and the population of their parents (age groups from 30-34 to 60-64). Because we combined the male and female populations, we can’t make any observations about the relative ages of mothers and fathers of the various child age groups. But, in general, it appears that the population of child age groups are most correlated with parent age groups that are 30 to 40 years older.
- Seniors and children. The reddish areas in the top right and bottom left of the table shows that there is a relatively low (or even no) correlation between the population of seniors and children in Edmonton neighbourhoods.
- Twenty somethings. The correlation between the population of 15-19 year olds and 20-24 year olds is surprisingly low; it’s the one off-diagonal element in the table that isn’t shaded green. But we know that this is an important life stage that involves moving out on your own, so this would likely create the break we observe in the table.
As I mentioned, we’re going to continue the analysis and discussion of the municipal census in a series of upcoming posts. Our first effort was relatively straightforward, and the findings should be quite intuitive. Conditionally formatted correlation tables in Excel are quick and often helpful. While you do have to be careful not to over-interpret the results (there are more powerful multivariate techniques that apply here), the approach throws useful light on new datasets. If any of you would be interested in a step-by-step of how to go about building these tables, leave a comment, and someone on the team will provide a follow-up post with the details.