Strategies for collecting and storing activity data

Topic
This guide discusses factors which impinge on what activity data to gather and if it should be aggregated before storage.
The problem
Activity data sets are typically large and require processing to be useful. Decisions that need to be made when processing large data sets include selecting what data to retain (e.g. to analyse student transactions and not staff) and what data to aggregate (e.g. for the purposes in hand do we require records of all books or simply a count of books borrowed by each student).
If we are being driven by information requests or existing performance indicators, we will typically manipulate (select, aggregate) the raw data early. Alternatively, if we are searching for whatever the data might tell us then maintaining granularity is essential (e.g. if you aggregate by time period, by event or by cohort, you may be burying vital clues). However, there is the added dimension of data protection – raw activity datasets probably contain links to individuals and therefore aggregation may be a good safeguard (though you may still need to throw away low incidence groupings that could betray individual identity).
The options
It is therefore important to consider the differences between two approaches to avoid, on one hand, loosing useful data by aggregation, or, on the other hand, unnecessarily using terabytes of storage.
Directed approach
Start with a pre-determined performance indicator or other statistical requirement and therefore selectively extract, aggregate and analyse a subset of the data accordingly; for example
  • Analyse library circulation trends by time period or by faculty or …
  • Analyse VLE logs to identify users according to their access patterns (time of day, length of session).
Exploratory approach
Analyse the full set (or sets) of available data in search of patterns using data mining and statistical techniques. This is likely to be an iterative process involving established statistical techniques (and tools), leading to cross- tabulation of discovered patterns, for example
  • Discovery 1 – A very low proportion of lecturers never post content in the VLE
  • Discovery 2 – A very low proportion of students never download content
  • Discovery 3 – These groups are both growing year on year
  • Pattern – The vast majority of both groups are not based in the UK (and the surprise is very low subject area or course correlation between the lecturers and the students)
Additional resources
Directed approachLibrary Impact Data Project (LIDP) had a hypothesis and went about collecting data to test it
Exploratory approach - Exposing VLE data was faced with the availability of around 40 million VLE event records covering 5 years and decided to investigate the patterns.
Recommender systems (a particular form of data mining used by such as supermarkets and online stores) typically adopt Approach 2, looking for patterns using established statistical techniques - http://en.wikipedia.org/wiki/Recommender_system and http://en.wikipedia.org/wiki/Data_Mining