Data: Activity data to Enhance and Increase Open-access Usage (AEIOU)
Your data may contain a lot of noise which makes processing less efficient and results less relevant e.g. filter robots and web crawlers used by search engines for indexing web sites and exclude double clicks. Try using queries to identify unusually high frequencies of events generated by servers and flag these.
Your activity data will grow very quickly and things (e.g. SQL Queries) that took a few milliseconds will take tens of seconds. Use open source technologies that are tuned for Big Data (e.g. Apache Solr and Mahout) or process data offline and optimise your database and code - see Deploying a massively scalable recommender system with Apache Mahout .
Initially we used SQL queries to identify items that had been viewed or downloaded by users within specific 'windows' or session times (10, 15 and 30 minutes). Refinements were made to rank the list of recommended items by ascending time and number of views and downloads. We have a requirement that the service should respond within 750 milliseconds, if not the client connection (from the repository) will timeout and no recommended items are displayed. The connection timeout is configured at the repository and is intended to avoid delays when viewing items.
Unsurprisingly, queries took longer to run as the data set grew (over 150,000 events) and query time was noticeably influenced by the number of events per user (IP address). Filtering out IP addresses from robots, optimising the database and increasing the timeout to 2 seconds temporarily overcame this problem.
However, it was clear that this would not be scalable and that other algorithms for generating recommended items maybe required. A little research suggested that Apache Mahout Recommender / Collaborative filtering techniques were worth exploring. We are currently testing Recommenders based on item preferences determined by whether or not an item has been viewed (boolean preference) or the total number of views per item. Item recommenders use similarities which require pre-processing using a variety of algorithms (including correlations). An API also exists for testing the relevance of the recommended items and we will be using this over the next few weeks to assess and refine the service.