Mark Hornick of Oracle’s Data Mining Technologies Group presented on the use of the Oracle Data Mining technology to drive recommendations at Oracle OpenWorld (OOW). The challenge in a show like Oracle OpenWorld is that there are thousands of sessions and attendees need help finding the sessions that will match their interests. Three groups were involved in this – Oracle Event Marketing who plan and manage the event as well as help design the Schedule Builder, George P Johnson who provided historical data and ran the schedule builder and Oracle Data Mining Technologies for data cleansing and modeling.
The business value of the project was based on improving the value attendees get out of the event by recommending sessions to them that most closely match their profile (based on attendance and enrollment patterns from past years) while also increasing attendance at those session. The prediction goal was specific to an individual attendees – what should be recommended to a particular person, not blanket recommendations. All this delivered in the context of the OOW Schedule Builder . Given demand for sessions the only way for an attendee to guarantee a spot was to pre-register. This meant that there was a need to recommend people the sessions that would be best for them before the show started. Sessions are one-time entities which creates a technical challenge – you must recommend them before anyone has attended and you cannot use ratings because the session is done before the ratings are collected. However, OOW collects lots of data about shows and has history – who registered, who attended which sessions and so on. This year JavaOne was integrated and, while it had data too, this data had to be integrated.
The idea was to recommend sessions by relative preference – all the possible sessions, rank ordered. They had to assess the effectiveness of the recommendation algorithm and they wanted to identify the top N similar sessions for a given session. Generally you would collect data initially and not make recommendations for a while but Oracle wanted to make recommendations immediately. To deal with this previous years were used to train and deploy a model. As the pre-conference enrollment proceeded this data was added to the mix and used to rebuild the recommendation model – essentially creating a blended model that uses both past years and the current year.
The methodology involved taking session data (mostly text including title and description), attendee data and attendance data (who attended what). When a new attendee registers and completes the registration survey this information is used to drive the model. The top 25 sessions would then be displayed when they logged on and this would be updated as they interacted. To see if these recommendations were any good they needed some success metrics. They assessed what percentage of people who enrolled found the session directly and what percentage found the session through a recommendation. They also assessed the model against a random recommendation.
Of course this is complicated by the fact that sessions are not the same year to year – they have no history and no future. They also don’t know who will attend until the session has been completed. To make this work they used session themes and attendee profiles – higher level projections from the data. For 2009 they had about 1,900 sessions with title, abstract and tracks. 34,500 attendees with basic information about products used and demographics and they had the intersection details. The methodology went like this:
- They clustered the session using text mining to identify session themes – clusters
- They build a classification model that predicts the clusters for attendees and scores each attendee for each cluster
- This allowed them to identify how likely it is that a 2010 attendee will like a particular 2009 session theme
- They then take 2010 sessions to see how like each of these session themes a given session will be
- This two vectors then allow them to see how likely it is that a particular 2010 attendee will like a particular 2010 session
The core of this is the text mining. They used stemming to reduce words to their core (integrating to integrate) and then remove stop words (and, or, Oracle) – all of which was handled by ODM. Term Frequency Inverse Document Frequency (TF-IDF) was used and measures how important a word is based on how common the word is overall and how much it is used in a particular document (uncommon words used a lot in a description are important for instance). E.g. a session contains 100 words and 6 of these are the word “mining” – it has a TF of 0.06 (6/100). If 25 of 1850 sessions have the word mining then the IDF is 4.3 (1850/25) and the TF-IDF score is 0.06*4.3 or 0.26. These words/measures are fed into a K-Means algorithm to find the clusters and identify the top 5 terms e.g. “Intelligence Hyperion Essbase Business Performance” or “Database 11G Data Technology Features”. These cluster descriptions can be easily reviewed for business meaning.
They are using the model to predict new sessions to new people – not standard recommendations to new people or new products to existing people but new to new. To evaluate the results they separate both attendance and session data into build and test data sets. The model is built using the build data from both and then tested using the test data from both – sessions from 2009 are scored for 2009 attendees and compared to the actual sessions attended – these should come from sessions that were highly ranked for the attendee.
While attendees can attend as many sessions as they can fit, on average they attend about 7 or so sessions (though some attend many more, obviously). So typically an attendee will not attend most of the sessions recommended. So Oracle also measure “enrichment”. When they get a hit – they would have recommended a session that was actually attended – they add to the enrichment score and when they get a miss – the attendee attended something that was not recommended – they lower the enrichment a little. The results of this can be compared to a random recommendation to show that the model is adding value.
In terms of results they had 600 or so session that had 40% or more enrollment from recommendations with a significant number getting very high percentages from recommendations. Enrollment is one thing but nearly 600 also got 30% of their attendees from recommendations. They also compared attendance based on recommendations, similar sessions, queries or direct attendance without enrollment. Only a small number of attendees used only recommendations but a very large percentage had a significant percentage of their sessions be driven by recommendations or similar sessions.
- Text mining is non-trivial
Consider phrases as well as words – “Oracle Data Mining” may be a better term than Oracle, Data and Mining
- Data quality matters
In POC and deployment things like data values, table structure etc need to be matched
- Introducing new dataset was both simple and challenging
The methodology worked fine but the new data had to be carefully mapped to the target schema
Minor corrections made thanks to feedback from Mark – he spoke fast and I missed a couple of things!