Risk management is necessary at Sun as new products are constantly being introduced. Each time there are challenges getting information out to people. Also find the same problem repeatedly in different geographies and were challenged to share information about problems and solutions between teams. By 2001 they found over 300 user developed applications supporting risk analysis and mitigation and some of these were even being offered to clients, creating problems with global clients getting inconsistent services. in 2001 tried to replace this with a single, in-house engine but this did not have the flexibility needed and was technology-based not business focused. 2005 decided to buy a rule engine and use that to deliver rule services that could be rapidly developed and evolved as fast as the old user developed applications while improving governance and quality.
The new approach was a significant change for business users who were comfortable with their user developed applications. The new approach had to be fast – 6 months was allowed – and the fact that it took only 5 months made people nervous. International obstacles were real and they had to establish the new system as a source of active knowledge not just content. Key success factors:
- Accuracy and quality is #1
- Timely automation and delivery of new rules – <24hrs
- Use-case agnostic rule results and services through an API
- Focus on raw data pre- and post-automation
It was important that they could replace their home grown applications and increase the speed to market. It was also important not to tie it to specific use cases or situations but to provide something more independent and easily accessible – become a platform for new applications. The system allows users to check an installation for accuracy, patches and versions needed etc as well as get proactive notification of potential problems. The system delivers a lot of rule services like Bad/Withdrawn Patch analysis, Security Path analysis, Recommended Patches – easy to use and answers critical questions. In each case they have a database component – a list of withdrawn patches, for instance – and wrap rules around it to use telemetry and systems data to find the bad patches that are in use / relevant. The systems have an API, a web interface for users and a command line interface. Big focus on reusable components.
The project focused on simplifying and easing the development of rules, enabling more complex decision-making and replacing all the existing systems. A global information model was critical to this. Besides Blaze Advisor for the rules, they also focused on XML, XSD, XSLT, XQL to standardize interactions and use MySQL for storing logs. It was important to them to have an internal revolution – new approach, new technology – while looking like a gentle evolution to users. Key to this was abstracting data using data services, representing everything in XML and understanding these data structures. The use of the rules technology also allowed the subject matter experts to directly create the rules that could then be used in the automated services, often externally to the rules engine with updates being applied to the rule repository programmatically.
Sun has built a risk infrastructure that allows them to globally deliver new information within 24hrs. Now run 7Bn rules a month, 9,000 active users. Benefits:
- 67% reduction in severe incidents
- 75% reduction in break/fix calls
- 10x reduction in unplanned outages
Check out http://www.sun.com/service/preventive/