Recent I.T. disruption
Thu, 13 Nov 2014 05:45:00 GMT
Dear Colleagues
Over the last few weeks we have unfortunately experienced disruption to I.T. services. I am extremely sorry for the impact this has had on both students and staff, and would like to offer a brief explanation of recent events, and what we’re doing to safeguard against them happening in the future.
Tuesday 7 October: loss of internet connection affecting all universities in the Yorkshire and Humber region. This impacted on people trying to access University I.T. services from off and on campus who were trying to access Unimail from around 10.00am until 3.00pm. The problem was caused by a JCB cutting through one of the main network fibre links in Leeds. In response to pressure from the I.T. Directors, Jisc, the body which manages the network, is reviewing its incident management processes and is bringing forward major upgrade works to prevent a recurrence.
UniLearn performance: there have been intermittent performance problems since the start of term. These relate primarily to a bug in the Blackboard software, which emerged following the summer upgrade, combined with high levels of usage of UniLearn. Blackboard is currently working on a fix to the software, and in the meantime we have switched off the module which was causing the problem. We are also introducing improvements to our testing procedures. I am pleased to say that Unilearn is now stable and usage by students is back to expected levels (indeed the statistics for the last week show usage is higher than for the same period last year). In the new year, once the prerequisite SAN (Storage Area Network) upgrade is complete we will be further improving the underlying infrastructure for UniLearn.
Thursday 30 October: loss of power affecting all I.T. services. The University has two Data Centres, DC1 and DC2, which share the load of the servers and data storage infrastructure. The initial power cut caused all systems in DC2 to fail as a result of technical failure of the battery backup system (UPS).
At the same time there was a major failure of the core network switch in DC1. Although this was restored by 11.30am on the day of the power cut, it left several systems in an unstable state due to the switch failure and the unavailability of DC2. The failure of the UPS in DC2 also meant that systems were not protected against a power surge which subsequently caused several hardware faults on major systems. Unfortunately almost every system was affected so this took several days to recover from.
I am pleased to say that thanks to the diligence of colleagues in Computing Services no data has been lost from any of the systems. All systems were up and running last week, though Timetabling required a reinstallation by the software company Scientia.
In the short term we will replace the battery backup system in DC2, and we are working on additional options to increase resilience. We have made changes to the configuration of systems in DC1 to minimise the risk of down time, and colleagues in Estates are evaluating power capacity to DC1 with a view to procuring a battery backup system with automatic failover to a standby generator. In the longer term, the new state-of-the-art DC1 (scheduled for completion December 2015) will incorporate an automatic battery backup and generator system.
Following feedback from a number of students and members of staff, we are also reviewing our communication procedures for major incidents which affect weekends. I am very conscious that members of the University require access to I.T. systems seven days a week, and in the event of any problems they need a simple way of finding out what is happening.
Once again I wish to sincerely apologise for the recent disruption. Most of it has been outside our control, but that is of course no consolation to users of our I.T. systems. I would also add by way of reassurance that over the course of a year our I.T. systems are available over 99% of the time.
Sue White
Director of Computing and Library Services