Our Catalog of Ideas, Sweat, Inspiration, observations and Mea Culpas for IT that just didn't work out, aka: It works Awesome Except it Didn't Work (for a moment, at least).
When our services are failing it hurts. It gives our stomachs knots. When we can't immediately tell our customers what the problem is, we want to pull out our hair.
We work hard everyday planning and managing risk for our services not to fail. But they do sometimes, and when ...
So much for a perfect game! We have had no outages since last August, but today we experienced an outage from 11:01:37 AM EST to 11:15:57 AM EST. The cause of the outage was an issue with one of our application firewalls which failed and unexpectedly resulted in ...
As part of our continued efforts to provide a secure SureMail environment we updated the configuration of a firewall on the Exchange Server today at approximately 4:20 AM EDT. This update went smoothly and our monitoring indicated no issues with the new configuration.
However, some SureMail clients did begin to experience connectivity issues at that time; this was due to a load-balancing problem caused by our updates. This issue was resolved at 8:20 AM EDT; no mail was lost only connectivity to local clients was affected.
We are currently updating our monitoring to detect this type of load-balancing issue in the future to prevent further connection issues of this kind.
Should you have any questions about the above, please let us know at This email address is hidden from email harvesters via JavaScript
- Your SureTech.com Solutions Team
Friday afternoon, June 17th, our server administrators noticed that the hard disk on one of our web servers was showing signs of potential failure. In the process of transferring to a new web server using our backup drive, we discovered the backup drive was compromised as well. This double effect caused our sites to be down intermittently between 4:30 pm Friday and 2 am Saturday as we restored data from our most recent back ups. This is the first hardware failure to impact operations in 8 years. We're happy that no backup data was lost, though please check any posts or events updates made between Friday, June 17 at 3 am and Saturday, June 18 at 2 am as some of these edits may not have been retained. We apologize for any inconvenience this causes. If you have any questions, please do not hesitate to contact us. Sincerely, Your Solutions Team at SureTech.com
This email address is hidden from email harvesters via JavaScript
A switch was installed in our SureDesk™ Data Center on Sunday 5/8/2011 in order to expand capacity and reliability.
Monday morning (5/9) some SureDesk™ users were experiencing sluggishness and intermitted connection instability which was determined to be caused by the new switch.
Configuration troubleshooting and failover systems were not responsive to a fix and the entire internet ...
We experienced a new failure mode today with one of our application firewalls. It started returning errors to some customer requests from the Internet at approximately 11:46 AM EST. The issue, while seriously affecting some customer organizations, was not detected via our multiple monitoring systems. However based on some issues we were able to see, we re-started the affected application firewall at 12:09 PM EST. The resulting re-convergence of load balancing that occurred affected the other application firewall starting at approximately 12:12 PM EST and ending by 12:19 PM EST. All services returned to normal production availability via the originally affected application firewall by 12:22 PM EST. The aggregate time during which any customer organizations were affected by the issue was 36 minutes.
We are taking corrective actions to detect the memory fragmentation issue that caused inbound requests to fail on the affected application firewall. We will update our monitoring systems to alert us of this issue prior to inbound requests being rejected, so that we can remediate the issue without customer organizations being affected.
Should you have any questions about the above, please let us know at This email address is hidden from email harvesters via JavaScript
- Your SureTech.com Solutions Team
We strive to ensure that all our products are reliable and consistent. Whenever services are interrupted we work hard to get to the bottom of the cause and solutions so that such an event does not happen again.
On Friday, Nov 12 we experienced an issue with our SureFiles™ ...
We strive to ensure that all our products are reliable and consistent. Whenever services are interrupted we work hard to get to the bottom of the cause and solutions so that such an event does not happen again.
On Monday, August 2 we experienced an outage for selected clients from 3:27 p.m. to 5:20 p.m. EST.
The underlying cause:
The partial outage today was caused by a problem with one of the application firewalls. It failed in such a way as to 'lock' those sessions that had been using it, and it required an on-site intervention to correct the issue.
Steps taken to prevent reoccurence:
We have taken action to prevent this failure mode from happening again, and also to enable remotely correcting this issue so that if this issue ever occurs again, the downtime associated will be much shorter.
Network Stability:
Overall, the entire SureMail™ environment has enjoyed a 99.902% availabilty rate over the past 365 days, along with very few scheduled maintentance periods. The MAIL34 mailbox server has experienced an availability rate of 99.906% since it was brought into production approximately 7 months ago.
Should you have any questions about the above, please let us know at This email address is hidden from email harvesters via JavaScript
Best regards,
- The Technical Support Team at SureTech.com
Your MS Exchange is fully recovered after Tuesday's service outage. All data has been fully recovered to your mailbox with no loss of data.
The issue we experienced was precipitated by a problem with an HP Storage Area Network that caused an enclosure with 12-drives to fail. Our work to recover from this was followed by an additional drive failure during the rebuild of the degraded array that hosted your mailbox, causing a catastrophic failure in the array.
Please note we regularly deal with upgrades, maintenance and occasional hardware failures transparently and without affecting your service. In this case, however, all data and data redundancy in the production environment was lost causing us to rely on our disaster recovery backup and log systems.
According to this procedure we made your mailboxes available in a 'dial-tone' configuration where you were able to send and receive emails online, but not able to work with older data offline. Then, the original mailbox data was restored, and the 'dial-tone' emails were merged into the mailboxes providing a full data recovery as of 6:30am Wednesday (yesterday).
We fully realize the interruption this caused and will make additional changes to improve our ability to survive a similar enclosure failure in the future without a similar (or idealy any) service interruption.
We appreciate your patience and cooperation during the resolution of this issue. We continue to work to improve the way we manage all our services, including during emergencies and appreciate your feedback.
As always, if you have any questions, suggestions or need support please drop us a line at This email address is hidden from email harvesters via JavaScript or call us at 609-688-1111
- The Solutions Team
We strive to ensure that all our products are reliable and consistent. Whenever services are interrupted we work hard to get to the bottom of the cause and solutions so that such an event does not happen again.
On Monday, December 28 we experienced an outage for selected clients from 8:58 a.m. to 9:57 a.m. EST.
The underlying cause:
An error in the Storage Area Network (SAN) supporting mailboxes hosted on the MAIL34 server removed client access to the mailboxes. We troubleshooted the issue and were able to bring the SAN and server back online within an hour.
Steps taken to prevent reoccurence:
We have implemented additional monitoring of the SAN in order to be informed quickly of this specific condition, so that if this issue ever occurs again, the downtime associated will be much shorter.
We are researching this issue further in an effort to eliminiate the possibility of it occuring again.
Network Stability:
Overall, the entire SureMail™ environment has enjoyed a 99.902% availabilty rate over the past 365 days, along with very few scheduled maintentance periods. The MAIL34 mailbox server has experienced an availability rate of 99.906% since it was brought into production approximately 7 months ago.
Should you have any questions about the above, please let us know at This email address is hidden from email harvesters via JavaScript
Best regards,
- The Technical Support Team at SureTech.com
We strive to ensure that all our products are reliable and consistent. Whenever services are interrupted we work hard to get to the bottom of the cause and solutions so that such an event does not happen again.
On Tuesday, July 22 we experienced an outage for selected clients from approximately 5 a.m. to 11 a.m. EST.
The underlying cause:
An error in the behavior of clustering services led to the offlining of a number of mailbox stores which prevented access to those mailboxes. The same event also introduced inconsistencies into the log files that are generated for these mailbox stores which made bringing them back online a lengthy process with some element of risk. Once we had taken steps to ensure that incoming mail would continue to be accepted by our incoming mail servers we made copies of all affected mailbox stores to ensure that existing data was secure before beginning the process of rebuilding the mailbox stores. The rebuild process is resource intensive and to minimise the downtime for our customers we allocated additional hardware resources to the recovery process. Recovery of mailboxes began 3 hours after the initial problem and was complete 9 hours later. Other dependent services were brought up on completion of this work.
Steps taken to prevent reoccurrence:
Should you have any questions about the above, please let us know at This email address is hidden from email harvesters via JavaScript
Best regards,
- The Technical Support Team at SureTech.com
At 1:49 AM EDT on 6/30/09, a brief power interruption in our Data Center appears to have severely damaged one of the four UPS's in one of our racks. (This UPS had not exhibited any symptoms of issues going into the power interruption.) The damaged UPS resulted in half of the rack's AC power supply being removed.The infrastructure in that rack was designed to continue to function in this type of partial power outage, but several limitations in this design were exposed yesterday, resulting in the queuing of all inbound email of organizations using the Ultimate Anti-Spam Protector option, an outage of BlackBerry service, and issues with one of our two infrastructure monitoring systems. At 8:42 AM, we re-routed inbound email from the queues to the Ultimate Anti-Spam Protector service, and the inbound email resumed processing. Due to a configuration issue with the re-routing, some organizations' inbound email was 'bounced' back to the email sender, instead of being successfully delivered. We were able to restore most email service by 9:30 AM. Some isolated issues with email and BlackBerry service remained until everything was fully resolved at 12:40 PM.
To prevent this type of issue in the future, we have taken corrective actions so that the Ultimate Anti-Spam Protector processing and BlackBerry services will continue functioning in the event of this type of issue in the future. We are in the process of updating the affected infrastructure monitoring system so that it too will operate properly during this type of issue. And we are replacing the affected UPS with a model that will provide our monitoring system with more diagnostic information, to help reduce the probability of a UPS-caused AC power outage occurring again.
We apologize for the service interruption, and we will build on the corrective actions we have already taken, as we continue to strive to provide the highest possible service level on a proactive basis. If you have any questions or concerns please do not hesitate to contact us at
This email address is hidden from email harvesters via JavaScript
or 1-800-882-8701.
- The SureTech.com Solutions Team
On June 22, 2009 we experienced a serious outage on our SureDesk™ systems. While attempting a minor stability upgrade, our systems admins encountered an unfortunate irreversible bug that crashed the connection service to our SureDesk™ Gold environment
As it happens we also had a parallel upgrade standing by for release this weekend that we were able to move up to be in effect today and include when we restored service.
Service was down from 7:30am to 3:07pm and we sincerely regret the inconvenience to all affected SureDesk™ Gold users. Going forward we have adjusted our upgrade policy for bugs to take less risks while system upgrades are also being rolled out. Please note SureDesk™ Platinum users were not affected. Our Gold services don’t have a fully redundant failover standby which contributes to the difficulties in restoring service we saw today.
Also please note in addition to policy changes, we are streamlining work arounds if this were to happen again (which we do NOT expect) including old-school “Terminal Services” access and streamlined local synchronization of SureFiles™. Feel free to contact us for more information.
On the good news Toot-Toot side you should find a number of benefits from the upgrade now that we suffered through the service interruption:
General reliability and performance improvements:
· multi-monitor support
· Additional intelligent printing and reliability
· graphics and color resolution improvements - certain video such as youtube.com now works better on the SureDesk™
Thanks for your patience and please let us know if we can do anything to be of help or if you need help restoring or upgrading your connection.
13 Hours and $1,400.00 To upgrade my Hard Drive?!?
We’ve always said that Managed Services for IT is usually a flawed business model. Pretty much the better job you do the less you make. Kinda like lawyers, I guess, except at least we talk about ...
American based, for american customers - and email.
That's pretty much the price of excellent service these days. If you outsource your service to a place that doesn't care about your customers ... - view comments
Xobni which is inbox spelled backwards is an absolutely terrific plug in for Microsoft Outlook except for the small fact that it doesn't work... - view comments