Blog

Smartsheet will be offline on March 11, 2016 starting at 8:00 PM PST for approximately 15 minutes for system improvements.

The Identify Services team will be performing upgrades to the Enterprise Directory backend this weekend (March 12 and March 13).
Precautions will be taken to mitigate any downtime of the Enterprise Directory, but there may be some instability of the service between 10 AM and 1 PM on Saturday and Sunday.

HPC Services engineer Michael Jennings gave a talk on the "Node Health Check (NHC)" on Feb 24, 2016 at the Stanford Conference and Exascale Workshop 2016 sponsored by the HPC Advisory Council. NHC, developed by Jennings, provides the framework and implementation for a highly reliable, flexible, extensible node health check solution. It is now widely recommended by major HPC job scheduler vendors and is in use at many large HPC sites and research institutions.

In this follow-up from his 2014 presentation at the Stanford HPC Advisory Council Conference, Michael will provide an update on the latest happenings with the LBNL NHC project, new features in the latest release, and a brief overview of the roadmap for future development.

About Michael Jennings

 Michael has been a UNIX/Linux Systems Administrator and a C/Perl developer for 20 years and has been author of or contributor to numerous open source software projects including Eterm, Mezzanine, RPM, Warewulf, and TORQUE. Additionally, he co-founded the Caos Foundation, creators of CentOS, and has been lead developer on 3 separate Linux distributions. He currently works as a Senior HPC Systems Engineer for the High Performance Computing Services group at Lawrence Berkeley National Laboratory and is the primary author/maintainer for the LBNL Node Health Check (NHC) project. He has also served for 2 years as President of SPXXL, the extreme-scale HPC users group.

Smartsheet will be offline on February 13, 2016 starting at 1:00 PM PST and until approximately 5 PM PST for system improvements.

We experienced an unscheduled outage of all authentication services including login.lbl.gov, ldap, OTP/pledge and other directory services on Christmas morning.  Network engineers resolved the outage.

 

Networking and Telecom will take numerous short (1-2 hour) outages December 29-30 and one longer outage.  The longest of these will take place December 29th which will impact AT&T mobile phone access, and traditional telephones at offsite buildings and the guest house, as well as VPN.  The outage window for this outage is based on the expected time to complete electrical upgrades by Facilities and may be extended if the work proves more challenging than anticipated. 

In addition, there will be numerous after hours outages on Friday December 18th to facilitate the work on Dec 29-30.


Current scheduled work in the Infrastructure Services area for the Dec-2015 winter break includes:



  • Tue, 29-Dec (tentative), times TBA -- Shutdown of 50A-1156 e-power to install branch circuit metering.  We expect an outage of up to 2 hours for UPS-backed power and 1 hour for circuits that are just generator-backed during this installation.  Most IT systems in 50A-1156 should be on dual power feeds before this time, so we expect the outage to impact:

    • All AT&T mobile phone and wired circuits (including the Guest House and offsite buildings)

    • WiFi service to the Lab

    • Remote access VPN service

    • Some building networking, including Blackberry Gate, Bldg’s 4, 48, 75, 75a, 75b, 85, and Keck Obeservatory

    • Some limited CPP equipment

  • Tue, 29-Dec, times TBA -- maintenance to primary border router er1-n1 and ALS Science DMZ router sr1-als.  We will have shifted traffic to the backup border router before this work, so the only outage impact should be for Science DMZ DTN’s (managed by HPCS and ALS)

  • Tue, 29-Dec, times TBA -- maintenance to DNS anycast service.  This may cause short (< 1 minute) interruptions to DNS resolution using the 131.243.5.1 anycast address.

  • Tue, 29-Dec 0800 - 2200, and Wed, 30-Dec 0800 - 2200 -- maintenance to non-data center switches at LBL.  This will usually mean a ~ 5 minute outage per switch as it reboots into the new operating system, although some switches can take up to 30 minutes to upgrade their firmware and return to normal operation.

  • Tue, 29-Dec, times TBA -- maintennace to ir1-n1 and ir3-n2 routers and new larger power supplies for ir1-n1.  This will mean a 10-30 minute outage for all routing for all LBL networks other than those at the ALS, at JGI, or in Zone 3 (buildings 31, 62, 66, 67, 72, 74, 77, 83, 84, 85, 86, and Strawberry Canyon gate house).

  • Tue, 29-Dec or Wed, 30-Dec, times TBA -- maintenance to LBLnet services switch.  This may cause short (usually < 5 minutes) outages for network services such as ntp, dns, onestop, iprequest, etc.

  • Tue, 29-Dec or Wed, 30-Dec, times TBA -- uplink migration from end-of-life equipment for Buildings 1, 4, 14, 31, 55, 56W, 64, 77, and 86.  Expected 5-10 minutes outages for these buildings as their uplinks move from old end-of-life equipment to current equipment.

  • Tue, 29-Dec or Wed, 30-Dec, times TBA -- maintenance to Bldg 84 switch.  Expected 10-30 minute outage while this equipment is replaced.

  • Tue, 29-Dec 0800 - 2200 -- replace primary Bldg 978 network switch.  Expected impact is several hours of downtime for most building network connections.

  • Wed, 30-Dec 0800 - 2200 -- replace primary Bldg 977 network switch.  Expected impact is several hours of downtime for most building network connections.

  • Wed, 30-Dec 1200 - 2200 -- replace primary JGI Bldg 100 non-sequencer network switch.  Expected impact is 2-3 hours of downtime for impacted 128.3.89.0 subnet.

  • Dates/times still being negotiated -- maintenance to IDM and TSC subnets.

The scheduled power work of December 18, 2015 will impact authentication to Lab services, email delivery and access to eRoom and Sympa lists starting at 5:30 PM.  Authentication to Lab services and email delivery will be impacted for approximately 15 to 30 minutes;  access to eRoom and Sympa lists will be impacted for approximately 30 to 60 minutes.

After testing ZOOM (a product chosen by ESNet last year) and comparing functionality and cost to our prior offering, we have decided to make a change.  In December we will start deployment and training which allows a one month overlap with the current tool  while we make the transition.  (We will continue to offer and recommend Google Hangouts when appropriate - easily accessed via Gmail or Google Calendar)

Our documention is being updated (go.lbl.gov/vc) and we are preparing to offer training (including one-on-one sessions with Help Desk staff members).  ZOOM is also on all the Video conferencing carts.


Key features


With our Zoom contract, we are acquiring one 1000 person Webinar license that can be assigned as needed.  Zoom also has a remote control feature that lets a participant screen share and let someone else take control (great for troubleshooting or providing assistance in areas outside of Video Conferencing).   Up to 50 participants can join a meeting.
 
We will also have the ability to let up to 4 traditional room systems connect (1 system in each of 4 meetings or 4 systems in one meeting)  Not needed as much anymore, but a useful feature.  We also understand that Zoom can connect to a Video conference bridge - something we occasionally have to do with DOE.
 

Getting Access
 

We plan to integrate Zoom with our Single Sign-on system so your Berkeley Lab identity ("LDAP") can be used to authenticate as a host and allow us to auto provision accounts.  We are also looking into offering toll free international numbers (which we can assign at the host level). We will have to recharge these costs for those of you who need this feature.  

The Linux Foundation the nonprofit organization dedicated to accelerating the growth of Linux and collaborative development, has announced an intent to form the OpenHPC Collaborative Project. This project will provide a new, open source framework to support the world's most sophisticated High Performance Computing environments. Warewulf, our cluster provisioning tool developed by IT's HPC architect Greg Kurtzer, is specified as the provisioning tool for the OpenHPC standard cluster building recipe.



Smartsheet will be offline for 90 minutes on November 7 at 1:00pm PST (2015-11-07 21:00 UTC) for system improvements. Smartsheet status can be checked here.

Fuze (a Video Conferencing Application)  will be performing a system service maintenance on Friday November 6, 2015 between the hours of 9:45pm and 10:30pm PST.  For details of the maintenance, please refer to the following link:

This outage has been resolved.

We are experiencing an outage on www.lbl.gov, today.lbl.gov, and newscenter.lbl.gov. The service provider is in the process of restoring those sites. We will update this page when new information is available.

This outage has been resolved.

Google Drive, Docs, Slides, and Sheets are experiencing partial outages today Oct 9 2015.  There is no estimated time to recovery from Google.  You can monitor Google's Apps Status Dashboard at: http://www.google.com/appsstatus#hl=en&v=status

 

We are experiencing an outage of our leased lines to buildings 978 and 972.  Engineers are working with the vendor to diagnose the situation.  We do not have an estimated return to service at this time.

We continue to get reports of issues with the built in calendar app for iPhones:  mysterious notifications to people who are not even guests in an event, syncing issues (events on the web that may be updated or cancelled don't show up on the iPhone) are two examples.  Our advice is to use the new Google app - it works and solves many of the issues we have seen over the past several years.

Reference our original article from March of this year

You can get the new Google Calendar app for IOS (and Android devices) here.