Commons (Confluence) Outage (May 26, 2016)
May 26, 2016
HPCS at Lustre Users Group 2016
Apr 03, 2016
Run on Lawrencium for Free - New Low Priority QoS
Mar 27, 2016
Archiving Files and Records Workshop on April 22
Mar 25, 2016
Smartsheet Scheduled Outage (March 11, 2016)
Mar 11, 2016
Enterprise Directory Scheduled Upgrade (March 12, 2016)
Mar 11, 2016
NHC Talk at Stanford Exascale Conference
Feb 20, 2016
Smartsheet Scheduled Outage (February 13, 2016)
Feb 12, 2016
Fri Dec 25 - Unscheduled outage of authentication services - resolved
Dec 25, 2015
Holiday Maintenance Outages 2015 December
Dec 18, 2015
Scheduled Power Work Impact on IT Systems (Friday, December 18, 2015)
Dec 15, 2015
ZOOM selected as Video Conferencing Tool
Nov 25, 2015
Commons will be offline on Thursday, May 26, 2016 starting at 5:00 PM for approximately 15 minutes for system improvements.
HPCS Storage Lead John White will be giving at talk at the annual Lustre Users Group 2016 conference being held in Portland, Oregon this week.
John's presentation will provide an introduction to the experiences and challenges involved in providing parallel storage to a HPC focused condo-style infrastructure. High Performance Computing Services at Lawrence Berkeley Lab serves as a middle-ground at the institutional level, between grad-student managed computation and the national-allocation class computing such as those offered by NERSC and XSEDE. Given our revenue sources, our infrastructure style is characterized by frequent small-scale buy-ins as well as infrequent 'large' grant funding. That challenge has shaped a nimble infrastructure that maximizes our customers' dollar/GB and dollar/FLOP but has lead to unique requirements from Lustre including upgrade paths that break traditional parallel file system rules. We focus on ease of management and, above all, a deep desire for uniformity across numerous Lustre instances to reach the goal of a true building-block infrastructure.
We are pleased to announce the “Low Priority QoS (Quality of Service)” pilot program which allows users to run the Lawrencium Cluster resources at no charge when running at a lower priority.
This program, tested by Lawrencium Condo users, is now available to all Lawrencium users. We hope by implementing such a solution, it would help users to increase their productivity by allowing them to make use of available computing resources.
The new QoSs "lr_lowprio" and "mako_lowprio" have been added that will allow users to run jobs that request up to 64 nodes and 3 days of runtime. This includes all general purpose partitions such as lr2, lr3, lr4, mako, and special purpose partitions such as lr_amd, lr_bigmem, lr_manycore, mako_manycore. By using these new QoSs, you are NOT subject to the usage recharge that we are currently collecting through the “lr_normal” and “mako_normal” QoSs; however, these QoSs do not get a priority as high as all the general, debug, and condo QoSs and they are subject to preemption by jobs submitted at the normal priority.
This has two implications to you:
1. When the system is busy, any job that is submitted with a Low Priority QoS will yield to other jobs with higher priorities. If you are running debug, interactive, or other types of jobs that require quick turn-around of resource, or have important deadline to catch, you may still want to use the general QoSs.
2. Further, when system is busy and there are higher priority jobs pending, scheduler will preempt jobs that are running with these lower priority QoSs automatically. The preempted jobs are chosen by the scheduler automatically and we have no way to set select criteria to control its behavior. Users can choose at submission time whether preempted jobs should simply be killed, or be automatically requeued after it is killed. Hence, we recommend that you have your application do periodic checkpoints so that it is able to restart from the last checkpoint. If you have a job that is not able to checkpoint/restart by itself, or non-interruptible during its runtime, you may want to use the general QoSs.
To submit jobs to this QoS, you will need to provide all the normal parameters, e.g., --partition=lr3, --account=ac_projectname, etc., for the QoS please use "--qos=lr_lowprio" or "--qos=mako_lowprio", and make sure you request less than 64 nodes and 3 days of runtime for the job. If you would like the scheduler to requeue the job in its entirety in the case that the job is preempted, please add "–requeue" to your srun or sbatch command, otherwise the job will simply be
killed when preemption happens. An example of the job script should look like below:
#SBATCH --partition=lr3 ### other partition options:
#SBATCH --qos=lr_lowprio ### another QoS option: mako_lowprio
###SBATCH --requeue ### only needed if automatically requeue is desired
For condo users who have been helping us to test these low priority QoSs on the lr2 and mako partitions, your current associations with your “lr_condo” account have not changed so you can continue to use them but they are limited to lr2 and mako partitions only. If you intend to use other partitions you will need to change the account from “lr_condo” to “ac_condo”, e.g., “lr_nanotheory” -> “ac_nanotheory”. And we will phase out associations connected to your “lr_condo” account in the next month without further notice, so please make the change now.
For more information about this program and how to use the low priority QoSs properly please check our online user guide.
The pilot program will run for two month (Mar 22 - May 22) and we will decide how to proceed from there based on the usage and feedback.
Please forward your requests, questions, and comments to firstname.lastname@example.org during this pilot period.
Are you running out of space in your office, moving to a new office, or tasked with processing the records of retiring (or already retired) scientists? As part of National Records and Information Management Month, you’ll learn which files need to be kept, which can be archived, and which can be disposed of at a workshop sponsored by the Archives and Records Office on Apr. 22 from 10:30 am to noon in 50A-5132. For more info and to register, go here:
Smartsheet will be offline on March 11, 2016 starting at 8:00 PM PST for approximately 15 minutes for system improvements.
The Identify Services team will be performing upgrades to the Enterprise Directory backend this weekend (March 12 and March 13).
Precautions will be taken to mitigate any downtime of the Enterprise Directory, but there may be some instability of the service between 10 AM and 1 PM on Saturday and Sunday.
HPC Services engineer Michael Jennings gave a talk on the "Node Health Check (NHC)" on Feb 24, 2016 at the Stanford Conference and Exascale Workshop 2016 sponsored by the HPC Advisory Council. NHC, developed by Jennings, provides the framework and implementation for a highly reliable, flexible, extensible node health check solution. It is now widely recommended by major HPC job scheduler vendors and is in use at many large HPC sites and research institutions.
In this follow-up from his 2014 presentation at the Stanford HPC Advisory Council Conference, Michael will provide an update on the latest happenings with the LBNL NHC project, new features in the latest release, and a brief overview of the roadmap for future development.
About Michael Jennings
Michael has been a UNIX/Linux Systems Administrator and a C/Perl developer for 20 years and has been author of or contributor to numerous open source software projects including Eterm, Mezzanine, RPM, Warewulf, and TORQUE. Additionally, he co-founded the Caos Foundation, creators of CentOS, and has been lead developer on 3 separate Linux distributions. He currently works as a Senior HPC Systems Engineer for the High Performance Computing Services group at Lawrence Berkeley National Laboratory and is the primary author/maintainer for the LBNL Node Health Check (NHC) project. He has also served for 2 years as President of SPXXL, the extreme-scale HPC users group.
Smartsheet will be offline on February 13, 2016 starting at 1:00 PM PST and until approximately 5 PM PST for system improvements.
We experienced an unscheduled outage of all authentication services including login.lbl.gov, ldap, OTP/pledge and other directory services on Christmas morning. Network engineers resolved the outage.
Networking and Telecom will take numerous short (1-2 hour) outages December 29-30 and one longer outage. The longest of these will take place December 29th which will impact AT&T mobile phone access, and traditional telephones at offsite buildings and the guest house, as well as VPN. The outage window for this outage is based on the expected time to complete electrical upgrades by Facilities and may be extended if the work proves more challenging than anticipated.
In addition, there will be numerous after hours outages on Friday December 18th to facilitate the work on Dec 29-30.
Current scheduled work in the Infrastructure Services area for the Dec-2015 winter break includes:
Tue, 29-Dec (tentative), times TBA -- Shutdown of 50A-1156 e-power to install branch circuit metering. We expect an outage of up to 2 hours for UPS-backed power and 1 hour for circuits that are just generator-backed during this installation. Most IT systems in 50A-1156 should be on dual power feeds before this time, so we expect the outage to impact:
All AT&T mobile phone and wired circuits (including the Guest House and offsite buildings)
WiFi service to the Lab
Remote access VPN service
Some building networking, including Blackberry Gate, Bldg’s 4, 48, 75, 75a, 75b, 85, and Keck Obeservatory
Some limited CPP equipment
Tue, 29-Dec, times TBA -- maintenance to primary border router er1-n1 and ALS Science DMZ router sr1-als. We will have shifted traffic to the backup border router before this work, so the only outage impact should be for Science DMZ DTN’s (managed by HPCS and ALS)
Tue, 29-Dec, times TBA -- maintenance to DNS anycast service. This may cause short (< 1 minute) interruptions to DNS resolution using the 184.108.40.206 anycast address.
Tue, 29-Dec 0800 - 2200, and Wed, 30-Dec 0800 - 2200 -- maintenance to non-data center switches at LBL. This will usually mean a ~ 5 minute outage per switch as it reboots into the new operating system, although some switches can take up to 30 minutes to upgrade their firmware and return to normal operation.
Tue, 29-Dec, times TBA -- maintennace to ir1-n1 and ir3-n2 routers and new larger power supplies for ir1-n1. This will mean a 10-30 minute outage for all routing for all LBL networks other than those at the ALS, at JGI, or in Zone 3 (buildings 31, 62, 66, 67, 72, 74, 77, 83, 84, 85, 86, and Strawberry Canyon gate house).
Tue, 29-Dec or Wed, 30-Dec, times TBA -- maintenance to LBLnet services switch. This may cause short (usually < 5 minutes) outages for network services such as ntp, dns, onestop, iprequest, etc.
Tue, 29-Dec or Wed, 30-Dec, times TBA -- uplink migration from end-of-life equipment for Buildings 1, 4, 14, 31, 55, 56W, 64, 77, and 86. Expected 5-10 minutes outages for these buildings as their uplinks move from old end-of-life equipment to current equipment.
Tue, 29-Dec or Wed, 30-Dec, times TBA -- maintenance to Bldg 84 switch. Expected 10-30 minute outage while this equipment is replaced.
Tue, 29-Dec 0800 - 2200 -- replace primary Bldg 978 network switch. Expected impact is several hours of downtime for most building network connections.
Wed, 30-Dec 0800 - 2200 -- replace primary Bldg 977 network switch. Expected impact is several hours of downtime for most building network connections.
Wed, 30-Dec 1200 - 2200 -- replace primary JGI Bldg 100 non-sequencer network switch. Expected impact is 2-3 hours of downtime for impacted 220.127.116.11 subnet.
- Dates/times still being negotiated -- maintenance to IDM and TSC subnets.
The scheduled power work of December 18, 2015 will impact authentication to Lab services, email delivery and access to eRoom and Sympa lists starting at 5:30 PM. Authentication to Lab services and email delivery will be impacted for approximately 15 to 30 minutes; access to eRoom and Sympa lists will be impacted for approximately 30 to 60 minutes.
After testing ZOOM (a product chosen by ESNet last year) and comparing functionality and cost to our prior offering, we have decided to make a change. In December we will start deployment and training which allows a one month overlap with the current tool while we make the transition. (We will continue to offer and recommend Google Hangouts when appropriate - easily accessed via Gmail or Google Calendar)
Our documention is being updated (go.lbl.gov/vc) and we are preparing to offer training (including one-on-one sessions with Help Desk staff members). ZOOM is also on all the Video conferencing carts.
With our Zoom contract, we are acquiring one 1000 person Webinar license that can be assigned as needed. Zoom also has a remote control feature that lets a participant screen share and let someone else take control (great for troubleshooting or providing assistance in areas outside of Video Conferencing). Up to 50 participants can join a meeting.
We will also have the ability to let up to 4 traditional room systems connect (1 system in each of 4 meetings or 4 systems in one meeting) Not needed as much anymore, but a useful feature. We also understand that Zoom can connect to a Video conference bridge - something we occasionally have to do with DOE.
We plan to integrate Zoom with our Single Sign-on system so your Berkeley Lab identity ("LDAP") can be used to authenticate as a host and allow us to auto provision accounts. We are also looking into offering toll free international numbers (which we can assign at the host level). We will have to recharge these costs for those of you who need this feature.
The Linux Foundation the nonprofit organization dedicated to accelerating the growth of Linux and collaborative development, has announced an intent to form the OpenHPC Collaborative Project. This project will provide a new, open source framework to support the world's most sophisticated High Performance Computing environments. Warewulf, our cluster provisioning tool developed by IT's HPC architect Greg Kurtzer, is specified as the provisioning tool for the OpenHPC standard cluster building recipe.
Smartsheet will be offline for 90 minutes on November 7 at 1:00pm PST (2015-11-07 21:00 UTC) for system improvements. Smartsheet status can be checked here.