Shared Cluster and Storage Program

Shared Cluster Hosting Program- Hoffman2 Shared Cluster

Shared Storage Program- Purchased Storage

Shared Cluster Program- Hoffman2 Shared Cluster

The Shared Hoffman2 Cluster is made up of two main virtualized clusters that have been optimized for different research needs. The Research Virtual Shared Cluster is made up from Contributed cores purchased by individual research groups and Base cores purchased by IDRE to augment the Contributed cores. One benefit of contributing cores to the shared cluster is that a research group is guaranteed use of the number of cores contributed with the ability to use surplus cores from the entire Hoffman2 Cluster. Other benefits provided to research groups when they join the shared cluster include:

  1. Complete system administration for contributed cores
  2. Cluster access through a 10Gb network interconnect to the campus backbone
  3. High performance home and scratch storage space
  4. A dedicated data center facility for housing the cluster. This eliminates the need to perform expensive space, cooling, and electrical modifications to existing office or lab space
  5. The capability to run large parallel jobs that can take advantage of the cluster's InfiniBand interconnect

Research groups who have contributed cores to the Research Virtual Shared Cluster also have access to the features of the General Purpose Cluster. This gives them:

  1. Access to pooled licenses, allowing researchers to run larger commercial applications without the cost of buying additional licenses
  2. Access to additional commercial and open source applications
  3. Web access to the Hoffman2 Cluster is provided through the UCLA Grid Portal

Base and Contributed Equipment Standards and Policies

All contributed hardware must be compatible with the base core architecture, processor type and speed, memory, disk space, and interconnect. This maximizes the effective management of the Hoffman2 Cluster to provide the highest level computing services to shared cluster customers. IDRE provides full support in helping researchers specify and purchase at optimal price/performance their cores to meet these standards.

Once contributed, these cores become part of the entire Hoffman2 Cluster and are no longer physically linked to a given research group. Because cycles are pooled across all Base and Contributed cores, which may be in use by others, the equivalent number of cores to those contributed is made available within 24 hours after a request. In practice, the number of cores contributed by a research group is generally available much sooner. Jobs that run on the Virtual Shared Cluster have a 14-day upper limit (with appropriate notification, longer runs may be accommodated).

While it is hard to give an exact number of additional cores available, in practice there are unused cores that can be made available within a reasonable period of time for researches that require use of cores in addition to those contributed.

With advance agreement, a very large job that requires a large segment of the entire shared cluster (those cores connected through the InfiniBand) can be accommodated dependent upon current cluster usage and consent by affected research groups.

Research Virtual Shared Cluster Hosting Costs

Research groups that contribute cores to the Hoffman2 Cluster agree to contribute their unused cycles to other researchers. They can regain full use of their contributed cores within 24 hours of submitting a job.

Users of the Virtual Research Shared Cluster, and users of the General Purpose Cluster, have the option of paying a one-time, per terabyte, charge for storage on the BlueArc storage system. This is particularly an important option for those that need more than the 20 GB directory space per user that is standard on the Research Virtual Shared or General Purpose Cluster or that want increased permanent space for large data sets to avoid recurring upload and transfer times. Please see IDRE Shared HPC Storage Program for further information.

Base and Contributed Equipment Renewals

After a period of three years all hardware within the shared cluster is evaluated for retention based on condition of equipment, cost to maintain, relative compute power and the ability to backfill with new systems. This is done to maintain a high performance and low maintenance system, while maximizing the utilization of data center space.

If the contributed cores can still be effectively maintained, those cores will remain inside the Hoffman2 Cluster and continue to be reevaluated on an annual basis. If the contributed cores can no longer be effectively maintained, upon mutual agreement, they will be redeployed for other uses or decommissioned.

The Campus General Purpose Cluster

UCLA Faculty (and their students) who have not contributed cores run parallel jobs on the General Purpose Cluster .

The Campus General Purpose Cluster is that part of the Hoffman2 Cluster System provided as a high performance computing resource for the entire UCLA campus and is available to UCLA students and faculty that:

  • Run primarily commercial applications and/or user written, discipline specific applications,
  • Have low-level or sporadic usage, and
  • Require a specific application, compiler, or visualization tool available only on the General Purpose or Applications Clusters.

Because resources in the General Purpose Cluster are limited there are restrictions on the jobs that can be run:

  • The maximum run time for a job is 24 hours. Jobs running longer than 24 hours will be killed by the scheduler
  • Jobs are limited to a maximum of 128 cores. Jobs rewquesting greate than 128 cores will not schedule

The Shared Hoffman2 Hardware and Software

The Hoffman2 Cluster has 64-bit nodes with an Ethernet newtork and Infiniband interconnect, with the following standard software suite:

  • Scheduler
  • Compilers: GCC and the best performing compiler for: C, C++, Fortran 77, 90 and 95 on the current Shared Cluster architecture
  • Applications and Libraries in the Basic Software Suite

Certain applications are provided for a base level of cluster usability. Every effort is made to maximize application usage to the extent capable under license agreements. Where possible software is provided that would not make sense for an individual research group to purchase on its own.

In addition to the Base and Contributed cores, the Hoffman2 Cluster includes the login nodes and the storage server. The Hoffman2 Cluster has both InfiniBand and gigabit Ethernet network switches and interconnect. The Ethernet interconnect is dedicated to traffic in and out of the storage system as well as various administrative functions and is used as the interconnect for the Applications cluster. To maintain maximum parallel performance, InfiniBand is used strictly for inter-node, MPI-type communication across the Research Virtual Shared Cluster and the General Purpose Cluster.

Shared Storage Program - Purchased Storage

For those that purchase storage from us, these are the terms and conditions under which we operate. Each group that purchases storage will be given a document specific to their request that must be signed by the responsible PI.

Please note that the High Performance storage provided by us is the only option for buying or using the storage on Hoffman2 shared cluster. We will neither mount any external file system on to Hoffman2 nor will export Hoffman2’s file system to outside world.

Specifications under the agreement are:

Cost

The cost for providing one terabyte of storage services for a period of three years is 19.747 hours charged at the current campus PAIV rate or approximately $1,500. These services include the physical storage space, administration of the users of your storage space, hardware and software upgrades and problem fixes plus backup services (if applicable) as described below. Storage service charge for 1 terabyte: 19.747 hours (per terabyte) of a campus PAIV (2012 rate) of $75.96 per hours = $1,500. Labor rates and tasks hours will be adjusted periodically to reflect changes in campus rates and required labor.

Space Usage Duration

Space will be made available for a period of three years. At that time a review will be conducted of the storage system hardware to see if it continues to meet overall reliability standards. If it does, a follow up review will be performed after an additional one year. In no case will the storage hardware be kept for longer than four years (see below). If not, the hardware will need to be upgraded and the storage user will be charged at the current storage rate. We anticipate this rate to be the same or lower than the current storage service rate (see above) although no guarantee can be made at this time. If the storage user does not or cannot pay the new storage fee they will have to make arrangements to move their data off of the storage server within a maximum of 30 days. IDRE personnel are available to assist you with moving your data.

Backups

Tape backup of storage space purchased under this agreement is optional. The user can select, in one terabyte increments, whether their purchased storage is backed up or not. Any storage that is backed up will have a strict limit of 1,000,000 files per terabyte. Purchased storage users can move storage between backed up and not backed up (see below)* in one terabyte and 1 million file increments as their needs demand by contacting the system administrators. Occasionally, purchased storage users will be required to log out of the system in order for their requests to be completed.

Additional information related to backups and recovery of your data:

  • Two versions of each file in your home directory storage are retained on our tape backup system: the most current version and the most current version preceding it. Most current is defined by the last modification date of the file. Once you delete a file you have 30 days to recover it. After 30 days no copy of a deleted file is left on the backup system.
  • Every effort will be made to restore lost data as quickly as possible. The amount of data, number of files, other activity on the backup system and workload of IDRE are all factors that can impact recovery time. We ask that you do not abuse this service, as the recovery procedure is disruptive to other ongoing operations.
  • As long as you continue to pay for your storage your data will be retained on our storage server and backup system (as long as you do not delete it). At the termination of your storage agreement with us you must remove your data within 30 days. After 30 days we will delete your data from our storage and backup systems and there is no way to recover it once this happens.

Protecting Personal Information (PI) on Hoffman2

Pursuant to UCLA Policy 404 any Personal Information (PI) data stored on the Hoffman2 Cluster file systems must be protected.

Personal Information is defined as "an individual's first name or first initial, and last name, in combination with any one or more of the following: (1) Social Security number, (2) driver's license number or California identification card number, (3) account number, credit or debit card number, in combination with any required security code, access code, or password that would permit access to an individual's financial account, (4) medical information, and (5) health insurance information."

Since IDRE has responsibility for the Hoffman2 cluster it also has responsiblity for insuring that users of the Hoffman2 file systems understand what is required of them.

To that end we need to know if you are storing any Personal Information on the Hoffman2 Cluster. If so, we strongly recommend you remove it immediately. If this is not possible then you must encrypt the data per policy guidelines. If you do decide to keep it you must inform the IDRE Director, in writing, what kind of Personal Information you have and why you must keep it on Hoffman2. If a security breach occurs and Personal Information is stolen AND it is not encrypted then YOU as the custodian of the data are liable for the exposure.