Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

DRAFT

Overview

ROSS (short for Resesarch Research Object Storage System) is a multi-protocol, web-based storage system.  The primary 4.2 PB storage unit for ROSS resides on the UAMS campus.  In the coming months the UAMS unit will be connected with a 2 PB sister unit sited at the University of Arkansas in Fayetteville, to allow for a limited amount of off-site, replicated storage.  ROSS currently has 23 storage nodes, with 15 at UAMS and 8 at UA Fayetteville.   Future expansions could increase the number of storage nodes and capacity.

...

If you wish to access an existing group namespace as an object user, please contact the namespace administrator for that namespace and ask to be added as an object user for that namespace.

Using ROSS

What APIs are supported by ROSS

ROSS supports multiple object protocol APIs.  The most common is S3, followed by Swift (from the Open Stack family).  Although ROSS can support the NFS protocol, our experiments show that the native NFS support is slower for many use cases than other options that simply emulate a POSIX file system using object protocols.  ROSS can also support HDFS (used in Hadoop clusters) and a couple of Dell/EMC proprietary protocols (ATMOS and CAS); however, we have not tested any of these other protocols, hence the HPC Admins could only offer limited support for these other options.  The primary protocols that the HPC Admins support is S3 and Swift.  We have more experience with S3.  Later we describe our two favorite tools for accessing ROSS, but you are not required to use our suggested tools.

How do I get access credentials?

blah

How do I create a bucket or container?

For your personal namespace, you can create buckets (the S3 term) or containers (the equivalent Swift term) from the administration API.

If it is a group namespace (e.g. lab, project, department), the namespace administrator can create buckets with certain characteristics for you.

In either cases, object users can also use APIs and tools to create buckets or containers.  Many of the ECS-specific features (e.g. replication, encryption, file access) are not available when using the bucket-creation options of tools and the APIs.  If you need any of those options, it is better to create the bucket or container with ROSS's web interface.  Using the APIs is beyond the scope of this document. 

What tools do you recommend for accessing ROSS

The bottom line - any tool that uses any of the supported protocols theoretically could work with ROSS.  There are a number of free and commercial tools that work nicely with object stores like ROSS.  Like most things in life, there are pros and cons to different tools.  Here is some guidance as to what we look for in tools.

Being inherently parallel accessible (remember the 23+ storage nodes), the best performance is gain when operations proceed in parallel.  Keep that in mind if building your own scripts (e.g. using ECS's variant of s3curl) or when comparing tools.  More parallelism (i.e. multiple transfers happening simultaneously across multiple storage nodes) generally yields better performance, up to a limit.  Eventually the parallel transfers become limited by other factors such as memory or network bandwidth constraints.  Generally there is a 'sweet spot' in the number of parallel transfer threads that can run before stepping on each other's toes.  The S3 protocol includes support for multipart upload of objects, where a large object upload can be broken into multiple pieces and transferred in parallel.  Tools that support multipart upload likely will have better performance than tools that don't.  In addition, tools that support moving entire directory trees and that work in parallel will have better performance than tools that move files one object at a time.

In experiments we have notices that write times to ROSS are considerably slower than read times, and slower than many POSIX file systems.  However, read times are significantly faster.  In other words, it takes longer to store new data into ROSS than to pull existing data out of ROSS.  We also notice (as is typically of most file systems) that transfers of large objects can go significantly faster than tiny objects.  Please keep these facts in mind when planning your use of ROSS - ROSS favors reads over writes and big things over little things.

In theory, a bucket can hold millions of files.  We have noticed, though, that with path prefix searching the more objects that have that prefix, typically related to the number of objects in a bucket.

What APIs are supported by ROSS

ROSS supports multiple object protocol APIs.  The most common is S3, followed by Swift (from the Open Stack family).  Although ROSS can support the NFS protocol, our experiments show that the native NFS support is slower for many use cases than other options that simply emulate a POSIX file system using object protocols.  ROSS can also support HDFS (used in Hadoop clusters) and a couple of Dell/EMC proprietary protocols (ATMOS and CAS); however, we have not tested any of these other protocols, hence the HPC Admins could only offer limited support for these other options.  The primary protocols that the HPC Admins support is S3 and Swift.  We have more experience with S3.  Later we describe our two favorite tools for accessing ROSS, but you are not required to use our suggested tools.


How do I get access credentials?

blah

How do I create a bucket or container?

For your personal namespace, you can create buckets (the S3 term) or containers (the equivalent Swift term) from the administration API.

If it is a group namespace (e.g. lab, project, department), the namespace administrator can create buckets with certain characteristics for you.

In either cases, object users can also use APIs and tools to create buckets or containers.  Many of the ECS-specific features (e.g. replication, encryption, file access) are not available when using the bucket-creation options of tools and the APIs.  If you need any of those options, it is better to create the bucket or container with ROSS's web interface.  Using the APIs is beyond the scope of this document. 

What tools do you recommend for accessing ROSS

The bottom line - any tool that uses any of the supported protocols theoretically could work with ROSS.  There are a number of free and commercial tools that work nicely with object stores like ROSS.  Like most things in life, there are pros and cons to different tools.  Here is some guidance as to what we look for in tools.

Being inherently parallel accessible (remember the 23+ storage nodes), the best performance is gain when operations proceed in parallel.  Keep that in mind if building your own scripts (e.g. using ECS's variant of s3curl) or when comparing tools.  More parallelism (i.e. multiple transfers happening simultaneously across multiple storage nodes) generally yields better performance, up to a limit.  Eventually the parallel transfers become limited by other factors such as memory or network bandwidth constraints.  Generally there is a 'sweet spot' in the number of parallel transfer threads that can run before stepping on each other's toes. 

The S3 protocol includes support for multipart upload of objects, where a large object upload can be broken into multiple pieces and transferred in parallel.  Tools that support multipart upload likely will have better performance than tools that don't.  In addition, tools that support moving entire directory trees and that work in parallel will have better performance than tools that move files one object at a timeIn experiments we have notices that write times to ROSS are considerably slower than read times, and slower than many POSIX file systems.  However, read times are significantly faster.  In other words, it takes longer to store new data into ROSS than to pull old out of ROSS.  We also notice (as is typically of most file systems) that transfers of large objects can go significantly faster than tiny objects.  Please keep these facts in mind when planning your use of ROSS - ROSS favors reads over writes and big things over little things.

After evaluating several tools, the HPC admins settled on 2 tools, ecs-sync and rclone, as 'best of breed' for moving data between ROSS and Grace's cluster storage system, where the /home, /scratch, and /storage directory live.  The ecs-sync program is the more efficient and the fastest of the two for bulk data moves.  It consumes fewer compute resources and less memory than rclone.  Yet when properly tweaked for number of threads (i.e. when the sweet spot is found) it moves data significantly faster than rclone.  The rclone program has more features than ecs-sync, including ways to browse data in ROSS, to mount a ROSS bucket as if it were a POSIX file system, and to synchronize content using a familiar rsync-like command syntax.  While ecs-sync is great for fast, bulk moves, rclone works very well for nuanced access to ROSS and small transfers.

...