Cloud Storage

Big Data and Cloud Computing (CC4053)

Eduardo R. B. Marques, DCC/FCUP

Cloud storage

We provide an overview of the main storage models used with cloud computing, and example cloud service offerings including:

Reference → most of the topics in these slides are also covered in chapters 2 and 3 of Cloud Computing for Science and Engineering.

Local file systems

Characteristics:

Advantage → most programs are designed to work with files, and use the file system hierarchy (directories) as a natural means to organize information - files/directories have no associated “data schema” though.

Disavantadges → Shared access by multiple hosts is not possible, and the volume of data is bound by the file system capacity. Distributed file systems, discussed next, can handle shared access and mitigate volume restrictions.

Local file systems (cont.)

Example: configuration of a VM book disk in Compute Engine

Network attached storage

Characteristics:

Advantages → data sharing by multiple hosts, and typically higher storage capacity (installed at server hosts)

Example cloud services → NFS: Amazon EFS, Google File Store, Azure Files (Azure Files also supports SMB).

Beyond shared storage, a note on parallel file systems → file systems such as Lustre or HDFS are designed for parallel computing using computer clusters. We will talk about HDFS later in the semester.

Distributed File Systems (cont.)

NFS client-server interaction

Image source: “Cloud Computing: Theory and Practice 2nd ed.” by D. C. Marinescu

Distributed File Systems (cont.)

Example: NFS configuration using Google File Store

Buckets

An object store, also known as bucket (in AWS and GCP), is a container for unstructured binary objects. It can be accessed over the Internet without any logical association to specific storage facility. An object in a bucket:

File systems vs. buckets

The logical simplicity of the bucket model allows easy replication with no synchronization constraints, and flexible/scalable use in diverse use cases (cloud apps, content delivery, backup, etc). Example: public data sets of Landsat/Sentinel satellite imagery are available, and used by Google Earth Engine.

Buckets in GCP

Example from Google Cloud Platform

Bucket creation in GCP

Basic parameters → a globally unique name, region placement, and storage class.

Bucket creation in GCP and billing

Parameters:

Billing:

Bucket creation in GCP and billing (cont.)

From Cloud Storage pricing, Feb. 2021

Access to GCP buckets

Access to GCP buckets using Python

# Imports the Google Cloud client library
from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# The name for the new bucket
bucket_name = "my-new-bucket"

# Creates the new bucket
bucket = storage_client.create_bucket(bucket_name)

print("Bucket {} created.".format(bucket.name))

Reference: google-cloud-storage library

Access to GCP buckets using Python (cont.)

def download_blob(bucket_name, 
                  source_blob_name, 
                  destination_file_name):
    """Downloads a blob from the bucket."""
    # bucket_name = "your-bucket-name"
    # source_blob_name = "storage-object-name"
    # destination_file_name = "local/path/to/file"
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

Reference: google-cloud-storage library

Databases

Database → structured collection of data concerning entities in the real-world and their relationship. The structure of data is many times described by a data schema (also called a data model).

Database management system (DBMS) → software system responsible by managing data schemas and the associated data, application interface using a query language, supporting concurrent transactions, and handling crash recovery.

Types of databases:

Relational databases - use of SQL

-- Table creation
CREATE TABLE CUSTOMER( 
  Id PRIMARY KEY AUTO_INCREMENT, 
  Login UNIQUE VARCHAR(16) NOT NULL,
  Name VARCHAR(64) NOT NULL,
  Phone INT NOT NULL
  Country VARCHAR(64) NOT NULL );
-- Data changes
INSERT INTO CUSTOMER(Login,Name,Phone,Country)
VALUES('bdcc','Alonsus Big' 9299334, 'Portugal');
UPDATE CUSTOMER SET Phone = 93949444 WHERE Login='bdcc';
DELETE FROM CUSTOMER WHERE Country = 'Spain';
-- Queries
SELECT Login,Country FROM CUSTOMER WHERE Id='1234';
SELECT * FROM CUSTOMER 
WHERE Country = 'China' AND Phone IS NOT NULL
ORDER BY Login;

Relational databases - DBaaS offerings

General-purpose and popular DBMSs ´like MySQL, PostgreSQL, or Microsoft SQL Server are offered through DBaaS (Database as a Service) solutions like e.g. Google Cloud SQL, Amazon RDS, or Azurel SQL.

Other offerings are oriented towards scalable operation in cloud computing, e.g., Google Spanner or Amazon Aurora.

Databases and ACID transactions

Transaction = sequence of database operations

ACID guarantees are often conflicting, especially if a database must scale to accommodate large volumes of data or concurrent accesses, and/or provide high-performance access to data.

For example, the SQL standard defines several “isolation levels” related to relaxations of the isolation requirement, which in turn may compromise consistency and atomicity.

Databases and ACID transactions (cont.)

Two transactions involving money transfer between bank accounts: 1 → 2 with a value of 100 (left) and 1 → 3 with a value of 200 (right). If they are not properly isolated we may have that for example that the balance of account 1 is decremented by 200 rather than 300.

Databases and ACID transactions (cont.)

Concurrent transactions must appear as if they executed in sequence i.e., the DBMS must ensure they are serializable.

In the example above with 4 transactions, where T2 and T3 are concurrent, the logical effect should be as if we executed the transactions in one the following orders: T1 → T2 → T3 → T4 or T1 → T3 → T2 → T4

Distributed databases

Image source: Overview of Oracle sharding, Oracle corp.

To cope with scalable operation for large amounts of data and accesses, distributed databases employ techniques such as sharding (data is partitioned across different hosts, illustrated above) or replication (data is mirrored across different hosts).

Distributed operation presents new challenges, though …

Distributed databases and the CAP theorem

CAP Theorem (also known as Brewer’s theorem) → an influential theoretical result for distributed databases that states that a system cannot guarantee all of the three following properties:

Distributed databases and the CAP theorem (cont.)

CAP? Choose two! Database systems fall under one the CA, CP, or AP categories.

NoSQL databases

NoSQL database systems

Key-value stores defines two abstract operations: put(k,v): stores value v in association to key k, and; get(k): get value associated to key k. The types of keys and values are often arbitrary, i.e., treated as byte sequences by the storage system.

Document stores are specialised key-value stores that operate on data, called documents, that have a “self-describing” hierarchical structure (e.g., as in XML, JSON). DB operations can be formulated in terms of the “self-describing” schema, sometimes using SQL-like languages. Examples → MongoDB, Google FireStore.

Graph databases organizes data in terms of graphs, such that nodes represent entities and edges represent relationships between entities. Example → Neo4j.

Multi-paradigm NoSQL databasesArangoDB and Azure Cosmos are examples that support key, document, or graph-based operations.

We only cover some examples of key-value stores next.

BigTable

BigTable from Google is a key-value cloud storage system for operating large volumes of data. It underlies several Google services like Google Earth, Google Analytics and in fact other NoSQL databases like Google FireStore.

BigTable (cont.)

BigTable is a column wide-store, a special type of key-value store.

As in relational databases, BigTable supports the concepts of table, row, and column, but with significant differences. Table = { rows }, however:

BigTable (cont.)

Image source: “Cloud Computing: Theory and Practice 2nd ed.” by D. C. Marinescu

BigTable (cont.)

from gcloud import bigtable
clientbt = bigtable.Client(admin=True)
clientbt.start()
instance = clientbt.instance('cloud-book-instance')
table = instance.table('book-table2')
table.create()
column_family = table.column_family('cf')
column_family.create()
row= table.row('key_value')
row.set_cell('cf', 'experiment', 'exp1')
row.set_cell('cf', 'date', '6/6/16')
row.set_cell('cf', 'link', 'http://some_location')
row.commit()
row_data = table.read_row('key_value') 
cells = row_data.cells['cf']

Reference: Python Client for Google Cloud Bigtable

In-memory key-value stores

from pymemcache.client.base import Client
mc = Client('my-memcached-server.com')
mc.set('foo', 'bar')
result = mc.get('foo')
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
r.set('foo', 'bar')
data = r.get('foo')

Memcached and Redis are two popular distributed in-memory key-value stores. (Redis also supports persistence and a number of other features beyond the raw key-value store functionality).

A typical use case → act as a cache for frequently accessed data in another database system. Both can be deployed using the Google Cloud Memorystore service.

In-memory key-value stores (cont.)

Image source: Deploying a HighlyAvailable Distributed Caching Layer on Oracle Cloud Infrastructure using Memcached & Redis, Oracle white paper, 2018.

What database should I use?

It depends on your use case. The following fragment from the Google BigTable documentation offers some advice:

Data warehouses

Data warehouses → systems used for storing and querying large datasets. They are used by data analytics tools whose purpose is to run intensive queries over the data in scalable manner.

In contrast to DBMS:

Example warehouse services → Google BigQuery, Amazon Redshift, and Azure Data Lake.

Google BigQuery

Essential characteristics:

BigQuery UI

Query example over a BigQuery public dataset

Access to Google BigQuery using Python

Query example

import google.cloud.bigquery as bq
client = bq.Client(project=projectId)
query = client.query('''
   SELECT count, 
          origin, org.name AS origin_name, 
          destination, dst.name AS destination_name
   FROM   `bdcc.vegas.flights` 
   JOIN   `bdcc.vegas.airports` org ON(origin=org.iata) 
   JOIN   `bdcc.vegas.airports` dst ON(destination=dst.iata)
   WHERE   org.state = 'NY'
   ORDER BY count DESC, origin, destination''')
pandas_df = query.to_dataframe()

Query example

Reference: Python Client for Google BigQuery

Access to Google BigQuery using Python (cont.)

import google.cloud.bigquery as bq
client = bq.Client(project=projectId)
table = bq.Table(table_name)
table.schema = (
   bq.SchemaField("movieId", "INTEGER", "REQUIRED"),
   bq.SchemaField("title",  "STRING", "REQUIRED"),
   bq.SchemaField("year", "INTEGER", "REQUIRED"),
   bq.SchemaField("imdbId", "INTEGER", "REQUIRED")
)
pandas_df = ...
load_job = client.load_table_from_dataframe(pandas_df, table)
while load_job.running():
   time.sleep(0.5) # wait ...

Table creation example

Reference: Python Client for Google BigQuery

Google BigQuery - other features