Big Data and Cloud Computing (CC4053)
We provide an overview of the main storage models used with cloud computing, and example cloud service offerings including:
Reference → most of the topics in these slides are also covered in chapters 2 and 3 of Cloud Computing for Science and Engineering.
Characteristics:
Advantage → most programs are designed to work with files, and use the file system hierarchy (directories) as a natural means to organize information - files/directories have no associated “data schema” though.
Disavantadges → Shared access by multiple hosts is not possible, and the volume of data is bound by the file system capacity. Distributed file systems, discussed next, can handle shared access and mitigate volume restrictions.
Example: configuration of a VM book disk in Compute Engine
Characteristics:
Advantages → data sharing by multiple hosts, and typically higher storage capacity (installed at server hosts)
Example cloud services → NFS: Amazon EFS, Google File Store, Azure Files (Azure Files also supports SMB).
Beyond shared storage, a note on parallel file systems → file systems such as Lustre or HDFS are designed for parallel computing using computer clusters. We will talk about HDFS later in the semester.
NFS client-server interaction
Image source: “Cloud Computing: Theory and Practice 2nd ed.” by D. C. Marinescu
Example: NFS configuration using Google File Store
An object store, also known as bucket (in AWS and GCP), is a container for unstructured binary objects. It can be accessed over the Internet without any logical association to specific storage facility. An object in a bucket:
gs://bucket_name/object_name
in Google Cloud Storage or s3://bucket_name/object_name
in Amazon S3gs://bucket_name/path/to/object
.The logical simplicity of the bucket model allows easy replication with no synchronization constraints, and flexible/scalable use in diverse use cases (cloud apps, content delivery, backup, etc). Example: public data sets of Landsat/Sentinel satellite imagery are available, and used by Google Earth Engine.
Example from Google Cloud Platform
Basic parameters → a globally unique name, region placement, and storage class.
Parameters:
Billing:
From Cloud Storage pricing, Feb. 2021
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client()
# The name for the new bucket
bucket_name = "my-new-bucket"
# Creates the new bucket
bucket = storage_client.create_bucket(bucket_name)
print("Bucket {} created.".format(bucket.name))
Reference: google-cloud-storage library
def download_blob(bucket_name,
source_blob_name,
destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
Reference: google-cloud-storage library
Database → structured collection of data concerning entities in the real-world and their relationship. The structure of data is many times described by a data schema (also called a data model).
Database management system (DBMS) → software system responsible by managing data schemas and the associated data, application interface using a query language, supporting concurrent transactions, and handling crash recovery.
Types of databases:
-- Table creation
CREATE TABLE CUSTOMER(
Id PRIMARY KEY AUTO_INCREMENT,
Login UNIQUE VARCHAR(16) NOT NULL,
Name VARCHAR(64) NOT NULL,
Phone INT NOT NULL
Country VARCHAR(64) NOT NULL );
-- Data changes
INSERT INTO CUSTOMER(Login,Name,Phone,Country)
VALUES('bdcc','Alonsus Big' 9299334, 'Portugal');
UPDATE CUSTOMER SET Phone = 93949444 WHERE Login='bdcc';
DELETE FROM CUSTOMER WHERE Country = 'Spain';
-- Queries
SELECT Login,Country FROM CUSTOMER WHERE Id='1234';
SELECT * FROM CUSTOMER
WHERE Country = 'China' AND Phone IS NOT NULL
ORDER BY Login;
General-purpose and popular DBMSs ´like MySQL, PostgreSQL, or Microsoft SQL Server are offered through DBaaS (Database as a Service) solutions like e.g. Google Cloud SQL, Amazon RDS, or Azurel SQL.
Other offerings are oriented towards scalable operation in cloud computing, e.g., Google Spanner or Amazon Aurora.
Transaction = sequence of database operations
ACID guarantees are often conflicting, especially if a database must scale to accommodate large volumes of data or concurrent accesses, and/or provide high-performance access to data.
For example, the SQL standard defines several “isolation levels” related to relaxations of the isolation requirement, which in turn may compromise consistency and atomicity.
Two transactions involving money transfer between bank accounts: 1 → 2 with a value of 100 (left) and 1 → 3 with a value of 200 (right). If they are not properly isolated we may have that for example that the balance of account 1 is decremented by 200 rather than 300.
Concurrent transactions must appear as if they executed in sequence i.e., the DBMS must ensure they are serializable.
In the example above with 4 transactions, where T2 and T3 are concurrent, the logical effect should be as if we executed the transactions in one the following orders: T1 → T2 → T3 → T4 or T1 → T3 → T2 → T4
Image source: Overview of Oracle sharding, Oracle corp.
To cope with scalable operation for large amounts of data and accesses, distributed databases employ techniques such as sharding (data is partitioned across different hosts, illustrated above) or replication (data is mirrored across different hosts).
Distributed operation presents new challenges, though …
CAP Theorem (also known as Brewer’s theorem) → an influential theoretical result for distributed databases that states that a system cannot guarantee all of the three following properties:
CAP? Choose two! Database systems fall under one the CA, CP, or AP categories.
Key-value stores defines two abstract operations: put(k,v)
: stores value v
in association to key k
, and; get(k)
: get value associated to key k
. The types of keys and values are often arbitrary, i.e., treated as byte sequences by the storage system.
Document stores are specialised key-value stores that operate on data, called documents, that have a “self-describing” hierarchical structure (e.g., as in XML, JSON). DB operations can be formulated in terms of the “self-describing” schema, sometimes using SQL-like languages. Examples → MongoDB, Google FireStore.
Graph databases organizes data in terms of graphs, such that nodes represent entities and edges represent relationships between entities. Example → Neo4j.
Multi-paradigm NoSQL databases → ArangoDB and Azure Cosmos are examples that support key, document, or graph-based operations.
We only cover some examples of key-value stores next.
BigTable from Google is a key-value cloud storage system for operating large volumes of data. It underlies several Google services like Google Earth, Google Analytics and in fact other NoSQL databases like Google FireStore.
BigTable is a column wide-store, a special type of key-value store.
As in relational databases, BigTable supports the concepts of table, row, and column, but with significant differences. Table = { rows }, however:
Image source: “Cloud Computing: Theory and Practice 2nd ed.” by D. C. Marinescu
from gcloud import bigtable
clientbt = bigtable.Client(admin=True)
clientbt.start()
instance = clientbt.instance('cloud-book-instance')
table = instance.table('book-table2')
table.create()
column_family = table.column_family('cf')
column_family.create()
row= table.row('key_value')
row.set_cell('cf', 'experiment', 'exp1')
row.set_cell('cf', 'date', '6/6/16')
row.set_cell('cf', 'link', 'http://some_location')
row.commit()
row_data = table.read_row('key_value')
cells = row_data.cells['cf']
Reference: Python Client for Google Cloud Bigtable
from pymemcache.client.base import Client
mc = Client('my-memcached-server.com')
mc.set('foo', 'bar')
result = mc.get('foo')
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
r.set('foo', 'bar')
data = r.get('foo')
Memcached and Redis are two popular distributed in-memory key-value stores. (Redis also supports persistence and a number of other features beyond the raw key-value store functionality).
A typical use case → act as a cache for frequently accessed data in another database system. Both can be deployed using the Google Cloud Memorystore service.
Image source: Deploying a HighlyAvailable Distributed Caching Layer on Oracle Cloud Infrastructure using Memcached & Redis, Oracle white paper, 2018.
It depends on your use case. The following fragment from the Google BigTable documentation offers some advice:
“Cloud Bigtable is not a relational database. It does not support SQL queries, joins, or multi-row transactions.”
“For in-memory data storage with low latency, consider Memorystore.”
Data warehouses → systems used for storing and querying large datasets. They are used by data analytics tools whose purpose is to run intensive queries over the data in scalable manner.
In contrast to DBMS:
Example warehouse services → Google BigQuery, Amazon Redshift, and Azure Data Lake.
Essential characteristics:
Query example over a BigQuery public dataset
Query example
import google.cloud.bigquery as bq
client = bq.Client(project=projectId)
query = client.query('''
SELECT count,
origin, org.name AS origin_name,
destination, dst.name AS destination_name
FROM `bdcc.vegas.flights`
JOIN `bdcc.vegas.airports` org ON(origin=org.iata)
JOIN `bdcc.vegas.airports` dst ON(destination=dst.iata)
WHERE org.state = 'NY'
ORDER BY count DESC, origin, destination''')
pandas_df = query.to_dataframe()
Query example
Reference: Python Client for Google BigQuery
import google.cloud.bigquery as bq
client = bq.Client(project=projectId)
table = bq.Table(table_name)
table.schema = (
bq.SchemaField("movieId", "INTEGER", "REQUIRED"),
bq.SchemaField("title", "STRING", "REQUIRED"),
bq.SchemaField("year", "INTEGER", "REQUIRED"),
bq.SchemaField("imdbId", "INTEGER", "REQUIRED")
)
pandas_df = ...
load_job = client.load_table_from_dataframe(pandas_df, table)
while load_job.running():
time.sleep(0.5) # wait ...
Table creation example
Reference: Python Client for Google BigQuery