Big Data and Cloud Computing

Module 2


EAD model



BDCC Assignment for module 2

Tips to handle EVENTS table


Classes


April 15th

    Data at scale (9:00 am recorded class), (11:30 am recorded class)
    To complement Machine learning Recap: Nilsson's draft book - Chapter 1

    To read for class of Wednesday 22nd:


April 20th

    Apache Beam (9:00 am recorded class), (11:30 am recorded class)

    Practical #1 (you will need a text file for this practical. I am using this):
        (9:00 am recorded practical class)
        (11:30h practical is together with the 11:30h theoretical part above)


April 22nd

    Worksheet #1: Par DBs x MR (Proposed answers)
    Recap on Parallel Programming (up to slide 11) (9:00 am recorded class), (11:30 am recorded class)    
   To read
:
        Software engineering for scientific big data analysis


April 27th

    Worksheet #2: Software engineering for scientific big data analysis (Proposed answers)
    ML pipeline using Apache Beam (9:00 am recorded class), (11:30 am recorded class)

    Tensorflow and Apache Beam
    Practical #2: full ML pipeline
        Colab Notebook to run Molecules ML pipeline


April 29th

    Recap on Parallel Programming (cont.) (from slide 11 to 28) (9:00 am recorded class), (11:30 am recorded class)


May 4th

    Practical #3: Multiprocessing in python (9:00 am recorded class), (11:30 am recorded class)

    To read:
        Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence
        From Persistent Identifiers to Digital Objects to Make Data Science More Efficient


May 6th

    Worksheet #3: Tools for DS, ML (paper 1) + Digital Objects (paper 2) (Proposed answers)

    Recap on Parallel Programming (cont.) (from slide 29 to end) (9:00 am recorded class), (11:30 am recorded class)
    A View of Value and Veracity

    To read:
        A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning


May 11th

    Practical #4 (9:00 am recorded class), (11:30 am recorded class)


May 13th

    A View of Value and Veracity (cont.) (9:00 am recorded class), (11:30 am recorded class)


May 18th

    Practical #5 Revisiting our full ML pipeline (9:00 am recorded class), (11:30 am recorded class)
        Have you ever tried these tasks?


May 20th

    A View of Value and Veracity (cont. from slide 26 till the end) (9:00 am recorded class), (11:30 am recorded class)
    GPGPUs


May 25th

    GPGPUs (cont. from slide 14 till the end)
    Practical #6 (9:00 am recorded class), (11:30 am recorded class)



May 27th

    Worksheet #4: A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning (Proposed answers)
    Utilizing Database Architecture and Data Cleaning to Increase ‘Time to Science’ and Decrease Resources Needed in Research Data Science Workflows (to complement the slides) (work by Christine Kirkpatrick)
    atSNP Infrastructure, a Case Study for Searching Billions of Records While Providing Significant Cost Savings over Cloud Providers) (to complement the slides) (work by Christopher Harrison)
    (9:00 am recorded class), (11:30 am recorded class)


Recommended Reading