Scaling up Data Science

Code	School	Level	Credits	Semesters
DATA3002	Computer Science	3	N/A	Full Year UK

Code: DATA3002
School: Computer Science
Level: 3
Credits: N/A
Semesters: Full Year UK

Summary

“Big Data” involves data whose volume, diversity and complexity requires new technologies, algorithms and analyses to extract valuable knowledge, which go beyond the normal processing capabilities of a single computer. The field of Big Data has many different faces such as databases, security and privacy, visualisation, computational infrastructure or data analytics/mining some of which you will already have learned. This teaching block is about scaling that up, and making use of multiple machines, possibly in the cloud, to produce data science solutions.

This module will provide the following concepts:

1. Introduction to Big data: introducing the main principles behind distributed/parallel systems with data intensive applications, identifying key challenges: capture, store, search, analyse and visualise the data.

2. Big Data frameworks and how to deal with big data: this includes the MapReduce programming model, as well as an overview of recent technologies (Hadoop ecosystem, and Apache Spark). Then, you will learn how to interact with the latest APIs of Apache Spark (RDDs, DataFrames and Datasets) to create distributed programs capable of dealing with big datasets (using Python and/or Scala)

3. Finally, we will dive into the data mining and machine learning part of the course, including data preprocessing approaches (to obtain quality data), distributed machine learning algorithms and data stream algorithms. To do so, you will use the Machine learning library of Apache Spark (MLlib) to understand how some machine learning algorithms (e.g. Decision Trees, Random Forests, k-means) can be deployed at a scale.

Learners should consider whether there are relevant problems in their workplace that require the use of high performance computing and techniques to address Big Data problems (joining up with the on-the-job acctivity).

Target Students

Only available to those studying towards the Data Scientist Degree apprenticeship programme

Classes

11 x 3 hours of distance learning-based video content and lab resources supported by ad-hoc drop-ins. One x 6-hour practical block release sessions

Assessment

100% Assignment

Assessed by end of designated period

Educational Aims

This block aims to introduce the concepts required to deliver data science projects at scale, tackling problems which cannot be solved on a single computer. Learners will understand how to do this from a practical point of view as well as understanding the limitations of such approaches.

Learning Outcomes

Teaching Goal 1

Understand the principles that allow the processing of big data sets.

Teaching Goal 2

Understand the limitations of big data technologies for distributed processing.

Teaching Goal 3

Demonstrate practical skills required to implement big-data solutions using modern large-scale data and compute infrastructures.

Teaching Goal 4

Able to design and implement a data science software system that is efficient (in terms of cost and time), using appropriate techniques, tools and computational resources for processing the type and nature of data.

KSBs

K3. How data can be used systematically, through an awareness of key platforms for data and analysis in an organisation, including:

Data processing and storage, including on-premise and cloud technologies.
Database systems including relational, data warehousing & online analytical processing, “NoSQL” and real-time approaches; the pros and cons of each approach.
Data-driven decision making and the good use of evidence and analytics in making choices and decisions.

K4. How to design, implement and optimise analytical algorithms – as prototypes and at production scale– using:

Statistical and mathematical models and methods.
Advanced and predictive analytics, machine learning and artificial intelligence techniques, simulations, optimisation, and automation.
Applications such as computer vision and Natural Language Processing.
An awareness of the computing and organisational resource constraints and trade-offs involved in selecting models, algorithms and tools.
Development standards, including programming practice, testing, source control.

K5. The data landscape: how to critically analyse, interpret and evaluate complex information from diverse datasets:

Sources of data including but not exclusive to les, operational systems, databases, web services, open data, government data, news and social media.
Data formats, structures and data delivery methods including “unstructured” data.
Common patterns in real-world data.

S1. Identify and clarify problems an organisation faces, and reformulate them into Data Science problems. Devise solutions and make decisions in context by seeking feedback from stakeholders. Apply scientific methods through experiment design, measurement, hypothesis testing and delivery of results. Collaborate with colleagues to gather requirements.

S2. Perform data engineering: create and handle datasets for analysis. Use tools and techniques to source, access, explore, prole, pipeline, combine, transform and store data, and apply governance (quality control, security, privacy) to data.

S3. Identify and use an appropriate range of programming languages and tools for data manipulation, analysis, visualisation, and system integration. Select appropriate data structures and algorithms for the problem. Develop reproducible analysis and robust code, working in accordance with software development standards, including security, accessibility, code quality and version control.

S4. Use analysis and models to inform and improve organisational outcomes, building models and validating results with statistical testing: perform statistical analysis, correlation vs causation, feature selection and engineering, machine learning, optimisation, and simulations, using the appropriate techniques for the problem.

S5. Implement data solutions, using relevant software engineering architectures and design patterns. Evaluate Cloud vs. on-premise deployment. Determine the implicit and explicit value of data. Assess value for money and Return on Investment. Scale a system up/out. Evaluate emerging trends and new approaches. Compare the pros and cons of software applications and techniques.

S7. Develop and maintain collaborative relationships at strategic and operational levels, using methods of organisational empathy (human, organisation and technical) and build relationships through active listening and trust development.

S8. Use project delivery techniques and tools appropriate to their Data Science project and organisation. Plan, organise and manage resources to successfully run a small Data Science project, achieve organisational goals and enable effective change.

B1. An inquisitive approach: the curiosity to explore new questions, opportunities, data, and techniques; tenacity to improve methods and maximise insights; and relentless creativity in their approach to solutions.

B2. Empathy and positive engagement to enable working and collaborating in multi-disciplinary teams, championing and highlighting ethics and diversity in data work.

B3. Adaptability and dynamism when responding to varied tasks and organisational timescales, and pragmatism in the face of real-world scenarios.

B5. An impartial, scientific, hypothesis-driven approach to work, rigorous data analysis methods, and integrity in presenting data and conclusions in a truthful and appropriate manner.

Conveners

Dr Simon Kent

View in Curriculum Catalogue

Last updated 07/01/2025.