Big Data
| Code | School | Level | Credits | Semesters |
| COMP4107 | School of Computer Science | 4 | 10 | Spring China |
- Code
- COMP4107
- School
- School of Computer Science
- Level
- 4
- Credits
- 10
- Semesters
- Spring China
Summary
Prerequisites: COMP3055 Machine Learning
“Big Data” involves data whose volume, diversity and complexity requires new technologies, algorithms and analyses to extract valuable knowledge, which go beyond the normal processing capabilities of a single computer. The field of Big Data has many different faces such as databases, security and privacy, visualisation, computational infrastructure or data analytics/mining. This module will provide the following concepts:
1. Introduction to Big data: introducing the main principles behind distributed/parallel systems with data intensive applications, identifying key challenges: capture, store, search, analyse and visualise the data.
2. SQL Databases vs. NoSQL Databases: understand the growing amounts of data; the relational database management systems (RDBMS); overview of Structured Query Languages (e.g. SQL); introduction to NoSQL databases; understanding the difference between a relational DBMS and a NoSQL database; Identifying the need to employ a NoSQL DB.
3. Big Data frameworks and how to deal with big data: this includes the MapReduce programming model, as well as an overview of recent technologies (Hadoop ecosystem, and Apache Spark). Then, you will learn how to interact with the latest APIs of Apache Spark (RDDs, DataFrames and Datasets) to create distributed programs capable of dealing with big datasets (using Python and/or Scala)
4. Finally, we will dive into the data mining and machine learning part of the course, including data preprocessing approaches (to obtain quality data), distributed machine learning algorithms and data stream algorithms. To do so, you will use the Machine learning library of Apache Spark (MLlib) to understand how some machine learning algorithms (e.g. Decision Trees, Random Forests, k-means) can be deployed at a scale.
Target Students
Part II and III undergraduate students and MSc students in the School of Computer Science. This module is part of the AI, Modelling and Optimisation theme and the Operating systems and Networks theme in CS. Available to JYA/Erasmus students.
Classes
- One 2-hour lecture each week for 12 weeks
- One 1-hour computing each week for 12 weeks
Activities may take place every teaching week of the Semester or only in specified weeks. It is usually specified above if an activity only takes place in some weeks of a Semester
Assessment
- 50% Group project: Programming project in groups
- 50% Exam (2-hour): Two hours written exam
Assessed by end of spring semester
Learning Outcomes
Knowledge and Understanding:
- Understand the role and importance of data, information and knowledge.
- Understand the principles of big data storage, retrieval and processing.
- Learn to use the key tools and new technologies of big data ecosystem.
- Learn to define and achieve high quality big data sets.
- Understand the features of large-scale data computing frameworks.
- Learn the principals of machine learning algorithms capable of handling big data.
- Understand cloud computing components to provide big data services.
Intellectual Skills:
- Understand complex ideas and relate them to specific problems or questions in the area of big data.
- Be able to identify distributed solutions/approaches to handle big datasets with existing technologies.
Professional/Practical Skills:
- Hands-on experience with state-of-the-art technologies to handle big data.
Transferable/Key Skills:
- Experience in problem solving.
- Experience in working in groups.
- Retrieve information from appropriate sources (e.g. Spark API).
- Experience in working with Big Data ecosystems (e.g. Hadoop)
Conveners
- Dr Saeid Pourroostaei Ardakan