Undergraduate Course: Programming for Data Science at Scale (INFR11255)
Course Outline
School | School of Informatics |
College | College of Science and Engineering |
Credit level (Normal year taken) | SCQF Level 11 (Year 4 Undergraduate) |
Availability | Available to all students |
SCQF Credits | 10 |
ECTS Credits | 5 |
Summary | The Programming for Data Science at Scale course will utilise the paradigms of programming at large scale to equip students with the practical skills required to leverage large-scale computational resources across a distributed cluster of computers. |
Course description |
Delivery Method:
The course will be delivered through a combination of: (1) live lectures, (2) practical labs, (3) tutorials, and (4) an online discussion forum (Piazza forum).
Content/Syllabus:
The course will vary slightly from year to year, but will include many of the following topics:
- Introduction to large-scale data processing
- Data-parallel programming: functional collections
- Distributed Data-parallel programming
- Distributed Key-value processing
- Optimizing distributed data processing: Shuffling and partitioning
- Distributed Query processing
- Distributed Graph processing
- Distributed Tensor processing
As this is a practical course touching a large number of topics and from separate areas, it is coursework only. For proper evaluation, students must be presented with real problems, rather than "toy" ones which can be solved in a very limited time. The evaluation is based on the following components:
1) Quizzes - learning outcomes 1, 2, 3.
2) Programming assignment - learning outcome 2.
3) Design assignment - learning outcomes 1, 3, 4.
|
Entry Requirements (not applicable to Visiting Students)
Pre-requisites |
|
Co-requisites | |
Prohibited Combinations | |
Other requirements | None |
Information for Visiting Students
Pre-requisites | The nature of this course means that assessment is only possible while the course is running. Any students entitled to a resit (e.g., visiting students, resits for professional purposes, ordinary degree students, or students with null sits) would need to retake the course in the following academic year. |
High Demand Course? |
Yes |
Course Delivery Information
|
Academic year 2024/25, Available to all students (SV1)
|
Quota: None |
Course Start |
Semester 1 |
Timetable |
Timetable |
Learning and Teaching activities (Further Info) |
Total Hours:
100
(
Lecture Hours 18,
Seminar/Tutorial Hours 4,
Supervised Practical/Workshop/Studio Hours 4,
Feedback/Feedforward Hours 2,
Programme Level Learning and Teaching Hours 2,
Directed Learning and Independent Learning Hours
70 )
|
Assessment (Further Info) |
Written Exam
0 %,
Coursework
100 %,
Practical Exam
0 %
|
Feedback |
The feedback provided to the students will be in various forms: (1) Q&A over the online forum, (2) self-feedback from auto-graded quizzes and programming assignment, (3) collective feedback for programming assignment, (4) feed-forward during the tutorial session for the design assignment. |
No Exam Information |
Learning Outcomes
On completion of this course, the student will be able to:
- demonstrate an understanding of the concepts behind different large-scale programming models and their associated data models.
- construct and justify a formulation in terms of a programming model for a given problem and implement that formulation on top of an existing framework.
- identify how to decompose large problems into sub-problems and compose the results by applying appropriate programming models.
- present implementations and engage in professional dialogue with peers to identify and adapt those implementations better to meet requirements.
|
Reading List
The course will be self-contained with no required books.
A list of resources:
- Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
- Zaharia, Matei, et al. "Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing." 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 2012.
- Armbrust, Michael, et al. "Spark sql: Relational data processing in spark." Proceedings of the 2015 ACM SIGMOD international conference on management of data. 2015.
- Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing." Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010. |
Additional Information
Course URL |
https://opencourse.inf.ed.ac.uk/pdss |
Graduate Attributes and Skills |
Knowledge integration: This course will help students understand different types of data and programming models.
Problem solving: The students will develop their problem-solving skills by formulating a given problem in terms of an appropriate programming model.
Applying and critiquing: The students will learn how to implement a formulation of a given problem on top of an existing framework for a given problem.
Critical and analytical thinking: The students will learn to identify how to 'break down' large-scale problems into discrete problems and compose the results using the appropriate model. They will also gain experience in profiling and tuning an existing implementation to better meet demands / requirements. |
Keywords | Large-scale programming,Functional programming,Query processing,Graph processing,Tensor processing |
Contacts
Course organiser | Dr Amir Shaikhha
Tel: (0131 6)50 4379
Email: amir.shaikhha@ed.ac.uk |
Course secretary | Miss Yesica Marco Azorin
Tel: (0131 6)50 5194
Email: ymarcoa@ed.ac.uk |
|
|