Undergraduate Course: Text Technologies for Data Science (INFR11145)
Course Outline
School | School of Informatics |
College | College of Science and Engineering |
Credit level (Normal year taken) | SCQF Level 11 (Year 4 Undergraduate) |
Availability | Available to all students |
SCQF Credits | 20 |
ECTS Credits | 10 |
Summary | This course teaches the basic technologies required for text processing, focussing mainly on information retrieval and text classification. It gives a detailed overview of information retrieval and describes how search engines work. It also covers basic knowledge of the main steps for text classification.
This course is a highly practical course, where at least 50% of what is taught in the course will be implemented from scratch in course works and labs, and students are required to complete a final project in small groups. All lectures, labs, and two course works will take place in Semester 1. The final group project will be due early Semester 2 by week 3 or 4. |
Course description |
Syllabus:
* Introduction to IR and text processing, system components
* Zipf, Heaps, and other text laws
* Pre-processing: tokenization, normalisation, stemming, stopping.
* Indexing: inverted index, boolean and proximity search
* Evaluation methods and measures (e.g., precision, recall, MAP, significance testing).
* Query expansion
* IR toolkits and applications
* Ranked retrieval and learning to rank
* Text classification: feature extraction, baselines, evaluation
* Web search
|
Entry Requirements (not applicable to Visiting Students)
Pre-requisites |
|
Co-requisites | |
Prohibited Combinations | Students MUST NOT also be taking
Text Technologies for Data Science (UG) (INFR11229)
|
Other requirements | MSc students must register for this course, while Undergraduate students must register for INFR11229 instead.
Maths requirements:
1. Linear algebra: Strong knowledge of vectors and matrices with all related mathematical operations (addition, multiplication, inverse, projections ... etc).
2. Probability theory: Discrete and continuous univariate random variables. Bayes rule. Expectation, variance. Univariate Gaussian distribution.
3. Calculus: Functions of several variables. Partial differentiation. Multivariate maxima and minima.
4. Special functions: Log, Exp, Ln.
Programming requirements:
1. Python and/or Perl, and good knowledge in regular expressions
2. Shell commands (cat, sort, grep, sed, ...)
3. Additional programming language could be useful for course project.
Team-work requirement:
Final course project would be in groups of 4-6 students. Working in a team for the project is a requirement. |
Information for Visiting Students
Pre-requisites | As above. No part time visiting students permitted. |
High Demand Course? |
Yes |
Course Delivery Information
|
Academic year 2024/25, Available to all students (SV1)
|
Quota: None |
Course Start |
Full Year |
Course Start Date |
16/09/2024 |
Timetable |
Timetable |
Learning and Teaching activities (Further Info) |
Total Hours:
200
(
Lecture Hours 18,
Supervised Practical/Workshop/Studio Hours 12,
Summative Assessment Hours 2,
Programme Level Learning and Teaching Hours 4,
Directed Learning and Independent Learning Hours
164 )
|
Assessment (Further Info) |
Written Exam
30 %,
Coursework
70 %,
Practical Exam
0 %
|
Additional Information (Assessment) |
Exam 30%
Coursework 70%
Course Work 1 10%, individual work covers implementing basic search engine
Course Work 2 20%, individual work covering IR evaluation and web search
Course Work 3 40%, is a group project, where each group is 4-6 members
All of the coursework is heavy on system implementation, and thus being familiar with programming and software engineering is a pre-requisite. Python is required for implementation of Course Work 1 and Course Work 2. For Course Work 3, students are free to use the implementation language they prefer. |
Feedback |
Not entered |
Exam Information |
Exam Diet |
Paper Name |
Hours & Minutes |
|
Main Exam Diet S2 (April/May) | Text Technologies for Data Science (INFR11145) | 2:120 | |
Learning Outcomes
On completion of this course, the student will be able to:
- build basic search engines from scratch, and use IR tools for searching massive collections of text documents
- build feature extraction modules for text classification
- implement evaluation scripts for IR and text classification
- understand how web search engines (such as Google) work
- work effectively in a team to produce working systems
|
Reading List
"Introduction to Information Retrieval", C.D. Manning, P. Raghavan and H. Schutze
"Search Engines: Information Retrieval in Practice", W. Bruce Croft, Donald Metzler, Trevor Strohman
"Machine Learning in Automated Text Categorization". F Sebastiani "The Zipf Mystery"
Additional research papers and videos to be recommended during lectures |
Contacts
Course organiser | Dr Walid Magdy
Tel: (0131 6)51 5612
Email: wmagdy@inf.ed.ac.uk |
Course secretary | Miss Yesica Marco Azorin
Tel: (0131 6)50 5194
Email: ymarcoa@ed.ac.uk |
|
|