Natural Language Processing
CS4740 Fall 2023
(LING4744/COGST4740/CS5740)
We've ordered the links below by what you will be checking most often throughout the semester first. Hence, information unlikely to change throughout the semester, such as policies, are later in the list.
The course schedule, with links to lecture material (slides, recordings, readings) and assignment materials.
Announcements are in the lecture slides or, when timely broadcasting to the class is needed, on Ed Stem (Discussions). Please set your Ed notifications appropriately.
Need to reach us? Consult our office hours and staff contact info.
Resources
- Online (and free) textbook: Jurafsky and Martin, Speech and Language Processing, 3rd edition (draft)
- NLP at Cornell, including related courses
- Platforms used in the class: some notes
- CMS and CMS usage guide (written for another class, but useful)
- Gradescope: CS4740+5740 home; site for the additional 5740 assignments; guide to grouping on Gradescope: it's different than on CMS.
- Colab [site]: guide and hints
Policies
Note: We reserve the right to make necessary changes to any policy on this page if it would jeopardize the smooth running of the course to leave it as written. We aim to avoid making alterations, and will try to be as transparent as possible about key changes, e.g., by posting to Ed Discussions.- Collaboration and preserving academic integrity
- Groups (a.k.a. "teams" or "partners") of two are allowed on all assignments except for CS5740 add-on assignments. You can partner with anyone in the class (regardless of whether they or you are registered for a grad or undergrad version of the class), but we strongly suggest that you let potential partners know (1) how much you intend to commit to teamwork (one case in which this comes up is if one partner is taking the course S/U and the other for letter grade), (2) what your preferred working hours are (morning, afternoon, night). You do not have to have the same partner on each assignment.
If you want to use git with your partner, you should use the Cornell COECIS GitHub, which allows you to create a private repository. Other versions of GitHub may make your private repository public without your knowledge.
If there is a need for a "group divorce" (some work was done jointly but the two of you no longer wish to work together), please contact intro-nlp-prof@cornell.edu for further instructions.
- Below, "you" means you and, if there is one, your official group partner.
Until all students' submissions' grades for the assignment have been posted (in case there are people with extensions and makeups) ...
(1) You must never look at, access or possess any portion of another group's program(s) in any form. (This includes lines of code written on a whiteboard, lines of code described verbally.)
(2) You must never show or share any portion of your program(s) in any form to anyone except a member of the course staff. As a consequence, do not post any part of your programs to Ed Discussions. (Posting error messages that contain snippets of code is OK.)
(3) You must not ask for or copy solutions from outside sources (such as StackOverflow or code autogenerators).
(4) You should specifically acknowledge by name all help you received, whether or not it was "legal" according to rules (1)-(3) above. This is also known as "citing your sources". Exception>: you do not need to acknowledge the course staff (although we appreciate it if you do!).
Example: in an assignment file, the header could read "Sources/people consulted: discussed strategy for process_strings() with Claire Cardie and Hakim Weatherspoon".
Of particular note:- The minimum penalty in this course for receiving unauthorized help (upon a guilty finding for an academic integrity violation): besides the mandated letter to the student's college, a negative score on the affected work (this is more than just a grade deduction, where one might retain some points). Hence, a student who submits fraudulent work receives less credit than a student who didn't turn the work in at all.
- The minimum penalty for giving unauthorized help (upon a guilty finding for an academic integrity violation): the mandated letter to the student's college. Please don't put your friends at risk by asking them for unauthorized help .
- We plan to use software-similarity checkers for each assignment.
If you turn in someone else's work for course credit, and forthrightly acknowledge you are doing so, you are not acting dishonestly and are not violating academic integrity, but that also does not show us you have learned anything. Thus, you may not receive grading credit, but you would not undergo academic integrity hearings. If, on the other hand, you violate academic integrity by claiming someone else's work as yours or by giving unauthorized help, then the academic integrity hearing process will be triggered, which can incur both grade penalties and storage of records by your College. For more on Cornell's policies and procedures, see the Dean of the Faculty's Academic Integrity Website.
- Groups (a.k.a. "teams" or "partners") of two are allowed on all assignments except for CS5740 add-on assignments. You can partner with anyone in the class (regardless of whether they or you are registered for a grad or undergrad version of the class), but we strongly suggest that you let potential partners know (1) how much you intend to commit to teamwork (one case in which this comes up is if one partner is taking the course S/U and the other for letter grade), (2) what your preferred working hours are (morning, afternoon, night). You do not have to have the same partner on each assignment.
- Use of generative AI (e.g., chatGPT or CoPilot)
Motivation for our rules: Imagine your boss walks into your office and asks you to code/fix a tagger right now. From what you have learned in this class, you should be able to fluidly respond right in front of your boss in a way demonstrating your mastery of the appropriate concepts.
- Do not use these technologies for your first or early drafts of code or responses, even if you intend to extensively post-edit later. (The issue: someone else's code "always looks right.")
- Advice: for your learning, strive to do the work without these technologies for as far along as possible.
- If you do use such technologies,
- do not use the output verbatim, but code/write it up "in your own words", to demonstrate your understanding of it.
- precisely specify what portions of the code you used it for, per the Collaboration and preserving academic integrity policy 5.1.2(4): "You should specifically acknowledge by name all help you received, whether or not it was "legal"".
- Do not use these technologies for your first or early drafts of code or responses, even if you intend to extensively post-edit later. (The issue: someone else's code "always looks right.")
- Deadlines when SDS accommodations are not involved:
We do not have slip days, and there is no "you can submit late for a small penalty": you need to hit the submission deadlines. But if there are extenuating circumstances, please email intro-nlp-prof@cornell.edu and we can talk. (Still submit what you have before the deadline, so we have an indication of your progress at that point.) - Exam conflicts:
Roughly two weeks before the exams, we will open a "pseudo-quiz" to submit make-up exam requests . We cannot determine too far ahead of time the dates/locations, in part because there are many classes with currently-unknown demand that are looking to reserve rooms. - Accommodations registered with the office of Student Disability Services (SDS):
The instructor(s) have online access to SDS letters regarding accommodations for exams and other course matters, and will honor these accommodations. But please help us help you by following the instructions that SDS provides.
Assignments: As recommended by the SDS office, we do ask that for each homework you let us know beforehand in a timely fashion if you wish to apply your SDS-approved accommodations: email intro-nlp-prof@cornell.edu . No need to notify us if you will not use your accommodations for a particular assignment.
Exams: Our course is part of the testing program organized by SDS. There is a good process, but it is involved: here are the details of the so-called "Alternate Testing Program" as applied to this class. - Workload/grading:
- 4 programming assignments (with possible partial milestones) (can be done in pairs) = 17% each;
- Expect each of the roughly four connected programming assignments to take tens of hours, although this time is distributed over multiple weeks; to require writing code to massage raw-ish data into different formats and other accessory functions as well as to implement core algorithms; and to necessitate much independent examination of documentation.
- Expect each of the roughly four connected programming assignments to take tens of hours, although this time is distributed over multiple weeks; to require writing code to massage raw-ish data into different formats and other accessory functions as well as to implement core algorithms; and to necessitate much independent examination of documentation.
- one evening midterm, Thu Oct 12, 2023 = 16%;
- one in-person final = 16%;
- but, since the exams test conceptual, individual-level knowledge, to receive a C- or above in the course, students must receive at least a C- on both exams.
- For each "4740" piece of graded coursework, we make preliminary score-to-letter-grade conversions. We do not use the same absolute cutoffs across different assignments/exams for grade levels because we adjust for the difficulty of each exam and assignment. Also, because you are not in competition with each other, student course grades are not dependent on how other students do. We do not report medians or means: as a wise former course-staff member said, "Reporting the median is guaranteed to make at least half the class feel bad", even if everyone did well.
- Students enrolled in CS5740 complete an additional component for each 4740 homework, to be done individually. Scores on these components are converted to "satisfactory", "borderline", and "unsatisfactory". If a student receives two "borderline"s or one "unsatisfactory" among the four homeworks, we reserve the right to lower the student's letter grade as computed for 4740 by the equivalent of a "level", for example, from a B to a B-.
- Regrade requests:
Communication regarding regrade requests must be done only in writing via the mechanisms of the relevant submission platform (e.g., CMS, Gradescope): given the number of staff involved in handling regrade requests, we need centralized records of all discussions.
We want to give grades that accurately represent our assessment of your understanding of the course material (although we also have time constraints). Hence, if you receive an incorrect score, you should absolutely bring it to our attention via the mechanisms just described.
However, we must explicitly mention an additional consequence of the importance of grade accuracy: if we notice that you have been assigned more points than you should have been, we are duty-bound to correct such scores downward to the correct value. - S/U enrollment:
From our perspective, the only difference in how we view or treat S/U- vs. letter-grade students is assigning final course grades: we determine letter grades for all students as if all of you were taking the course for a letter grade, and then for S/U students make the conversion C- or better: S, D+ or below: U.
But as stated in our Collaboration Policy 5.a.1, S/U students are encouraged to set expectations with their assignment partners about their intended level of commitment to the teamwork. - Auditing
Sitting on the lectures is fine as long as there are enough physical seats. People who are auditing the course, either officially in Student Center or unofficially ("just sitting in"), should not submit any work, partner with officially registered non-auditors, nor take any exams, and should not join office hours if the lines are too long (we need to conserve our grading and staff resources).
- 4 programming assignments (with possible partial milestones) (can be done in pairs) = 17% each;
- Prerequisites
This course is not only an introduction to natural language processing, but also satisfies the practicum/project requirement for CS majors, and the coursework is designed with that in mind.- Strong programming skills are important. Three semesters of programming classes are strongly recommended (e.g., completion of CS3110). CS2110 may suffice if you individually could have successfully and easily completed the assignments by yourself.
- Python experience. Pytorch experience (as through CS4780) not required but some students report it being very helpful.
- Comfort with elementary probability.
- Clear understanding of matrix and vector operations.
- Familiarity with differentiation.
- Collaboration and preserving academic integrity
- Course description: This course constitutes an introduction to natural language processing (NLP), the goal of which is to enable computers to use human languages as input, output, or both. NLP is at the heart of many of today's most exciting technological achievements, including machine translation, automatic conversational assistants and Internet search. The course will introduce core problems and methodologies in NLP, including machine learning, problem design, and evaluation methods.
- Enrollment: the "preliminary course information document", which was the temporary home page before class started, contains waitlist/enrollment policies and alternate courses, among other things.
For S/U and auditing information, please see the "workload and grading" policy items 5.g.4 and 5.g.5, respectively.
Acknowledgments: the collaboration policies are drawn from those posted for CS1110 Spring 2022. Seagull images cropped from "is he talking to me" by Leonard J. Matthews, license CC BY-NC-SA 2.0; our images can be used under the same license.