LA3: Clustering and Frequent Itemsets

Important note

This assignment must be submitted individually. You are encouraged to
discuss and exchange solutions during the lab sessions or on Slack, but
you are not allowed to share code electronically. Plagiarism,
unauthorized collaboration and other offenses under Concordia’s
Academic Code of
Conduct
will be firmly handled.

Preliminary comments

To submit this assignment, you will have to be familiar with Git and
GitHub. If you have never used these technologies, it is recommended to
go through the following tutorials:

In particular, you will have to be able to:

Clone a Git repository from GitHub: find the URL of a GitHub repository
and clone it using git clone <repo_url>.
Commit modifications to a local clone of a Git repository: git add <file> and git commit -m "message".
Push modifications from your local clone to the origin repository on GitHub: git push.

Assignment submission

You have to submit your assignment through GitHub classroom, using the following procedure:

Accept the assignment at https://classroom.github.com/a/UYpMRSbI. This will create your own copy
of the assignment repository, located at http://github.com/tgteacher/bigdata-la3-your_github_username.
Clone your copy of the assignment repository on your computer, and
implement the functions in answers/answer.py, following the instructions in the
documentation strings. A skeleton of your answer file already exists in file answers/answer.py
with the required syntax for each function.
Commit your solution to your local copy of the assignment repository.
Push your solution to your GitHub copy of the assignment repository.

Important: please make sure that the email address you use in Moodle is
added to your GitHub account (you can add multiple addresses to your
GitHub account).

You can repeat steps 3 and 4 as many times as you wish. Your assignment
will be graded based on a snapshot of your repository taken on the
submission deadline.

Evaluation

Grading

General Rule

Your assignment will be automatically graded through software tests.

The tests are already available in directory tests. You
may want to run them as you implement your solution, to check that your
code passes them. To do so, you will have to install pytest and simply
run pytest tests in the base directory of the assignment.

Your grade will be determined from the number of passing tests as
returned by pytest. All tests will contribute equally to the final
grade. For instance, if 20 tests are evaluated, and your solution passes 18 tests, then your grade
will be 90%.

This grading scheme is meant to be transparent and objective. However,
it is also radical and you should be very meticulous with your coding:
make a single syntax error in your answer file, such as a spurious
tabulation character, and all the tests will fail! To avoid that kind
of surprises, you are strongly encouraged to check the output of the
tests on Travis CI regularly.

Exceptions

The rules below aim at discouraging cheating. They might sound a bit harsh,
but in general be cool: if you don’t aim at cheating, you probably won’t :)

You are not allowed to modify the tests to make them pass. Every deliberate
attempt to modify the tests will result in a grade of 0.
You must use the libraries mentioned in the instructions to
implement the assignment. Any attempt to implement the solution with a different
library, for instance Dask when Spark was expected, will result in a grade of 0.
You are not allowed to make the tests pass using a hard-coded solution. Your solution
must, in principle, apply to other similar datasets. Any hard-coded solution will receive
the grade of 0.
Any deliberate attempt to trick the grading system by making the tests pass
without providing a correct, non-hardcoded, solution will receive the grade of 0.

Test environment and live feedback

Your code will be tested with Python 3.6 in a Ubuntu environment
provided by Travis CI. It is your responsibility to ensure that the
tests will pass in this environment. The following resources will help
you.

Python 3.6 is available in the computer labs and can be loaded using
module load python/3.6. You can check the version of Python that
you are using by running python --version. Computer labs can easily be
accessed remotely, using ssh.

It is strongly suggested that you run the disclosed tests before
submitting your assignment, using pytest as explained previously.

Live feedback on your assignment is provided through Travis CI
here. You will have to sign-in using
your GitHub account to see your assignment repository. Your grade will be determined
from the result of the
tests executed in Travis CI.