Scalable Data Science

By Prof. Anirban Dasgupta & Prof. Sourangshu Bhattacharya | IIT Kharagpur

Learners enrolled: 2019

Consider the following example problems: One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire Wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers. One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory. One is interested in learning either a supervised model or find unsupervised patterns, but the data is distributed over multiple machines. Communication being the bottleneck, naïve methods to adapt existing algorithms to such a distributed setting might perform extremely poorly. In all the above situations, a simple data mining / machine learning task has been made more complicated due to large scale of input data, output results or both. In this course, we discuss algorithmic techniques as well as software paradigms which allow one to develop scalable algorithms and systems for the common data science tasks.

INTENDED AUDIENCE : Computer Science and Engineering

PREREQUISITES : Algorithms, Machine Learning

INDUSTRY SUPPORT : Google, Microsoft, Facebook, Amazon, Flipkart, LinkedIn etc.

Summary

Course Status :	Completed
Course Type :	Elective
Duration :	8 weeks
Category :	Computer Science and Engineering
Credit Points :	2
Level :	Postgraduate
Start Date :	29 Jul 2019
End Date :	20 Sep 2019
Exam Date :	29 Sep 2019 IST

Note: This exam date is subjected to change based on seat availability. You can check final exam date on your hall ticket.

Page Visits

Course layout

Week 1 : Background: Introduction (30 mins) Probability: Concentration inequalities, (30 mins) Linear algebra: PCA, SVD (30 mins) Optimization: Basics, Convex, GD. (30 mins) Machine Learning: Supervised, generalization, feature learning, clustering. (30 mins)

Week 2 : Memory-efficient data structures: Hash functions, universal / perfect hash families (30 min) Bloom filters (30 mins) Sketches for distinct count (1 hr) Misra-Gries sketch. (30 min)

Week 3 : Memory-efficient data structures (contd.): Count Sketch, Count-Min Sketch (1 hr) Approximate near neighbors search: Introduction, kd-trees etc (30 mins) LSH families, MinHash for Jaccard, SimHash for L2 (1 hr)

Week 4 : Approximate near neighbors search: Extensions e.g. multi-probe, b-bit hashing, Data dependent variants (1.5 hr) Randomized Numerical Linear Algebra Random projection (1 hr)

Week 5 : Randomized Numerical Linear Algebra CUR Decomposition (1 hr) Sparse RP, Subspace RP, Kitchen Sink (1.5 hr)

Week 6 : Map-reduce and related paradigms Map reduce - Programming examples - (page rank, k-means, matrix multiplication) (1 hr) Big data: computation goes to data. + Hadoop ecosystem (1.5 hrs)

Week 7 : Map-reduce and related paradigms (Contd.) Scala + Spark (1 hr) Distributed Machine Learning and Optimization: Introduction (30 mins) SGD + Proof (1 hr)

Week 8 : Distributed Machine Learning and Optimization: ADMM + applications (1 hr) Clustering (1 hr) Conclusion (30 mins)

Books and references

J. Leskovec, A. Rajaraman and JD Ullman. Mining of Massive Datasets. Cambridge University Press, 2nd Ed.
Muthukrishnan, S. (2005). Data streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science, 1(2), 117-236.
Woodruff, David P. ""Sketching as a tool for numerical linear algebra."" Foundations and Trends® in Theoretical Computer Science 10.1–2 (2014): 1-157.
Mahoney, Michael W. ""Randomized algorithms for matrices and data."" Foundations and Trends® in Machine Learning 3.2 (2011): 123-224.

Instructor bio

Anirban Dasgupta is currently an Associate Professor of Computer Science & Engineering at IIT Gandhinagar. Prior to this, he was a Senior Scientist at Yahoo! Labs Sunnyvale. Anirban works on algorithmic problems for massive data sets, large scale machine learning, analysis of large social networks and randomized algorithms in general. He did his undergraduate studies at IIT Kharagpur and doctoral studies at Cornell University. He has also received the Google Faculty Research Award (2015), the Cisco University grant (2016), and the ICDT Best Newcomer Award (2016).

Sourangshu Bhattacharya is an Assistant Professor in the Department of Computer Science and Engineering, IIT Kharagpur. He was a Scientist at Yahoo! Labs from 2008 to 2013, where he was working on prediction of Click-through rates, Ad-targeting to customers, etc on the Rightmedia display ads exchange. He was a visiting scholar at the Helsinki University of Technology from January - May 2008. He received the B.Tech. in Civil Engineering from I.I.T. Roorkee in 2001, M.Tech. in computer science from I.S.I. Kolkata in 2003, and Ph.D. in Computer Science from the Department of Computer Science & Automation, IISc Bangalore in 2008. He has many publications in top conferences and journals, including ICML, NIPS, WWW, ICDM, CIKM, etc. His current research interests include modeling influence in social networks, distributed machine learning, and representation learning.

Course certificate

The course is free to enroll and learn from. But if you want a certificate, you have to register and write the proctored exam conducted by us in person at any of the designated exam centres.
The exam is optional for a fee of Rs 1000/- (Rupees one thousand only).
Date and Time of Exams: 29th September 2019, Morning session 9am to 12 noon; Afternoon Session 2pm to 5pm.
Registration url: Announcements will be made when the registration form is open for registrations.
The online registration form has to be filled and the certification exam fee needs to be paid. More details will be made available when the exam registration form is published. If there are any changes, it will be mentioned then.
Please check the form for more details on the cities where the exams will be held, the conditions you agree to when you fill the form etc.

CRITERIA TO GET A CERTIFICATE

Average assignment score = 25% of average of best 6 assignments out of the total 8 assignments given in the course.
Exam score = 75% of the proctored certification exam score out of 100
Final score = Average assignment score + Exam score

YOU WILL BE ELIGIBLE FOR A CERTIFICATE ONLY IF AVERAGE ASSIGNMENT SCORE >=10/25 AND EXAM SCORE >= 30/75.

If one of the 2 criteria is not met, you will not get the certificate even if the Final score >= 40/100.
Certificate will have your name, photograph and the score in the final exam with the breakup.It will have the logos of NPTEL and IIT Kharagpur. It will be e-verifiable at nptel.ac.in/noc.
Only the e-certificate will be made available. Hard copies are being discontinued from July 2019 semester and will not be dispatched

SWAYAM Helpline / Support