Used Python's Multiprocessing and Queues to run 6 processes on Raspberry Pi. The processes used OpenCV library’s BackgroundSubtractorMOG2 and Canny edge detector algorithms to detect motion. Upon motion detection videos are uploaded to AWS S3 and meta data to AWS DynamoDB. A Python-Flask based web server running on the Pi provides video live streaming. Used TensorFlow Inception-V3 running on AWS EC2 to classify images in the videos. A web based visualization interface built using Javascript, HTML and Google charts provide access to the videos and meta data
View Project User Interface Backend Code Frontend CodeCreated a Log Search tool using Inverted Index. It supported full Boolean search query. Query was processed using Shunting-Yard algorithm. Boolean Search was processed using Python set methods. Two versions was implemented. A local version and a cloud version
Local version: On a log file of 100 MB, speeds faster than Grep was achieved. Inverted index was create using a Python script. Postings list was stored in a local instance of MongoDB. The input log file was split into smaller files. Postings list contained filename and offset of log entries
Cloud version: Inverted index was created using MRJob and AWS EMR. Postings list was stored in MongoDB in AWS EC2. Each log entry was stored in AWS with a unique key
In this experiment, housing rental advertisements from 5 cities were scrapped from Craiglist using Python and deduplicated using text analytics. The study utilized a 2x2 factorial design to measure the prevalence of racial and/or social status discrimination. Name was used to indicate race ('white' and 'black' sounding names) and type of job in the email body was used to indicate social status. Responses were read using Python and test of propoetions was used for statistical signifance.
View ProjectUsed Washington Metro rail passenger ridership data for the month of May 2012. Used D3/Javascript and Tableau to create visualizations. Created Chord diagram and small multiples using D3/Javascript.
View Project CodeIn this Kaggle competition, we used Extra trees classifier and 16 engineered features to predict forest cover type using cartographic variables
View Project CodeUsed Softlayer to store the dataset (270 GB) in HDFS and then used Spark for feature reduction and clustering. For each song in a cluster, the next 100 similar songs were computed using Euclidean distance and stored in ElasticSearch. Created a website using Python-Webpy that allowed users to search songs and once a song was selected the website played the next similar songs by querying ElasticSearch. The website streamed 30 seconds song clips from 7Digital using song ids.
View Project CodeTrained 2 Neural Networks as part of assignments of this course
One with 4096 nodes in a single hidden layer
Second with 2 hidden layers with 1024 nodes each
Linear Regression, Logistic Regression, Nearest Neighbors, Naive Bayes, Decision Trees, Gradient Descent, Clustering, Gaussian Mixture Models, PCA, Graphs, Expectation Maximization. Using Python and Scikit-Learn
AssignmentsTrained models on very large datasets using distributed infrastructure and parallel computation. Tools used - Python, Spark, MRJob, MapReduce, local Hadoop and AWS EMR. Linear and logistic regression using gradient descent. Shortest path, PageRank, Decision Trees, SQL Joins, Market Basket using Apriori
AssignmentsRandomized Experiments and experiment design. Mainly used R
Assignments