Pandas

Investigating Fandango Movie Ratings

In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango’s rating system was biased and dishonest. He published his analysis in this article — a great piece of data journalism that’s totally worth reading. Fandango displays a 5-star rating system on their website, where the minimum rating is 0 stars and the maximum is 5 stars. Hickey found that there’s a significant discrepancy between the number of stars displayed to users and the actual rating, which he was able to find in the HTML of the page.

Analyzing NYC High School Data

One of the most controversial issues in the U.S. educational system is the efficacy of the standardized tests, and whther they’re unfair to certain groups. Investigating the correlation between SAT scores and demographic might be an interesting angle to take. We could correlate SAT scores with factors like race, gender, income, and more. The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college.

Demand Forecasting of Perishable Products

The objective of this project is to minimize wastage of meal kits in retail stores. Currently, this is being done by tracking each individual item from the source until the point of sale. This is a cumbersome process and is labor intensive. In order to realize the objective using machine learning the first step in the process is to have an accurate forecast of the demand. This project focuses on generating accurate forecast for each individual item (46 unique items) for each store (47 unique stores).

Mobile App for Lottery Addiction

Many people start playing the lottery for fun, but for some this activity turns into a habit which eventually escalates into addiction. Like other compulsive gamblers, lottery addicts soon begin spending from their savings and loans, they start to accumulate debts, and eventually engage in desperate behaviors like theft. In this project, we are going to contribute to the development of a mobile app by writing a couple of functions that are mostly focused on calculating probabilities.

AirBnB: Nearest Neighbors

Introduction AirBnB is a marketplace for short term rentals that allows you to list part or all of your living space for others to rent. You can rent everything from a room in an apartment to your entire house on AirBnB. Because most of the listings are on a short-term basis, AirBnB has grown to become a popular alternative to hotels. The company itself has grown from it’s founding in 2008 to a 30 billion dollar valuation in 2016 and is currently worth more than any hotel chain in the world.

Star Wars: A data exploration

Before the release of “Star Wars: The Force Awakens”, the team at FiveThirtyEight wanted to answer some questions about the Star Wars franchise. In particular they were interested in answering the question Which movie is the best movie in the franchise? The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.

Analyze Employee Exit Survey

In this project, we will work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. The objective of this project is to be able to answer the following questions: Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?

Building a Spam Filter with Naive Bayes

In this project, we’re going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam). To train the algorithm, we’ll use a dataset of 5,572 SMS messages that are already classified by humans.

Finding the best markets to advertise an e-learning product

In this project, we’ll aim to find the two best markets to advertise our product in — we’re working for an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. Understanding the Data To avoid spending money on organizing a survey, we’ll first try to make use of existing data to determine whether we can reach any reliable result.