From Hackernews

Exploratory Analysis of Hacker News Posts

From Hackernews

Exploratory Analysis of Hacker News Posts

Hacker news is a social news website focusing on computer science and entrepreneurship. It was started by the startup incubator Y Combinator, where posts are voted and commented on similar to reddit. Posts that make it to the top of the Hacker News’ listings have more frequent visitors as a result.

In this project we are interested in the posts that begin with either Ask HN or Show HN. The posts submitted by users which ask the Hacker News community specific questions start with “Ask HN” prefix. Similarly, posts submitted by users to showcase their projects or product start with “Show HN” prefix. We are going to explore which among these two types of posts:

  • Receive more comments on average.
  • The time of the day when the posts receive more comments on average.
from csv import reader
# Read the csv file and convert it into a list of lists and store it in a
# variable termed hn

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
# Display the first 5 rows of the list hn

hn[0:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]
# Assigning the first row containing the column names to a  variable 
# called headers and removing the header row from hn

header = hn[0]
hn = hn[1:]
print(header)
print(hn[0:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        
        ask_posts.append(row)

    elif title.lower().startswith('show hn'):
        
        show_posts.append(row)
    
    else:
        
        other_posts.append(row)
        
print("The number of 'Ask Posts' in the list:", len(ask_posts))
print("The number of 'Show Posts' in the list:", len(show_posts))
print("The number of 'General Posts' in the list:", len(other_posts))
The number of 'Ask Posts' in the list: 1744
The number of 'Show Posts' in the list: 1162
The number of 'General Posts' in the list: 17194
# Finding the total number of comments in 'Ask Posts'

total_ask_comments = 0

for row in ask_posts:
    comments = row[4]
    comments = int(comments)
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)

total_show_comments = 0

for row in show_posts:
    comments = row[4]
    comments = int(comments)
    total_show_comments += comments
    
avg_show_comments = total_show_comments/len(show_posts)

print("The average number of comments on 'Ask Posts' are:", avg_ask_comments)
print("The average number of comments on 'Show Posts' are:", avg_show_comments)
The average number of comments on 'Ask Posts' are: 14.038417431192661
The average number of comments on 'Show Posts' are: 10.31669535283993

Based on our analysis above, we can conclude that ‘Ask Posts’ receive more comments on an average than ‘Show Posts’. This finding, tells us that the community is more helpful to its users who need assistance with their work/project.

import datetime as dt
result_list = []

for row in ask_posts:
    created = row[6]
    comments = int(row[4])
    
    result_list.append([created, comments])
    

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment = row[1]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, "%H")
    
    if hour not in counts_by_hour:
        
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
        
print(counts_by_hour)
print(comments_by_hour)
{'21': 109, '06': 44, '01': 60, '16': 108, '17': 100, '23': 68, '19': 110, '09': 45, '13': 85, '02': 58, '07': 34, '20': 80, '10': 59, '22': 71, '12': 73, '14': 107, '18': 109, '11': 58, '03': 54, '05': 46, '08': 48, '04': 47, '15': 116, '00': 55}
{'21': 1745, '06': 397, '01': 683, '16': 1814, '17': 1146, '23': 543, '19': 1188, '09': 251, '13': 1253, '02': 1381, '07': 267, '20': 1722, '10': 793, '22': 479, '12': 687, '14': 1416, '18': 1439, '11': 641, '03': 421, '05': 464, '08': 492, '04': 337, '15': 4477, '00': 447}
result_list[0:5]
[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]
avg_by_hour = []

total = 0
for value in comments_by_hour:
    avg_by_hour.append([value, comments_by_hour[value]/counts_by_hour[value]])

avg_by_hour
[['21', 16.009174311926607],
 ['06', 9.022727272727273],
 ['01', 11.383333333333333],
 ['16', 16.796296296296298],
 ['17', 11.46],
 ['23', 7.985294117647059],
 ['19', 10.8],
 ['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['02', 23.810344827586206],
 ['07', 7.852941176470588],
 ['20', 21.525],
 ['10', 13.440677966101696],
 ['22', 6.746478873239437],
 ['12', 9.41095890410959],
 ['14', 13.233644859813085],
 ['18', 13.20183486238532],
 ['11', 11.051724137931034],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['15', 38.5948275862069],
 ['00', 8.127272727272727]]
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")
for val in sorted_swap:
    date = val[1]
    date = dt.datetime.strptime(date, "%H")
    date = dt.datetime.strftime(date, "%H:%M")
    print("{}: {:.2f} average comments per post".format(date, val[0]))
[[16.009174311926607, '21'], [9.022727272727273, '06'], [11.383333333333333, '01'], [16.796296296296298, '16'], [11.46, '17'], [7.985294117647059, '23'], [10.8, '19'], [5.5777777777777775, '09'], [14.741176470588234, '13'], [23.810344827586206, '02'], [7.852941176470588, '07'], [21.525, '20'], [13.440677966101696, '10'], [6.746478873239437, '22'], [9.41095890410959, '12'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.051724137931034, '11'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.25, '08'], [7.170212765957447, '04'], [38.5948275862069, '15'], [8.127272727272727, '00']]
Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post

The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There’s about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm EST.


Avatar
Amol Kulkarni
Ph.D.

My research interests include application of Machine learning algorithms to the fields of Marketing and Supply Chain Engineering, Decision Theory and Process Optimization.