74. Data Analysis for BiliDash#

This notebook serves the premilinary data analysis and probing (description phase) of the bilidash data.

74.1. Time Series Data of UUPERS Account on bilibili.com#

I have been tracking the data for info about number of fans, number of view, coins etc.

74.1.1. Correlation between Video Views and Fans#

There will definitely be conversion from views of video to number of fans. With the time series data, I can find out the correlation.

The first idea is that the slope of views and the slope of fans must be related.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime

from pathlib import Path
ls data
# path = "data/"
# path = "http://dynamic.imresear.ch/server/track-data/uupers/"
path = Path("data")
uupers = "uupers.csv"
vids = ["21768006.csv", "22755224.csv"]
df_uupers = pd.read_csv( path / uupers)
df_vid1 = pd.read_csv(path / vids[0])
df_vid2 = pd.read_csv(path / vids[1])
def str2dt(str):
    converts string to datetime object
    temp_dt = [int(i) for i in ('2018-5-5-1-48-9').split('-')]
    return datetime.datetime( temp_dt[0], temp_dt[1], temp_dt[2], temp_dt[3], temp_dt[4] )
df_uupers['dt'] = [ str2dt(i) for i in df_uupers['dt'] ]
df_vid1['dt'] = [ str2dt(i) for i in df_vid1['dt'] ]
df_vid2['dt'] = [ str2dt(i) for i in df_vid2['dt'] ]
df_uupers = df_uupers.set_index('dt').reset_index()
df_vid1 = df_vid1.set_index('dt').reset_index()
df_vid2 = df_vid2.set_index('dt').reset_index()
plt.plot(df_uupers[' fans'])
[<matplotlib.lines.Line2D at 0x7fda469f4e80>]
Index(['dt', 'view', 'danmaku', 'reply', 'favorite', 'coin', 'share',
       'now_rank', 'his_rank', 'no_reprint', 'copyright'],
total_views = df_vid1['view'].values + df_vid2['view'].values
total_coins = df_vid1['coin'].values + df_vid2['coin'].values
df_vid_tv = pd.DataFrame({"dt": df_vid1['dt'].values,"view":total_views, "coin":total_coins})
Index(['dt', ' fans'], dtype='object')
df_vid_tv_slope = ( (df_vid_tv['view'].values)[1:] - (df_vid_tv['view'].values)[:-1] )
df_vid_tc_slope = ( (df_vid_tv['coin'].values)[1:] - (df_vid_tv['coin'].values)[:-1] )
df_uupers_slope = ( (df_uupers[' fans'].values)[1:] - (df_uupers[' fans'].values)[:-1] )
ratio_f = 20
ratio_c = 10

plt.plot( df_vid_tv_slope, 'k-', label='views' )
plt.plot( df_vid_tc_slope * ratio_c, 'b-.', label='coins'+'*'+str(ratio_c) )
plt.plot( df_uupers_slope *ratio_f, 'r--', label='fans'+'*'+str(ratio_f) )
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    top=False,         # ticks along the top edge are off
    labelbottom=False) # labels along the bottom edge are off

Conclusion for now

  1. The conversion rate from views to coin is about 1/10

  2. The conversion rate from views to fans is about 1/20

74.2. Cross Correlation of Time Series#

The estimations are made by hand. Mathematically, cross correlation tells us about the correlation.

74.3. Missing Videos#

missing_vids_path = "/Users/leima/OneDrive - University of New Mexico/data/mybilividdata/failed_vid_data.csv"
missing_vids = np.genfromtxt(missing_vids_path, delimiter=',')
plt.title('Missing Videos')
plt.ylabel('Video ID')
plt.xlabel('Ordered Position of Video ID')
# plt.tick_params(
#     axis='x',          # changes apply to the x-axis
#     which='both',      # both major and minor ticks are affected
#     bottom=False,      # ticks along the bottom edge are off
#     top=False,         # ticks along the top edge are off
#     labelbottom=False) # labels along the bottom edge are off


Calculate missing video rate

def missing_rate(missing_vids_in, start, win):
#     missing_vids_max_f = max(missing_vids)
#     missing_vids_min_f = min( missing_vids_f[start:start+win])
    missing_vids_f = [ i for i in missing_vids_in if (i < start+win)&(i> start) ]

    missing_vids_full_f = [int(i) for i in range(start, start + win)]

#     return len( list( set(missing_vids_full_f) - set(missing_vids_f) ) )/len(missing_vids_full_f)
    return len( missing_vids_f )/win
windowsize = 100000

missing_rate_list = []

missing_data_start = 0
missing_data_flag = missing_data_start
missing_data_end = max(missing_vids)

for i in np.arange(missing_data_start, missing_data_end - windowsize, windowsize ):
    missing_rate_list.append( missing_rate( missing_vids, missing_data_flag, windowsize ) )
    missing_data_flag = missing_data_flag + windowsize
plt.plot( missing_rate_list, 'k.' )
plt.title('Missing Video Rate')
plt.ylabel('Missing Video Rate')
plt.xlabel('Abstracted Time')
# plt.ylim([0,1])

The missing video id data rate is about


Yet we do not know the reason. It might because the website doesn’t use all vids. It might also because the videos are submitted but deleted.