74. Data Analysis for BiliDash#

This notebook serves the premilinary data analysis and probing (description phase) of the bilidash data.

74.1. Time Series Data of UUPERS Account on bilibili.com#

I have been tracking the data for info about number of fans, number of view, coins etc.

74.1.1. Correlation between Video Views and Fans#

There will definitely be conversion from views of video to number of fans. With the time series data, I can find out the correlation.

The first idea is that the slope of views and the slope of fans must be related.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime

from pathlib import Path
%%bash
ls data
21768006.csv
22755224.csv
uupers.csv
# path = "data/"
# path = "http://dynamic.imresear.ch/server/track-data/uupers/"
path = Path("data")
uupers = "uupers.csv"
vids = ["21768006.csv", "22755224.csv"]
df_uupers = pd.read_csv( path / uupers)
df_vid1 = pd.read_csv(path / vids[0])
df_vid2 = pd.read_csv(path / vids[1])
def str2dt(str):
    """
    converts string to datetime object
    """
    
    temp_dt = [int(i) for i in ('2018-5-5-1-48-9').split('-')]
    return datetime.datetime( temp_dt[0], temp_dt[1], temp_dt[2], temp_dt[3], temp_dt[4] )
    
    
df_uupers['dt'] = [ str2dt(i) for i in df_uupers['dt'] ]
df_vid1['dt'] = [ str2dt(i) for i in df_vid1['dt'] ]
df_vid2['dt'] = [ str2dt(i) for i in df_vid2['dt'] ]
df_uupers = df_uupers.set_index('dt').reset_index()
df_vid1 = df_vid1.set_index('dt').reset_index()
df_vid2 = df_vid2.set_index('dt').reset_index()
plt.plot(df_uupers[' fans'])
[<matplotlib.lines.Line2D at 0x7fda469f4e80>]
../../_images/bilibili-data-analysis_12_1.png
df_vid1.columns
Index(['dt', 'view', 'danmaku', 'reply', 'favorite', 'coin', 'share',
       'now_rank', 'his_rank', 'no_reprint', 'copyright'],
      dtype='object')
total_views = df_vid1['view'].values + df_vid2['view'].values
total_coins = df_vid1['coin'].values + df_vid2['coin'].values
df_vid_tv = pd.DataFrame({"dt": df_vid1['dt'].values,"view":total_views, "coin":total_coins})
df_vid_tv['dt'].head(4)
0   2018-05-05 01:48:00
1   2018-05-05 01:48:00
2   2018-05-05 01:48:00
3   2018-05-05 01:48:00
Name: dt, dtype: datetime64[ns]
df_uupers['dt'].head(4)
0   2018-05-05 01:48:00
1   2018-05-05 01:48:00
2   2018-05-05 01:48:00
3   2018-05-05 01:48:00
Name: dt, dtype: datetime64[ns]
df_uupers.columns
Index(['dt', ' fans'], dtype='object')
df_vid_tv_slope = ( (df_vid_tv['view'].values)[1:] - (df_vid_tv['view'].values)[:-1] )
df_vid_tc_slope = ( (df_vid_tv['coin'].values)[1:] - (df_vid_tv['coin'].values)[:-1] )
df_uupers_slope = ( (df_uupers[' fans'].values)[1:] - (df_uupers[' fans'].values)[:-1] )
ratio_f = 20
ratio_c = 10

plt.figure(figsize=(10,4))
plt.plot( df_vid_tv_slope, 'k-', label='views' )
plt.plot( df_vid_tc_slope * ratio_c, 'b-.', label='coins'+'*'+str(ratio_c) )
plt.plot( df_uupers_slope *ratio_f, 'r--', label='fans'+'*'+str(ratio_f) )
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    top=False,         # ticks along the top edge are off
    labelbottom=False) # labels along the bottom edge are off
plt.legend()
plt.show()
../../_images/bilibili-data-analysis_20_0.png

Conclusion for now

  1. The conversion rate from views to coin is about 1/10

  2. The conversion rate from views to fans is about 1/20

74.2. Cross Correlation of Time Series#

The estimations are made by hand. Mathematically, cross correlation tells us about the correlation.

74.3. Missing Videos#

missing_vids_path = "/Users/leima/OneDrive - University of New Mexico/data/mybilividdata/failed_vid_data.csv"
missing_vids = np.genfromtxt(missing_vids_path, delimiter=',')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[28], line 2
      1 missing_vids_path = "/Users/leima/OneDrive - University of New Mexico/data/mybilividdata/failed_vid_data.csv"
----> 2 missing_vids = np.genfromtxt(missing_vids_path, delimiter=',')

File ~/anaconda3/envs/mini-lab/lib/python3.9/site-packages/numpy/lib/npyio.py:1959, in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows, encoding, ndmin, like)
   1957     fname = os_fspath(fname)
   1958 if isinstance(fname, str):
-> 1959     fid = np.lib._datasource.open(fname, 'rt', encoding=encoding)
   1960     fid_ctx = contextlib.closing(fid)
   1961 else:

File ~/anaconda3/envs/mini-lab/lib/python3.9/site-packages/numpy/lib/_datasource.py:193, in open(path, mode, destpath, encoding, newline)
    156 """
    157 Open `path` with `mode` and return the file object.
    158 
   (...)
    189 
    190 """
    192 ds = DataSource(destpath)
--> 193 return ds.open(path, mode, encoding=encoding, newline=newline)

File ~/anaconda3/envs/mini-lab/lib/python3.9/site-packages/numpy/lib/_datasource.py:533, in DataSource.open(self, path, mode, encoding, newline)
    530     return _file_openers[ext](found, mode=mode,
    531                               encoding=encoding, newline=newline)
    532 else:
--> 533     raise FileNotFoundError(f"{path} not found.")

FileNotFoundError: /Users/leima/OneDrive - University of New Mexico/data/mybilividdata/failed_vid_data.csv not found.
plt.figure(figsize=(10,5))
plt.plot(missing_vids[:500:],'k.')
plt.title('Missing Videos')
plt.ylabel('Video ID')
plt.xlabel('Ordered Position of Video ID')
# plt.tick_params(
#     axis='x',          # changes apply to the x-axis
#     which='both',      # both major and minor ticks are affected
#     bottom=False,      # ticks along the bottom edge are off
#     top=False,         # ticks along the top edge are off
#     labelbottom=False) # labels along the bottom edge are off

plt.show()
../../_images/bilibili-data-analysis_27_0.png

Calculate missing video rate

def missing_rate(missing_vids_in, start, win):
    
#     missing_vids_max_f = max(missing_vids)
#     missing_vids_min_f = min( missing_vids_f[start:start+win])
    missing_vids_f = [ i for i in missing_vids_in if (i < start+win)&(i> start) ]


    missing_vids_full_f = [int(i) for i in range(start, start + win)]

#     return len( list( set(missing_vids_full_f) - set(missing_vids_f) ) )/len(missing_vids_full_f)
    return len( missing_vids_f )/win
windowsize = 100000

missing_rate_list = []

missing_data_start = 0
missing_data_flag = missing_data_start
missing_data_end = max(missing_vids)

for i in np.arange(missing_data_start, missing_data_end - windowsize, windowsize ):
    
    missing_rate_list.append( missing_rate( missing_vids, missing_data_flag, windowsize ) )
    
    missing_data_flag = missing_data_flag + windowsize
    
plt.figure(figsize=(10,5))
plt.plot( missing_rate_list, 'k.' )
plt.title('Missing Video Rate')
plt.ylabel('Missing Video Rate')
plt.xlabel('Abstracted Time')
# plt.ylim([0,1])
plt.show()
../../_images/bilibili-data-analysis_31_0.png

The missing video id data rate is about

np.average(missing_rate_list)
0.4731796551724139

Yet we do not know the reason. It might because the website doesn’t use all vids. It might also because the videos are submitted but deleted.