影视宝的爬虫,etl任务脚本

unknown cbe7a4b1f0 readme 5 years ago
dags f61a70b000 init 5 years ago
fty_util f61a70b000 init 5 years ago
shell f61a70b000 init 5 years ago
task_clean f61a70b000 init 5 years ago
task_idl f61a70b000 init 5 years ago
task_odl f61a70b000 init 5 years ago
task_other f61a70b000 init 5 years ago
task_scrapy f61a70b000 init 5 years ago
task_tmp f61a70b000 init 5 years ago
task_yxb f61a70b000 init 5 years ago
.gitignore cbe7a4b1f0 readme 5 years ago
LICENSE 19b8436a64 Initial commit 5 years ago
README.md cbe7a4b1f0 readme 5 years ago
ad_tv_recom_score_matrix.txt f61a70b000 init 5 years ago
bash_near_real_job.sh f61a70b000 init 5 years ago
config.cfg f61a70b000 init 5 years ago
idl_ad_pub_station_stats.py f61a70b000 init 5 years ago
idl_tv_sr_denoise.py f61a70b000 init 5 years ago
odl_near_realtime_calc.py f61a70b000 init 5 years ago
online_ad_tv_sr_pre.py f61a70b000 init 5 years ago
setup.py f61a70b000 init 5 years ago
tmp_ad_tv_sr_stat.py f61a70b000 init 5 years ago
tv_outline_recom.py f61a70b000 init 5 years ago
tv_real_recom_fix.py f61a70b000 init 5 years ago

README.md

py_script

影视宝的爬虫,etl任务脚本

约定

  1. idl最终数据 odl中间处理表 tmp临时表 yxb原始数据
  2. view开头的表是数据库视图
  3. province央卫视频道 area 省级地面频道
  4. crontab定时任务调用shell脚本

数据源

数据分类 内容 来源
收视率数据 收视人群,时间,地区,剧名,电视台,收视率等 手动导入
电视剧基本数据 主创导演、制片人、编剧,主演,男一女一等,剧情介绍 百度等爬虫
电视剧发行备案数据 剧名,分类,发行时间等
上传电视剧 主创导演、制片人、编剧,主演,男一女一等,剧情介绍 用户录入

处理脚本

方法 文件 数据走向
按月统计收视 tmp_tv_avg_ratings_fatt0.py tmp.ad_television_month->tmp.month_channel_stat
每个电视剧的收视率 tmp_tv_avg_ratings_stat.py tmp.month_channel_stat->tmp.tv_avg_ratings
电视台对应电视剧的类型等 tmp_tv_category_stat.py odl.ad_television+odl.ad_tv_lib->tmp.tv_category_stat
电视台近一年平均收视率 tmp_year_channel_avg_ratings_stat_by_tv odl.ad_television->tmp.channel_avg_ratings
把全天+52城收视数据导入odl库 odl_ad_audience_cps_time.py yxb.ad_audience_cps_time -> odl.ad_audience_cps_time
odl库全天+52城收视数据增量更新 odl_ad_audience_cps_time_incr_update.py yxb.ad_audience_cps_time -> odl.ad_audience_cps_time
ad_television+ad_rating+ad_tv_lib导入odl.ad_television odl_ad_television.py yxb.adtelevision(2010,2011...)+yxb.adrating(2010,2011...)->odl.ad_television
odl.ad_television增量更新 odl_ad_television_incr_update.py 同上
odl.ad_tv_lib表数据新增 odl_ad_tv_lib yxb.ad_tv_lib->odl.ad_tv_lib
odl.ad_tv_lib表更新数据 odl_ad_tv_lib_insert 同上
odl.ad_tv_record_distribution数据新增 odl_ad_tv_record_distribution.py odl.dsj_gongshi(电视剧备案数据)和odl.faxing(电视剧发行数据)->odl.ad_tv_record_distribution
odl.ad_tv_record_distribution数据增量更新 odl_ad_tv_record_distribution_insert.py 同上
地方电台数据新增 odl_area_ad_television.py yxb.ad_television_tetv + yxb.ad_rating_tetv -> odl.area_ad_television
地方电台数据增量更新 odl_area_ad_television_incr_update.py 同上
ad_tv_record_distribution数据从odl导入idl idl_ad_tv_record_distribution.py odl.ad_tv_record_distribution->idl.ad_tv_record_distribution
tv_avg_ratings数据从tmp导入idl idl_tv_avg_ratings_stat.py tmp.tv_avg_ratings->idl.tv_avg_ratings
tv_category_stat数据从tmp导入idl idl_tv_category_stat.py tmp.tv_category_stat->idl.tv_category_stat
channel_avg_ratings数据从tmp导入idl idl_year_channel_avg_ratings_stat.py tmp.channel_avg_ratings->idl.tv_channel_avg_ratings
清理数据内容,去除'报备机构,换行'等 odl_ad_tv_record_distribution_update_company_field.py odl.ad_tv_record_distribution
清理数据内容,去除'空格,换行'等 odl_ad_tv_record_distribution_update_theme_field.py odl.ad_tv_record_distribution
整理分类字段 scrapy_category_clean.py scrapy.tv_category_scrapy
整理分类字段 scrapy_category_update.py scrapy.tv_category_scrapy
整理分类字段 scrapy_dianshiju_clean.py scrapy.tv_category_scrapy
将tv_category_scrapy表中的分类数据(多个)分割存到分类关联表中 tv_category_relation.py scrapy.tv_category_scrapy->odl.tv_category_relation
update_date.py
update_first_type.py