|
@@ -1,3 +1,51 @@
|
1
|
1
|
# py_script
|
2
|
2
|
|
3
|
|
-影视宝的爬虫,etl任务脚本
|
|
3
|
+影视宝的爬虫,etl任务脚本
|
|
4
|
+
|
|
5
|
+## 约定
|
|
6
|
+1. idl最终数据 odl中间处理表 tmp临时表 yxb原始数据
|
|
7
|
+2. view开头的表是数据库视图
|
|
8
|
+3. province央卫视频道 area 省级地面频道
|
|
9
|
+4. crontab定时任务调用shell脚本
|
|
10
|
+5.
|
|
11
|
+
|
|
12
|
+## 数据源
|
|
13
|
+|数据分类|内容|来源|
|
|
14
|
+|-------|----|----|
|
|
15
|
+|收视率数据|收视人群,时间,地区,剧名,电视台,收视率等|手动导入
|
|
16
|
+|电视剧基本数据|主创导演、制片人、编剧,主演,男一女一等,剧情介绍|百度等爬虫
|
|
17
|
+|电视剧发行备案数据||剧名,分类,发行时间等|广电总局爬虫
|
|
18
|
+|上传电视剧|主创导演、制片人、编剧,主演,男一女一等,剧情介绍|用户录入
|
|
19
|
+
|
|
20
|
+## 处理脚本
|
|
21
|
+|方法 | 文件 | 数据走向|
|
|
22
|
+|---- |----- |--------|
|
|
23
|
+|按月统计收视 |tmp_tv_avg_ratings_fatt0.py |tmp.ad_television_month->tmp.month_channel_stat
|
|
24
|
+|每个电视剧的收视率|tmp_tv_avg_ratings_stat.py|tmp.month_channel_stat->tmp.tv_avg_ratings
|
|
25
|
+|电视台对应电视剧的类型等|tmp_tv_category_stat.py|odl.ad_television+odl.ad_tv_lib->tmp.tv_category_stat
|
|
26
|
+|电视台近一年平均收视率|tmp_year_channel_avg_ratings_stat_by_tv|odl.ad_television->tmp.channel_avg_ratings
|
|
27
|
+|把全天+52城收视数据导入odl库|odl_ad_audience_cps_time.py|yxb.ad_audience_cps_time -> odl.ad_audience_cps_time
|
|
28
|
+|odl库全天+52城收视数据增量更新|odl_ad_audience_cps_time_incr_update.py|yxb.ad_audience_cps_time -> odl.ad_audience_cps_time
|
|
29
|
+|ad_television+ad_rating+ad_tv_lib导入odl.ad_television|odl_ad_television.py|yxb.ad_television_(2010,2011...)+yxb.ad_rating_(2010,2011...)->odl.ad_television
|
|
30
|
+|odl.ad_television增量更新|odl_ad_television_incr_update.py|同上
|
|
31
|
+|odl.ad_tv_lib表数据新增|odl_ad_tv_lib|yxb.ad_tv_lib->odl.ad_tv_lib
|
|
32
|
+|odl.ad_tv_lib表更新数据|odl_ad_tv_lib_insert|同上
|
|
33
|
+|odl.ad_tv_record_distribution数据新增|odl_ad_tv_record_distribution.py|odl.dsj_gongshi(电视剧备案数据)和odl.faxing(电视剧发行数据)->odl.ad_tv_record_distribution
|
|
34
|
+|odl.ad_tv_record_distribution数据增量更新|odl_ad_tv_record_distribution_insert.py|同上
|
|
35
|
+|地方电台数据新增|odl_area_ad_television.py|yxb.ad_television_tetv + yxb.ad_rating_tetv -> odl.area_ad_television
|
|
36
|
+|地方电台数据增量更新|odl_area_ad_television_incr_update.py|同上
|
|
37
|
+|ad_tv_record_distribution数据从odl导入idl |idl_ad_tv_record_distribution.py|odl.ad_tv_record_distribution->idl.ad_tv_record_distribution
|
|
38
|
+|tv_avg_ratings数据从tmp导入idl|idl_tv_avg_ratings_stat.py|tmp.tv_avg_ratings->idl.tv_avg_ratings
|
|
39
|
+|tv_category_stat数据从tmp导入idl|idl_tv_category_stat.py|tmp.tv_category_stat->idl.tv_category_stat
|
|
40
|
+|channel_avg_ratings数据从tmp导入idl|idl_year_channel_avg_ratings_stat.py|tmp.channel_avg_ratings->idl.tv_channel_avg_ratings
|
|
41
|
+|清理数据内容,去除'报备机构,换行'等|odl_ad_tv_record_distribution_update_company_field.py|odl.ad_tv_record_distribution
|
|
42
|
+|清理数据内容,去除'空格,换行'等|odl_ad_tv_record_distribution_update_theme_field.py|odl.ad_tv_record_distribution
|
|
43
|
+|整理分类字段|scrapy_category_clean.py|scrapy.tv_category_scrapy
|
|
44
|
+|整理分类字段|scrapy_category_update.py|scrapy.tv_category_scrapy
|
|
45
|
+|整理分类字段 |scrapy_dianshiju_clean.py|scrapy.tv_category_scrapy
|
|
46
|
+|将tv_category_scrapy表中的分类数据(多个)分割存到分类关联表中|tv_category_relation.py|scrapy.tv_category_scrapy->odl.tv_category_relation
|
|
47
|
+||update_date.py|
|
|
48
|
+||update_first_type.py|
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
|