蘇州網(wǎng)站建設(shè)價(jià)格seo排名優(yōu)化關(guān)鍵詞
目錄)
- SQL
- python
- 開放性業(yè)務(wù)題(二選一)
- 完整代碼
SQL
問題描述
SQL, 請(qǐng)根據(jù)前一周各產(chǎn)品的總GMV將其分成五類:GMV Top 20%、20%-40%,40%-60%,60%-80%以及Bottom 20%的產(chǎn)品組,請(qǐng)計(jì)算這五類產(chǎn)品組在后一周相較于前一周GMV的環(huán)比提升幅度
- 數(shù)據(jù)庫(kù)中表名為 tmp_data
- 表字段:pID(產(chǎn)品名稱), dID(銷售日期,YYYY-MM-DD格式), gmv(金額)
- 假設(shè)數(shù)據(jù)連續(xù)且完整,即前后兩周每日銷售額沒有為0的情況
實(shí)現(xiàn)語(yǔ)言為SQL,最終請(qǐng)?zhí)峤淮a與計(jì)算結(jié)果
select min(dID), max(dID) from test_sql;#2022-10-29 2022-11-11select t1.lable, concat(round((t2.recent_week_gmv-t1.last_week_gmv)/t1.last_week_gmv,2),'%') as wow
from (select lable, sum(gmv_) as last_week_gmv
from (select pID, gmv_,
(case when rn<0.2 then 'Top 20%'
when rn<0.4 then '20%-40%'
when rn<0.6 then '40%-60%'
when rn<0.8 then '60%-80%'
else 'Bottom 20%' end) as lable
from
(select pID, sum(gmv) as gmv_, cume_dist() over(order by sum(gmv) desc) as rn
from tmp_data
where dID>='2022-10-29' and dID<='2022-11-04'
group by pID
order by gmv_ desc) as a
) as b
group by lable
) as t1inner join (select lable, sum(gmv_) as recent_week_gmv
from (select pID, gmv_,
(case when rn<0.2 then 'Top 20%'
when rn<0.4 then '20%-40%'
when rn<0.6 then '40%-60%'
when rn<0.8 then '60%-80%'
else 'Bottom 20%' end) as lable
from
(select pID, sum(gmv) as gmv_, cume_dist() over(order by sum(gmv) desc) as rn
from tmp_data
where dID>='2022-11-05' and dID<='2022-11-11'
group by pID
order by gmv_ desc) as a
) as b
group by lable
) as t2
on t2.lable = t1.lable;lable wow
Top 20% -0.72%
20%-40% -0.58%
40%-60% -0.65%
60%-80% -0.42%
Bottom 20% -0.46%
python
問題描述
附件數(shù)據(jù)集(conversion_data.csv)為某平臺(tái)用戶轉(zhuǎn)化信息。最終業(yè)務(wù)目標(biāo)為提升轉(zhuǎn)化率,希望分析團(tuán)隊(duì)通過數(shù)據(jù)分析、概率統(tǒng)計(jì)方法或機(jī)器學(xué)習(xí)方法給業(yè)務(wù)方相應(yīng)的建議。
- 表中g(shù)rpID為分組信息,base為該分組量級(jí),cvr為該分組轉(zhuǎn)化率(核心目標(biāo)),feature1-feature28為統(tǒng)計(jì)到的28個(gè)影響特征
實(shí)現(xiàn)語(yǔ)言Python和R二選一,最終交付請(qǐng)包括注釋、代碼、輸出、圖表等可以展示分析思路與流程的內(nèi)容,同時(shí)將Notebook轉(zhuǎn)化成PDF格式再提交。
關(guān)于如何提升轉(zhuǎn)化率,關(guān)鍵是找出哪些特征因子對(duì)cvr有正性影響,哪些對(duì)cvr有負(fù)性影響,其各自的影響程度分別是多少,通過擴(kuò)大正性影響,降低負(fù)性影響從而達(dá)到提升轉(zhuǎn)化率的效果,問題轉(zhuǎn)化為去尋找特征因子對(duì)cvr的影響。具體可以先計(jì)算出各特征與cvr的pearson相關(guān)系數(shù),再看看系數(shù)的熱力圖,再通過搭建模型查看各特征因子的在模型下對(duì)cvr的重要程度。
相關(guān)系數(shù)矩陣&熱力圖
Corr=df.corr() #相關(guān)系數(shù)矩陣
cor = Corr.sort_values(by = ['cvr'], axis=0, ascending = True) #按cvr排序橫向排列
sns.heatmap(Corr, vmax=.8, square=True,) #熱力圖
plt.show()
base cvr Feature1 Feature2 Feature3 ... Feature24 Feature25 Feature26 Feature27 Feature28
Feature6 0.224114 -0.756728 -0.808393 -0.181685 0.288669 ... -0.352540 -0.726383 -0.769274 0.566497 0.109772
Feature27 0.713999 -0.703784 -0.678812 -0.267156 0.332391 ... -0.079622 -0.656632 -0.682684 1.000000 -0.010947
Feature3 0.478730 -0.534602 -0.517856 -0.988208 1.000000 ... 0.322946 -0.556814 -0.411315 0.332391 0.601913
Feature19 0.119104 -0.517936 -0.585010 -0.224186 0.281394 ... -0.565172 -0.476672 -0.562676 0.343562 -0.180092
Feature7 0.370844 -0.474294 -0.196186 0.110337 -0.039697 ... -0.158661 -0.108217 -0.143505 0.476188 -0.495868
Feature10 0.577230 -0.449998 -0.503034 -0.905145 0.891294 ... 0.310175 -0.551352 -0.474629 0.487161 0.658157
base 1.000000 -0.441421 -0.392978 -0.471431 0.478730 ... 0.068126 -0.376921 -0.375078 0.713999 0.178519
Feature18 0.077272 -0.203571 -0.236679 0.490290 -0.408194 ... -0.220825 -0.208170 -0.276678 0.254997 -0.086524
Feature28 0.178519 -0.183497 -0.214614 -0.640599 0.601913 ... 0.655401 -0.336748 -0.198847 -0.010947 1.000000
Feature22 0.004151 -0.177423 -0.279160 0.354691 -0.290391 ... -0.073605 -0.297561 -0.361559 0.181035 0.201477
Feature23 -0.160149 -0.108619 -0.093559 0.484322 -0.411309 ... 0.175318 -0.136582 0.068336 0.011303 -0.206764
Feature9 0.118227 -0.059924 -0.003391 0.244152 -0.222774 ... -0.589892 0.108395 -0.172524 0.066214 -0.267187
Feature24 0.068126 -0.018481 0.118036 -0.373320 0.322946 ... 1.000000 -0.054716 0.229332 -0.079622 0.655401
Feature14 -0.081404 0.026290 0.010087 -0.241803 0.230874 ... 0.541626 -0.077957 0.182030 -0.021916 0.184277
Feature16 0.258838 0.070469 0.312441 -0.116694 0.075583 ... -0.239778 0.400829 0.201863 -0.013232 -0.215665
Feature20 -0.196267 0.171468 0.081339 0.106536 -0.106389 ... -0.170723 0.123204 0.053974 0.066488 -0.168935
Feature12 -0.268578 0.175629 0.158582 0.893915 -0.871301 ... -0.477747 0.230034 -0.013711 0.031699 -0.575051
Feature5 0.342802 0.216372 0.380177 -0.070081 0.029075 ... 0.181418 0.354419 0.398978 -0.087208 -0.211474
Feature15 0.126423 0.227224 0.480034 -0.000748 -0.050945 ... -0.206836 0.557642 0.376151 -0.176728 -0.283002
Feature21 -0.058053 0.309183 0.314857 0.097627 -0.118735 ... -0.029046 0.338266 0.315115 -0.017281 -0.259533
Feature13 -0.580208 0.368426 0.340048 0.557234 -0.558035 ... 0.234146 0.247412 0.559932 -0.456297 -0.248982
Feature11 -0.553998 0.374911 0.410215 0.928251 -0.897625 ... -0.375576 0.470418 0.379569 -0.406848 -0.677061
Feature2 -0.471431 0.437308 0.420028 1.000000 -0.988208 ... -0.373320 0.469587 0.332699 -0.267156 -0.640599
Feature4 -0.408524 0.512194 0.762520 0.651733 -0.701479 ... -0.180543 0.750906 0.481900 -0.507596 -0.452813
Feature17 -0.406211 0.790212 0.959055 0.460516 -0.540032 ... -0.030287 0.986516 0.787931 -0.687834 -0.347838
Feature8 -0.408669 0.803202 0.961982 0.463937 -0.545872 ... -0.037030 0.986828 0.785132 -0.691376 -0.341361
Feature26 -0.375078 0.816414 0.830065 0.332699 -0.411315 ... 0.229332 0.793896 1.000000 -0.682684 -0.198847
Feature25 -0.376921 0.840333 0.963762 0.469587 -0.556814 ... -0.054716 1.000000 0.793896 -0.656632 -0.336748
Feature1 -0.392978 0.852769 1.000000 0.420028 -0.517856 ... 0.118036 0.963762 0.830065 -0.678812 -0.214614
cvr -0.441421 1.000000 0.852769 0.437308 -0.534602 ... -0.018481 0.840333 0.816414 -0.703784 -0.183497
從相關(guān)系數(shù)矩陣可以看到從小到大的排列和取值如下表所示
從Pearson相關(guān)系數(shù)來看,values值大于0.7的有Feature1,Feature25,Feature26,Feature8,Feature17,Feature6,Feature27一共7個(gè),我們可以利用feature_selection來進(jìn)一步驗(yàn)證。
feature_selection
直接調(diào)用sklearn的feature_selection類也是可以找出這7個(gè)最為重要的特征變量的,然而在現(xiàn)實(shí)生活中我們選擇特征個(gè)數(shù)不能單單只要這7個(gè)就夠了,我們可以適當(dāng)?shù)姆趴s一些,比如9個(gè),10個(gè),11個(gè)等,然后再把這些特征摘出來做建模準(zhǔn)備。
selector = SelectKBest(f_regression, k=9)
z = selector.fit_transform(X, y)
filter = selector.get_support()
features = np.array(X.columns) #所有特征變量名
print("篩選出來的9個(gè)特征變量名:{}".format(features[filter])) #篩選出的特征變量
輸出
選出來的9個(gè)特征變量名:['Feature1' 'Feature3' 'Feature6' 'Feature8' 'Feature17' 'Feature19''Feature25' 'Feature26' 'Feature27']
其實(shí),這個(gè)時(shí)候從結(jié)果來看已經(jīng)能夠看出一二了,可以結(jié)合具體業(yè)務(wù)給出哪些指標(biāo)應(yīng)該的加強(qiáng)的,哪些指標(biāo)應(yīng)該抑制的。
預(yù)測(cè)模型
作為數(shù)據(jù)科學(xué)家應(yīng)該利用已有的歷史數(shù)據(jù)為未來的變動(dòng)給出具有指導(dǎo)性預(yù)測(cè),這種場(chǎng)景往往發(fā)生在某些變量明確會(huì)在未來一段時(shí)間內(nèi)發(fā)生明顯變化,如計(jì)劃在未來2周內(nèi)加大線上廣告投入資金,所以我們需要根據(jù)某些變量的變化來預(yù)測(cè)未來轉(zhuǎn)化率的變化情況,需要進(jìn)行預(yù)測(cè)模型搭建,針對(duì)這些量化很好的結(jié)構(gòu)型數(shù)據(jù),現(xiàn)在比較流行的做法是先做一個(gè)簡(jiǎn)單的模型,觀看其效果,然后再用一些魔改的集成學(xué)習(xí)算法,下面以GBDT算法為例來進(jìn)行預(yù)測(cè)模型搭建。
X, y = df[['Feature1', 'Feature3', 'Feature6', 'Feature8', 'Feature17', 'Feature19','Feature25', 'Feature26', 'Feature27']], df.cvr #特征和目標(biāo)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=True, random_state=0) #劃分訓(xùn)練測(cè)試集
less_cat_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype=='object' and X_train[col_name].nunique()<10] #少類別型變量
more_cat_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype=='object' and X_train[col_name].nunique()>=10] #多類別型變量
num_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype in ['int64', 'float64']] #數(shù)值型特征
# print(less_cat_col, more_cat_col, num_col)less_cat_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),('encoder', OneHotEncoder(handle_unknown='ignore'))]) #類別型變量先用眾數(shù)填充再獨(dú)熱編碼
more_cat_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))]) #類別型變量先用眾數(shù)填充再普通編碼num_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())]) #數(shù)值型變量采用均值填充和標(biāo)準(zhǔn)化
preprocessor = ColumnTransformer(transformers = [('less_cat', less_cat_transform, less_cat_col),('more_cat', more_cat_transform, more_cat_col),('num', num_transform, num_col)]) #不同的預(yù)處理步驟打包到一起
model = GradientBoostingRegressor(n_estimators = 300, learning_rate = 0.05, max_depth = 7, min_samples_leaf= 3, random_state=0) # 模型初始化
pipe = Pipeline(steps=[('preprocessing', preprocessor),('model', model)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
MAE = mean_absolute_error(y_test, y_pred) #平均絕對(duì)誤差
score = pipe.score(X_test, y_test)
# print(pipe.named_steps['preprocessing']._feature_names_in)
# print(pipe.named_steps['preprocessing'].get_feature_names)
# print(pipe.named_steps['model'].feature_importances_)
print("mean_absolute_error: {}, and model score: {}".format(MAE, score))
with open(r"D:\項(xiàng)目\acxiom的面試\模型測(cè)試 A\有工作經(jīng)驗(yàn)\conversation.pickle", "wb") as model_file: #保存/data/model/pickle.dump(pipe, model_file)
最后模型2個(gè)重要的評(píng)估指標(biāo)是
mean_absolute_error: 0.4527472563951145, and model score: 0.6817745289721121
其實(shí),針對(duì)這種樣本量偏小的模型得分接近0.7已經(jīng)不錯(cuò)了,后續(xù)更應(yīng)該注重?cái)?shù)據(jù)積累和特征量化工作的pipeline。
開放性業(yè)務(wù)題(二選一)
問題描述
注:題目為開放性,可自由提出各種假設(shè)前提、設(shè)定問題邊界等
- 如何給某品牌所有消費(fèi)者做價(jià)值分層?如何找到核心高價(jià)值消費(fèi)者?如何發(fā)掘有潛力的消費(fèi)者?
- 電商大促活動(dòng)如何有針對(duì)性的給不同用戶發(fā)券?如何測(cè)試該策略是否對(duì)業(yè)務(wù)有提升?
這里簡(jiǎn)單回答一下第一個(gè)問題,首先,要界定客戶價(jià)值,客戶價(jià)值評(píng)估的維度,如果沒有這方面的知識(shí)積累的話,可以參考經(jīng)典RFM模型,從客戶價(jià)值的最常用的三個(gè)維度來進(jìn)行評(píng)估,有R(Recency)交易間隔、F(Frequency)交易頻度、M(Monetary)交易金額,這三個(gè)維度是跨時(shí)間,可以提前約定時(shí)間跨度,比如1個(gè)月,3個(gè)月,半年,一年,3年等,這樣做有其好處,然后將問題轉(zhuǎn)化為聚類問題,可以將客戶分成高價(jià)值,中價(jià)值,低價(jià)值三類客戶,然后聚焦每一類客戶主題的共性和差異性,哪些弱,哪些強(qiáng),補(bǔ)弱扶強(qiáng)策略進(jìn)行精準(zhǔn)營(yíng)銷。
完整代碼
# -*- encoding: utf-8 -*-
'''
@Project : conversation
@Desc : 提升轉(zhuǎn)化率
@Time : 2023/02/24 19:53:24
@Author : 帥帥de三叔,zengbowengood@163.com
'''
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest,f_regression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
conversation_data = pd.read_csv(r"D:\項(xiàng)目\acxiom的面試\模型測(cè)試 A\有工作經(jīng)驗(yàn)\conversation_data.csv", delimiter = ',', index_col='grp_ID')
df=(conversation_data-conversation_data.mean())/conversation_data.std() #標(biāo)準(zhǔn)化
X,y = df.drop(labels=['cvr'], axis=1), df.cvr #特征變量,目標(biāo)變量Corr=df.corr() #相關(guān)系數(shù)矩陣
cor = Corr.sort_values(by = ['cvr'], axis=0, ascending = True) #按cvr排序橫向排列
sns.heatmap(Corr, vmax=.8, square=True,) #熱力圖
plt.show()selector = SelectKBest(f_regression, k=9) #準(zhǔn)備篩選9個(gè)特征變量進(jìn)行建模
z = selector.fit_transform(X, y)
filter = selector.get_support()
features = np.array(X.columns) #所有特征變量名
print("篩選出來的9個(gè)特征變量名:{}".format(features[filter])) #篩選出的特征變量X, y = df[['Feature1', 'Feature3', 'Feature6', 'Feature8', 'Feature17', 'Feature19','Feature25', 'Feature26', 'Feature27']], df.cvr #特征和目標(biāo)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=True, random_state=0) #劃分訓(xùn)練測(cè)試集
less_cat_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype=='object' and X_train[col_name].nunique()<10] #少類別型變量
more_cat_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype=='object' and X_train[col_name].nunique()>=10] #多類別型變量
num_col = [col_name for col_name in X_train.columns if X_train[col_name].dtype in ['int64', 'float64']] #數(shù)值型特征
# print(less_cat_col, more_cat_col, num_col)less_cat_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),('encoder', OneHotEncoder(handle_unknown='ignore'))]) #類別型變量先用眾數(shù)填充再獨(dú)熱編碼
more_cat_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='most_frequent')),('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))]) #類別型變量先用眾數(shù)填充再普通編碼num_transform = Pipeline(steps = [('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())]) #數(shù)值型變量采用均值填充和標(biāo)準(zhǔn)化
preprocessor = ColumnTransformer(transformers = [('less_cat', less_cat_transform, less_cat_col),('more_cat', more_cat_transform, more_cat_col),('num', num_transform, num_col)]) #不同的預(yù)處理步驟打包到一起
model = GradientBoostingRegressor(n_estimators = 300, learning_rate = 0.05, max_depth = 7, min_samples_leaf= 3, random_state=0) # 模型初始化
pipe = Pipeline(steps=[('preprocessing', preprocessor),('model', model)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
MAE = mean_absolute_error(y_test, y_pred) #平均絕對(duì)誤差
score = pipe.score(X_test, y_test)
# print(pipe.named_steps['preprocessing']._feature_names_in)
# print(pipe.named_steps['preprocessing'].get_feature_names)
# print(pipe.named_steps['model'].feature_importances_)
print("mean_absolute_error: {}, and model score: {}".format(MAE, score))
with open(r"D:\項(xiàng)目\acxiom的面試\模型測(cè)試 A\有工作經(jīng)驗(yàn)\conversation.pickle", "wb") as model_file: #保存/data/model/pickle.dump(pipe, model_file)