问题描述
对于我目前正在学习的课程,我正在尝试建立一个虚拟交易、客户和产品数据集,用于展示网上商店环境中的机器学习用例以及财务仪表板;不幸的是,我们没有得到虚拟数据.我认为这是提高我的 R 知识的好方法,但在实现它时遇到了严重的困难.
我的想法是我指定了一些参数/规则(任意/虚构,但适用于某种聚类算法的演示).我基本上是在尝试隐藏一个模式,然后利用机器学习重新找到这个模式(不是这个问题的一部分).我隐藏的模式基于产品采用生命周期,试图展示如何识别不同的客户类型用于有针对性的营销目的.
我将展示我在寻找什么.我想让它尽可能真实.我试图通过将每个客户的交易数量和其他特征分配给正态分布来做到这一点;我完全愿意接受其他潜在的方法来做到这一点?
以下是我走了多远,先建一张客户表:
# Define Customer Types & Respective probabilities CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers") PropCustTypes <- c(.10, .45, .30, .15) # Probability of being in each group. set.seed(1) # Set seed to make reproducible Customers <- data.frame(ID=(1:10000), CustomerType = sample(CustomerTypes, size=10000, replace=TRUE, prob=PropCustTypes), NumBought = rnorm(10000,3,2) # Number of Transactions to Generate, open to alternative solutions? ) Customers[Customers$Numbought<0]$NumBought <- 0 # Cap NumBought at 0
接下来,生成可供选择的产品表:
Products <- data.frame( ID=(1:50), DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000), SuggestedPrice = rnorm(50, 50, 30)) Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$ Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10") # Cap Releasedate to 1 year ago
现在我想根据当前相关的每个变量的以下参数生成 n 笔交易(数字在上面的客户表中).
Parameters <- data.frame( CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"), BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X ByDirectCustomer = c(0.60, .30, 0.15, 0.05), ByPartnerBlog = c(0.30, .30, 0.35, 0.35), Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate. Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing. stringsAsFactors=FALSE) Parameters CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount 1 EarlyAdopter 0.1 0.60 0.30 1 0.00 2 Pragmatists 0.4 0.30 0.30 6 0.00 3 Conservatives 0.5 0.15 0.35 12 0.05 4 Dealseeker 0.6 0.05 0.35 12 0.10
这个想法是,"EarlyAdopters"(平均而言,正态分布)10% 的交易带有标签"BySearchEngine"、60% 的"ByDirectCustomer"和 30% 的"ByPartnerBlog";这些值需要相互排除:无法通过 PartnerBlog 和最终数据集中的搜索引擎获得.选项有:
ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")
此外,我想使用上述方法生成一个正态分布的折扣变量.为简单起见,标准差可能是平均值/5.
接下来,我最棘手的部分,我想根据一些规则生成这些交易:
- 几天内分布比较均匀,周末可能稍微多一些;
- 在 2006 年至 2014 年之间展开.
- 摊开这些年来客户的交易次数;
- 客户不能购买尚未发布的产品.
其他参数:
YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer? DailyMax <- 1 # Same question? Likely dependent on YearlyMax
CustomerID 2 的结果是:
Transactions <- data.frame( ID = c(1,2), CustomerID = c(2,2), # The customer that bought the item. ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified. GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard. Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior. Transactions ID CustomerID ProductID DateOfPurchase ReferredBy GrossPrice Discount 1 1 2 51 2013-01-02 DirectCustomer 50.00 0.02 2 2 2 100 2012-12-03 SearchEngine 52.99 0.00
我对编写 R 代码越来越有信心,但我在编写代码以保持全局参数(交易的每日分布、每位客户每年最多 # 笔交易)以及各种联系时遇到了困难排队:
- 时效性:发布后人们购买的速度有多快
- ReferredBy:该客户是如何访问我的网站的?
- 客户获得了多少折扣(说明客户对折扣的敏感程度)
这使我不知道是否应该在客户表上编写一个 for 循环,为每个客户生成交易,或者我是否应该采取不同的路线.非常感谢任何贡献.替代的虚拟数据集也是受欢迎的,尽管我渴望通过 R 解决这个问题.我会随着我的进展保持这篇文章的更新.
我目前的伪代码:
- 使用 sample() 将客户分配给客户类型
- 生成 Customers$NumBought 交易
- ...还在想吗?
编辑:生成事务表,现在我"只"需要用正确的数据填充它:
Tr <- data.frame( ID = 1:sum(Customers$NumBought), CustomerID = NA, DateOfPurchase = NA, ReferredBy = NA, GrossPrice=NA, Discount=NA)
推荐答案
很粗略,建立一个天数的数据库,以及当天的访问量:
days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits) # you could change the customerRate to reflect promotions, time since launch, ... days$nVisits <- rpois(8000, days$customerRate)
然后对访问进行编目
visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits) visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights) visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])
前面带有 X 的任何变量都是过程的参数.根据您拥有的其他列,您将类似地继续通过参数化可用对象之间的相对可能性来生成事务数据库.或者您可以生成一个访问数据库,其中包括当天可用的每个产品的密钥:
productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits) visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights) day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1))) visits <- visits[(1:nrow(visits))[day$productsAvailable],] visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))
然后,您可以决定一个函数,为每一行提供客户购买该商品的概率(基于日期、客户、产品).然后通过`visits$didTheyPurchase<-runif(nrow(visits))<填写购买XmyProbability.
抱歉,我直接打字时可能有错别字,但希望这能给你一个想法.