散文網(wǎng) » 生活 »日常 » R語(yǔ)言二手車汽車銷售數(shù)據(jù)可視化探索：預(yù)處理、平滑密度圖、地理空間可視化（帶自測(cè)題

R語(yǔ)言二手車汽車銷售數(shù)據(jù)可視化探索：預(yù)處理、平滑密度圖、地理空間可視化（帶自測(cè)題

2022-07-24 11:17 作者:拓端tecdat 0人讀過(guò) | 我要投稿

全文鏈接：http://tecdat.cn/?p=27546

原文出處：拓端數(shù)據(jù)部落公眾號(hào)

本文用爬蟲(chóng)采集了汽車銷售數(shù)據(jù)，后來(lái)對(duì)其進(jìn)行了擴(kuò)展，創(chuàng)建這個(gè)數(shù)據(jù)集，其中包括境內(nèi)的所有二手車輛或者經(jīng)銷商車輛條目數(shù)據(jù)。這些數(shù)據(jù)每隔幾個(gè)月就會(huì)被抓取一次，它包含提供的關(guān)于汽車銷售的大部分相關(guān)信息，包括價(jià)格、狀況、制造商、緯度/經(jīng)度和 18 個(gè)其他類別等列。對(duì)于機(jī)器學(xué)習(xí)ML 項(xiàng)目，請(qǐng)考慮對(duì)位置列（例如 long/lat）進(jìn)行特征工程。

問(wèn)題 #1 數(shù)據(jù)集中有多少個(gè)觀測(cè)值？

# 我們可以通過(guò)計(jì)算行的數(shù)量來(lái)獲得觀察值的數(shù)量
## [1] 34677
# 另外，我們可以得到數(shù)據(jù)集，并查看行數(shù)（觀察值）。
dim(vposts)
## [1] 34677 ? ?27

問(wèn)題 #2 變量的名稱是什么？每個(gè)變量的類別是什么？

unist sply(X ?vposs, FUN = las) )
prit.able( sply(X = vosts, FUN clss) )

問(wèn)題 #3 所有車輛的平均價(jià)格是多少？中間價(jià)？和十分位數(shù)？在車輛價(jià)格分布圖上顯示這些。

讓我們先來(lái)看看這個(gè)問(wèn)題的一些數(shù)據(jù)探索過(guò)程。

denstyplot(osts$pice, min = "rice", xlab = Prie")

# 可以肯定的是，9999999 30002500 600030000 600030000的價(jià)格是非常可疑的。
# 讓我們看看任何超過(guò)100,000的汽車。
idx =which( vpss$price >= ?100000 & !is.na(vpst$price) )
legt( idx )

idx = idx[de(vpsts[ idx, "price"])]
vos[ idx, c(headr", "prce") ]

?

有一些非常昂貴的汽車，例如梅賽德斯-奔馳 G63 AMG、賓利慕尚、瑪莎拉蒂 3500 GT、保時(shí)捷 GT 等。

?

價(jià)格為 600030000 的兩條記錄是 1968 年和 1969 年的 Pontiac GTO - 600030000 美元，從閱讀帖子正文可以看出，這些記錄是在 6,000 美元到 30,000 美元之間定制 GTO 的報(bào)價(jià)。

價(jià)格為 9999999 的 2001 年本田雅閣看起來(lái)像是發(fā)布廣告的人故意誤導(dǎo)的價(jià)格，因?yàn)樗麄兾茨芴顚懫渌麕讉€(gè)字段。

也有很多汽車以 1 元的價(jià)格出售。這是以最低價(jià)格發(fā)布的常見(jiàn)廣告策略，因?yàn)榇蠖鄶?shù)人將價(jià)格從最低到最高排序，因此這些廣告更頻繁地出現(xiàn)在頂部。其中大部分是經(jīng)銷商的誤導(dǎo)性廣告，一些是汽車零部件，一些是汽車融資的報(bào)價(jià)。這里有太多數(shù)據(jù)需要手動(dòng)清理，所以我們將它們排除在外。

idx = which( post$ri == 1 & !is.na(vpostspce) )
idx = smple(x = dx, size = 60, replace = FALSE)
denstyplotvpots$pice[ idx )

?

quantle(x = ve, probs = c(0.05,0.99), na.rm = TRUE)

?

dec = quantile(x = vposce[ idx ], probs = seq(from = 0.1, to = 0.9, by = 0.1) )
plot(density(vpss$pce[ idx ])

?

問(wèn)題#4 有哪些不同類別的車輛，即類型變量/列？每個(gè)類別的比例是多少？

nams( table(vpoype, useNA = "ifany") )

?

ort( rond( x = prop.tbl( x = tae(vpst$type,eNA = "ifany") ),digis = 4) )

?

dott(x = sort(t), xlim = c(-0.05, 1.05), cex = 1.5)

?

t = prole( x = table(ts$type[ !is.(vposs$type) ], usNA = "ifany") )
dolot(x

?

接近一半的數(shù)據(jù)缺少車輛類型。

問(wèn)題#5 顯示燃料類型和車輛類型之間的關(guān)系。這取決于變速類型嗎？

我們可以從下面的整體馬賽克圖中看到，按變速箱類型，汽油車輛在車輛類型和變速箱類型中占主導(dǎo)地位，但值得注意的是卡車的柴油百分比高于其他車輛類型，以及帶有自動(dòng)檔的公共汽車。

在點(diǎn)圖中看到這些相同的關(guān)系可能比在馬賽克圖中更容易看到。

tbl = tbl[ rw.orde, col.order ]
maicpot(tbl

?

?

dotplot(
prop.tabl

?

問(wèn)題 #6 數(shù)據(jù)集中代表了多少個(gè)不同的城市？

length( levels(vpcity) )

?

問(wèn)題 #7 直觀地展示“車主出售”和“經(jīng)銷商出售”的數(shù)量/比例在不同城市之間的差異？

有點(diǎn)可疑的是，所有城市都有大約 5000 個(gè)觀測(cè)值，并且每個(gè)城市內(nèi)的百分比幾乎是完美的 50/50。

請(qǐng)注意，我們還在設(shè)置部分創(chuàng)建了一個(gè)新變量ownerDealer。

table(vpoity)

?

prop.table(taler, vposrgin = 2)

?

plot( table(vpostty, vporDealer, u

?

plot( prtable(table(vpoststy

?

條形圖基本上顯示了關(guān)鍵信息。由于我們對(duì)每個(gè)城市內(nèi)所有者的待售百分比感興趣，因此點(diǎn)圖可能最能觀察到這一點(diǎn)。

我們可以從表格和圖表中非常清楚地看到，車主發(fā)帖和經(jīng)銷商發(fā)帖的百分比幾乎是完美的 50/50，而且在不同城市之間似乎根本沒(méi)有差異。

問(wèn)題 #8 在這個(gè)數(shù)據(jù)集中，一輛車的最高價(jià)格是多少？檢查這一點(diǎn)并修復(fù)該值?，F(xiàn)在檢查價(jià)格的新最高值。

我們?cè)谏厦娴膯?wèn)題 3 中看到，價(jià)格數(shù)據(jù)存在很多問(wèn)題。

?

# 讓我們使用一個(gè)四舍五入的平均價(jià)格
nwPice = rund( man(osts$prie[ix]), digits = -3)
# 讓我們看看我們是否能從數(shù)據(jù)集本身找到一個(gè)合適的點(diǎn)估計(jì)。
idx = ( voss$maer == adillac" & vpots$yar%in% c(2002) &
vpts$price < 999999 &vpostspric > .case = TRUE) &

?

# 平均價(jià)格估計(jì)低于2500美元或3000美元的發(fā)布價(jià)格，因此使用較低的2500美元
roud( meanvpots$prie[id]), diits = -3)

?

還有更多需要修復(fù)的地方。

問(wèn)題 #9 每個(gè)城市“車主銷售”和“經(jīng)銷商銷售”最常見(jiàn)的三種汽車品牌是什么？它們是相似的還是完全不同的？

cities = levels(vposts$city)
# 我們可以在一行中完成內(nèi)部函數(shù)，但它很難讀懂，所以把它分成幾個(gè)步驟。
# ? ? ?names( head( sort( table(vposts$maker[ vposts$city == x & vposts$byOwner == y & sing = TRUE), 3) )
makeByCityByOwner = lapply(X = c(TRUE, FALSE), FUN = function(y){
})
names(makeByCityByOwner) = c("Owner", "Dealer")
makeByCityByOwner

?

# 按業(yè)主和經(jīng)銷商檢查每個(gè)城市的頂部是否匹配
makeByyOwner$Owner[ 1, ] == makeByOwner$Dealer[ 1, ]

?

按城市出售的前 3 名中，有 2 名按所有者出售的產(chǎn)品在同一城市內(nèi)的經(jīng)銷商出售的前 3 名中。每個(gè)城市的車主排名與除 SacTown 之外的所有城市的經(jīng)銷商排名相同。

問(wèn)題 #10 直觀地比較不同城市的車齡分布以及“車主銷售”和“經(jīng)銷商銷售”。提供對(duì)圖的解釋，即關(guān)鍵結(jié)論和見(jiàn)解是什么？

2022 年的本田奧德賽“只有 117102 英里”，所以這可能是 2002 年的拼寫錯(cuò)誤，所以讓我們這樣修復(fù)它。

年份 = 4 的 Jeep 可能是 2004 年，因?yàn)樗幸粋€(gè)“AM/FM 盒式磁帶播放器-muli CD 播放器”。

vpyear[ vpyear == 2022 & !is.na(vpstsyer) ] = 2002
vposts = osts[ -which(pstsyar == 1900 & !is.na(vpotar)), ]
vpotsyar[ vposear == 4 & !is.na(vpose) ] = 2004
vpossage = 2016 - vpyear
histrm( ~ age | byOwn)

?

ix = ( vpsts$g < 25 & !is.na(vostge))
hitoram( ~ age | byOwne

# 按城市來(lái)看，不同城市的車主與經(jīng)銷商根本沒(méi)有太大的區(qū)別。
histogram( ~ age | byOwner + city,

?

似乎車主出售的汽車往往比經(jīng)銷商出售的汽車年份更老。但是，這似乎因城市而異。

問(wèn)題 #11 在地圖上標(biāo)出帖子的位置？你注意到了什么？

我們可以得出結(jié)論，在這些主要城市出售二手車的人（和/或汽車本身）的位置往往相當(dāng)緊密地聚集在主要城市周圍。

對(duì)于遠(yuǎn)離主要城市之一的地點(diǎn)，可能有多種解釋。例如，當(dāng)他們實(shí)際發(fā)布廣告時(shí)，他們可能正在旅行。但總的來(lái)說(shuō)，發(fā)布汽車的人的位置通常與他們?cè)噲D出售車輛的城市相同。

我們可以通過(guò)使用 alpha 參數(shù)來(lái)控制繪圖點(diǎn)的透明度，從而更好地查看密度和滲入其他區(qū)域的情況，從而對(duì)該圖進(jìn)行進(jìn)一步改進(jìn)。

map('state', mar = c(0,0,0,0))
invisible(
lapply( 1:le col.palette[x] ?)
}
)
)
legend("bottch = 15, cex = 0.9)

?

points(x = loionByCity[[ "sac" ]]$lette[1] )
points(x = locationBy[[ "sfbay" ]]$l5] )

?

問(wèn)題 #12 總結(jié)燃料類型、驅(qū)動(dòng)和車輛類型的分布。

請(qǐng)注意，在下面的點(diǎn)圖中，不同面板中的分布幾乎相同，但分布在中間列中顯示出一些變化，其中fuel type = "gas".?因此，我們基本上可以將燃料類型從圖中刪除，子集只fuel type = "gas"考慮其余三個(gè)變量之間的關(guān)系。

dotplot( table(vpossts$drive, vposransmission, vpo$type))
dotplot( tasts$type,sts$uel,vpos$drive, vpostrnsmission,
auto.key = list(co )

# 對(duì)于幾乎所有的數(shù)據(jù)，燃料=汽油。
table(vposts$fuel, useNA = "ifany")

?

我們看到自動(dòng)檔?fuel == "gas"在所有類型的汽車中最常見(jiàn)，其次是手動(dòng)。在后輪驅(qū)動(dòng)車輛中，手動(dòng)檔比例確實(shí)高于轎跑車和敞篷車的其他車型，這是有道理的，因?yàn)檗I跑車和敞篷車往往是跑車。在四輪驅(qū)動(dòng)中，越野車比例更高。

dotplot(
prop.table( table(vposts$ty

?

問(wèn)題 #13 里程表讀數(shù)和車齡有關(guān)系嗎？里程表讀數(shù)和價(jià)格？解釋結(jié)果。里程表讀數(shù)和車齡有關(guān)嗎？

我們應(yīng)該花一些時(shí)間清理里程表讀數(shù)。例如，最大里程表讀數(shù) 1234567890 只是一些廣告。但是為了簡(jiǎn)單起見(jiàn)，我們看到里程表讀數(shù)的第 99 個(gè)百分位數(shù)是 2.610^{5}，因此我們將在 500,000 處修剪數(shù)據(jù)獲得幾乎所有分布。

絕大多數(shù)數(shù)據(jù)似乎確實(shí)呈上升趨勢(shì)。但是，請(qǐng)注意大約 5 歲到 20 歲之間的濃密陰影，它們的里程表讀數(shù)較低。

quantile(vpoter, probs = 0.99, na.rm = TRUE)

?

正如我們?cè)谙旅娴钠交Ⅻc(diǎn)圖中看到的，里程表讀數(shù)與價(jià)格之間普遍存在負(fù)相關(guān)關(guān)系，但請(qǐng)注意，有些非常昂貴的汽車?yán)锍瘫碜x數(shù)較低，其中許多是古董車。

idx = ( vpots$meter < 500000 & vpsts$ice >= 500 & vposice <= 100000 &
!is.na(vposdometer) & !is.na(vposprice) )
smoothScatte

?

問(wèn)題 #14 識(shí)別“老爺”車。這些是什么廠家生產(chǎn)的？這些的價(jià)格分布是什么？

從下面的第一個(gè) smoothScatter 圖中，超過(guò) 35 年的汽車是“老爺車”。

從下表中可以看出，雪佛蘭和福特占“老爺車”的 50% 以上。特別是，由于美國(guó)直到 1970 年代石油危機(jī)才開(kāi)始大規(guī)模進(jìn)口日本汽車，因此日本“老爺車”并不多，而我們對(duì)“老爺車”的截止時(shí)間約為 1970 年。

比較“老爺車”與所有汽車的價(jià)格分布，“老爺車”似乎密度更高，價(jià)格更高，大部分價(jià)格低于 40,000 美元，而整體數(shù)據(jù)的大部分往往低于 20,000 美元。

idx = (vpts$prie >= 500 & vpos$rce <= 100000 &
!is.na(vpsts$rice) ?!is.na(vpoge) )
smootScater(x = vpst$ge[idx], y = vpoce[ix]

?

# 看看制造商和老爺車的價(jià)格分布情況
idx = (vpoage >= 35 & !is.na(vpst$ge))

?

問(wèn)題 #15 我省略了這個(gè)數(shù)據(jù)集中的一個(gè)重要變量。你認(rèn)為那是什么？我們可以從其他變量中得出這個(gè)嗎？

在網(wǎng)站上搜索汽車時(shí)，通常是年份、品牌和型號(hào)，按順序排列。請(qǐng)注意，年份和品牌（即制造商）是數(shù)據(jù)集中的獨(dú)立變量。但是，請(qǐng)注意數(shù)據(jù)集中調(diào)用的變量是 year、make 和 model。因此，如果我們可以解析每個(gè)標(biāo)題的文本字符串以提取模型，我們可以為模型導(dǎo)出我們自己的獨(dú)立變量。

head(vposts$header, 20)

?

問(wèn)題 #16 顯示使用情況和里程表是如何相關(guān)的。還有使用情況和價(jià)格是如何相關(guān)的。以及汽車的狀況和年齡。簡(jiǎn)要解釋您的發(fā)現(xiàn)。

conditos = leels(vpsts$conitio)
conditon= sprintf('"%s",\n', conditions)
cat(conditions)

# 我們將以最常見(jiàn)的現(xiàn)有類別為基礎(chǔ)建立新的類別。
sort(tble(vpst$coition))

?

vposts$
sane_odo = subst
boxplot(odoution bb = "Miles")

?

# 做第二張圖，以更好地顯示分布情況。
boxlotoistuiles")

?

# 現(xiàn)在我們可以看到，最高的里程表讀數(shù)似乎是在 "一般 "和 "良好 "條件下，這有點(diǎn)令人驚訝。有可能人們?cè)诶锍瘫磔^高時(shí)夸大了車況，試圖讓它聽(tīng)起來(lái)更吸引人。車況分布最分散的是 "殘次品"，這是有道理的，因?yàn)闅埓纹菲嚳赡芊浅Ｅf，也可能是被損壞的新汽車。
san_rice = suset(vpsts,pric < 2e5)
pice_y_cond =split(sane_prce$prce, san_pricew_cond)
boxplo(price_b_con, co

?

age_y_cod = spli(sane_ae$age, sae_age$new_cond)
boxplot(age_by_cond, col = "

?

價(jià)格和車齡分布并沒(méi)有顯示出任何太令人驚訝的地方。價(jià)格和狀況似乎直接相關(guān)。 “像新”的汽車有時(shí)會(huì)以極高的價(jià)格提供，而這在狀況較差的汽車中并不常見(jiàn)。車齡和狀況成反比：舊車的狀況似乎更糟。?

自測(cè)題?

Question #1 How many observations are there in the data set?

Question #2 What are the names of the variables? and what is the class of each variable?

Question #3 What is the average price of all the vehicles? the median price? and the deciles? Displays these on a plot of the distribution of vehicle prices.

Question #4 What are the different categories of vehicles, i.e.?the type variable/column? What is the proportion for each category ?

Question #5 Display the relationship between fuel type and vehicle type. Does this depend on transmission type?

Question #6 How many different cities are represented in the dataset?

Question #7 Visually display how the number/proportion of “for sale by owner” and “for sale by dealer” varies across city?

Question #8 What is the largest price for a vehicle in this data set? Examine this and fix the value. Now examine the new highest value for price.

Question #9 What are the three most common makes of cars in each city for “sale by owner” and for “sale by dealer”? Are they similar or quite different?

Question #10 Visually compare the distribution of the age of cars for different cities and for “sale by owner” and “sale by dealer”. Provide an interpretation of the plots, i.e., what are the key conclusions and insights?

Question #11 Plot the locations of the posts on a map? What do you notice?

Question #12 Summarize the distribution of fuel type, drive, transmission, and vehicle type. Find a good way to display this information.

Question #13 Plot odometer reading and age of car? Is there a relationship? Similarly, plot odometer reading and price? Interpret the result(s). Are odometer reading and age of car related?

Question #14 Identify the “old” cars. What manufacturers made these? What is the price distribution for these?

Question #15 I have omitted one important variable in this data set. What do you think it is? Can we derive this from the other variables? If so, sketch possible ideas as to how we would compute this variable.

Question #16 Display how condition and odometer are related. Also how condition and price are related. And condition and age of the car. Provide a brief interpretation of what you find.

posts by people selling vehicles. The important variable that I did not give you was the model/type of the vehicle being sold. This is very important for determining the price of the vehicle. For example, a new Volve V60 has a suggested price of $35,000, but a new S60 has a price of $43,000, and the new Toyota Yaris and Avalon are $15,000 and $32,000 respectively - a factor of 2. So we need to determine the model of the vehicle.

We also want to verify some of the data and fix it if possible. And we also want to be able to programmatically extract other information from the posts if it is present.

Extract the price being asked for the vehicle from the?body?column, if it is present, and check if it agrees with the actual price in the?pricecolumn.
Extract a Vehicle Identication Number (VIN) from the?body, if it is present. We could use this to both identify details of the car (year it was built, type and model of the car, safety features, body style, engine type, etc.) and also use it to get historical information about the particular car. Add the VIN, if available, to the data frame. How many postings include the VIN?
Extract phone numbers from the?body?column, and again add these as a new column. How many posts include a phone number?
Extract email addresses from the?body?column, and again add these as a new column. How many posts include an email address?
Find the year in the description or body and compare it with the value in the year column.
Determine the model of the car, e.g., S60, Boxter, Cayman, 911, Jetta. This includes correcting mis-spelled or abbreviated model names. You may find the?agrep()?function useful. You should also use statistics, i.e., counts to see how often a word occurs in other posts and if such a spelling is reasonable, and whether this model name has been seen with that maker often.

When doing these questions, you will very likely have to iterate by developing a regular expression, and seeing what results it gives you and adapting it. Furthermore, you will probably have to use two or more strategies when looing for a particular piece of information. This is expected; the data are not nice and regularly formatted.

Modeling

Pick two models of cars, each for a different car maker, e.g., Toyota or Volvo. For each of these, separately explore the relationship between the price being asked for the vehicle, the number of miles (odometer), age of the car and condition. Does location (city) have an effect on this? Use a statistical model to be able to suggest the appropriate price for such a car given its age, mileage, and condition. You might consider a linear model, k-nearest neighbors, or a regression tree.

You need to describe why the method you chose is appropriate? what assumptions are needed and how reasonable they are? and how well if performs and how you determined this? Would you use it if you were buying or selling this type of car?

Useful Functions

strsplit(), grep(), grepl(), gregexpr(), sub(), gsub().
agrep(), adist(), nchar(), substring()
The stringi and stringr packages.

標(biāo)簽：