利用python清洗.log文件并保存為.csv文件

1?數(shù)據(jù)處理思路
????一、數(shù)據(jù)的讀取與寫入
????(1)讀取.log數(shù)據(jù)
????(2)將.log數(shù)據(jù)寫入.xlsx文件并保存至當(dāng)前文件夾
????二、數(shù)據(jù)清洗
????(1)去除表格中多余的工作簿
????(2)去除表格中多余的數(shù)據(jù)
????????????①去除多余的列
????????????②去除多余的行
????????????③去除數(shù)據(jù)中的字符“?”并將其替換成nan
????????????④將數(shù)據(jù)為nan的用其前一個數(shù)據(jù)替換

2 數(shù)據(jù)處理步驟及代碼
2.1?讀取.log文件數(shù)據(jù)
讀取.log文件的時,讀取有用數(shù)據(jù),即前36*6=216列,代碼如下:
# (1)讀取原始數(shù)據(jù)
raw_data = open("./data/測試.log", "r", encoding="UTF-8")
count = 0
datalist = []
for i in raw_data:
????datalist.append(tuple(i.split( )[0:218]))
raw_data.close()
2.2?將.log數(shù)據(jù)寫入.xlsx文件并保存至當(dāng)前文件夾
將文件寫入到.xlsx文件中,并創(chuàng)建一個空的sheet工作簿,利用for循環(huán)將代碼寫入到sheet工作簿,并保存當(dāng)前文件下,形成規(guī)范的.xlsx數(shù)據(jù)文件,代碼如下:
# (2)將原始數(shù)據(jù)寫入.xlsx文件并保存至當(dāng)前文件夾
data_over = openpyxl.Workbook()
sheet = data_over.create_sheet()
for i in range(0, len(datalist)):
????data = datalist[i]
????for j in range(0, 217):
????????sheet.cell(row=(i+1), column=(j+1), value=data[j])
data_over.save("./raw_data.xlsx")
2.3?去除多余的工作簿
由于csv文件格式的狀態(tài)下,業(yè)已包含了sheet工作簿,如果不去除該工作簿,利用pd.read_csv()時,就是出現(xiàn)DataFrame的內(nèi)容為None,其代碼如下:
# (3)去除表格中多余的工作簿,df為dataframe
sheet_name = "Sheet"
workbook = openpyxl.load_workbook("raw_data.xlsx")
worksheet = workbook[sheet_name]
workbook.remove(worksheet)
workbook.save("data_cleansing.xlsx")
2.4?去除表格中多余的title
對數(shù)據(jù)進(jìn)行預(yù)分析,發(fā)現(xiàn)了數(shù)據(jù)中含有多余的title,這對后續(xù)數(shù)據(jù)的分析將造成影響。針對這種情況,一般有兩種方法進(jìn)行處理,一是采用drop函數(shù),去除多余的行;二是用loc函數(shù)對數(shù)據(jù)進(jìn)行多次讀取。前者適用于超大數(shù)據(jù),很簡便,后者常用于少量數(shù)據(jù)分析。這里采用第一種方法,代碼如下:
# (4)去除表格中多余的title
df = pd.read_excel("data_cleansing.xlsx")
df.to_csv("data_cleansing.csv",encoding="UTF-8")
df = pd.read_csv("./data_cleansing.csv",index_col=0,low_memory=False,sep=",")
new_df = df.drop(index=[84, 237], inplace=False)
new_df.to_csv("data_cleansing_drop.csv")
2.5?去除數(shù)據(jù)中的字符“?”并用前一行對應(yīng)的列值填充
數(shù)據(jù)預(yù)覽中,發(fā)現(xiàn)數(shù)據(jù)中包含“?”字符串,故利用numpy中的nan特殊字符取替換掉“?”,后續(xù)就可以用相關(guān)函數(shù)去替換nan,經(jīng)分析,數(shù)據(jù)的值與時間具有強(qiáng)相關(guān),所以在此次填寫nan中,采用前一行對應(yīng)的列值經(jīng)行補(bǔ)充,函數(shù)為:method="ffill",代碼如下:
# 將含有"?"的位置替換成nan值
new_df[new_df == "?"] = np.nan
# 用前一行對應(yīng)的值補(bǔ)填空缺值
new_df_nan = new_df.fillna(method="ffill")
new_df_nan.to_csv("data_cleansing_over.csv")

3 完整代碼
import pandas as pd
import numpy as np
import openpyxl
def data_processing():
"""
????包含數(shù)據(jù)清洗(去除不需要的字段和錯誤數(shù)據(jù))、數(shù)據(jù)轉(zhuǎn)存至Excel、數(shù)據(jù)庫等。
????一、原始數(shù)據(jù)的讀取與寫入
????(1)讀取原始數(shù)據(jù)
????(2)將原始數(shù)據(jù)寫入.xlsx文件并保存至當(dāng)前文件夾
????二、原始數(shù)據(jù)清洗
????(3)去除表格中多余的工作簿
????(4)去除表格中多余的title
????"""
????# (1)讀取原始數(shù)據(jù)
????raw_data = open("./data/測試.log", "r", encoding="UTF-8")
????datalist = []
????for i in raw_data:
????????datalist.append(tuple(i.split( )[0:218]))
????raw_data.close()
????# (2)將原始數(shù)據(jù)寫入.xlsx文件并保存至當(dāng)前文件夾
????data_over = openpyxl.Workbook()
????sheet = data_over.create_sheet()
????for i in range(0, len(datalist)):
????????data = datalist[i]
????????for j in range(0, 217):
????????????sheet.cell(row=(i+1), column=(j+1), value=data[j])
????data_over.save("./raw_data.xlsx")
????# (3)去除表格中多余的工作簿
????sheet_name = "Sheet"
????workbook = openpyxl.load_workbook("raw_data.xlsx")
????worksheet = workbook[sheet_name]
????workbook.remove(worksheet)
????workbook.save("data_cleansing.xlsx")
????# (4)去除表格中多余的title,df為dataframe
????df = pd.read_excel("data_cleansing.xlsx")
????df.to_csv("data_cleansing.csv",encoding="UTF-8")
????df = pd.read_csv("./data_cleansing.csv",index_col=0,low_memory=False,sep=",")
????new_df = df.drop(index=[84, 237], inplace=False)
????new_df.to_csv("data_cleansing_drop.csv")
????# 將含有"?"的位置替換成nan值
????new_df[new_df == "?"] = np.nan
????# 用前一行對應(yīng)的值補(bǔ)填空缺值
????new_df_nan = new_df.fillna(method="ffill")
????new_df_nan.to_csv("data_cleansing_over.csv")
????return None
if __name__ == "__main__":
????# 數(shù)據(jù)處理
????data_processing()

看完記得一鍵三連喲~
