记录在实际操作过程中遇到遇到的一些有意思的使用法.
pandas
Serise
获取最小值的 index
Serise.idxmin()
DataFrame
csv 读取扩充
# 读取unix log
popcon = pd.read_csv('../data/popularity-contest', sep=' ', )
# 下载网站数据读取
weather_mar2012 = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='latin1', header=True)
自定义 dataframe 的显示
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
对不标准数据集的读取
raw_dataset = pd.read_csv(dataset_path, names=column_names,na_values = "?", comment='\t',sep=" ", skipinitialspace=True)
pop 函数的返回值
train_labels = train_dataset.pop('MPG')
对 dataframe 手动拆分为训练集和测试集
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)
频次统计
complaints['Complaint Type'].value_counts()
对日期进行聚类
纵向聚类
问题简述:对一年中的 每个星期一\星期二... 进行聚类,如此,一年划分为 7 类
berri_bikes.index.weekday
# 注:当index为日期时,day\weekday\month\year
# 使用非index无法进行该操作 AttributeError: 'Series' object has no attribute 'year'
berri_bikes.loc[:,'weekday'] = berri_bikes.index.weekday
weekday_counts = berri_bikes.groupby('weekday').aggregate(sum)
横向聚类
问题简述:对一年中的每个星期进行聚类,如此,一年划分的类的数目为该年的星期数
bikes.resample('M').apply(np.sum)
该方法不限于 index 为日期
字符串操作
问题简述: 对 Weather 进行筛选,摘出包含 Snow 的行
weather_description = weather_2012['Weather']
is_snowing = weather_description.str.contains('Snow')
数据清洗
# 在读取数据时进行清洗过滤
# 清洗na数据,更改Incident Zip列的类型
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('../data/311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})
numpy
matplotlib
# seaborn
# 将Dataframe中的指定列(复数级)两两绘制图像
sns.pairplot(train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
- Trends - A trend is defined as a pattern of change.
sns.lineplot - Line charts are best to show trends over a period of time, and multiple lines can be used to show trends in more than one group. - Relationship - There are many different chart types that you can use to understand relationships between variables in your data.
sns.barplot - Bar charts are useful for comparing quantities corresponding to different groups.
sns.heatmap - Heatmaps can be used to find color-coded patterns in tables of numbers.
sns.scatterplot - Scatter plots show the relationship between two continuous variables; if color-coded, we can also show the relationship with a third categorical variable.
sns.regplot - Including a regression line in the scatter plot makes it easier to see any linear relationship between two variables.
sns.lmplot - This command is useful for drawing multiple regression lines, if the scatter plot contains multiple, color-coded groups.
sns.swarmplot - Categorical scatter plots show the relationship between a continuous variable and a categorical variable. - Distribution - We visualize distributions to show the possible values that we can expect to see in a variable, along with how likely they are.
sns.distplot - Histograms show the distribution of a single numerical variable.
sns.kdeplot - KDE plots (or 2D KDE plots) show an estimated, smooth distribution of a single numerical variable (or two numerical variables).
sns.jointplot - This command is useful for simultaneously displaying a 2D KDE plot with the corresponding KDE plots for each individual variable.
seaborn
X 轴日期合并
问题
由于 x 轴上的日期太过密集(年-月),所以显示很不好看
解决方案
将日期再分类,以年为单位
from datetime import datetime
plt.figure(figsize=(12,6))
date_ticks = museum_data.index #x轴坐标列表
g = sns.lineplot(data=museum_data)
g.set_xticks(date_ticks[::12])
# 核心代码
# g.set_xticklabels(labels = [foo.year for foo in [datetime.strptime(text, '%Y-%m-%d') for text in date_ticks[::12]]]) # 对于不标准的时间格式,先进行格式化
g.set_xticklabels(labels = [foo.split('-')[0] for foo in date_ticks[::12]]) # 对于标准的时间格式,直接使用分割
# 注:直接分割的效率远远高于格式化
g.set_xlabel('Date')
# Add title
plt.title("Monthly Visitors to Los Angeles City Museums")