记录在实际操作过程中遇到遇到的一些有意思的使用法.

pandas

Serise

获取最小值的 index

Serise.idxmin()

DataFrame

csv 读取扩充

# 读取unix log
popcon = pd.read_csv('../data/popularity-contest', sep=' ', )
# 下载网站数据读取
weather_mar2012 = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='latin1', header=True)

自定义 dataframe 的显示

# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')

# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)

对不标准数据集的读取

raw_dataset = pd.read_csv(dataset_path, names=column_names,na_values = "?", comment='\t',sep=" ", skipinitialspace=True)

pop 函数的返回值

train_labels = train_dataset.pop('MPG')

对 dataframe 手动拆分为训练集和测试集

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

频次统计

complaints['Complaint Type'].value_counts()

对日期进行聚类

纵向聚类

问题简述:对一年中的 每个星期一\星期二... 进行聚类,如此,一年划分为 7 类

berri_bikes.index.weekday
# 注:当index为日期时,day\weekday\month\year
# 使用非index无法进行该操作 AttributeError: 'Series' object has no attribute 'year'

berri_bikes.loc[:,'weekday'] = berri_bikes.index.weekday

weekday_counts = berri_bikes.groupby('weekday').aggregate(sum)

横向聚类

问题简述:对一年中的每个星期进行聚类,如此,一年划分的类的数目为该年的星期数

bikes.resample('M').apply(np.sum)

该方法不限于 index 为日期

字符串操作

问题简述: 对 Weather 进行筛选,摘出包含 Snow 的行

weather_description = weather_2012['Weather']
is_snowing = weather_description.str.contains('Snow')

数据清洗

# 在读取数据时进行清洗过滤
# 清洗na数据,更改Incident Zip列的类型
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('../data/311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})

numpy

matplotlib

# seaborn
# 将Dataframe中的指定列(复数级)两两绘制图像
sns.pairplot(train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")

Trends - A trend is defined as a pattern of change.
sns.lineplot - Line charts are best to show trends over a period of time, and multiple lines can be used to show trends in more than one group.
Relationship - There are many different chart types that you can use to understand relationships between variables in your data.
sns.barplot - Bar charts are useful for comparing quantities corresponding to different groups.
sns.heatmap - Heatmaps can be used to find color-coded patterns in tables of numbers.
sns.scatterplot - Scatter plots show the relationship between two continuous variables; if color-coded, we can also show the relationship with a third categorical variable.
sns.regplot - Including a regression line in the scatter plot makes it easier to see any linear relationship between two variables.
sns.lmplot - This command is useful for drawing multiple regression lines, if the scatter plot contains multiple, color-coded groups.
sns.swarmplot - Categorical scatter plots show the relationship between a continuous variable and a categorical variable.
Distribution - We visualize distributions to show the possible values that we can expect to see in a variable, along with how likely they are.
sns.distplot - Histograms show the distribution of a single numerical variable.
sns.kdeplot - KDE plots (or 2D KDE plots) show an estimated, smooth distribution of a single numerical variable (or two numerical variables).
sns.jointplot - This command is useful for simultaneously displaying a 2D KDE plot with the corresponding KDE plots for each individual variable.

seaborn

X 轴日期合并

问题

由于 x 轴上的日期太过密集(年-月),所以显示很不好看

解决方案

将日期再分类,以年为单位

from datetime import datetime

plt.figure(figsize=(12,6))
date_ticks = museum_data.index #x轴坐标列表

g = sns.lineplot(data=museum_data)
g.set_xticks(date_ticks[::12])

# 核心代码
# g.set_xticklabels(labels = [foo.year for foo in [datetime.strptime(text, '%Y-%m-%d') for text in date_ticks[::12]]])   # 对于不标准的时间格式，先进行格式化
g.set_xticklabels(labels = [foo.split('-')[0] for foo in date_ticks[::12]])    # 对于标准的时间格式，直接使用分割
# 注：直接分割的效率远远高于格式化

g.set_xlabel('Date')
# Add title
plt.title("Monthly Visitors to Los Angeles City Museums")