学习是件开心事

分区分表分库分片应用

sqlserver单表数据超过2000w后,索引查询的效率开始下降。当数据过亿后,就开始有慢sql出现了。如何对这种体量的数据库进行优化,成为开发最头疼的事情。

分区

把一张表的数据分成N个区块,在逻辑上看最终只是一张表,但底层是由N个物理区块组成的。
常见的分区方式有:

  • Range(范围,根据id大小分区,有热点问题,id越大代表越新,查询次数越多,id越小代表数据越陈旧,几乎无查询)
  • Hash取模(哈希,对id进行取模运算,将数据均匀分布到事先设计好的分区中,后续如果模值变动,需要数据迁移)
  • 按照时间拆分 (同range)
  • Range+Hash取模 (分区+分库的实现形式,单表保存id和group的映射关系,再根据group内id取模运算,均匀落库到指定的db中,后续扩容直接新增一个group组即可)

按照时间拆分的一个完整示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
-- 行程跟进分区函数 (按月分区) 不拆表
-- 2010-01-01至2018-01-01 为一个区
-- 2018-01-01至2030-12-01 每个月一个区
-- 2030-12-01至2050-01-01 为一个区
IF NOT EXISTS (SELECT top 1 1 FROM sys.partition_functions where name = 'activityDayTimeRange')
BEGIN
CREATE PARTITION FUNCTION activityDayTimeRange (smalldatetime)
AS RANGE RIGHT FOR VALUES ('2010-01-01'
,'2018-01-01','2018-02-01','2018-03-01','2018-04-01','2018-05-01','2018-06-01'
,'2018-07-01','2018-08-01','2018-09-01','2018-10-01','2018-11-01','2018-12-01'
,'2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01','2019-06-01'
,'2019-07-01','2019-08-01','2019-09-01','2019-10-01','2019-11-01','2019-12-01'
,'2020-01-01','2020-02-01','2020-03-01','2020-04-01','2020-05-01','2020-06-01'
,'2020-07-01','2020-08-01','2020-09-01','2020-10-01','2020-11-01','2020-12-01'
,'2021-01-01','2021-02-01','2021-03-01','2021-04-01','2021-05-01','2021-06-01'
,'2021-07-01','2021-08-01','2021-09-01','2021-10-01','2021-11-01','2021-12-01'
,'2022-01-01','2022-02-01','2022-03-01','2022-04-01','2022-05-01','2022-06-01'
,'2022-07-01','2022-08-01','2022-09-01','2022-10-01','2022-11-01','2022-12-01'
,'2023-01-01','2023-02-01','2023-03-01','2023-04-01','2023-05-01','2023-06-01'
,'2023-07-01','2023-08-01','2023-09-01','2023-10-01','2023-11-01','2023-12-01'
,'2024-01-01','2024-02-01','2024-03-01','2024-04-01','2024-05-01','2024-06-01'
,'2024-07-01','2024-08-01','2024-09-01','2024-10-01','2024-11-01','2024-12-01'
,'2025-01-01','2025-02-01','2025-03-01','2025-04-01','2025-05-01','2025-06-01'
,'2025-07-01','2025-08-01','2025-09-01','2025-10-01','2025-11-01','2025-12-01'
,'2026-01-01','2026-02-01','2026-03-01','2026-04-01','2026-05-01','2026-06-01'
,'2026-07-01','2026-08-01','2026-09-01','2026-10-01','2026-11-01','2026-12-01'
,'2027-01-01','2027-02-01','2027-03-01','2027-04-01','2027-05-01','2027-06-01'
,'2027-07-01','2027-08-01','2027-09-01','2027-10-01','2027-11-01','2027-12-01'
,'2028-01-01','2028-02-01','2028-03-01','2028-04-01','2028-05-01','2028-06-01'
,'2028-07-01','2028-08-01','2028-09-01','2028-10-01','2028-11-01','2028-12-01'
,'2029-01-01','2029-02-01','2029-03-01','2029-04-01','2029-05-01','2029-06-01'
,'2029-07-01','2029-08-01','2029-09-01','2029-10-01','2029-11-01','2029-12-01'
,'2030-01-01','2030-02-01','2030-03-01','2030-04-01','2030-05-01','2030-06-01'
,'2030-07-01','2030-08-01','2030-09-01','2030-10-01','2030-11-01','2030-12-01'
,'2050-01-01')
END
GO
IF NOT EXISTS (SELECT top 1 1 FROM sys.partition_schemes where name = 'activityDayTimePartitionScheme')
BEGIN
CREATE PARTITION SCHEME activityDayTimePartitionScheme
AS PARTITION activityDayTimeRange ALL TO ([PRIMARY]);
end
GO
--将原表的主键id的聚簇索引改为非聚簇索引
IF EXISTS(SELECT top 1 1 from sysindexes WHERE id=object_id('activity_follow') AND name='PK_activity_follow_follow_Id')
BEGIN
ALTER TABLE dbo.activity_follow drop CONSTRAINT PK_activity_follow_follow_Id;
ALTER TABLE dbo.activity_follow ADD CONSTRAINT PK_activity_follow_Id PRIMARY KEY NONCLUSTERED (id)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY];
END
GO
-- 为DayTime(分区列)创建聚簇索引
IF NOT EXISTS(SELECT top 1 1 from sysindexes WHERE id=object_id('activity_follow') AND name='IX_activity_follow_daytime_partition')
BEGIN
CREATE CLUSTERED INDEX IX_activity_follow_daytime_partition ON dbo.activity_follow (daytime)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
ON activityDayTimePartitionScheme(daytime);
END
GO

查询分区结果示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
select convert(varchar(50), ps.name
) as partition_scheme,
p.partition_number,
convert(varchar(10), ds2.name
) as filegroup,
convert(varchar(19), isnull(v.value, ''), 120) as range_boundary,
str(p.rows, 9) as rows
from sys.indexes i
join sys.partition_schemes ps on i.data_space_id = ps.data_space_id
join sys.destination_data_spaces dds
on ps.data_space_id = dds.partition_scheme_id
join sys.data_spaces ds2 on dds.data_space_id = ds2.data_space_id
join sys.partitions p on dds.destination_id = p.partition_number
and p.object_id = i.object_id and p.index_id = i.index_id
join sys.partition_functions pf on ps.function_id = pf.function_id
LEFT JOIN sys.Partition_Range_values v on pf.function_id = v.function_id
and v.boundary_id = p.partition_number - pf.boundary_value_on_right
WHERE i.object_id = object_id('fang_property_follow')--分区表名
and i.index_id in (0, 1)
order by p.partition_number

Range+hash的架构图
Image text
Image text

分表

是把一张表分成多个小表,适用如下场景: 单表的部分字段频繁的需要修改,可以将该类字段独立出来成一张表,实现单库的”读写分离”

分库

对于时效性不高的数据,可以通过读写分离缓解数据库压力,需要注意的是在业务分区上哪些业务是允许一定延迟的,以及数据同步问题

分片

在分布式存储系统中,数据需要分散存储在多台设备上,数据分片就是用来确定数据在多台存储设备上分布的技术。数据分片的目的是

  • 分布均匀,每台设备上的数据量要尽可能详尽
  • 负载均衡,每台设备上的请求量要尽可能相近
  • 扩缩容时产生的数据迁移尽可能少