Why is it needed?
(为什么需要它?)
When data is stored on disk-based storage devices, it is stored as blocks of data.
(当数据存储在基于磁盘的存储设备上时,它将存储为数据块。)
These blocks are accessed in their entirety, making them the atomic disk access operation. (这些块完全被访问,使它们成为原子磁盘访问操作。)
Disk blocks are structured in much the same way as linked lists; (磁盘块的结构与链表大致相同;)
both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously. (两者都包含数据部分,指向下一个节点(或块)位置的指针,并且两者都不需要连续存储。)
Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn't sorted requires a Linear Search which requires N/2
block accesses (on average), where N
is the number of blocks that the table spans.
(由于许多记录只能在一个字段上排序,我们可以声明搜索未排序的字段需要线性搜索,这需要N/2
块访问(平均),其中N
是表格跨越的块数。)
If that field is a non-key field (ie doesn't contain unique entries) then the entire tablespace must be searched at N
block accesses. (如果该字段是非关键字段(即不包含唯一条目),则必须在N
块访问时搜索整个表空间。)
Whereas with a sorted field, a Binary Search may be used, which has log2 N
block accesses.
(而对于排序字段,可以使用二进制搜索,其具有log2 N
块访问。)
Also since the data is sorted given a non-key field, the rest of the table doesn't need to be searched for duplicate values, once a higher value is found. (此外,由于在给定非关键字段的情况下对数据进行了排序,因此一旦找到更高的值,则不需要搜索表的其余部分的重复值。)
Thus the performance increase is substantial. (因此,性能提高很大。)
What is indexing?
(什么是索引?)
Indexing is a way of sorting a number of records on multiple fields.
(索引是一种对多个字段中的多个记录进行排序的方法。)
Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. (在表中的字段上创建索引会创建另一个包含字段值的数据结构,以及指向与其相关的记录的指针。)
This index structure is then sorted, allowing Binary Searches to be performed on it. (然后对该索引结构进行排序,允许对其执行二进制搜索。)
The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.
(索引的缺点是这些索引需要磁盘上的额外空间,因为索引使用MyISAM引擎一起存储在表中,如果同一个表中的许多字段被索引,则此文件可以快速达到底层文件系统的大小限制。)
How does it work?
(它是如何工作的?)
Firstly, let's outline a sample database table schema;
(首先,让我们概述一个示例数据库表模式;)
Field name Data type Size on disk
id (Primary key) Unsigned INT 4 bytes
firstName Char(50) 50 bytes
lastName Char(50) 50 bytes
emailAddress Char(100) 100 bytes
Note : char was used in place of varchar to allow for an accurate size on disk value.
(注意 :使用char代替varchar以允许磁盘值的准确大小。)
This sample database contains five million rows and is unindexed. (此示例数据库包含五百万行并且未编入索引。)
The performance of several queries will now be analyzed. (现在将分析几个查询的性能。)
These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field). (这些是使用id (已排序的键字段)的查询和使用firstName (非键未排序字段)的查询。)
Example 1 - sorted vs unsorted fields
(示例1 - 排序与未排序的字段)
Given our sample database of r = 5,000,000
records of a fixed size giving a record length of R = 204
bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024
bytes.
(给定我们的r = 5,000,000
个固定大小记录的样本数据库,给出R = 204
字节的记录长度,并使用MyISAM引擎将它们存储在表中,该引擎使用默认块大小B = 1,024
字节。)
The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5
records per disk block. (该表的阻塞因子是bfr = (B/R) = 1024/204 = 5
每个磁盘块bfr = (B/R) = 1024/204 = 5
记录。)
The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000
blocks. (保持表所需的块总数是N = (r/bfr) = 5000000/5 = 1,000,000
个块。)
A linear search on the id field would require an average of N/2 = 500,000
block accesses to find a value, given that the id field is a key field.
(鉴于id字段是关键字段,对id字段的线性搜索将需要平均N/2 = 500,000
块访问来找到值。)
But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20
block accesses. (但由于id字段也被排序,因此可以进行二进制搜索,需要平均log2 1000000 = 19.93 = 20
块访问。)
Instantly we can see this is a drastic improvement. (我们可以立即看到这是一个巨大的进步。)
Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000
block accesses.
(现在firstName字段既没有排序也没有键字段,因此二进制搜索是不可能的,值也不是唯一的,因此该表将需要搜索到最后的确切N = 1,000,000
块访问。)
It is this situation that indexing aims to correct. (正是这种情况,索引旨在纠正。)
Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to.
(假定索引记录仅包含索引字段和指向原始记录的指针,那么它就会小于它指向的多字段记录。)
So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. (因此索引本身比原始表需要更少的磁盘块,因此需要更少的块访问来迭代。)
The schema for an index on the firstName field is outlined below; (firstName字段上的索引的模式概述如下;)
Field name Data type Size on disk
firstName Char(50) 50 bytes
(record pointer) Special 4 bytes
Note : Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.
(注意 :MySQL中的指针长度为2,3,4或5个字节,具体取决于表的大小。)
Example 2 - indexing
(示例2 - 索引)
Given our sample database of r = 5,000,000
records with an index record length of R = 54
bytes and using the default block size B = 1,024
bytes.
(给定我们的r = 5,000,000
条记录的样本数据库,索引记录长度为R = 54
字节,并使用默认块大小B = 1,024
字节。)
The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18
records per disk block. (索引的阻塞因子是bfr = (B/R) = 1024/54 = 18
每个磁盘块bfr = (B/R) = 1024/54 = 18
记录。)
The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778
blocks. (保持索引所需的块总数为N = (r/bfr) = 5000000/18 = 277,778
个块。)
Now a search using the firstName field can utilize the index to increase performance.
(现在,使用firstName字段进行搜索可以利用索引来提高性能。)
This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19
block accesses. (这允许索引的二进制搜索具有平均log2 277778 = 18.08 = 19
块访问。)
To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20
block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table. (要查找实际记录的地址,这需要进一步访问块读取,使总数达到19 + 1 = 20
块访问,与在非索引表中查找firstName匹配所需的1,000,000次块访问相差甚远。)
When should it be used?
(什么时候应该使用?)
Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.
(鉴于创建索引需要额外的磁盘空间(上述示例额外增加277,778个块,增加约28%),并且太多索引可能导致文件系统大小限制引起的问题,必须仔细考虑选择正确的要索引的字段。)
Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided.
(由于索引仅用于加速在记录中搜索匹配字段,因此,仅用于输出的索引字段仅仅是在执行插入或删除操作时浪费磁盘空间和处理时间,因此应该避免。)
Also given the nature of a binary search, the cardinality or uniqueness of the data is important. (同样考虑到二进制搜索的性质,数据的基数或唯一性很重要。)
Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. (对基数为2的字段进行索引会将数据分成两半,而基数为1,000则会返回大约1,000条记录。)
With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space. (如此低的基数,有效性会降低到线性排序,如果基数小于记录数的30%,查询优化器将避免使用索引,从而有效地使索引浪费空间。)