【译】数据库基础:用 Go 从零开始写一个 SQL 数据库 —— 第一部分

【译】数据库基础:用 Go 从零开始写一个 SQL 数据库 —— 第一部分

In this series we'll write a rudimentary database from scratch in Go. Project source code is available on Github.

在这个系列文章中,我们将使用 Go 从零开始写一个很基础的数据库。项目源代码可在 Github 找到。

In this first post we'll build enough of a parser to run some simple CREATE, INSERT, and SELECT queries. Then we'll build an in-memory backend supporting TEXT and INT types and write a basic REPL.

在第一篇中,我们将构建解析器来运行一些简单的 CREATEINSERTSELECT 查询。然后,构建一个支持 TEXTINT 等后端内存的类型,并编写一个基本的 REPL。

We'll be able to support the following interaction:

项目完成后,我们将支持下面的交互:

$ go run *.go
Welcome to gosql.
# CREATE TABLE users (id INT, name TEXT);
ok
# INSERT INTO users VALUES (1, 'Phil');
ok
# SELECT id, name FROM users;
| id | name |
====================
| 1 |  Phil |
ok
# INSERT INTO users VALUES (2, 'Kate');
ok
# SELECT name, id FROM users;
| name | id |
====================
| Phil |  1 |
| Kate |  2 |
ok

The first stage will be to map a SQL source into a list of tokens (lexing). Then we'll call parse functions to find individual SQL statements (such as SELECT). These parse functions will in turn call their own helper functions to find patterns of recursively parseable chunks, keywords, symbols (like parenthesis), identifiers (like a table name), and numeric or string literals.

第一步是将 SQL 源映射为 token 列表(词法分析)。然后我们将调用解析函数来解析单个 SQL 语句(如 SELECT)。这些解析函数将依次调用一些辅助函数,以发现可递归解析的语句块、关键字、符号(如“括号”)、标识符(如“表名称”)和数字或字符串文字。

Then, we'll write an in-memory backend to do operations based on an AST. Finally, we'll write a REPL to accept SQL from a CLI and pass it to the in-memory backend.

然后,我们将编写一个内存后端来执行基于 AST 的操作,最后,我们会编写一个 REPL 来接收命令行下的 SQL 并将其传递到内存后端进行解析。


This post assumes a basic understanding of parsing concepts. We won't skip any code, but also won't go into great detail on why we structure the way we do.

本文是假定你对解析概念有了基本的了解之上的。我们不会跳过任何代码,但也不会详细讨论为何要这样写。

For a simpler introduction to parsing and parsing concepts, see this post on parsing JSON.

有关解析和解析概念的更简单介绍,请参阅该解析 json 文章


lexing

词法分析

The lexer is responsible for finding every distinct group of characters in source code: tokens. This will consist primarily of identifiers, numbers, strings, and symbols.

lexer(词法分析器)负责查找源码中的每一组不同的字符:tokens(令牌)。它主要由标识符、数字、字符串和符号组成。

The gist of the logic will be to iterate over the source string and collect characters until we find a delimiting character such as a space or comma. In this first pass, we'll pretend users don't insert delimiting characters into strings. Once we've reached a delimiting character, we'll "finalize" the token and decide whether it is valid or not.

主要逻辑是迭代源字符串并收集字符,直到找到分隔符(如空格、逗号)。在第一遍字符传递中,我们假设用户没有将分隔字符插入到字符串中。一旦遇到一个分隔符,我们将会”最终确定“并标记它是否有效。

First off, we'll define a few types and constants for use in lexer.go:

首先,我们在 lexer.go 中定义一些类型和常量:

package main

import (
    "fmt"
    "io"
    "strings"
)

type location struct {
    line uint
    col  uint
}

type keyword string

const (
    selectKeyword keyword = "select"
    fromKeyword   keyword = "from"
    asKeyword     keyword = "as"
    tableKeyword  keyword = "table"
    createKeyword keyword = "create"
    insertKeyword keyword = "insert"
    intoKeyword   keyword = "into"
    valuesKeyword keyword = "values"
    intKeyword    keyword = "int"
    textKeyword   keyword = "text"
)

type symbol string

const (
    semicolonSymbol  symbol = ";"
    asteriskSymbol   symbol = "*"
    commaSymbol      symbol = ","
    leftparenSymbol  symbol = "("
    rightparenSymbol symbol = ")"
)

type tokenKind uint

const (
    keywordKind tokenKind = iota
    symbolKind
    identifierKind
    stringKind
    numericKind
)

type token struct {
    value string
    kind  tokenKind
    loc   location
}

func (t *token) equals(other *token) bool {
    return t.value == other.value && t.kind == other.kind
}

func (t *token) finalize() bool {
    return true
}

Next we'll write out the main loop:

接着我们写出循环体的主要内容:

func lex(source io.Reader) ([]*token, error) {
    buf := make([]byte, 1)
    tokens := []*token{}
    current := token{}
    var line uint = 0
    var col uint = 0

    for {
        _, err := source.Read(buf)
        if err != nil && err != io.EOF {
            return nil, err
        }

        // Add semi-colon for EOF
        // 在结尾增加分号
        var c byte = ';'
        if err == nil {
            c = buf[0]
        }

        switch c {
        case '\n':
            line++
            col = 0
            continue
        case ' ':
            fallthrough
        case ',':
            fallthrough
        case '(':
            fallthrough
        case ')':
            fallthrough
        case ';':
            if !current.finalize() {
                return nil, fmt.Errorf("Unexpected token '%s' at %d:%d", current.value, current.loc.line, current.loc.col)
            }

            if current.value != "" {
                copy := current
                tokens = append(tokens, &copy)
            }

            if c == ';' || c == ',' || c == '(' || c == ')' {
                tokens = append(tokens, &token{
                    loc:   location{col: col, line: line},
                    value: string(c),
                    kind:  symbolKind,
                })
            }

            current = token{}
            current.loc.col = col
            current.loc.line = line
        default:
            current.value += string(c)
        }

        if err == io.EOF {
            break
        }
        col++
    }

    return tokens, nil
}

Last, we'll write a finalizer helper for each kind of fundemental token and check on each in a reasonable order.

最后,我们会给每个基本 token 编写终结辅助函数,并按一定的顺序检查所有 token。

Validating numbers

验证数字

Numbers are the most complex. So we'll refer to the PostgreSQL documentation (section 4.1.2.6) for what constitutes a valid number.

数字是最复杂的。因此我们参考 PostgreSQL 文档(4.1.2.6节)来确定有效数字是怎样的。

func (t *token) finalizeNumeric() bool {
    if len(t.value) == 0 {
        return false
    }

    periodFound := false
    expMarkerFound := false

    i := 0
    for i < len(t.value) {
        c := t.value[i]

        isDigit := c >= '0' && c <= '9'
        isPeriod := c == '.'
        isExpMarker := c == 'e'

        // Must start with a digit or period
        // 必须以数字或分隔符`.`开始
        if i == 0 {
            if !isDigit && !isPeriod {
                return false
            }

            periodFound = isPeriod
            i++
            continue
        }

        if isPeriod {
            if periodFound {
                return false
            }

            periodFound = true
            i++
            continue
        }

        if isExpMarker {
            if expMarkerFound {
                return false
            }

            // No periods allowed after expMarker
            // expMarker 后可以是非分隔符`.`
            periodFound = true
            expMarkerFound = true

            // expMarker must be followed by digits
            // expMarker 后必须是数字
            if i == len(t.value)-1 {
                return false
            }

            cNext := t.value[i+1]
            if cNext == '-' || cNext == '+' {
                i++
            }

            i++
            continue
        }

        if !isDigit {
            return false
        }

        i++
    }

    t.kind = numericKind
    return true
}

Validating strings

验证字符串

Strings must start and end with a single apostrophe. But once we identify this is a string, we'll rewrite the value dropping these for easier use by the rest of the project.

字符串一定是以单撇号开始与结束的。但是,一旦我们确定这是一个字符串,我们将会丢弃掉这些字符串定界符,以便其余真实字符串的解析。

func (t *token) finalizeString() bool {
    if len(t.value) == 0 {
        return false
    }

    if t.value[0] == '\'' && t.value[len(t.value)-1] == '\'' {
        t.kind = stringKind
        t.value = t.value[1 : len(t.value)-1]
        return true
    }

    return false
}

Validating symbols and keywords

验证符号和关键字

Symbols and keywords come from a fixed set of strings, so they're easy to compare against.

符号和关键字就是一组固定的字符串,因此它们很容易就能通过比较解析它们。

func (t *token) finalizeSymbol() bool {
    switch t.value {
    case "*":
        break
    case ";":
        break
    case "(":
        break
    case ")":
        break
    default:
        return false
    }

    t.kind = symbolKind
    return true
}

func (t *token) finalizeKeyword() bool {
    switch strings.ToLower(t.value) {
    case "select":
        break
    case "from":
        break
    case "as":
        break
    case "table":
        break
    case "create":
        break
    case "insert":
        break
    case "into":
        break
    case "values":
        break
    case "int":
        break
    case "text":
        break
    default:
        return false
    }

    t.value = strings.ToLower(t.value)
    t.kind = keywordKind
    return true
}

Validating identifiers

验证标识符

Now we can finish up the original finalize function and assume any token not matching one of these is a valid identifier.

现在,我们可以完成前面的 finalize 函数,并假设与其中有一个不能匹配的 token,则视为有效的标识符。

func (t *token) finalizeIdentifier() bool {
    t.kind = identifierKind
    return true
}

func (t *token) finalize() bool {
    if t.finalizeSymbol() {
        return true
    }

    if t.finalizeKeyword() {
        return true
    }

    if t.finalizeNumeric() {
        return true
    }

    if t.finalizeString() {
        return true
    }

    if t.finalizeIdentifier() {
        return true
    }

    return false
}

And that's it for the lexer! If you copy lexer_test.go from the main project, the tests should now pass.

这就是 lexer!如果你是从主项目中复制了 lexer_test.go 文件,则单元测试应该能通过。

AST model

AST 模型

At the highest level, an AST is a collection of statements:

从全局上看,AST 就是语句的集合:

package main

type Ast struct {
    Statements []*Statement
}

A statement, for now, is one of INSERT, CREATE, or SELECT:

现在,一条语句就是 INSERTCREATESELECT 中的一种:

type AstKind uint

const (
    SelectKind AstKind = iota
    CreateTableKind
    InsertKind
)

type Statement struct {
    SelectStatement      *SelectStatement
    CreateTableStatement *CreateTableStatement
    InsertStatement      *InsertStatement
    Kind                 AstKind
}

Insert

An insert statement, for now, has a table name and a list of values to insert:

目前,一条 insert 语句中需要有一个表名和插入的值列表:

type InsertStatement struct {
    table  token
    values *[]*expression
}

An expression is a literal token or (in the future) a function call or inline operation:

表达式就是一个字符 token(在以后)或者是 一个函数调用或者是一个内链操作:

type expressionKind uint

const (
    literalKind expressionKind = iota
)

type expression struct {
    literal *token
    kind    expressionKind
}

CREATE

A create statement, for now, has a table name and a list of column names and types:

现在,create 语句需要有表名和字段名及其类型的列表:

type columnDefinition struct {
    name     token
    datatype token
}

type CreateTableStatement struct {
    name token
    cols *[]*columnDefinition
}

SELECT

A select statement, for now, has a table name and a list of column names:

select 语句需要有表名和字段名:

type SelectStatement struct {
    item []*expression
    from token
}

And that's it for the AST.

以上都是 AST 中的内容。

Parsing

解析

The Parse entrypoint will take a list of tokens and attempt to parse statements, separated by a semi-colon, until it reaches the last token.

Parse 函数入口会获取一个 token 列表,并尝试解析由分号分隔的语句,直到最后一个 token。

In general our strategy will be to increment and pass around a cursor containing the current position of unparsed tokens. Each helper will return the new cursor that the caller should start from.

通常,我们的策略是递增并传递一个含有未解析 token 的位置的游标。辅助函数会返回调用方即将要解析的新游标。

package main

import (
    "errors"
    "fmt"
    "io"
)

func tokenFromKeyword(k keyword) token {
    return token{
        kind:  keywordKind,
        value: string(k),
    }
}

func tokenFromSymbol(s symbol) token {
    return token{
        kind:  symbolKind,
        value: string(s),
    }
}

func expectToken(tokens []*token, cursor uint, t token) bool {
    if cursor >= uint(len(tokens)) {
        return false
    }

    return t.equals(tokens[cursor])
}

func helpMessage(tokens []*token, cursor uint, msg string) {
    var c *token
    if cursor < uint(len(tokens)) {
        c = tokens[cursor]
    } else {
        c = tokens[cursor-1]
    }

    fmt.Printf("[%d,%d]: %s, got: %s\n", c.loc.line, c.loc.col, msg, c.value)
}

func Parse(source io.Reader) (*Ast, error) {
    tokens, err := lex(source)
    if err != nil {
        return nil, err
    }

    a := Ast{}
    cursor := uint(0)
    for cursor < uint(len(tokens)) {
        stmt, newCursor, ok := parseStatement(tokens, cursor, tokenFromSymbol(semicolonSymbol))
        if !ok {
            helpMessage(tokens, cursor, "Expected statement")
            return nil, errors.New("Failed to parse, expected statement")
        }
        cursor = newCursor

        a.Statements = append(a.Statements, stmt)

        atLeastOneSemicolon := false
        for expectToken(tokens, cursor, tokenFromSymbol(semicolonSymbol)) {
            cursor++
            atLeastOneSemicolon = true
        }

        if !atLeastOneSemicolon {
            helpMessage(tokens, cursor, "Expected semi-colon delimiter between statements")
            return nil, errors.New("Missing semi-colon between statements")
        }
    }

    return &a, nil
}

Parsing statements

解析语句

Each statement will be one of INSERT, CREATE, or SELECT. The parseStatement helper will call a helper on each of these statement types and return true if one of them succeeds in parsing.

每个语句都会是 INSERTCREATESELECT 中的一个。parseStatement 辅助函数会对每个语句类型调用对应的辅助函数,如果解析成功,则返回 true

func parseStatement(tokens []*token, initialCursor uint, delimiter token) (*Statement, uint, bool) {
    cursor := initialCursor

    // Look for a SELECT statement
    // 查找 SELECT 语句
    semicolonToken := tokenFromSymbol(semicolonSymbol)
    slct, newCursor, ok := parseSelectStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:            SelectKind,
            SelectStatement: slct,
        }, newCursor, true
    }

    // Look for a INSERT statement
    // 查找 INSERT  语句
    inst, newCursor, ok := parseInsertStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:            InsertKind,
            InsertStatement: inst,
        }, newCursor, true
    }

    // Look for a CREATE statement
    // 查找 CREATE 语句
    crtTbl, newCursor, ok := parseCreateTableStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:                 CreateTableKind,
            CreateTableStatement: crtTbl,
        }, newCursor, true
    }

    return nil, initialCursor, false
}

Parsing select statements

解析 select 语句

Parsing SELECT statements is easy. We'll look for the following token pattern:

解析 SELECT 语句很简单。我们将查找下方的 token 进行匹配:

1.`SELECT`
2.$expression [, ...]
3.`FROM`
4.$table-name`

Sketching that out we get:

我们大致可以得到下面的内容:

func parseSelectStatement(tokens []*token, initialCursor uint, delimiter token) (*SelectStatement, uint, bool) {
    cursor := initialCursor
    if !expectToken(tokens, cursor, tokenFromKeyword(selectKeyword)) {
        return nil, initialCursor, false
    }
    cursor++

    slct := SelectStatement{}

    exps, newCursor, ok := parseExpressions(tokens, cursor, []token{tokenFromKeyword(fromKeyword), delimiter})
    if !ok {
        return nil, initialCursor, false
    }

    slct.item = *exps
    cursor = newCursor

    if expectToken(tokens, cursor, tokenFromKeyword(fromKeyword)) {
        cursor++

        from, newCursor, ok := parseToken(tokens, cursor, identifierKind)
        if !ok {
            helpMessage(tokens, cursor, "Expected FROM token")
            return nil, initialCursor, false
        }

        slct.from = *from
        cursor = newCursor
    }

    return &slct, cursor, true
}

The parseToken helper will look for a token of a particular token kind.

parseToken 辅助函数会查找特定类型的 token 所对应的 kind 值。

func parseToken(tokens []*token, initialCursor uint, kind tokenKind) (*token, uint, bool) {
    cursor := initialCursor

    if cursor >= uint(len(tokens)) {
        return nil, initialCursor, false
    }

    current := tokens[cursor]
    if current.kind == kind {
        return current, cursor + 1, true
    }

    return nil, initialCursor, false
}

The parseExpressions helper will look for tokens separated by a comma until a delimiter is found. It will use existing helpers plus parseExpression.

parseExpressions 辅助函数会查找由逗号分隔的 token。它将使用现有的辅助函数以及 parseExpression 函数。

func parseExpressions(tokens []*token, initialCursor uint, delimiters []token) (*[]*expression, uint, bool) {
    cursor := initialCursor

    exps := []*expression{}
outer:
    for {
        if cursor >= uint(len(tokens)) {
            return nil, initialCursor, false
        }

        // Look for delimiter
        // 查找分隔符
        current := tokens[cursor]
        for _, delimiter := range delimiters {
            if delimiter.equals(current) {
                break outer
            }
        }

        // Look for comma
        // 查找逗号
        if len(exps) > 0 {
            if !expectToken(tokens, cursor, tokenFromSymbol(commaSymbol)) {
                helpMessage(tokens, cursor, "Expected comma")
                return nil, initialCursor, false
            }

            cursor++
        }

        // Look for expression
        // 查找表达式
        exp, newCursor, ok := parseExpression(tokens, cursor, tokenFromSymbol(commaSymbol))
        if !ok {
            helpMessage(tokens, cursor, "Expected expression")
            return nil, initialCursor, false
        }
        cursor = newCursor

        exps = append(exps, exp)
    }

    return &exps, cursor, true
}

The parseExpression helper (for now) will look for a numeric, string, or identifier token.

parseExpression 函数(现在)会查找数字、字符串或标识符 token。

func parseExpression(tokens []*token, initialCursor uint, _ token) (*expression, uint, bool) {
    cursor := initialCursor

    kinds := []tokenKind{identifierKind, numericKind, stringKind}
    for _, kind := range kinds {
        t, newCursor, ok := parseToken(tokens, cursor, kind)
        if ok {
            return &expression{
                literal: t,
                kind:    literalKind,
            }, newCursor, true
        }
    }

    return nil, initialCursor, false
}

And that's it for parsing a SELECT statement!

以上就是解析一条 SELECT 语句的内容!

--- 待续