不安全Golang重构笔记 - PaperCache

关于不安全

https://buaq.net https://f5.pm
不安全是一个我一直以来获取我感兴趣相关的rss阅读器，与公开的阅读器不同的是我会对文章正文和图片做永久的存储，很多时候看到的好的文章过一段时间可能站就关了，而且纯靠书签方式进行管理并不是特别的方便，所以诞生不安全。
这次重构主要是因为上一个版本是差不多我6年前用php写的，包括后端的爬虫，缓存等，写的时候没有考虑太多，加上中间硬塞了一个新功能导致代码面目全非，这次就用go进行重构，顺便也给文章正文用前段时间刚出的zinc做了索引服务，对用户来说最大的变化就是支持全文检索了。

支持功能

[x] 全文索引
[x] telegram机器人推送
[x] twitter推送
[x] 文章定时爬取
[x] 支持http/socks5代理
[x] 支持cloudflare woker代理
[x] 收藏功能
[x] api添加文章功能
[x] 更好的正文提取
[x] 图片自动上传到百度云对象存储
[ ] 工具分享
[ ] 文章每日排行榜(兼容微信公众号)

zinc全文索引

全文索引采用的zinc，其使用的bluge作为底层的索引引擎，再其基础上封装了bulk类似es的查询语法。

有几个坑，第一不支持调整size大小，默认是返回1000条

❯ grep -n -r 1000 *
auth/GetUsers.go:19:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
auth/AuthMiddleware.go:42:    searchRequest := bluge.NewTopNSearch(1000, termQuery)
startup/Loadconfig.go:19:            return 10000
startup/Loadconfig.go:25:    return 10000
uquery/AllDocuments.go:11:    searchRequest := bluge.NewTopNSearch(1000, query)
uquery/MatchAllQuery.go:21:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
uquery/MatchQuery.go:28:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
uquery/MultiPhraseQuery.go:21:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
uquery/DateRangeQuery.go:12:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
uquery/QueryStringQuery.go:23:    searchRequest := bluge.NewTopNSearch(1000, finalQuery).WithStandardAggregations()
uquery/MatchPhraseQuery.go:22:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
uquery/WildcardQuery.go:22:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
uquery/FuzzyQuery.go:21:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
uquery/TermQuery.go:21:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()
uquery/PrefixQuery.go:21:    searchRequest := bluge.NewTopNSearch(1000, query).WithStandardAggregations()

第二个坑，不支持offset进行调整查询位置，这俩个问题算一个问题，官方暂时也没有支持翻页的计划,这里自己做了一个改动
pkg/core/index.go

func (index *Index) Search(iQuery v1.ZincQuery) (v1.SearchResponse, error) {
    var Hits []v1.Hit

    var searchRequest bluge.SearchRequest

    var err error

    ....
    
    writer := index.Writer

    reader, err := writer.Reader()
    if err != nil {
        log.Print("error accessing reader: %v", err)
    }

    dmi, err := reader.Search(context.Background(), searchRequest)
    if err != nil {
        log.Print("error executing search: %v", err)
    }

    // highlighter := highlight.NewANSIHighlighter()
    var count = 0
    // iterationStartTime := time.Now()
    next, err := dmi.Next()
    for err == nil && next != nil {
        // 这边加一个offset
        count++
        // iQuery结构体我增加了一个Offset字段，可以从前台传过来，这样就可以使用{offset:10,size:10},实现limit 10,10的功能了
        if count < iQuery.Offset {
            // 他读数据是通过next进行读下一条，内核不了解没啥好办法，只能先通过这样的方式做offset了
            next, err = dmi.Next()
            continue
        }
        var result map[string]interface{}
        var id string
        var timestamp time.Time
        err = next.VisitStoredFields(func(field string, value []byte) bool {
            if field == "_source" {
                json.Unmarshal(value, &result)
                return true
            } else if field == "_id" {
                id = string(value)
                return true
            } else if field == "@timestamp" {
                timestamp, _ = bluge.DecodeDateTime(value)
                return true
            }
            return true
        })
        if err != nil {
            log.Print("error accessing stored fields: %v", err)
        }

        hit := v1.Hit{
            Index:     index.Name,
            Type:      index.Name,
            ID:        id,
            Score:     next.Score,
            Timestamp: timestamp,
            Source:    result,
        }

        next, err = dmi.Next()
        // results = append(results, result)

        Hits = append(Hits, hit)
        // 默认他iQuery结构提是存在size字段的，但是没有通过他进行调整大小，我们自己加一个判断，这里理论是来说使用len会影响性能
        if len(Hits) > iQuery.Size {
            break
        }
    }
    if err != nil {
        log.Print("error iterating results: %v", err)
    }

    ....
    reader.Close()

    return resp, nil
}

第三个坑是没有类似logstash那样可以直接导入数据到es的第三方工具，这里我直接给他加了个功能从mysql里面读数据进去

func ReadFromDB() {
    var notes []Notes
    // 从数据库里读取出所有的数据
    DB.Model(&Notes{}).Find(&notes)
    for _, note := range notes {
        log.Println("[*] Import ", note.Hash)
        // 从文件读取文章正文并提取纯文本部分的数据
        note.Content = GetNoteContent(note)
        if ret, err := json.Marshal(&note); err == nil {
            var doc map[string]interface{}
            // 最后把struct转换成map
            json.Unmarshal(ret, &doc)
            ImportData("buaqbatchImport", doc)
        }
    }

}
func ImportData(indexName string, doc map[string]interface{}) {
    if !core.IndexExists(indexName) {
        //这部分的代码基本是直接抄的他前台createDocument部分的代码
        newIndex, err := core.NewIndex(indexName)

        if err != nil {
            log.Print(err)
            return
        }
        core.ZINC_INDEX_LIST[indexName] = newIndex // Load the index in memory
    }
    index := core.ZINC_INDEX_LIST[indexName]
    docID := uuid.New().String()
    // 他创建document需要传一个uuid和一个map的结构体
    err := index.UpdateDocument(docID, &doc)
    log.Default().Println(err)

}

分词

说搜索就不得不说分词，一开始尝试了go版本的jieba分词
后来因为其还是用c然后用cgo进行调用，编译之后会依赖glibc，在某些机器上因为glibc版本过低不能运行，哪怕用xgo编译也不行，后来找到了sego整体用下来也还可以，sego是纯go开发的，不用担心glibc版本的问题。

Rss抓取

这里的rss爬取用的是gofeed,官方号称支持如下版本，暂时没有遇到坑

RSS 0.90
Netscape RSS 0.91
Userland RSS 0.91
RSS 0.92
RSS 0.93
RSS 0.94
RSS 1.0
RSS 2.0
Atom 0.3
Atom 1.0
JSON 1.0
JSON 1.1

正文提取

试了如下基本版本的正文提取

https://github.com/naibahq/go-readability/
https://github.com/ying32/readability
https://github.com/go-shiori/go-readability
最后尝试下来shiori的最好用，并且之前php版本的readability对一些微信公众号和非博客/新闻性质的页面正文抓取并不是特别友好，shioir可以说是目前用的最强的，超过了nodejs和python的readability版本,暂时也没有遇到坑。

如果硬要说坑的话，可能就会有潜在的xss风险，他对正文提取似乎还是比较包容，没有删除太多的标签，这点后面有空优化一下。

其他

发现了一个老版本的问题，就是图片有时候没办法缓存，主要原因是 ip被目标网站ban掉了，这边直接通过cf的worker做proxy去抓取内容。
zinc这边对在建立索引的时候对cpu要求还是高，后期可以考虑加一台机器专门做搜索。