Streaming Systems 阅读笔记

2020-03-03

Streaming 101

开头作者先批判了一番对 streaming 这个概念理解不到位的，称他们会 clouds what streaming really means，这里学了个新词儿 cloud，然后给出了他认为的 streaming system 的定义：

A type of data processing engine that is designed with infinite datasets in mind.

作者认为描述一个 dataset 有两方面：

cardinality: 数据集大小
- Bounded data：有界数据集
- Unbounded data：无界数据集
constitution：the constitution defines the ways one can interact with the data
- Table: A holistic view of a dataset at a specific point in time.
- Stream: An element-by-element view of the evolution of a dataset over time

然后说细节啊什么的都后面再讲，然后就开始说流计算的优劣，既然谈到流计算，作者又顺便科普了一下 lambda 架构

然后又扯到了 Kappa 架构，流计算系统如果想要取代批处理系统需要两点

Correctness: Strong consistency is required for exactly-once processing(MillWheel, spark Streaming, flink snapshot)
Tools for reasoning about time

其中提到了 Processing-time lag 和 Event-time skew 这俩概念

有界数据集处理过程比较确定，流批基本类似，看一下无界数据集的处理

对于批处理引擎处理无界数据，批处理引擎的设计并不是为了处理流数据，但是也不是不能做，实现一般都是分割成有限数据集，分割主要有在这么几种

对于流处理引擎，作者对流计算处理无界数据的场景大致分成了四类：

Windowing 这里要单独拿出来讲，window 主要分为三类：

其中 windows 使用场景又包括两类：

Windowing by processing time: buffers up incoming data into windows until some amount of processing time has passed

特点是

simple
judging window completeness is straight forward(no need to deal with “late” data)

Windowing by event time: when you need to observe a data source in finite chunks that reflect the times at which those events actually happened

特点是

Buffering： window 的生命周期会比 event time 的生命周期长（比如迟到事件）所以状态存储压力会大，当然作者也说了，设计的好的系统这个问题不是事儿（用 reduce 维护单状态等等）
Completeness: 该窗口内数据是否已经全部到达不可知（watermark）

streaming -> systems build with unbounded data (approximate/speculative results, cardinality/constitution)
streaming is in fact a strict superset of batch
two high-level concepts necessary for streaming systems: correctness and reasoning about time
differences between event time and processing time
major data processing approaches: time-agnostic, approximation, windowing by processing time, and windowing by event time

言：

当遇到你时，大脑连上CSGO都会掉帧。