Streaming Systems 阅读笔记
2020-03-03
Streaming 101
Terminology: What Is Streaming?
开头作者先批判了一番对 streaming 这个概念理解不到位的,称他们会 clouds what streaming really means, 这里学了个新词儿 cloud,然后给出了他认为的 streaming system 的定义:
A type of data processing engine that is designed with infinite datasets in mind.
作者认为描述一个 dataset 有两方面:
- cardinality: 数据集大小
- Bounded data: 有界数据集
- Unbounded data: 无界数据集
- constitution:the constitution defines the ways one can interact with the data
- Table: A holistic view of a dataset at a specific point in time.
- Stream: An element-by-element view of the evolution of a dataset over time
然后说细节啊什么的都后面再讲,然后就开始说流计算的优劣,既然谈到流计算,作者又顺便科普了一下 lambda 架构
- 批处理系统旁边有个流处理系统
- 执行同样的任务(当然具体的输入输出可能有区别)
- 流处理系统提供低延迟、近似精确的结果(提供的一致性保证不同)
- 批处理系统稍后提供更准确的结果
然后又扯到了 Kappa 架构,流计算系统如果想要取代批处理系统需要两点
- Correctness: Strong consistency is required for exactly-once processing(MillWheel, spark Streaming, flink snapshot)
- Tools for reasoning about time
Event Time vs Processing Time
- Event time: 事件发生时间
- Processing time:处理系统收到事件的时间
其中提到了 Processing-time lag 和 Event-time skew 这俩概念
Data Processing Patterns
有界数据集处理过程比较确定,流批基本类似,看一下无界数据集的处理
对于批处理引擎处理无界数据,批处理引擎的设计并不是为了处理流数据,但是也不是不能做,实现一般都是分割成有限数据集,分割主要有在这么几种
- FIXED WINDOWS: 按照固定时间或者固定数量分成窗口,然后依次处理这些窗口(滚动窗口)
- SESSIONS
对于流处理引擎,作者对流计算处理无界数据的场景大致分成了四类:
- Time-agnostic:处理过程完全和时间语义无关,仅仅考虑事件本身就好,比如 filtering 和 inner join 操作
- APPROXIMATION ALGORITHMS:近似 Top-N、流式 k-means 算法等等,输出即是结果
Windowing 这里要单独拿出来讲,window 主要分为三类:
- Fixed windows (tumbling windows)
- Sliding windows (hopping windows)
- Sessions
其中 windows 使用场景又包括两类:
Windowing by processing time: buffers up incoming data into windows until some amount of processing time has passed
特点是
- simple
- judging window completeness is straight forward(no need to deal with “late” data)
Windowing by event time: when you need to observe a data source in finite chunks that reflect the times at which those events actually happened
特点是
- Buffering: window 的生命周期会比 event time 的生命周期长(比如迟到事件)所以状态存储压力会大,当然作者也说了,设计的好的系统这个问题不是事儿(用 reduce 维护单状态等等)
- Completeness: 该窗口内数据是否已经全部到达不可知(watermark)
Summary
- streaming -> systems build with unbounded data (approximate/speculative results, cardinality/constitution)
- streaming is in fact a strict superset of batch
- two high-level concepts necessary for streaming systems: correctness and reasoning about time
- differences between event time and processing time
- major data processing approaches: time-agnostic, approximation, windowing by processing time, and windowing by event time