Garland +

Streaming Systems 阅读笔记

Streaming 101

Terminology: What Is Streaming?

开头作者先批判了一番对 streaming 这个概念理解不到位的,称他们会 clouds what streaming really means, 这里学了个新词儿 cloud,然后给出了他认为的 streaming system 的定义:

A type of data processing engine that is designed with infinite datasets in mind.

作者认为描述一个 dataset 有两方面:

  1. cardinality: 数据集大小
    • Bounded data: 有界数据集
    • Unbounded data: 无界数据集
  2. constitution:the constitution defines the ways one can interact with the data
    • Table: A holistic view of a dataset at a specific point in time.
    • Stream: An element-by-element view of the evolution of a dataset over time

然后说细节啊什么的都后面再讲,然后就开始说流计算的优劣,既然谈到流计算,作者又顺便科普了一下 lambda 架构

然后又扯到了 Kappa 架构,流计算系统如果想要取代批处理系统需要两点

  1. Correctness: Strong consistency is required for exactly-once processing(MillWheel, spark Streaming, flink snapshot)
  2. Tools for reasoning about time

Event Time vs Processing Time

其中提到了 Processing-time lag 和 Event-time skew 这俩概念

Data Processing Patterns

有界数据集处理过程比较确定,流批基本类似,看一下无界数据集的处理

对于批处理引擎处理无界数据,批处理引擎的设计并不是为了处理流数据,但是也不是不能做,实现一般都是分割成有限数据集,分割主要有在这么几种

对于流处理引擎,作者对流计算处理无界数据的场景大致分成了四类:

Windowing 这里要单独拿出来讲,window 主要分为三类:

其中 windows 使用场景又包括两类:

Windowing by processing time: buffers up incoming data into windows until some amount of processing time has passed

特点是

Windowing by event time: when you need to observe a data source in finite chunks that reflect the times at which those events actually happened

特点是

Summary

The What, Where, When, and How of Data Processing

言:

Blog

Thoughts

Project