Data Pipeline - Ori Morningstar's Garden

- **Epistemic status:** #budding A Data Pipeline is a software architecture methodology in which raw data is inserted into a data store. Before data flows into the data store it undergoes some processing such as filtering, masking, and aggregations which ensure the data is standardized and integrated properly. There are two ways that a data pipeline is implemented: ## Batch Processing Batch processing is the method in which the batches of data are sent to the data store during set time intervals. Often done on off-peak business hours to minimize the workload of impacted resources, since batch processing usually works with large data sets. Batch processing is optimal for jobs that don't require an immediate analysis of a specific data set, such as a monthly report or software backups. ## Streams Streams is the method in which the data is sent and processed continuously. Streams are optimal for jobs that require immediate analysis, such as applications and point of sale systems. A single action is considered an “event” such as adding an item to check out meanwhile typically a group of events is considered a “topic” or a “stream”. --- ## References - Sarfin, Rachel Levy. “Streaming Data Pipelines: Building a Real-Time Data Pipeline Architecture.” _Precisely_ (blog), April 9, 2021. <https://www.precisely.com/blog/big-data/streaming-data-pipelines-how-to-build-one>. - Thomas, David, and Andrew Hunt. _The Pragmatic Programmer, 20th Anniversary Edition: Journey to Mastery_. Second edition. Boston: Addison-Wesley, 2019. - “What Is a Data Pipeline | IBM.” Accessed October 25, 2022. <https://www.ibm.com/topics/data-pipeline>.