\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:embed {\"url\":\"https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/\"} -->\n\u003Cfigure class=\"wp-block-embed\">\u003Cdiv class=\"wp-block-embed__wrapper\">\nhttps://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/\n\u003C/div>\u003C/figure>\n\u003C!-- /wp:embed -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-tensors-and-their-properties\">\u003Cstrong>Tensors and their properties\u003C/strong>\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>A tensor describes an n-dimensional array of data. Often, the so-called \u003Cem>rank \u003C/em>or the \u003Cem>number of axes\u003C/em> refer to the dimensions. A rank-0 tensor is a scalar, a rank-1 tensor is a vector, and matrices refer to a rank-2 tensor.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>N-dimensional tensors are \u003Ca href=\"https://d2l.ai/chapter_preliminaries/linear-algebra.html\">ideal for machine learning applications\u003C/a> as they provide fast access to data by a quick lookup and without decoding or further processing. Due to the \u003Ca href=\"https://en.wikipedia.org/wiki/Tensor_algebra\">well-known matrix mathematics\u003C/a>, computing with tensors is very efficient and allows the training of deep learning models that require the computation of millions and billions of parameters. Many tensor operations such as addition, subtraction, the \u003Ca href=\"https://en.wikipedia.org/wiki/Hadamard_transform\">Hadamard transform\u003C/a>, \u003Ca href=\"https://en.wikipedia.org/wiki/Dot_product\">dot product\u003C/a>, and many more are efficiently implemented in standard machine learning libraries.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Storing data in the tensor format comes with a notable time/memory tradeoff, which is not uncommon in computer science. Storing encoded and compressed data reduces the required disk space to a minimum. To access the data, it must be decoded and decompressed, which requires computational effort. For single files, this is mostly irrelevant and the advantages of fast transfer and low storage requirements outweigh the access time. However, when training deep learning models, the data is accessed frequently, and algorithms fundamental for machine learning models (such as convolution for image analysis) cannot operate on encoded data.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>A well-encoded 320 x 213 pixel JPG image requires only around 13 KB of storage, whereas a float32 tensor of the same image data utilizes about 798 KB of memory, an increase of 6100%.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:image -->\n\u003Cfigure class=\"wp-block-image\">\u003Cimg src=\"https://lh4.googleusercontent.com/wCWGVe0SmkzzggJdi2h4FT4hHox7h75cpo59B5FHUAQCcZ2BAzXvnKFTWYrsa4tqU165NsvkQyzV9F9dp2dnPrwkPo-AEKkqx3Rh_P7vuZ2JUP6lDdToHlEf4vwmt8KaaOBxfOz_ThnM_t2kJSanzZMgvy1CVcCD6tupeTJae4DnEM48q91i3z6OX-ZSrw\" alt=\"\"/>\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>320 x 213 color pixels in JPG only require 13 KB of storage\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:image -->\n\u003Cfigure class=\"wp-block-image\">\u003Cimg src=\"https://lh5.googleusercontent.com/xa-Di8zfpmP1sdgrcAQF6XOKK0GjUeoRHK2EfbV-9N0zZNSd8nE414--4IaPiPjlhfQYSXBIp9e0ERVDK0FrbPhttm_dWPyAAed2vAsff1oA0Ji7hEhWOm-vAWWtEVmJ8aYslifs94EBame81GBf_ssAuZiWKDOGIKu77AC6-S3Vlpk2dDfQ0LChYfa9gg\" alt=\"\"/>\u003C/figure>\n\u003C!-- /wp:image -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>same image data stored in a 320 x 213 x 3 float32 tensor weights 798 KB\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>To combine the advantages of both, special\u003Ca href=\"https://datagy.io/pytorch-dataloader/\"> data loader modules have been designed\u003C/a> to preprocess the data for optimized usage in the tensor format.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Additional optimizations such as batching and sparse data formats exist to handle the large amounts of data. Nevertheless, hardware requirements for (training) deep learning models remain high.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:heading -->\n\u003Ch2 id=\"h-decisions-for-your-data-pipeline\">\u003Cstrong>Decisions for your data pipeline\u003C/strong>\u003C/h2>\n\u003C!-- /wp:heading -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Taking into account the above insights on specialized data structures, let’s have a look at the decisions one has to make when designing a data pipeline.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>First of all, even before starting to develop any machine learning models, make sure to store any relevant data \u003Cstrong>structured \u003C/strong>and \u003Cstrong>accessible\u003C/strong>. For image data, this usually means some cloud storage with additional metadata attached to the files or stored in a separate database. For our data loader it is important to have a structured list of the relevant files and their attached labels. This metadata will be used to download and preprocess the files. Keep in mind that at some point there can be multiple machines working on the same datasets in parallel, so they all need parallel access. For the training procedure, we want to cache the data on the training machine directly to avoid high transaction times and costs, as we frequently access the data. Even if you don’t plan to train a machine learning model (yet), it might be worth it to think about storing relevant data and potentially labels that could be useful for supervised learning later.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>In the next step, we convert the data into a useful tensor format. The tensor rank depends on the used data type (see \u003Ca href=\"https://colab.research.google.com/drive/1mg2WRO7_DIc1U_0Q1NO7ou6C4F89NuWY\">examples in the workbook\u003C/a>) and, more surprisingly, on the problem definition. It’s important to define if the model should interpret data (for example a sentence) independent from others or what parts of the data are related to each other. A batch usually consists of a number of independent samples. The batch size is flexible and can be reduced down to a single sample at inference/testing time. The type of the tensor also depends on the data type and normalization methods (pixels can be represented as integers from 0 to 255 or as a floating number from 0 to 1).\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>For smaller problems, it might be possible to load the full dataset into memory (tensor format) and perform the training on this data source with the advantage of faster data loading during training and a low CPU load. However, for most practical problems this is rarely possible, as even standard datasets easily surpass hundreds of gigabytes. For those cases, \u003Ca href=\"https://www.tensorflow.org/guide/data_performance\">asynchronous data loaders can work as a thread on the CPU\u003C/a> and prepare the data in memory. This is a continuous process so it works even if the total amount of memory is smaller than the full dataset.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Dataset decisions:\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:list -->\n\u003Cul>\u003C!-- wp:list-item -->\n\u003Cli>structured format\u003C/li>\n\u003C!-- /wp:list-item -->\n\n\u003C!-- wp:list-item -->\n\u003Cli>accessible storage\u003C/li>\n\u003C!-- /wp:list-item -->\n\n\u003C!-- wp:list-item -->\n\u003Cli>labels and metadata\u003C/li>\n\u003C!-- /wp:list-item -->\n\n\u003C!-- wp:list-item -->\n\u003Cli>tensor format (rank, batch size, type, normalization)\u003C/li>\n\u003C!-- /wp:list-item -->\n\n\u003C!-- wp:list-item -->\n\u003Cli>loading data from disk to memory, parallelization\u003C/li>\n\u003C!-- /wp:list-item -->\u003C/ul>\n\u003C!-- /wp:list -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>Scalars, vectors, matrices, and especially tensors are the basic building blocks of any machine learning dataset. Training a model starts with building a relevant dataset and data processing pipeline. This article provided an overview of optimized data structures and explained some of the relevant aspects of the tensor format. Hopefully the discussed decisions for designing data pipelines can serve as a starting point for more detailed insights into the topic of data processing in machine learning.\u003C/p>\n\u003C!-- /wp:paragraph -->\n\n\u003C!-- wp:paragraph -->\n\u003Cp>\u003Cem>Visit \u003C/em>\u003Ca href=\"https://colab.research.google.com/drive/1mg2WRO7_DIc1U_0Q1NO7ou6C4F89NuWY\">\u003Cem>the additional notebook\u003C/em>\u003C/a>\u003Cem> for practical examples of how to process different types of data.\u003C/em>\u003C/p>\n\u003C!-- /wp:paragraph -->","html","2023-01-04T15:00:00.000Z",{"current":523},"getting-your-data-in-shape-for-machine-learning",[525,533,537,542,547,551],{"_createdAt":526,"_id":527,"_rev":528,"_type":529,"_updatedAt":526,"slug":530,"title":532},"2023-05-23T16:43:21Z","wp-tagcat-code-for-a-living","9HpbCsT2tq0xwozQfkc4ih","blogTag",{"current":531},"code-for-a-living","Code for a Living",{"_createdAt":526,"_id":534,"_rev":528,"_type":529,"_updatedAt":526,"slug":535,"title":536},"wp-tagcat-data",{"current":536},"data",{"_createdAt":526,"_id":538,"_rev":528,"_type":529,"_updatedAt":526,"slug":539,"title":541},"wp-tagcat-data-structures",{"current":540},"data-structures","data structures",{"_createdAt":526,"_id":543,"_rev":528,"_type":529,"_updatedAt":526,"slug":544,"title":546},"wp-tagcat-machine-learning",{"current":545},"machine-learning","machine learning",{"_createdAt":526,"_id":548,"_rev":528,"_type":529,"_updatedAt":526,"slug":549,"title":550},"wp-tagcat-tensors",{"current":550},"tensors",{"_createdAt":552,"_id":553,"_rev":554,"_type":529,"_updatedAt":555,"description":556,"slug":565,"title":566},"2025-04-24T16:28:57Z","797b8797-6e65-4723-b53f-8bc005305384","vn3UzGZJyacwgllS8WZNgc","2025-04-24T16:29:32Z",[557],{"_key":558,"_type":59,"children":559,"markDefs":564,"style":67},"bb32f75814b4",[560],{"_key":561,"_type":63,"marks":562,"text":563},"dbcf27ef29b3",[],"Community-generated articles submitted for your reading pleasure. ",[],{"_type":10,"current":566},"contributed","Getting your data in shape for machine learning",[569,575,581,587],{"_id":570,"publishedAt":571,"slug":572,"sponsored":12,"title":574},"370eca08-3da8-4a13-b71e-5ab04e7d1f8b","2025-08-28T16:00:00.000Z",{"_type":10,"current":573},"moving-the-public-stack-overflow-sites-to-the-cloud-part-1","Moving the public Stack Overflow sites to the cloud: Part 1",{"_id":576,"publishedAt":577,"slug":578,"sponsored":512,"title":580},"e10457b6-a9f6-4aa9-90f2-d9e04eb77b7c","2025-08-27T04:40:00.000Z",{"_type":10,"current":579},"from-punch-cards-to-prompts-a-history-of-how-software-got-better","From punch cards to prompts: a history of how software got better",{"_id":582,"publishedAt":583,"slug":584,"sponsored":12,"title":586},"65472515-0b62-40d1-8b79-a62bdd2f508a","2025-08-25T16:00:00.000Z",{"_type":10,"current":585},"making-continuous-learning-work-at-work","Making continuous learning work at work",{"_id":588,"publishedAt":589,"slug":590,"sponsored":12,"title":592},"1b0bdf8c-5558-4631-80ca-40cb8e54b571","2025-08-21T14:00:25.054Z",{"_type":10,"current":591},"research-roadmap-update-august-2025","Research roadmap update, August 2025",{"count":122,"lastTimestamp":594},"2023-05-25T09:48:04Z",["Reactive",596],{"$sarticleModal":597},false,["Set"],["ShallowReactive",600],{"sanity-aQlcsJmRVXUiEuj39lwUQ7GkuqSrzS13aBhSc9d8yIQ":-1,"sanity-comment-wp-post-21411-1756387746037":-1},"/2023/01/04/getting-your-data-in-shape-for-machine-learning"]