Data Sourcing : Questions to Data Source teams when building new pipelines

Pradeep Vijayakumar
2 min readDec 7, 2020

It is very common practice in a data pipeline or ETL projects to get to know all the details about the source before any development activity begins. When we miss to ask some critical questions and start development, it results in tedious re-work or modifying the built pipeline at later stage in the process, when we get to know additional details.

Also, when it comes to new pipeline projects, the initial discussions doesn’t usually involve techie folks such as a Data Engineer or Architect. So we wanted to document some of these basic questions when going to new source discussions.

Questions/Checklists:

  1. Where is the data available? (E.g.: FTP, SFTP, S3, Email, Slack, SharePoint, etc...)
  2. What is the format of data? (E.g.: .csv, .xlsx, .txt, Database tables, etc..)
  3. When is the data available usually for consumption?(E.g.: 4am, real-time, near real time, As soon as dependent job completes)
  4. How updated is the data? Is there a column that identifies when the data was updated/inserted? (E.g.: previous day, 1 hour before, 1 minute before, etc..).
  5. How frequently does the data get updated? Does the historical data get changed? If yes, how much of history can it go and update or change? (E.g.: Operational system can go change past 3 months of transactional data)
  6. What is the level of grain of that data? Or Is there a business key? If not, how do we identify uniqueness of the row?
  7. What quantity of data are we dealing with? How quickly does it grow? (E.g.: Millions/thousands/etc..).
  8. Do we have metadata document explaining the various fields/columns/tables along with its corresponding data types/constraints like NULL/ Default values?
  9. How do we get access to the data? Are there any restrictions?
  10. Does the table load depend on other processes?
  11. When can we get a sample of the data?
  12. Can we build ETL process to extract data from this source on a scheduled basis?
  13. Are there any PII (Personal Identity Information/secure info) information in the data?
  14. How will we be notified in case of any issues with the loading process?

The above are some of the basic questions we can ask a potential data source team. Some of the questions may not be relevant because of the source we are dealing with and also there might be additional questions that we may need to ask. I have purposefully kept the questions more generic so that anyone can make use of it.

--

--