60 Terms every Data Analyst Must Know

9 min readFeb 10, 2022

1. Algorithm — An algorithm is a set of instructions we give to a computer system so it can take values and manipulate them into a usable form.

2. Artificial Intelligence — Artificial Intelligence is the intelligence demonstrated by machines. AI is the development of computer science systems to perform tasks similar to human intelligence such as speech recognition, visual perception, decision making and language translators etc.

3. Big data — Refers to the large volume of data, both structured and unstructured. But it is not the amount of data that is important. It is how organizations use this large amount of data to generate insights. Companies use various tools, techniques, and resources to make sense of this data to derive effective business strategies.

4. Business Analytics — Business analytics is used to show the practical methodology followed by an organization for exploring data to extract insights. The methodology involves the statistical analysis of data, followed up with the interpretation in the business context.

5. Business Intelligence — Business intelligence is a set of strategies, applications, data, technologies used by an organization for data collection, analysis and generating insights to derive strategic business opportunities.

6. Classification — Classification is a supervised machine learning technique. It deals with categorizing a data point based on its similarity to other data points.

7. Clustering — Clustering is an unsupervised learning method used to discover the inherent groupings in the data. For example: Grouping customers based on their purchasing behavior which is further used to segment the customers. And then the companies can use the appropriate marketing tactics to generate more profits.

8. Computer vision — Computer Vision is a field of computer science that deals with enabling computers (or devices) to visualize, process, and identify images/videos in the same way that a human vision does. Some applications are — Pedestrians, cars, road detection in smart (self-driving) cars Object recognition Object tracking Motion analysis.

9. Dashboard — It is a data product or a report of graphical representation of analysis performed on a dataset. A graphical report may include different charts and infographics with insights. It is an information management tool which is used to visually track and analyze key performance indicators, metrics, and key data points.

10. Data — A piece of information that can be stored, processed, or analysed. Data is the unit of information that is collected through observations.

11. Data Aggregation — Data aggregation refers to the collection of data from multiple sources to bring all the data together into a common athenaeum for the purpose of reporting and/or analysis.

12. Data Analysis — Data analysis is the process of collecting, modelling, and analyzing data to extract insights that support decision.

13. Data Architecture and design — Data architecture consists of models, policies, standards, or rules that control which data is aggregated, and how it is arranged, stored, integrated, and brought to use in data systems. It has three phases: Conceptual representation of business entities The logical representation of the relationships between business entities The physical construction of the system for functional support.

14. Data Cleaning — Data Cleansing/Scrubbing/Cleaning is a process of revising data to remove incorrect spellings, duplicate entries, adding missing data, and providing consistency. It is required as incorrect data can lead to bad analysis and wrong conclusions/insights.

15. Data Collection — Systematic approach of collecting observations, measurements. It helps to collect first

16. Data Culture — A Data Culture is the collective behaviours and beliefs of people who value, practice, and encourage the use of data to improve decision

17. Data Driven decisions — Making Decisions Using facts, metrics, and data to guide strategic business decisions that align with your goals, objectives, and initiatives

18. Data Engineering — Data engineering is the aspect of data science that focuses on practical applications of data collection and analysis. Data engineering helps make data more useful and accessible for consumers of data. It focuses on source, transform, and analyze data from each system.

19. Data Governance — Data Governance is the process, and procedure organizations use to manage, utilize, and protect their data.

20. Data Literacy — Data literacy is the ability to read, work with, analyze and communicate with data. It is a skill that empowers all levels of workers to ask the right questions of data and machines, build knowledge, make decisions, and communicate meaning to others.

21. Data Mart — It is a simple form of data warehouse focused on a single subject or line of business. With a data mart, teams can access data and gain insights faster, because they do not have to spend time searching within a more complex data warehouse or manually aggregating data from different sources.

22. Data Mining — Data mining is a study of extracting useful information from structured/unstructured data taken from various sources. This is done usually for mining for frequent patterns, mining for associations, mining for correlations, mining for clusters and mining for predictive analysis.

23. Data Modelling — Data modelling is the process of creating a data model for an information system by using certain formal techniques. Data modelling is used to define and analyze the requirement of data for supporting business processes.

24. Data Pipelines — A collection of scripts or functions that pass data along in a series. The output of the first method becomes the input to the second. This process continues until the data is appropriately cleaned and transformed for whatever task a team is working on.

25. Data Science — Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data and apply knowledge and actionable insights from data across a broad range of application domains.

26. Data Storytelling — Data storytelling is the practice of building a narrative around a set of data and its accompanying visualizations to help convey the meaning of that data in a powerful and compelling fashion.

27. Data Visualization — The art of communicating data visually. This involves visual insights, infographics, graphs, plots, and data dashboards.

28. Data Warehouse — A data warehouse is a large collection of business data used to help an organization make decisions. It is a system used to do quick analysis of business trends using data from many sources.

29. Database — It is a structured collection of data. The collected information is organized in a way such that it is easily accessible by the computer. Databases are built and managed by using database programming languages. The most common database language is SQL.

30. Dataset — A dataset (or data set) is a collection of data. A dataset is organized into some type of data structure. In a database, for example, a dataset might contain a collection of business data (names, salaries, contact information, sales etc.).

31. Deep learning — Deep Learning is associated with a machine learning algorithm (Artificial Neural Network, ANN) which uses the concept of human brain to facilitate the modelling of arbitrary functions.

32. Descriptive Analysis — It looks at data statistically to tell you what happened in the past. Descriptive analytics helps a business understand how it is performing by providing context to help stakeholders interpret information. This can be in the form of data visualizations like graphs, charts, reports, and dashboards.

33. EDA — Exploratory Data Analysis. It is a phase used for data science pipeline in which the focus is to understand insights of the data through visualization or by statistical analysis.

34. ETL — Acronym of three database functions extract, transform and load. These three functions are combined into one tool to place them from one to another database. Extract It is the process of reading data from a database. Transform It is the process of conversion of extracted data in the desired form so that it can be put into another database. Load It is the process of writing data into the target database

35. Feature Selection — Feature Selection is a process of choosing those features which are required to explain the predictive power of a statistical model and dropping out irrelevant features. This can be done by either filtering out less useful features or by combining features to make a new one.

36. Hadoop — Hadoop is an open-source distributed processing framework used when we must deal with enormous data. It allows us to use parallel processing capability to handle big data

37. Hyperparameter — A hyperparameter is a parameter whose value is set before training a machine learning or deep learning model. Different models require different hyperparameters and some require none. Hyperparameters should not be confused with the parameters of the model because the parameters are estimated or learned from the data.

38. Hypothesis — hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.

39. Inferential Statistics — In inferential statistics, we try to hypothesize about the population by only looking at a sample of it. For example, before releasing a drug in the market, internal tests are done to check if the drug is viable for release. But here we cannot check with the whole population for viability of the drug, so we do it on a sample which best represents the population.

40. Insights — Insights are generated by statistical analysis of data, followed by business interpretation.

41. Interquartile Range — IQR is a measure of variability based on dividing the rank.

42. Libraries — The library is having a collection of related functionalities of codes that allows you to perform many tasks without writing your code

43. Machine learning — Machine learning is a computer science field that makes use of statistical strategies to provide the facility to “learn” with data on the computer. Machine learning is used for exploiting the opportunities hidden in big data.

44. Mean — mean is said to be the average value of all the numbers. It can sometimes be used as a representation of the whole data.

45. Median — Median of a set of numbers is usually the middle value. When the total numbers in the set are even, the median will be the average of the two middle values. Median is used to measure the central tendency.

46. Metadata — Metadata is the data about data. It is administrative, descriptive, and structural data that identifies the assets.

47. Mode — Mode is the most frequent value occurring in the population. It is a metric to measure the central tendency, i.e., a way of expressing, in a (usually) single number, important information about a random variable or a population.

48. Model Selection — Model selection is the task of selecting a statistical model from a set of known models. Various methods that can be used for choosing the model are: Exploratory Data Analysis Scientific Methods

49. Natural Language processing — Natural Language Processing is a field which aims to make computer systems understand human speech. NLP consists of techniques to process, structure, categorize raw text and extract information. Example Chatbot.

50. Normal Distribution — The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Bell Curve graph.

51. Normalization — A set of data is said to be normalized when all the values have been adjusted to fall within a common range. We normalize data sets to make comparisons easier and more meaningful.

52. Outlier — An outlier is a data point that is considered extremely far from other points. They are generally the result of exceptional cases or errors in measurement and should always be investigated early in a data analysis workflow.

53. Overfitting — Overfitting is a condition when a model considers too much of information.

54. Predictive Analysis — Predictive analytics takes historical data and feeds it into a machine learning model that considers key trends and patterns. The model is then applied to current data to predict what will happen next.

55. Regression — Regression is a supervised machine learning problem. It focuses on how a target value changes as other values within a data set change. Regression problems generally deal with continuous variables, like how square footage and location affect the price of a house.

56. Standard deviation — It shows how much the members of a group differ from the mean value for the group

57. Supervised learning — Subcategory of Machine Learning that uses labelled datasets to train algorithms for classification or predictions.

58. Text Analytics — The text analytics is basically the process of the application of linguistic, machine learning, and statistical techniques on text

59. Underfitting — Underfitting happens when the data does not offer enough information to the model.

60. Unsupervised learning — In this Technique we do not have any target or outcome variable to predict/estimate. The goal of unsupervised learning is to model the underlying structure or distribution in the data to learn more about the data or segment into different groups based on their attributes.

60 Terms every Data Analyst Must Know

Written by CODEX