DataPipeline:Data Hub-LinkedIn 07/19 Update SLTechnology News&Howtos

DataPipeline:Data Hub-LinkedIn

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Authors: Mars Lan, Seyi Adebajo, Shirshanka Das

Translator: Zhang Yaran

As the world's largest workplace social platform, LinkedIn's data team is constantly committed to expanding its infrastructure to meet the growing needs of big data's ecosystem. As the amount and richness of data increases, it becomes more and more challenging for data scientists and engineers to identify data assets, understand their sources, and take appropriate action based on these insights.

In order to continue to expand the productivity and innovation of data as it grows, we have created a general metadata search and discovery tool, Data Hub.

I. extended metadata

To improve the productivity of the LinkedIn data team, we previously developed and open source WhereHows, a central metadata repository and dataset portal. The types of metadata stored include technical metadata (for example, location, schema, partition, ownership) and process metadata (for example, inheritance, job execution, lifecycle information). WhereHows also has a search engine to help locate data sets of interest.

Since the initial release of WhereHows in 2016, the industry has paid more and more attention to the use of metadata to improve the productivity of data scientists. For example, tools developed in this area include AirBnb's Dataportal,Uber, Databook,Netflix 's Metacat,Lyft 's Amundsen and, more recently, Google's Data Catalog. At LinkedIn, we have also been busy expanding our metadata collection to support new use cases while maintaining fairness, privacy, and transparency. However, we are beginning to realize that WhereHows does not meet our changing metadata needs. The following is a summary of our experience from extending WhereHows:

1. Push is better than pull

While extracting metadata directly from the source seems to be the most direct way to collect metadata, development and maintenance can quickly become a nightmare. It is more extensible to allow individual metadata providers to push information to a central repository through API or messages. This push-based approach also ensures that new and updated metadata is reflected in a more timely manner.

two。 General is better than concrete WhereHows has a strong opinion about what the metadata of a dataset or job should look like. This results in fixed API, data models, and storage formats, and small changes to the metadata model will result in cascading changes up and down the stack. If we design a general architecture that is not affected by the metadata model of storage and services, it will be more scalable. This in turn allows us to focus on developing a metadata model that suits our business without having to worry about the lower layers of the stack.

3. Online is as important as offline

Once the metadata is collected, it is natural to want to analyze the metadata to gain value. A simple solution is to dump all metadata to an offline system, such as Hadoop, where arbitrary analysis can be performed. However, we soon found that it was not enough to support offline analysis. For example, access control and data privacy processing, the latest metadata must be queried online.

4. Relationships play an important role.

Metadata usually conveys important relationships (for example, lineage, ownership, and dependency relationships) that can achieve powerful functions such as impact analysis, data aggregation, better search relevance, and so on. It is important to model all of these relationships as the most important components and to support efficient analysis of queries against them.

5. Polycentralization

We realized that modeling metadata around a single entity (dataset) was not enough. Data, code, and role entities throughout the ecosystem (datasets, data scientists, teams, code, microservices API, metrics, AI functions, AI models, dashboards, notebooks, etc.) need to be integrated into metadata maps.

Get to know Data Hub together.

About a year ago, we designed WhereHows from scratch based on this knowledge. We realize that LinkedIn increasingly requires a unified search and discovery experience across various data entities, as well as metadata graphs that connect them together. As a result, we decided to expand the scope of the project to build a fully generic metadata search and discovery tool, Data Hub, with an ambitious vision of linking LinkedIn employees to data that is critical to them.

We split the monolithic WhereHows stack into two different stacks: the modular UI front end and the common metadata architecture back end. The new architecture enables us to quickly expand the scope of metadata collection, not just datasets and jobs. As of this writing, Data Hub has stored and indexed tens of millions of metadata records that contain 19 different entities, including datasets, metrics, jobs, charts, AI functions, people, and groups. We also plan to play the role of metadata in machine learning models and tags, experiments, dashboards, microservices API and code in the near future.

III. Modular UI

Data Hub Web applications are the way most users interact with metadata. The application is written in Ember Framework and runs on the middle tier of Play. In order to make the development scalable, we make full use of a variety of modern network technologies, including ES9,ES.Next,TypeScript, Yarn and code quality tools such as Prettier and ESLint. The presentation layer, the control layer, and the data layer are modularized into packages so that specific views in the application are built by a combination of related packages.

Component service framework

When applying the modular UI infrastructure, we built the Data Hub Web application as a series of components to complete the business, grouped into installable packages. The software package architecture uses Yarn Workspaces and Ember add-ons on the basis, and componentized using the components and services of Ember. You can think of this as a UI built using small building blocks (that is, components and services) to create larger building blocks (that is, Ember add-ons and npm / Yarn packages) that are combined to form a Data Hub Web application.

With components and services at the core of the application, the framework allows us to separate different aspects and combine other functions in the application. In addition, the segmentation of each layer provides a very customizable architecture that allows consumers to extend or simplify their applications to take advantage of only the functionality relevant to their domain or to build new metadata models. Interact with Data Hub

At the highest level, the front end provides three types of interaction: (1) search, (2) browse, and (3) view / edit metadata. Here are some screenshots of practical applications (click to see more clearly)

Screenshot of Data Hub App

Similar to a typical search engine experience, users can search for one or more types of entities by providing a list of keywords. They can further achieve the results by screening a series of aspects. Advanced users can also use operators such as OR,NOT and regex to perform complex searches.

Data entities in Data Hub can be organized and browsed in a tree, where each entity is allowed to appear in multiple locations in the tree. This enables users to browse the same directory in different ways, for example, through physical deployment configuration or business function organization. There is even a special section in the tree that shows only "certified entities" that can be planned through a separate governance process.

The final interactive view / edit metadata is also the most complex. Each data entity has a profile page that displays all relevant metadata, for example, the dataset profile page may contain its schema, ownership, compliance, health, and lineage metadata. It can also show the relationship between entities and other entities. For editable metadata, users can also update it directly through UI.

IV. Common metadata architecture

In order to fully realize the vision of Data Hub, we need an architecture that can be extended using metadata. There are four different forms of scalability challenges:

1. Modeling: model all types of metadata and relationships in a developer-friendly manner.

two。 Get: get a large number of metadata changes on a large scale through API and streaming.

3. Services: services for the collection of raw and derived metadata, as well as complex queries against metadata.

4. Index: index metadata on a large scale and automatically update the index when the metadata changes. Metadata modeling

In short, metadata is "data that provides information about other data." In terms of metadata modeling, there are two different requirements:

1. Metadata is also data

In order to simulate metadata, we need a general modeling language.

two。 Metadata is distributed

It is unrealistic to expect all metadata to come from a single source. For example, the system that manages the access control list (ACL) of the dataset is likely to be different from the system that stores schema metadata. A good modeling framework should allow multiple teams to develop their metadata models independently while presenting a unified view of all metadata related to data entities.

We chose to take advantage of Pegasus, an open source and complete data schema language created by LinkedIn. Pegasus is designed for general-purpose data modeling, so it is suitable for most metadata. However, because Pegasus does not provide a clear method of model relationships or associations, we have introduced some custom extensions to support these use cases.

To demonstrate how to model metadata using Pegasus, let's look at a simple example demonstrated by the following modified entity relationship diagram (ERD).

The example contains three types of entities: users, groups, and datasets, represented by the blue circle in the figure. We use arrows to represent the three types of relationships between these entities, namely OwnedBy,HasMember and HasAdmin. In other words, a group consists of an administrator and multiple users who can have one or more datasets.

Unlike traditional ERD, we place the attributes of the entity and the relationship directly inside the circle and below the relationship name, respectively, in order to attach new types of components (called "metadata aspects") to the entity. Different teams can own and develop different aspects of the same entity metadata without interfering with each other, thus realizing the requirements of distributed metadata modeling. Three types of metadata aspects: ownership, profile, and membership are presented as green rectangles in the above example. Dotted lines represent the association of metadata aspects with entities. For example, profiles can be associated with users, and ownership can be associated with datasets, and so on.

You may have noticed that entity and relational attributes overlap with metadata, for example, the firstName attribute of User should be the same as the firstName field of the associated Profile. The reasons for the repetition will be explained in the second half of this article.

In the case of Pegasus, we transform each entity, relationship, and metadata aspect into a separate Pegasus schema file (PDSC). For simplicity, we include only one model for each category here. First, let's take a look at the PDSC of the user entity:

Each entity needs a globally unique ID in the form of URN, which can be thought of as a type of GUID. The User entity has attributes that include first name, last name, and LDAP, each of which maps to an optional field in the user record.

Next is the PDSC model of the OwnedBy relationship:

Each relational model naturally contains the Source and destination fields, which use its URN to point to a specific entity instance. The model can optionally include other property fields, such as "type" in this example. Here, we also introduce a custom attribute called "pairings" to restrict relationships to specific source and target URN types. In this case, the OwnedBy relationship can only be used to connect the dataset to the user.

Finally, you will find the model for Ownership metadata below. Here, we choose to model ownership as an array of records containing type and ldap fields. However, as long as it is a valid PDSC record, there are few restrictions on metadata modeling. This makes it possible to meet the previously mentioned requirement that "metadata is also data".

After all the models have been created, the next question is how to connect them together to form the proposed ERD. We will defer this discussion to the metadata indexing section later in this article.

Metadata acquisition

Data Hub provides two forms of fetching metadata: direct API calls or Kafka streams. The former is used for metadata changes that require read-write consistency, while the latter is more suitable for fact-oriented updates.

Data Hub's API is based on Rest.li, an extensible, strongly typed RESTful services architecture that is widely used in LinkedIn. Because Rest.li uses Pegasus as its interface definition, you can use all the metadata models defined in the previous section verbatim. The days of multi-level model transformation from API to storage are a thing of the past-API and models will always be synchronized.

For Kafka-based acquisition, the metadata generator is expected to generate a standardized metadata change event (MCE) containing a list of suggested changes to a particular metadata aspect keyed by the corresponding entity URN. The format of MCE is Apache Avro, which is automatically generated by the Pegasus metadata model.

Using the same metadata model in both API and Kafka event patterns allows us to easily improve the model without painstaking maintenance of the corresponding transformation logic. However, to achieve truly seamless schema evolution, we need to limit all schema changes to always backward compatibility. This is enforced at build time by adding compatibility checks.

At LinkedIn, we tend to rely more on Kafka streams because it provides loose coupling between producers and consumers. We receive millions of MCE from different producers every day, and that number is expected to grow exponentially as the scope of metadata collection expands. In order to construct the streaming metadata extraction pipeline, we use Apache Samza as the flow processing framework. The purpose of getting Samza jobs is to achieve high throughput quickly and easily. It simply converts the Avro data back to Pegasus and calls the appropriate Rest.li API to complete the acquisition.

Metadata service

Once the metadata is ingested and stored, the original and derived metadata must be provided effectively. Data Hub is designed to support four common large metadata queries:

1. Document-oriented query

two。 Graphic-oriented query

3. Complex queries involving joins

4. Full-text retrieval

To achieve this, Data Hub needs to use a variety of data systems, each dedicated to extending and providing limited types of queries. For example, Espresso is LinkedIn's NoSQL database, which is particularly suitable for large-scale document-oriented CRUD. Similarly, Galene can easily index and provide full-text search on a web scale. When it comes to non-trivial graphical queries, it's not surprising that specialized graph DB can perform orders of magnitude better than RDBMS. However, facts have proved that graph structure is also a natural way to represent foreign key relations, which can effectively answer complex join queries.

Data Hub further abstracts the underlying data system through a set of common data access objects (DAO), such as key-valued DAO, query DAO, and search DAO. Furthermore, when you modify the implementation of DAO, you no longer need to change any business logic in Data Hub. This will enable us to make full use of LinkedIn proprietary storage technology while providing open source Data Hub for reference implementation for popular open source systems.

Another major benefit of DAO abstraction is standardized change data capture (CDC). Regardless of the type of underlying data storage system, any update operation with the key value DAO will automatically emit a metadata audit event (MAE). Each MAE contains the URN of the corresponding entity, as well as the front and back images of specific metadata aspects. This supports the lambda architecture, where MAE can be processed in batches or streams. Like MCE, MAE's schemas are automatically generated by the metadata model.

Metadata index

The last missing part is the metadata index pipeline. The system connects metadata models together and creates corresponding indexes in graphical databases and search engines to promote effective queries. This business logic is captured in the form of an index builder and a graph builder and executed as part of a Samza job that processes MAE. Each builder registers their interest in specific metadata in the job and will make calls using the appropriate MAE. The builder then returns to a list of idempotent updates that will be applied to the search index or graph DB.

The metadata index pipeline is also highly scalable because it can be partitioned based on the entity URN of each MAE to support orderly processing of each entity.

V. conclusions and expectations

In this article, we introduced Data Hub, which is the latest development in our LinkedIn metadata journey. The project includes a modular UI front end and a common metadata architecture back end.

For the past six months, Data Hub has supported weekly searches and a variety of specific operations for more than 1500 employees at LinkedIn. LinkedIn's metadata graph contains more than a million data sets, 23 data storage systems, 25k metrics, more than 500 AI functions, and most importantly, all LinkedIn employees are the creators, consumers and operators of the graph.

We continue to improve Data Hub by adding more interesting user stories and correlation algorithms to the product. We also plan to add native support for GraphQL and automatically generate code in the near future using Pegasus domain-specific language (PDL). At the same time, we are actively sharing the evolution of WhereHows with the open source community and will make an announcement after the public release of Data Hub.

Source: Linkedin Engineering, author of "Data Hub: A Generalized Metadata Search & Discovery Tool" / Mars Lan, Seyi Adebajo, Shirshanka Das

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.