In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "how to use .NET 5". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use .NET 5".
.net 5 is designed to provide a unified runtime and framework with unified runtime behavior and development experience across platforms. Microsoft has released big data (. NET for Spark) and machine learning (ML.NET) tools that collaborate with .NET, which together provide a productive end-to-end experience. In this article, we will introduce the basics of .NET for Spark, big data, ML.NET, and machine learning. We will examine their API and features, and show you how to start building and consuming your own Spark assignments and ML.NET models.
What is big data?
Big data is an almost self-evident industry term. The term refers to large data sets, usually at the TB or even PB level, that are used as inputs to analysis to reveal patterns and trends in the data. The key difference between big data and traditional workloads is that big data is often too large, complex or changeable for traditional databases and applications to handle. A popular method of data classification is called "3V". It is called "3V". One popular way of classifying data is called "3V". Velocity speed, Variety diversity.
Big data's solution is tailored to accommodate high-capacity, complex and diverse data structures, and manages speed through batch (static) and stream (dynamic) processing.
Most big data solutions provide a way to store data in a data warehouse, which is usually a distributed cluster optimized for fast retrieval and parallel processing. Dealing with big data often involves multiple steps, as shown in the following figure:
.net 5 developers who need analysis and insight based on large datasets can use the .NET implementation based on the popular big data solution Apache Spark: .NET for Spark.
.NET for Spark
The .NET for Spark is based on Apache Spark, an open source analysis engine for processing big data. It is designed to process large amounts of data in memory to provide better performance than other solutions that rely on persistent storage. It is a distributed system that processes workloads in parallel. It provides support for loading, querying, processing, and outputting data.
Apache Spark supports Java, Scala, Python, R, and SQL. Microsoft created the .NET for Spark to increase support for .NET. The solution provides free, open, cross-platform tools for building big data applications in languages supported by .NET, such as C # and F#, so that you can use existing .NET libraries while taking advantage of Spark features such as SparkSQL.
The following code shows a small but complete .NET for Spark application that reads a text file and outputs the number of words in descending order.
Using Microsoft.Spark.Sql; namespace MySparkApp {class Program {static void Main (string [] args) {/ / Create a Spark session. SparkSession spark = SparkSession.Builder (). AppName ("word_count_sample"). GetOrCreate (); / / Create initial DataFrame. DataFrame dataFrame = spark.Read (). Text ("input.txt"); / / Count words. DataFrame words = dataFrame.Select (Functions.Split (Functions.Col ("value"), ") .Alias (" words ")) .Select (Functions.Explode (Functions.Col (" words ")) .Alias (" word ")) .GroupBy (" word ") .Count () .OrderBy (Functions.Col (" count "). Desc ()) / / Show results. Words.Show (); / / Stop Spark session. Spark.Stop ();}
Configuring .NET for Spark on the development machine requires several dependencies to be installed, including Java SDK and Apache Spark. You can check the introduction guide to the handle here (https://aka.ms/go-spark-net)).
Spark for .NET can be run in a variety of environments and can be deployed to the cloud. Deployable targets include Azure HDInsight, Azure Synapse, AWS EMR Spark, and Databricks. If the data is available as part of the project, you can submit it along with other project files.
Big data is usually used with machine learning to gain insight into data.
What is machine learning?
First of all, let's introduce the basic knowledge of artificial intelligence and machine learning.
Artificial intelligence (AI) means that computers imitate human intelligence and abilities, such as reasoning and finding meaning. Typical artificial intelligence techniques usually start with rules or logical systems. As a simple example, consider a scenario where you want to classify something as "bread" or "not bread". When you start, it seems to be a simple question, such as "if it has eyes, it's not bread." However, you will soon begin to realize that there are many different features that can classify something as bread or non-bread, and the more features there are, the longer and more complex a series of if statements will be, as shown in the following figure:
As can be seen from the examples in the above figure, traditional rule-based artificial intelligence technologies are often difficult to expand. This is what machine learning is for. Machine learning (ML) is a subset of artificial intelligence that can find patterns in past data and learn from experience to take action on new data. ML allows computers to make predictions without programming clear logic rules. Therefore, you can use ML when you have a problem that is difficult (or impossible) to solve with rule-based programming. You can think of ML as "programming that is not programmable".
To solve the "bread" and "non-bread" problems with ML, you provide bread and non-bread examples (as shown in the following figure) instead of implementing a long list of complex if statements. You pass these examples to an algorithm that finds a pattern in the data and returns a model, which you can then use to predict whether an image that has not been "seen" by the model is "bread" or "not bread".
The figure above shows another way of thinking about AI and ML. AI takes rules and data as input, and the expected output is based on answers to those rules. ML is a rule that takes data and answers as inputs and outputs that can be used to generalize new data.
AI takes rules and data as input and outputs expected answers based on those rules. ML takes data and answers as inputs and outputs rules that can be used to summarize new data.
ML.NET
Microsoft released ML.NET on Build in May 2019, an open source, cross-platform ML framework for .NET developers. Over the past nine years, Microsoft teams have widely used the internal version of the framework to implement popular ML driver features; some examples include Dynamics 365 fraud detection, PowerPoint design concepts, and Microsoft Defender anti-virus threat protection.
ML.NET allows you to build, train, and consume ML models in the .NET ecosystem without the need for a background in ML or data science. ML.NET can run anywhere .NET runs. In Windows, Linux, macOS, on-prem, offline scenarios (such as WinForms or WPF desktop applications) or any cloud (such as Azure). You can use ML.NET for various scenarios, as described in Table 1.
ML.NET uses automatic machine learning, or AutoML, to automate the process of building and training ML models to find the best model based on the scenarios and data provided. You can use ML.NET 's AutoML through AutoML.NET API or ML.NET tools, including Model Builder in Visual Studio and cross-platform ML.NET CLI, as shown in figure 6. In addition to training the best model, the ML.NET tool generates the files and C # code needed to consume the model in an end-user .NET application, which can be any .NET application (desktop, Web, console, etc.). All AutoML solutions offer local training options, and image classification allows you to take advantage of the cloud and use Azure ML in Model Builder for training.
You can learn more about ML.NET at Microsoft Docs at https://aka.ms/mlnetdocs.
ML combined with big data
Big data and ML can be well combined. Let's build a pipeline that uses both Spark for .NET and ML.NET to show how big data and ML work together. Markdown, a popular language for writing documents and creating static Web sites, uses less complex syntax than HTML, but provides more format control than plain text. This is an excerpt from the markdown file in the .NET document library:
-title: Welcome to .NET description: Getting started with the .NET family of technologies. Ms.date: 12 am 03 ms.custom: "updateeachrelease"-# Welcome to .NET See [Get started with .NET Core] (core/get-started.md) to learn how to create .NET Core apps. Build many types of apps with .NET, such as cloud, IoT, and games using free cross-platform tools...
The part between the dashes, called the front matter, is the metadata about the document described using YAML. The part that begins with the pound sign (#) is the title. Two hashes (# #) represent secondary headings. Getting started with .NET Core is a hyperlink.
Our goal is to process a large number of documents, add metadata such as word count and estimated reading time, and automatically group similar articles together.
This is the pipeline we will build:
Establish a word count for each document
Estimate the reading time of each document
Create a list of the first 20 words for each document based on "TF-IDF" or "term frequency / reverse document frequency" (which is explained later).
The first step is to pull the document repository and the application to be referenced. You can use any repository and folder structure that contains Markdown files. The examples used in this article are from the .NET document repository and can be cloned from https://aka.ms/dot-net-docs.
After preparing the local environment for .NET and Spark, you can pull the project from https://aka.ms/spark-ml-example.
The solution folder contains a batch command (available in the repository) that you can use to run all steps.
Dealing with Markdown
The DocRepoParser project recursively traverses subfolders in the repository to collect metadata about each document. The Common project contains several helper classes. For example, FilesHelper is used for all files Ibank O. It keeps track of where files and file names are stored and provides services such as reading files for other projects. The constructor requires a label (a number that uniquely identifies the workflow) and the path to the repo or top-level folder that contains the document. By default, it creates a folder under the user's local application data folder. If necessary, you can overwrite it.
MarkdownParser uses Microsoft.Toolkit.Parsers to parse Markdown's library. The library has two tasks: first, it must extract headings and subheadings; second, it must extract words. Markdown files are exposed as "blocks", representing titles, links, and other Markdown features. The block in turn contains the "Inlines" that hosts the text. For example, this code parses a TableBlock by iterating through rows and cells to find the Inlines.
Case TableBlock table: table.Rows.SelectMany (r = > r.Cells) .SelectMany (c = > c.Inlines) .ForEach (I = > candidate = RecurseInline (I, candidate, words, titles)); break
This code extracts the text portion of the hyperlink:
Case HyperlinkInline hyper: if (! string.IsNullOrWhiteSpace (hyper.Text)) {words.Append (hyper.Text.ExtractWords ());} break
The result is a CSV file, as shown in the following figure:
The first step is to prepare the data to be processed. The next step is to use the Spark for .NET job to determine the number of words, reading time, and the first 20 terms for each document.
Build Spark Job
The SparkWordsProcessor project is used to run Spark jobs. Although the application is a console project, it needs Spark to run. The runjob.cmd batch command submits the job to a correctly configured Windows computer to run. The mode of a typical job is to create a session or "application", execute some logic, and then stop the session.
Var spark = SparkSession.Builder () .AppName (nameof (SparkWordsProcessor)) .GetOrCreate (); RunJob (); spark.Stop ()
You can easily read the file from the previous step by passing its path to the Spark session.
Var docs = spark.Read (). HasHeader (). Csv (filesHelper.TempDataFile); docs.CreateOrReplaceTempView (nameof (docs)); var totalDocs = docs.Count ()
The docs variable resolves to a DataFrame. A Data Frame is essentially a table with a set of columns and a common interface for interacting with data, regardless of its underlying source. You can reference a data frame from another data frame. SparkSQL can also be used to query data frame. You must create a temporary view that provides an alias for data frame in order to reference it from SQL. With the CreateOrReplaceTempView method, you can query rows from data frame like this:
SELECT * FROM docs
The totalDocs variable retrieves the count of all lines in the document. Spark provides a function called Split to decompose strings into arrays. The Explode function turns each array item into a row:
Var words = docs.Select (fileCol, Functions.Split (nameof (FileDataParse.Words) .AsColumn (), ") .Alias (wordList)) .Select (fileCol, Functions.Explode (wordList.AsColumn ()) .Alias (word)
The query generates a row for each word or term. This data frame is the basis for generating the term TF, or the count of each word in each document.
Var termFrequency = words .GroupBy (fileCol, Functions.Lower (word.AsColumn ()) .Alias (word)) .Count () .OrderBy (fileCol, count.AsColumn (). Desc ())
Spark has a built-in model to determine "term frequency / reverse document frequency". In this example, you will manually determine the term frequency to demonstrate how it is calculated. Terms appear at a specific frequency in each document. A document about wizard may have a high count of the word "wizard". The words "the" and "is" may also appear very frequently in the same document. For us, it is clear that the word "wizard" is more important and provides more context. On the other hand, Spark must be trained to recognize important terms. To determine what is really important, we will summarize the document frequency (document frequency), or the number of times a word appears in all documents in repo. This is "grouping by number of occurrences":
Var documentFrequency = words .GroupBy (Functions.Lower (word.AsColumn ()) .Alias (word)) .Agg (Functions.CountDistinct (fileCol) .Alias (docFrequency)
Now is the time to calculate. A special equation can be used to calculate the so-called reverse document frequency (inverse document frequency), or IDF. Enter the natural logarithm of the total document (plus one) into the equation and divide by the document frequency of the word (plus one).
Static double CalculateIdf (int docFrequency, int totalDocuments) = > Math.Log (totalDocuments + 1) / (docFrequency + 1)
Words that appear in all documents are assigned lower values than words that appear less frequently. For example, given 1000 documents, the IDF of a word that appears in each document is 0.003 compared to a word that appears in only a few documents (about 1). Spark supports user-defined functions, which you can register.
Spark.Udf () .Register (nameof (CalculateIdf) CalculateIdf)
Next, you can use this function to calculate the IDF of all words in data frame:
Var idfPrep = documentFrequency.Select (word.AsColumn (), docFrequency.AsColumn ()) .WithColumn (total, Functions.Lit (totalDocs)) .WithColumn (inverseDocFrequency, Functions.CallUDF (nameof (CalculateIdf), docFrequency.AsColumn (), total.AsColumn ()
Use the document frequency data frame to add two columns. The first column is the total number of words in the document, and the second column is to call your UDF to calculate IDF. Another step is to identify "important words". An important word is a word that appears infrequently in all documents, but often in the current document, expressed as TF-IDF, which is just a product of IDF and TF. In the case of "is", the IDF is 0.002 and the frequency in the document is 50, while the IDF of "wizard" is 1 and the frequency is 10. Compared with "wizard" with a frequency of 10, the TF-IDF calculation of "is" is 0.1. This gives Spark a better idea of importance, not just the original word count.
So far, you have used code to define data frame. Let's try SparkSQL. To calculate TF-IDF, you connect the document frequency data frame with the reverse document frequency data frame and create a new column called termFreq_inverseDocFreq. Here is the SparkSQL:
Var idfJoin = spark.Sql ($"SELECT t.File, d.word, d. {docFrequency}, d. {inverseDocFrequency}, t.count, d. {inverseDocFrequency} * t.count as {termFreq_inverseDocFreq} from {nameof (documentFrequency)} d inner join {nameof (termFrequency)} t on t.word = d.word")
Explore the code to see how the final step is implemented. These steps include:
All the steps described so far provide a template or definition for Spark. Like LINQ queries, the actual processing does not occur until the result is materialized (such as when the total number of documents is calculated). The final step calls Collect to process and return the result and write it to another CSV. You can then use the new file as input to the ML model, and the following is part of the file:
Spark for .NET enables you to query and shape data. You build multiple data frame on the same data source and then add them to gain insight into important terms, word count, and reading time. The next step is to apply ML to automatically generate categories.
Forecast category
The final step is to classify the documents. The DocMLCategorization project contains the Microsoft.ML package for ML.NET. Although Spark uses data frame, data view provides a similar concept in ML.NET.
This example uses a separate project for ML.NET so that the model can be trained as a separate step. For many scenarios, you can reference ML.NET directly from your .NET for Spark project and execute ML as part of the same work.
First, you must mark the class so that ML.NET knows which columns in the source data map to attributes in the class. Use LoadColumn annotations in the FileData class, like this:
[LoadColumn (0)] public string File {get; set;} [LoadColumn (1)] public string Title {get; set;}
You can then create a context for the model and load data view from the file generated in the previous step:
Var context = new MLContext (seed: 0); var dataToTrain = context.Data .LoadFromTextFile (path: filesHelper.ModelTrainingFile, hasHeader: true, allowQuoting: true, separatorChar:',')
The ML algorithm works best with numbers, so the text in the document must be converted into numeric vectors. ML.NET provides a FeaturizeText method for this. In one step, the models are:
Detection language
Mark text as a single word or tag
Standardize the text in order to standardize the variants of words and make them case similar
Convert these terms to consistent values or "feature vectors" to be processed
The following code converts the column to a feature and then creates a "Features" column that combines multiple features.
Var pipeline = context.Transforms.Text.FeaturizeText (nameof (FileData.Title). Featurized (), nameof (FileData.Title)) .append (context.Transforms.Text.FeaturizeText (nameof (FileData.Subtitle1). Featurized (), nameof (FileData.Subtitle1) .append (context.Transforms.Text.FeaturizeText (nameof (FileData.Subtitle2). Featurized (), nameof (FileData.Subtitle2)) .append (context.Transforms.Text.FeaturizeText (nameof (FileData.Subtitle3). Featurized ()) Nameof (FileData.Subtitle3)) .append (context.Transforms.Text.FeaturizeText (nameof (FileData.Subtitle4). Featurized (), nameof (FileData.Subtitle4)) .append (context.Transforms.Text.FeaturizeText (nameof (FileData.Subtitle5). Featurized (), nameof (FileData.Subtitle5) .append (context.Transforms.Text.FeaturizeText (nameof (FileData.Top20Words). Featurized (), nameof (FileData.Top20Words)) .append (context.Transforms.Concatenate (features, nameof (FileData.Title). Featurized ()) Nameof (FileData.Subtitle1). Featurized (), nameof (FileData.Subtitle2). Featurized (), nameof (FileData.Subtitle3). Featurized (), nameof (FileData.Subtitle4). Featurized (), nameof (FileData.Subtitle5). Featurized (), nameof (FileData.Top20Words). Featurized ())
At this point, the data has been properly prepared for the training model. Training is unsupervised, which means it must use an example to infer information. You have not entered the sample categories into the model, so the algorithm must find out the correlation of the data by analyzing how the features are clustered. You will use the k-means clustering algorithm. The algorithm uses features to calculate the "distance" between documents, and then "draws" the boundary around the grouped documents. The algorithm involves randomization, so the results of the two runs will be different. The main challenge is to determine the optimal clustering size for training. It is best to have different optimal number of categories for different document sets, but the algorithm requires you to enter the number of categories before training.
The code iterates between 2 and 20 clusters to determine the optimal size. For each run, it takes the feature data and applies the algorithm or trainer. Then, it converts the existing data according to the prediction model. The results are evaluated to determine the average distance of the documents in each cluster, and the result with the smallest average distance is selected.
Var options = new KMeansTrainer.Options {FeatureColumnName = features, NumberOfClusters = categories,}; var clusterPipeline = pipeline.Append (context.Clustering.Trainers.KMeans (options)); var model = clusterPipeline.Fit (dataToTrain); var predictions = model.Transform (dataToTrain); var metrics = context.Clustering.Evaluate (predictions); distances.Add (categories, metrics.AverageDistance)
After training and evaluation, you can save the best model and use it to predict the dataset. An output file is generated along with a summary that displays some metadata about each category and lists the title below. The title is only one of several functions, so sometimes you need to study the details carefully to make the category meaningful. In local tests, documents such as tutorials are grouped together, API documents are grouped into another group, and exceptions are grouped into their own group.
The ML zip file can be used with Prediction Engine for new data in other projects.
The machine learning model is saved as a single zip file. This file can be included in other projects and used with Prediction Engine to predict new data. For example, you can create a WPF application that allows users to browse directories and then load and classify documents using trained models without first training them.
What's the next step?
Spark for .NET is scheduled to be released in GA at the same time as .NET 5. Please visit https://aka.ms/spark-net-roadmap to read the roadmap and plan for the launch of features. (translation note:. NET 5 official release time has passed, Spark for .NET has been officially released with .NET 5)
This article focuses on the local development experience. In order to take full advantage of big data's power, you can submit your Spark homework to the cloud. There are a variety of CVMs that can hold PB-level data and provide dozens of cores of computing power for your workload. Azure Synapse Analytics is an Azure service designed to host large amounts of data, provide clusters for running big data jobs, and allow interactive exploration through chart-based dashboards. To learn how to submit Spark for .NET jobs to Azure Synapse, read the official documentation (https://aka.ms/spark-net-synapse)).
The following table lists common tasks and scenarios for ML.NET machine learning:
Thank you for reading, the above is the content of "how to use .NET 5", after the study of this article, I believe you have a deeper understanding of the problem of how to use .NET 5, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.