Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use C and C++ in data Science

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article is mainly about "how to use C and C++ in data science", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn how to use C and C++ in data science.

Programming task

The programs you will write in this series:

Read data from a CSV file

Interpolate data with straight lines (that is, f (x) = m ⋅ x + Q)

Draw the results to an image file

This is a common situation encountered by many data scientists. The sample data is the first set of Anscombe's quartet, as shown in the following table. This is a set of manually constructed data that provides the same results when fitting straight lines, but their curves are very different. A data file is a text file in which tabs are used as column delimiters and the first few lines as headings. This task will use only the first group (that is, the first two columns).

The way of C language

C language is a general programming language, and it is one of the most widely used languages today (according to TIOBE index, RedMonk programming language ranking, programming language popularity index and GitHub Octoverse status). This is a fairly old language (born around 1973) and has written many successful programs with it (such as the Linux kernel and Git are just two examples). It is also one of the languages closest to the internal operating mechanism of a computer because it is directly used to manipulate memory. It is a compiled language; therefore, the source code must be converted by the compiler to machine code. Its standard library is very small and has few functions, so other libraries have been developed to provide the missing functions.

I use this language most often in numerical operations, mainly because of its performance. I find it tedious to use because it requires a lot of boilerplate code, but it is well supported in a variety of environments. C99 standard is the latest version, which adds some beautiful features and is well supported by the compiler.

Along the way, I will introduce the necessary background of C and C++ programming so that both beginners and advanced users can continue to learn.

Installation

To develop with C99, you need a compiler. I usually use Clang, but GCC is another valid open source compiler. For linear fitting, I choose to use the GNU Science Library. I can't find any sensible libraries for drawing, so the program relies on an external program: Gnuplot. The example also uses a dynamic data structure defined in the Berkeley Software Distribution (BSD) to store data.

It is easy to install in Fedora:

Sudo dnf install clang gnuplot gsl gsl-devel code comments

In C99, the format of the comment is to place / / at the beginning of the line, and the rest of the line is discarded by the interpreter. In addition, anything between / * and * / will also be discarded.

/ / this is a comment and will be ignored by the interpreter / * this is also ignored * / necessary libraries

The library consists of two parts:

Header file, which contains the function description

The source file that contains the function definition

The header file is contained in the source file, while the source file of the library file is linked to the executable file. Therefore, the header file required for this example is:

/ / input / output function # include / / Standard Library # include / / string manipulation function # include / / BSD queue # include / / GSL Scientific function # include # include main function

In C, the program must be in a special function called the main function main ():

Int main (void) {...}

This is different from the Python introduced in the previous tutorial, which runs all the code found in the source file.

Define variable

In C, variables must be declared before they are used and must be associated with types. Whenever you want to use a variable, you must decide what kind of data to store in it. You can also specify whether you intend to use the variable as a constant value, which is not required, but the compiler can benefit from this information. The following is from the fitting_C99.c program in the repository:

Const char * input_file_name = "anscombe.csv"; const char * delimiter = "\ t"; const unsigned int skip_header = 3; const unsigned int column_x = 0; const unsigned int column_y = 1; const char * output_file_name = "fit_C99.csv"; const unsigned int N = 100

Arrays in C are not dynamic, and in a sense, the length of the array must be determined in advance (that is, before compilation):

Int data_array [1024]

Since you usually don't know how many data points there are in the file, use a single chain list. This is a dynamic data structure that can grow indefinitely. Fortunately, BSD provides linked lists. This is an example definition:

Struct data_point {double x; double y; SLIST_ENTRY (data_point) entries;}; SLIST_HEAD (data_list, data_point) head = SLIST_HEAD_INITIALIZER (head); SLIST_INIT (& head)

This example defines a data_point list of structured values that contain both x and y values. The syntax is quite complex, but intuitive, and it would be too lengthy to describe it in detail.

Printout

To print on a terminal, you can use the printf () function, which is similar to Octave's printf () function (introduced in the first article):

Printf ("# Anscombe's first set with C99 #\ n")

The printf () function does not automatically add newline characters to the end of the printed string, so you must add newline characters. The first argument is a string that can contain formatting information for other parameters passed to the function, such as:

Printf ("Slope:% f\ n", slope); read data

Now comes the hard part. There are libraries that parse CSV files in C, but none seem stable or popular enough to fit into the Fedora package repository. Instead of adding dependencies to this tutorial, I decided to write this section myself. Again, it is too verbose to discuss these details, so I will only explain the general idea. For brevity, some lines in the source code will be ignored, but you can find the complete sample code in the repository.

First, open the input file:

FILE* input_file = fopen (input_file_name, "r")

Then read the file line by line until an error occurs or the file ends:

While (! ferror (input_file) & &! feof (input_file)) {size_t buffer_size = 0; char * buffer = NULL; getline (& buffer, & buffer_size, input_file);.}

The getline () function is a nice addition to the POSIX.1-2008 standard. It reads the entire line in the file and is responsible for allocating the necessary memory. Then use the strtok () function to divide each line into character token. Iterate through the characters and select the desired columns:

Char * token = strtok (buffer, delimiter); while (token! = NULL) {double value; sscanf (token, "% lf", & value); if (column = = column_x) {x = value;} else if (column = = column_y) {y = value;} column + = 1; token = strtok (NULL, delimiter);}

Finally, when the x and y values are selected, the new data point is inserted into the linked list:

Struct data_point * datum = malloc (sizeof (struct data_point)); datum- > x = x; datum- > y = y; SLIST_INSERT_HEAD (& head, datum, entries)

The malloc () function dynamically allocates (reserves) some persistent memory for new data points.

Fitting data

The GSL linear fitting function gslfitlinear () expects its input to be a simple array. Therefore, since you will not know the size of the arrays to be created, you must allocate their memory manually:

Const size_t entries_number = row-skip_header-1; double * x = malloc (sizeof (double) * entries_number); double * y = malloc (sizeof (double) * entries_number)

Then, traverse the linked list to save the relevant data to the array:

SLIST_FOREACH (datum, & head, entries) {const double current_x = datum- > x; const double current_y = datum- > y; x [I] = current_x; y [I] = current_y; I + = 1;}

Now that you have finished with the linked list, please clean it up. Always free memory that has been manually allocated to prevent memory leaks. Memory leaks are bad, bad, bad (say important words three times). Every time memory is not released, the garden dwarf cannot find his head:

While (! SLIST_EMPTY (& head)) {struct data_point * datum = SLIST_FIRST (& head); SLIST_REMOVE_HEAD (& head, entries); free (datum);}

Finally, finally! You can fit your data:

Gsl_fit_linear (x, 1, y, 1, entries_number, & intercept, & slope, & cov00, & cov01, & cov11, & chi_squared); const double r_value = gsl_stats_correlation (x, 1, y, 1, entries_number); printf ("Slope:% f\ n", slope); printf ("Intercept:% f\ n", intercept); printf ("Correlation coefficient:% f\ n", r_value) Drawing

You must use an external program to draw. Therefore, save the fit data to an external file:

Const double step_x = ((max_x + 1)-(min_x-1)) / N; for (unsigned int i = 0; I < N; I + = 1) {const double current_x = (min_x-1) + step_x * i; const double current_y = intercept + slope * current_x; fprintf (output_file, "% f\ t% f\ n", current_x, current_y);}

The Gnuplot command to draw two files is:

Plot 'fit_C99.csv' using 1:2 with lines title' Fit', 'anscombe.csv' using 1:2 with points pointtype 7 title' Data' result

Before running the program, you must compile it:

Clang-std=c99-I/usr/include/ fitting_C99.c-L/usr/lib/-L/usr/lib64/-lgsl-lgslcblas-o fitting_C99

This command tells the compiler to use the C99 standard, read the fitting_C99.c file, load the gsl and gslcblas libraries, and save the results to fitting_C99. The output on the command line is as follows:

# Anscombe's first set with C99 # Slope: 0.500091 Intercept: 3.000091 Correlation coefficient: 0.816421

This is the resulting image generated with Gnuplot:

Clipper 11 mode

C++ is a general programming language and one of the most popular languages in use today. It was created as heir to C (born in 1983), with a focus on object-oriented programming (OOP). C++ is generally regarded as a superset of C, so C programs should be able to compile using the C++ compiler. This is not entirely true because they behave differently in some extreme cases. In my experience, C++ requires less boilerplate code than C, but syntax is more difficult for object-oriented development. The Category 11 standard is the latest version, which adds some beautiful features and is basically supported by the compiler.

Since C++ is largely compatible with C, I will only emphasize the difference between the two. Any part that I do not cover in this section means that it is the same as in C.

Installation

The dependency of this C++ example is the same as that of the C example. On Fedora, run:

Necessary libraries for sudo dnf install clang gnuplot gsl gsl-devel

The library works the same way as the C language, but the include directive is slightly different:

# include # include extern "C" {# include # include}

Since the GSL library is written in C, you must inform the compiler of this special case.

Define variable

C++ supports more data types (classes) than C, for example, the string type has more functions than its C version. Update the definition of the variable accordingly:

Const std::string input_file_name ("anscombe.csv")

For structured objects such as strings, you can define variables without using the = symbol.

Printout

You can use the printf () function, but the cout object is more familiar. Use operator

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report