Skip to main content

Linux sed remove duplicates and get unique values



Removing duplicates and getting unique values is quite easy provided that the input data follows a specific format, for example the string or raw data has spaces in between.

But a dilemma can occur if the data has no spaces in between the characters of the string, instead of spaces it is separated by dashes.

So, how to remove duplicates, get the unique values and still retain the format of the raw data?

Like this raw data: (just a sample string)
the-quick-brown-fox-jumps-over-the-lazy-dog-jumps-over-the-lazy-cat-jumps-over-the-rabbit

When removing duplicates and getting unique values via this command:

sort duplicates.txt | uniq (this will work if the data is separated by spaces)

duplicates.txt assumes that it has the string as illustrated above.

Sample output:


The output will be the exactly be the same with the input. Why? It is because the whole string is treated as literal one string, because the dashes connect between the character eliminating the space delimiter.

Example, if the requirement is to remove duplicates, get unique values and retain the dashes from the final output. How to do it?

Removing the duplicates and getting the unique values and still retain the same format, can be done in 3 process.
  1. Remove the dashes from the raw data
  2. Remove the duplicates and get unique values
  3. Put back the dashes to the final output


Here’s the Bash Shell code:

#!/bin/sh

xfile="file_with_dashes_and_duplicates.txt"

sed 's/-/ /g' "$xfile" > "$HOME/file_with_no_dashes.txt"

cat $HOME/file_with_no_dashes.txt | xargs -n1 | sort -u | xargs > file_no_duplicate.txt
file="$HOME/file_no_duplicate.txt"

sed 's/ /-/g' "$file" > "$HOME/final_output_no_duplicates_unique_content.txt"

=======
sed 's/-/ /g' == find and substitute “-“ dash with a “/ /” or a space. g == all the character that can be found on the string

sed 's/ /-/g' == find and substitute a space with a dash, g (global) all matching characters

> file_no_duplicate.txt == IO redirection to the file specified

 Output of  above script to the final data:

The script:

Raw file or input data:


File or data with no dashes:


File or data with no duplicates:


The final output:



Cheers. Enjoy Linux!!!

================================
Free Android Apps:

Click  links below to find out more:

Excel Keyboard guide:


Heaven's Dew Fall  Prayer app for Android :



Catholic Rosary Guide  for Android:
Pray the Rosary every day, countless blessings will be showered upon your life if you recite the Rosary faithfully. 
https://play.google.com/store/apps/details?id=com.myrosaryapp


Comments

Popular posts from this blog

WMIC get computer name

WMIC get computer model, manufacturer, computer name and  username. WMIC is a command-line tool and that can generate information about computer model, its manufacturer, its username and other informations depending on the parameters provided. Why would you need a command line tool if there’s a GUI to check? If you have 20 or 100 computers, or even more. It’s quite a big task just checking the GUI to check the computer model and username. If you have remote computers, you need to delegate someone in the remote office or location to check. Or you can just write a batch file or script to automate the task. Here’s the code below on how get computer model, manufacturer and the username. Open an elevated command prompt and type:     wmic computersystem get "Model","Manufacturer", "Name", "UserName" Just copy and paste the code above, the word “computersystem” does not need to be change to a computer name. A...

Print error 016-799 - Fuji Film Xerox

016-799 Fuji Xerox or Fuji Film print error code. That shows a description error as “Print instruction Fail detected in decomposer.” The error code and error description are alien languages for users and even system administrators who are not familiar with Fuji Xerox error code. The error code is quite simple and easy to fix, if the job print goes to the printer but print out doesn’t come out. So, basically the print job was received by the printer, but the printer just doesn’t know what type of paper or what size to use or which tray to utilize for the print out. In some instances, this is just a paper mismatch but the error description; if using Windows 10 to print does not exactly points to what is the issue. First thing to check, is the paper size selected by the user to print. Example, if the printer configuration is A3 and A4 sizes only. But then the person printing the file accidentally chooses “A4 Cover” then this error 016-799 will occur. ...

Notepad++ convert multiple lines to a single line and vice versa

Notepad++ is an awesome text editing tool, it can accept regex to process the text data. If the data is in a “.csv” format or comma separated values which is basically just a text file that can either be opened using a text editor, excel or even word. Notepad++ can process the contents of the file using regex. Example if the data has multiple rows or lines, and what is needed is to convert the whole lines of data into a single line. Notepad++ can easily do it using regex. However, if the data is on a single line and it needs to be converted into multiple lines or rows then regex can also be used for this case. Here’s an example on how to convert multiple rows or lines into a single line. Example data: Multiple rows, just a sample data. Press Ctrl+H, and  on "Find what" type: [\r\n]+ and on "Replace with" type with: , (white space) --white space is needed if need to have a space in between the data. See image below, "Regular Expression" must be se...